Tuesday, 30 June 2009

New Challenges and Opportunities in Newtwork Biology

Trey Ideker *****

Richard Karp and Lee Hood are his science inspiration.
Wants to reconstruct pathways from multiple biological sources.

If you know something about a pathway you can perturb it systematically in silico and in vivo/vitro. Start was the Galactose metabolism pathway and its transcriptional control. 3 core genes. 1000 other genes were perturbed in the mutants. Transcriptional interactions were available first, now we have protein-protein interactions and we also had KEGG/metabolism.
Coloured networks - states, expression profiles, phosphorylation etc. extract sub-networks from the main network based on colour - pathway extraction.

Used phenotype for colouring the invastion of HIV.

What is the goal for the next 10 years? (What do networks actually mean?). Cell is a hairball inside in terms of its interactions.

Developed PathBLAST and NetworkBLAST using the analogy of genome assembly for network assembly. Align the orthologues between species to align the network. Works at the protein family level - removes paralogues. They are not the best algorithms now.

Moves to gene relation networks based on synthetic lethality or phenotype changes for double mutants. Gene interaction networks are logical networks (Booleans). Very little overlap between physical and genetic networks. Gene interactions are between subgraphs that are physical interactions.


ChiP-chip looking at DNA damage caused by stress and DNA repair. Finding where a TF binds by immunoprecipitation - identify the fragments that were bound to the immunoglobulin purified TFs. Problems of drift in that binding might not be significant - non-functional binding. Validated byt checking downstream loss in gene deletions but only 10% of those sites identified from ChiP-chip are verified. Spurious interactions take place close to telomeres closer than 25 kilo-base-pairs. This is condition specific so adding rapamycon can perturb it. Possibly indicates sequestration effects that make the TFs non-functional.

Using it for disease classification. Diagnose breast cancer metastasis using expression profiles. 300 patients 1/3 of whom became metastatic. Area under ROC is 65%. Heterogeneity is a problem. Little overlap between different gene sets from different studies.
Look at sequentiality or connections in protein-protein interactions.


Cancer might be the perturbation out of homeostasis.


Web 09

Bioinformatics in an Undergraduate Programme

Kam Dahlquist, Murli Nair *****

Part of curriculum development movement - the need to use problem based learning and the need to have a progressive curriculum that follows what will happen in real life. How to build problems and hands on learning into biology curriculum including statistics. Based around the ideas from the http://bioquest.org frameworks.

Online reports about the problems of including quantitative and computational skills into the biology curriculum.
"Bioinformatics is Biology and we cannot just have 2 pages in the textbook" Murli Nair, IUSB.
http://genomebiology.com/2008/9/12/114 All Biologists are Bioinformaticians

Monday, 29 June 2009

Network Based Prediction of Metabolic Enzyme Sub-cellular Localisation

Shira Mintz-Oron***
Clear and easy to understand talk but the method seems very complex, it is necessary to go to the flux level?

GFP and microscopy are the experimental techniques but they are costly and difficult in higher eukaryotes.

Localisation is by motifs, composition, homology etc.
Here predict localisation based on metabolic networks.
Prior knowledge of the localisation of a subset of the network is needed, want to minimise the number of membrane transport reactions (thermodyamic cost).
Use constraint based modelling to predict the flux rate at a steady state, predict the flux rates under constraints of mass balance, thermodynamic constraints and enzymatic capacity (rate bounds). Maximise biomass production.

  1. Divide the data into the localised and the unlocalised (unknowns) enzymes.
  2. Put all the enzymes into all compartments - build all the transport reactions limit enzyme activity for known localised reactions to the compartments in which they are known to be active.
  3. Give a unit penalty for transport reactions - want a low penalty so that some transport is allowed.
Fuzzy classification so gets scores for more than one compartment and so this means that a distribution across multiple compartments is possible.

Using Side Effects of Medicines to Identify Drug Targets

Michael Kuhn **
Phenotype data - what side effects are caused by a drug - from clinical trials - text mined (1000)
Targets of drugs are also available as a dataset - 500
750 drugs to characterise drugs with similar side effects have similar targets.

  • Problem to deal with synonyms in phenotype data - use ontologies (Costart) to cluster the same concepts in the side effects.
  • Some side effects are very common - parent terms of more specific side effects in the ontology - these become non-predictive. use a log frequency weight.
  • Some side effects are correlated use Gerstein-Sonnhammer-Chothia weights (from HMM).

Use shuffling to normalise the score and get the side effect similarity. Also measure chemical similarity. Low chemical similarity low chance of sharing targets and similar for side effect similarity - need both chemical and side effect similarity for effective description. Side effect much more effective at predicting the same targets than chemical similarity.


Can create a drug-drug network connected when they share a target which will have the same phenotype. Rabeprazole is an exception as it is not a nervous system drug but a stomach drug.

Side effects also occur with the placebo as well as with the drug.

On targets and off targets are treated equally in the model - so they assume the same off target is affected in the case of having the same side effects.


Clustered Alignments of Gene-Expression Time Series Data

Adam Smith ***
Want to align time series to compare treatments so you can find causative effects. Ultimately would be good to search a database for similar effects to find genes that are operating together.

You get warping so you need to align equivalent points. Use splines to create continuous series from discrete data.
Shorting alignments trim series to the same features of maxima and minima.
SCOW - efficient method for aligning time series.

  • Most extensive fiting is dynamic programming to minimse Euclidean distances between two time series being aligned (Sakoe and Chiba 1978)
  • Parametric time warping (Eilers 2005) approximate warping to a parabolic or linear warp.
  • Segment based warping (Smith et al 2008) - alignment score for different segments
  • COW correlation optimised warping - points of discontinuity are called knots - where there is a break and a warp.
  • SCOW - shorting correlation optimised warping.
Evaluated EDGE toxicology database 216 observations 1600 genes times 6 to 96 hours.
Trying to match query to find the most similar treatment profiles in the database.

For clustering pick an average time series alignment and then the extremes above and below before continuing to add the other time series distances to each of the clusters. By using clusters the alignments are improved.

Modelling Stochasticity and Robustness in Gene Regulatory Networks

Abhishek Garg ****

Takes a Boolean modelling approach. Stochastic Boolean Modelling

Enviromental input changes the nature of the differentiated cell. Stochastic behaviour is common in biological models.

Robustness maintains functionality over perturbations. In gene networks do they move between steady states (can they change attractor), and give different cell type.

Stochasticity can be applied to the nodes or the functions.
Not Probabalistic Boolean Networks - Datta group.
Use Boolean functions AND, OR etc.

Stochasticity in nodes flip the output using a probability distribution.
Kauffman, Willanda etc lots of literature.
Over-represents noise by placing it at the end and not at the inputs/intermediates.

More stochastic the more interaction are involved so allosteric and protein localisation has high noise.

Modelling Ecological and Genetic Diversity in Bacteria

Eric Alm ***
Used adaptML to look at the partitioning of bacteria by habitat and seasonal variation. This is an MC based method.

Ecological preferences are sufficient to produce speciation - what was defined as one species was found to have 17 different habitat locations and seasonal variations. Habitats are defined by net sizes or species in which they are present. Move from fish to squid means adaptation to the mucus membranes and so aquired new features.

Go from zooplankton to small particle associated. Looking for patterns of gene expression, between different lifestyles. Did 20 genomes how do you analyse a population of genomes?

  1. Where are the recombination break-points?
    • Inconsistency of a phylogentic tree at a point with phylogentic trees from preceding bases.

    • McDonald-Kreitman Test

Information and Biology

Pierre-Henri Gouyon ***** (cannot give this enough stars this is fundamental)

Takes Richard Lewontin's view - Triple Helix - Genes, Epigentics and Environment.

Biodiversity - treasure but also a problem - devil is in the detail.
Biology is a young science - anthropomorphism and vitalism are difficult to escape, hard to make it Galilean like physics. Born really in 1859 with Darwin.
  • Mandrake is human seed grown from the ground.
  • Linaeus is the frist bioinformatician.
"eternal law of reproduction and multiplication within the limits of their proper types" Linnaeus

Diversity is the variety of species - is biodiversity just counting the species and what differences warrant new species - in insects differences are much smaller.
  • Cuviers - Extinctions destroy invariability there is no transformation.
  • Lamark species transform never become extinct.
Darwin considered individuals and not species. Product of his environment and seeing the struggles around him. All are as evolved as the others.

"no clear line of demarcation has as yet been drawn between species and sub-species" Darwin.

It is a continuum - this is very important. We have become bean counter and not integrative biologists.

Galton developed heritability and Weissman kills of acquired characteristics - amputees do not give birth to amputees. Do not produce what I have "learnt" pass on what was given to me. I am like a tube through which something passes so the only effect an individual can play is over the number of progeny.
  • Bateson pupil of Galton - did not like Maths
  • Pearson was the maths student.
Systematics and genetics converge with coalescence. Diversity needs to refect distances as well as number of leaves.

Epigenetics - what the reading system add to pass on from one generation to the next.

"Individuals are contingent artefacst invented by genes to be reproduced."
What information produces the individual?
No linear relationships DNA does not go to RNA like carbon + oxygen goes to carbon dioxide. So the flow is not linear or free from weight.

Genes are agents and not limited to the matter genes are not equal to DNA.

The same repsonse can be triggered from many different causes so how do you quantify information without context? You need to be able to read a book for the information to be transformed into action. Recipes do not make dishes. Need recipe, chef (reading system) and the environment - his oven/kitchen.

The Triple Helix

Nature is not moral or immoral it is indifferent. This was the shock of Darwin.

Eugenics Ada Juke case - sex perverts and illegitimacy were hertiable. It was common to all geneticists of the time - they all believed they could do the most for mankind, scientist sense of superiority. Ended at Nuremberg so now we look at the human rights.

  • Not determined by our genes but our environment.- left wing
  • Innate is more important - right wing.
Exchange horses and riders so each rider rides each horse over the same course. Contingent over the horses and the riders. Results cannot be extrapolated to new populations - only valid for the same sets.

Need a proper definition of biological information beyond Shannon.

Sunday, 28 June 2009

BioPathways SIG II - ISMB/ECCB 2009

Using FunCoup to predict disease genes

Eric Sonnhammer *
Build networks combining all biological data PPI, coexpression, phylogenetic profiles (for co-evolution), domain interactions, subcellular co-localisation, TF binding sites, miRNA targetting.

Uses orthology to relate between organisms. Data is compared vs "training sets" in a Bayesian Framework. Using Gold Standard datasets HPRD for PPI, KEGG metabolic and signalling pathways. Uses log odds ratios. Networks cut at a confidence of 0.2 number of interactions grows exponentially with confidence level (why does this worry me?). Lots of support for interations comes from orthologous species.


Query from a few starting genes. Searches across species show orthologues. Examples for Parkinsons and Alzheimers from his paper.


The methods can only predict associations to disease if the genes are part of the Large Central Component - isolated genes are not within the network.



S Schbath ****
Finding motifs with unusual statistics. Can run on a set of words not just words as letters. Restriction sites are no common or DNA would be cleaved too often - e.g. EcoR1 sites. Chi-motif is very common as it protects DNA from enzymic degradation. Promoter regions are also uncommon.

Chi motifs are species specific. Skewed so you can check for the sequence against the reverse compliment to check levels of skew between strands.

Need Gaussian statistics with high word frequencies and Poisson based models with low frequencies. This allows you to use z-tests for short words which are frequent. Set the distribution in the command line.

Can compare distributions of words between two regions, organisms etc.

  • H0 equally exceptional in both sequences
  • H1 more exceptional in the first sequence
  • Adds -seq2 to the command line.

Saturday, 27 June 2009


I am interested in high performance computing for both data-mining and implmenting stochastic models. An exciting new possibility is the use of Graphical Processing Units for parallel processing.

The Biomanycores Project

nVidia have released a beta release of a complier for their cards in May which makes development easier. CUDA is available as open source with extensive documentation but the prefered method will the OpenCL when this becomes more widely available as this will not be hardware dependent.


supported by nVIDIA professor partnership.

An early example of code was the implementation of Smith-Waterman in SWcuda (~10 papers on implementations in 2007-2009). Can be built into any of the Bio* projects which insures portability (java, python, perl) but requires CUDA or OpenCL SDKs.

Incorporating GPUs into the R statistical environment

Josh Buchner ***
nVidia programmed with CUDA, GTX260,295.
New project(2 months) designed for exploratory data analysis of large-scale biomedical datasets. Implemented Grangers causality test, Pearson Correlation Coefficient.
  • gpuGranger
  • gpuCor
  • gpuHClust
  • gpuDistClust
  • gpuMi
  • gpuSolve
  • Also built an interface to SVMs but currently only available in linux and Mac versions.

BioPathways SIG - ISMB/ECCB 2009

An organisational disaster. We have all been given the wrong schedule. We have been given last years and the abstracts for this years talks! Now I have to redo my plans for today so that I can go between pathways and BOSC and see what I want to see.

I use a five star rating about how useful I found the talk but remember that is always personal!

Systems Biology Graphical Notation

Nicolas de la Novere **

SBGN is a support for creating views of networks as there are many different possible representations and a useful tool cannot constrain the representation you wish to create.

Rule-based modelling: Model perturbations and resolution

Russell Harmer *****

Excellent simple technique for creating rules based agent models that only depends on defining the entities and the relationships between them. This allows many species to be grouped and simplified and prevents a combinatorial explosion. The use of hierarchies of agents removes the problem of a second combinatorial explosion when you are trying to deal with mutations and modified species. It is simple to adapt as a verbal model and there is none of the maths of ODEs but it gives a useful representation that improves our knowledge. This demonstrates that verbal models are still important in biology and that maths still has limitations where there is complexity.

Topological network alignment uncovers biological function and phylogeny

Tijana Milenkovic ***

A breathless talk but it kept me awake. The aim of the work is to use network topology to discover biological features without using sequence information. Currently methods align the sequences at nodes and when they are homologous they align them and do not only depend on network topology. This includes a heuristic to compare networks which is a nice problem to work on in itself. If you have a metric for the distances between networks then you can construct phylogentic trees.

The heuristc starts from a seed node that is aligned between the two networks based on the graphlet degree vector of that node i.e. how many triangles and rhomboids are found centred on that node. The ultimate finding is that the phylogeny using the topological alignment distance agrees with that from sequence methods in funghi and protists.

Question: asked about how the metabolic networks of the funghi were reconstructed and this uses sequence homology and so there is an indirect element of sequence homology in the networks that are being compared. This question brought a laugh from the audience by saying this is why the tree based on sequence homology is the same as that from the new program.

But this is wrong
the phylogentic tree based on sequence is likely to be based on rRNA and not amino acids, as phylogeny is different for different genes and so you would get a different phylogeny for different parts of the network!

The possible weakness of the work is that adding the sequence removes the degeneracy problem to make sure you have aligned the right nodes in the two networks to each other. Then from these "seeds" you can extend it to align nodes where there is little sequence identity that might suggest duplication and recruitment events. Never forget the biology.

Graphlet paper
Natasa Przulj

Other views