You are here : Home > LPCV > Theoretical advances and numerical methods for genomes comparisons. Application to the Plasmodium falciparum/Arabidpsis thaliana genomes and proteomes comparison

Olivier Bastien

Theoretical advances and numerical methods for genomes comparisons. Application to the Plasmodium falciparum/Arabidpsis thaliana genomes and proteomes comparison

Published on 21 April 2006

Thesis presented April 21, 2006

Malaria is a major threat for humankind with a rough record of half a billion of infected people. Recently, one of the best known attributes of the plant cells, a relic chloroplast, termed apicoplast, was discovered within the cells of apicomplexan parasites and appears to holds vital functions unique to plants. Therefore, it is now admitted that in the "plant-side" of the parasite reside innovative targets for intervention, using molecules harboring herb​icidal properties. To that extent, the release of the complete genome of Plasmodium falciparum, paved the way to the search for innovative plant-related protein targets.
A first step for searching such target is the genome-scale pairwise comparison between plant model, like Arabidopsis thaliana, and P. falciparum. The first release of P. falciparum identify 5,268 predicted proteins from which 60 % have not sufficient similarity to proteins in other organisms to justify a functional assignment. A singular feature of the P. falciparum genome was put forward to explain this prediction failure: the A+T richness (82 %) which is known to influenced the distribution of amino acids in proteins. In order to consider this feature, we developed a new scoring scheme that extend the BLOSUM model, the non-symmetric matrices dirAtPf, which consider the difference of global distribution of amino acids in proteins between two species.
One supplementary effort in sequence analysis theory have been made with a mathematical demonstration which provide a single-linkage clustering criterion for genome-scale comparison. This demonstration lie on the Z-Value computation and the Bienaymé-Chebyshev theorem.
We re-examined the estimate of the sequences "dissemblance within assessed resemblance" as a source for divergence time calculation and evolutionary reconstruction. We sought the probabilistic, statistical and geometric rules that an optimal alignment score has to respect in respect of the recently demonstrated TULIP theorem. We used these rules as a framework of constraints to build up a geometric representation of a space of probably homologous proteins and define a theoretically explicit measure of protein proximity. Eventually, we constrained the topology associated to this geometric space by respecting i) the protein clock and derived phylogenetic models and ii) taking into account the lineages that separate sequences from the ancestral diverging events. This unified model, called the TULIP topological space, reconciles concepts from different fields of protein science that were not yet explicitly connected. The spatial geometry and topology of probably homologous proteins, built from pair-wise alignments, being univocal, applications include the reconstruction of univocal classification trees. The power of this elaborate topological spatial representation is illustrated by comparison with phylogenetic reconstructions obtained from multiple alignments.

Sequence alignments, genomic comparison, Malaria, Plasmodium falciparum

Download this thesis.