printlogo
http://www.ethz.ch/index_EN
Department of Computer Science
 
print
  
English Deutsch

Prof. Gaston Gonnet

Prof
Prof. Gaston H. Gonnet

Where Did SARS Come From? Analyzing the Evolutionary History of the SARS Virus

Computational Biochemistry and Computer Algebra

Traditionally phylogenetic trees are computed on the basis of multiple sequence alignments. For species with a low sequence similarity and a high degree of gene rearrangement such as RNA viruses, reliable multiple sequence alignments are very difficult if not impossible to obtain. We apply a different approach that uses the k-nucleotide composition as an evolutionary signal. The different methods of creating a phylogenetic tree from a k-nucleotide composition vector were studied extensively. The combination of odds-ratios, Euclidean distance and trinucleotides gave the best phylogenetic trees when compared to a known tree of eleven vertebrate species. The developed methods were also used to classify the SARS virus. Our results suggest that the SARS virus belongs to group I of the coronaviruses.

The project presented here was a collaboration of our group with Prof. Dr. Mathias Ackermann and Dr. Kurt Tobler both from the Institute of Virology at the University of Zurich.

Authors: Peter von Rohr, Markus Friberg, Gina Cannarozzi and Gaston Gonnet, January 27, 2004

Evolutionary origin of SARS Virus heavily debated

Since the outbreak of SARS (Severe Acute Respiratory Syndrome) the question about the evolutionary origin of the virus causing the disease has been debated very heavily. Opinions range from the virus having developed independently to it being a close evolutionary relative of other coronaviruses. Some people even believe in the so-called “Far-Out Theory” according to which the SARS virus has its origins in outer space. The question about the evolutionary history of the SARS virus has quite some impact on the development of vaccines and treatments, because if the virus can be shown to be related to known viruses, knowledge about existing vaccines and treatments would speed up the development of new ones tremendously.

In a collaboration with researchers from the Institute of Virology at the University of Zurich, our group (Computational Biochemistry Research Group) has made a contribution to shed some light on the evolutionary origin of the SARS virus.

Phylogenetic trees as graphical representation of evolution

Inferring evolutionary history of a given set of taxonomic species from molecular sequence data (mostly proteins, sometimes DNA) is a long-studied problem in Computational Biology. In the simplest case of the problem, we are given a set of species and each of those species is represented by one molecular sequence. The sequences have to be homologous which means that they are related by common ancestry. The species are clustered according to the similarity of their sequences. The results of such an evolutionary analysis is usually displayed by a so-called phylogenetic tree. A phylogenetic tree is a special type of binary tree with un-labeled internal nodes. The leaves of the tree correspond to the species that are analyzed. An example of a phylogenetic tree is given in Figure 2.

Multiple sequence alignment required

The most common method for building a phylogenetic tree are based on multiple sequence alignment (MSA). A multiple sequence alignment is an array in which each sequence occupies one row. The columns contain those characters of the sequence (either amino acids or nucleotides) that are believed to be derived from the one codon in the common ancestor of all species. Either complete genomes are compared or deduced amino acid sequences of coding regions. Both approaches suffer from disadvantages.

A biological phenomenon called gene rearrangements in which the order of the genes on the DNA molecule is subject to changes further complicates or completely destroys multiple sequence alignments. Gene rearrangements are very common in viruses. Finally, finding good multiple sequence alignments for long sequences and many organisms is computationally very intensive and sometimes even impossible. In summary, a multiple sequence alignment is a very good source of information to construct phylogenetic trees, but to compute the multiple sequence alignment itself can be a very difficult problem.

In the case of coding nucleic acids, the analysis has to be limited to the specific parts, which contain related proteins. Naturally, we have to know the coding regions, and we have to make an arbitrary choice of which regions to compare. Different regions can lead to different trees. If the sequences are very different in length or show a low degree of similarity, the resulting multiple sequence alignment can be of low quality, leading to unreliable trees.

k-nucleotide frequencies provide an evolutionary conserved signal

The main focus in this project was do infer the evolutionary history of 49 RNA viruses including the SARS virus by re-constructing a phylogenetic tree. Because the degree of sequence similarity among the RNA viruses is rather small and gene rearrangements occur quite frequently in these viruses, we did not re-construct the phylogenetic tree via a multiple sequence alignment but instead chose an alternative route. It has been known for quite some time that dinucleotide composition is similar in related organisms and different in unrelated organisms and thus can be used as a measure of distance between two organisms. Dinucleotide composition is the frequency of occurrence of subsequent pairs of nucleotides in the DNA of an organism e.g. How many times does the dinucleotide ”AC” occur in this genome?. Instead of just looking at dinucleotides, we extended the approach to k-nucleotides which correspond to the frequencies of k subsequent bases in the DNA sequence. Using the k-nucleotide frequencies as an evolutionary signal, we studied different ways of re-constructing a phylogenetic tree as shown in Figure 1 (below).

Figure 1: Alternative ways of computing a phylogenetic tree from k-nucleotide frequencies
Figure 1: Alternative ways of computing a phylogenetic tree from k-nucleotide frequencies

We either computed distances between the viruses from the k-nucleotide frequencies directly or transformed them first to odds ratios or Poisson deviates and got the distance measures from the transformed frequencies. As distance measures we have used Euclidean distance, relative entropy, the chi-square test statistic or a likelihood based score. From the pairwise distance matrix for all of the viruses, the topology and the branch lengths of the phylogenetic tree were estimated by an iterative least squares approach. The quality of each combination of k-nucleotides and method was evaluated using an known phylogenetic tree from eleven vertebrate species. The combination of odds ratios, Euclidean distance and tri-nucleotides resulted in the most accurate phylogenetic trees. More details on the methods and code doing all these computations is available in the form of a bio-recipe.

Where did SARS come from?

For all combinations of methods and k-nucleotides the SARS virus consistently was clustered into the group of coronaviruses which are denoted by the single circles towards the right end in Figure 2.

Figure 2: Unrooted tree of 49 RNA viruses based on the Euclidean distance of odds rations using trinucleotides
Figure 2: Unrooted tree of 49 RNA viruses based on the Euclidean distance of odds rations using trinucleotides

However the relative topological position of the SARS virus within the coronaviruses differed slightly when different combinations of methods and k-nucleotides were used. To be able to position the SARS virus more consistently within the group of corona viruses, the analysis was repeated for the seven coronaviruses only. The resulting trees had all the same topology, no matter which method or k-nucleotide composition was used (Figure 3). The bottom line for the SARS virus is that it is consistently classified together with TGEV, HCoV-229E, and PEDV into group I of the coronaviruses.

Figure 3: Unrooted tree of seven coronaviruses based on the Euclidean distance of odds rations using trinucleotides
Figure 3: Unrooted tree of seven coronaviruses based on the Euclidean distance of odds rations using trinucleotides
 

Wichtiger Hinweis:
Diese Website wird in älteren Versionen von Netscape ohne graphische Elemente dargestellt. Die Funktionalität der Website ist aber trotzdem gewährleistet. Wenn Sie diese Website regelmässig benutzen, empfehlen wir Ihnen, auf Ihrem Computer einen aktuellen Browser zu installieren. Weitere Informationen finden Sie auf
folgender Seite.

Important Note:
The content in this site is accessible to any browser or Internet device, however, some graphics will display correctly only in the newer versions of Netscape. To get the most out of our site we suggest you upgrade to a newer browser.
More information

© 2012 ETH Zurich | Imprint | 27 June 2006
top