next up previous contents
Next: Availability and Contact

The DARWIN Manual ©

G. Gonnet and M. Hallett

22 July 2000

The Evolution of Darwin

Darwin  is an easy to use interpreted computer language especially tailored to research in the biosciences. Its purpose is to serve as a biochemists' workbench where researchers can explore molecular sequence data quickly and easily. The Darwin  project began in 1991 and reflects much of the research done in the CBRG (Computational Biochemistry Research Group) at the ETH-Zurich. Darwin itself shares much in common with the language MAPLE which is designed for symbolic computation. Broadly speaking, it consists of two parts: the libraries and the kernel.

The libraries correspond closely to what one expects from a software package: a pre-defined set of functions offered by the system. The libraries reflect current and past trends in our research efforts but also incorporate many algorithms from the literature, particularly those related to sequence comparison, phylogenetic tree construction, multiple sequence alignment and secondary structure prediction. The libraries themselves are written in the Darwin  language and are therefore easy to read even for novice programmers. We briefly describe the current contents of the libraries in the next section.

The kernel of Darwin  is responsible for the lower-level operations in the system: executing commands and libraries, memory management, input/output, communication with the operating system, load balancing, etc. Darwin  itself is written in C although this source code is not made publically available. The kernel also contains critical routines, that is, routines which must be performed efficiently due to the number of times they are called or the complexity of the routines themselves. These include (but are not limited too) routines for pairwise alignment, all versus all alignments, and tree construction. Although the kernel is not modifiable, one can execute native code (that is, user designed code written in C, JAVA, etc.) from within Darwin .

Since Darwin  is a computer language, it allows one to go beyond the fixed set of routines offered in the kernel and library. The language itself is a high-level interpreted language equipped with lists, sets, general data structures, and a robust collection of basic mathematical functions allowing the user to quickly prototype new ideas. Darwin  programs are relatively fast even when compared to optimized C code. A moderately experienced user of the system will be able to modify existing libraries (written in the Darwin  language) when necessary or create new libraries appropriate to whichever problem they are currently exploring. Below is a small example of a Darwin  session:

unix:  darwin                           
  Darwin: Sequence Searching Facility                
  Version 2.0, August 1998                                     
    (c) E.T.H. Zurich                
 DB:=ReadDb('SwissProt38'):         
  Peptide file(SwissProt38(54714687),       
  77977 entries, 28268293 aminoacids)             
 printf( '\nIdentification: %s', 
               Entry( 1 )['ID']); 
  Identification: 100K_RAT                                 
 printf( 'Accession Number: %s', 
               Entry(1)['AC'] );      
  Accession Number: Q62671;  
 CreateDayMatrices():                  
 res := AlignOneAll( 1, DB, DM, 120 ): 
 length( res );                        
  19
 hisofar := 0:  index := 0:
 for i from 2 to length( res ) do     
     if (res[i,Sim]  hisofar) then
           hisofar := res[i,Sim]:  
           index := i:
           fi:
     od:
 printf('\nMost similar is %d with 
      score %5.2f', index, hisofar );
Most similar is 11 with score 301.13

The above program loads the SwissProt v. 38 dataset (ReadDb), then prints out the identification and accession tags for the first entry. After creating the GCB extended Dayhoff matrices, the first entry is compared against all other entries in SwissProt (AlignOneAll). In this example, the alignment is performed at a PAM distance of 250(variable DM) and all signficant matches are stored in the variable res. A significant match here is defined as any match with a similarity score greater than or equal to 120.1 We search through the 19 such matches for the alignment which induced the highest similarity score.

Although large, the libraries distributed with Darwin  are far from complete (computational biology travels simply too fast to make ``keeping up'' viable). Users are invited to submit new libraries for inclusion in future releases of the system.

Combining algorithms both from the literature and research local to the CBRG, our system allows a flexibility that no previous system has offered. This flexibility is an absolute necessity as we enter an age where the analysis of complete genomes will be commonplace. We believe that the power of Darwin  remains largely untapped although over 400 research groups have experimented with our software.

Below is a brief description of the contents of the Darwin  language, libraries and built-in routines. We note however that this description is not complete; there are other libraries for many discrete and continuous mathematical optimization problems plus other smaller tools for manipulating sequence data. The manual contains more information on these topics.

Basic Mathematical Operations The system includes operations for sets, lists, trigometric functions, combinatorial graphs, long integers, real and complex numbers, likelihood/probability calculations, matrices (including LLL decompositions, Gaussian eliminations, Givens eliminations, singular value decompositions, eigenvalue/eigenvector computation, Gram-Schmidt decompositions and linear regressions), control of input/output, and interacting with the operating system.

Pairwise Alignment Darwin  comes equipped with routines to align peptide sequences versus peptide sequences, or nucleotide versus peptide sequences []. The alignment routines are based on the full dynamic programming approach using the GCB matrices. See []. The system can also perform parametric alignments which seek to find the PAM distance which maximizes similarity score. Local alignments, global alignments, and cost-free ``end gap'' alignments are all possible.

Dataset Conversions The system includes routines for converting raw Swiss-Prot [] or EMBL [] flat-files to the Darwin  format. The libraries also include a parser shell which can be easily modified to parse any flat-file.

All versus All Routines One of the most useful features of Darwin  is its ability to perform large scale comparisons of genomes; that is, the alignment of every sequence in a dataset with every other sequence in the dataset. To this end, Darwin  automatically generates a patricia tree2 when a sequence dataset is loaded and provides various built-in (i.e. located in the kernel) functions for performing fast alignments. Also, Darwin  is capable of distributing a large set of jobs over an intra-net, can control the computation of these jobs on the foreign machines, and collect the results.

Peptide Transition Matrices The following peptide transition matrices are built into Darwin : Dayhoff, GCB [], BLOSUM [], amongst others. New matrices can be computed from sample data. There are also routines for converting PAM distance to and from percent identity.

Protein Identification via Peptide Mass Darwin  contains routines for protein identification by aligning the masses of small collections of peptides after N- or C- terminal digestion against either a nucleotide and peptide dataset [].

Phylogenetic Tree and Multiple Sequence Alignment Construction

Historically, tree construction in Darwin  has been based on distance matrices and the system contains various related routines: tree topology construction algorithms (Neighbour joining [], clustering methods [], amongst others), least squares fits to tree topologies, and local optimization routines. There are now routines for tree construction based on circular orders, a new method developed in []. Multiple sequence alignments [] are created relative to a phylogenetic tree and the system includes several methods for scoring the quality of the alignment including a novel method developed in [].

Statistics and Visualization This system includes routines for drawing histograms, dot plots, and bar graphs. The system can also draw unrooted trees, rooted trees, split trees, and combinatorial graphs. There are a large number of routines for producing random permutations, combinations, distributions and specific biological objects such as sequences, trees, and multiple sequence alignments.

This manual describes the Darwin  language and all of the basic functionality of the language including the basic commands, constructors, data types, built-in data structures, and descriptors for all library functions. DARWIN V. 1 suffered from a somewhat scattered and nonintuitive naming scheme for its predefined functions. In order to make Darwin  more usable, we have adopted a standardized naming convention3. Furthermore, a substantial subset of the manual is available via on-line help from within Darwin  and the remainder is available via the WWW [5]. Lastly, a large number of bugs reported by our user base have been fixed.

This book is divided into three parts. Part I - An Introduction to Darwin has been designed to familiarize even the most computer illiterate amongst us with the basic Darwin environment. We have to tried to write to the biologist/biochemist who perhaps has had a first year Introduction to Computer Science course and who has distant memories of for-loops and recursion stored somewhere deep in the recess of their subconscious. An attempt has been made to use simple terminology, only giving those definitions we deem important in later chapters. For new users, we recommend that Part I be read ``in one sitting'' beginning at Chapter 1 - Exploring the Basics and following the discussion and examples through to the end of Chapter 11 - A Guide to Debugging.

An experienced programmer may find that Part I need only be skimmed in order to familiarize themself with the pecularities of the Darwin system. Once comfortable with the language, users may find that it acts as a short reference guide for looking up commands ``on the fly''. Towards this end, we have attempted to make each chapter self-contained.

Chapter 1 - Exploring the Basics provides a basic session with Darwin designed to give new users a feeling of how to interact with the system. Chapter 2 through to Chapter 7 provides a more in depth tour of the basic Darwin language with a focus on the most commonly used commands and routines built into the kernel. We recommend new users become familiar with the topics covered in these chapters before venturing into Part II.

The more esoteric and application specific routines are discussed in Chapter [*] - Genetic Databases, Chapter 8 - Randomization, Statistics and Visualization, Chapter [*] - Producing HTML Code, Chapter 13 - Darwin's Interprocessor Skills, and Chapter 14 - Calling External Functions.

Chapter [*] - Genetic Databases provides an in depth look at how Darwin builds, stores and manipulates genetic databases. In some sense, these data structures are the cornerstone of the system and a fluency with their manipulation will greatly ease the difficulty of programming in Darwin. Chapter 8 - Randomization, Statistics and Visualization contains an overview of the randomization and statistics functions followed with an explanation of the basic primitives available for graphing and plotting information. These are used extensively throughout later chapters, most notably Chapter 16 - Dayhoff Matrices and Mutation Matrices, Chapter 17 - Coping with Insertions and Deletions, Chpater [*] - Generating Random Sequences, and Chapter 22 - Phylogenetic Trees.

An indepth reading of Chapter 9 - Overloading, Polymorphism and Object Orientation, Chapter 12 - Measuring Performance and Chapter [*] - Producing HTML Code should be postponed until the reader feels he/she is particularly comfortable with the system.

Chapter 13 Darwin's Interprocessor Skills  explains the mechanisms built into Darwin for interprocessor communication. These routines allow users to fragment large computationally intensive jobs into smaller pieces which can be distributed automatically to other processors. A complete understanding of this topic is not necessary for one to proceed into later chapters with the exception of the latter half of Chapter [*] - All against All where a program is given which performs an exhausive matching of a set of amino acid sequences.

Each chapter in Part II Darwin and Problems from Biochemistry  examines a different bioinformatic problem. Every chapter contains (1) a statement of the problem, (2) a discussion concerning any biologic assumptions we make about the data, (3) an explanation of how we model the problem mathematically, (4) a description of the algorithm, (5) a Darwin implementation, (6) a discussion about the accuracy and efficiency of our algorithm,and (7) a short guide to the literature. In this manner, we tour the Darwin libraries motivating each routine and data structure with a concrete example. Beyond the understanding of the Darwin libraries, we hope such a presentation gives users

Part [*] - The Reference Guide provides a complete list of the Darwin built-in functions and libraries. It is structured into sections containing routines related by function (e.g. Mathematical Functions, Input/Output Functions, string Searching Functions, and so forth).

The appendices contain some general material including a short introduction to statistics and dynamic programming. For those readers unfamiliar with the mathematics underlying the models we use, these chapters will provide a deeper understanding of our methods. All of the examples and programs used throughout this manual are available via the world wide web (WWW) or by ftp (file transfer protocol). The COMPUTATIONAL BIOCHEMISTRY RESEARCH GROUP at ETH-Zürich maintains a web cite at:

http://cbrg.inf.ethz.ch/
The home page located at this address contains links to this manual and the example code files. If you do not have access to a web browsers, the ftp address
ftp inf.ethz.ch
user: anonymous
also mirrors these files.



 
next up previous contents
Next: Availability and Contact
Gina Cannarozzi
2001-07-05