Next: Contents

The DARWIN Manual ©

G. Gonnet - M. Hallett

29 August 1997

The Evolution of Darwin

This section is reserved for a short description written by Gonnet and Benner about the early beginnings of Darwin- the motivation, who worked on it etc.

The following can be completed at the end.

Darwin was designed to be a workbench where biochemists could forge tools to explore genetic data quickly and easily. The workbench is a partially interpreted general purpose language containing many built-in routines and data structures (tools from which tools can be built) especially tailored to the wants and needs of the bioinformatics community.

Darwin provides a flexible structure to hold a complete or partial genetic database. It is general enough to hold DNA, RNA or amino acid sequences and allows an unlimited amount of annotation information to be kept alongside each entry.

(Example of a SwissProt in Darwin entry) (Example of an EMBL in Darwin entry)

Dayhoff matrices

(Example of a Dayhoff matrix)

PAM distance

One of the first motivations for the Darwin system was to allow for the complete ``self matching'' of the entire SwissProt database. The goal was to compare every entry (some 30,000 at the time) against every other entry. It is a simple task to create pairwise comparisons of entries in Darwin.

(Example of a AA-AA comparison)

AA-AA comparisons with PAM distances)

(example of PAM distance related to Dayhoff matrices)

The built-in Darwin pairwise comparison algorithms are based on the method of maximum likelihood. In broad terms, this methodology is based on the belief that we can test whether two sequences are truly homologous by testing the hypothesis

Gene finding

(Example of a DNA-AA comparison)

Phylogenetic Trees

(Example of a Phylogenetic tree - rooted)

Multiple Sequence alignments

(Example of an MSA)

Probabilistic ancestral sequences

(Example of a PAS)

Secondary Structure

(Example from SAINT)

Molecular Weight Traces

(Example of a Molecular weight trace)

programming language

(Examples of the programming language)

End of Sample Session

This book is divided into three parts. Part - An Introduction to Darwin has been designed to familiarize even the most computer illiterate amongst us with the basic Darwin environment. We have to tried to write to the biologist/biochemist who perhaps has had a first year Introduction to Computer Science course and who has distant memories of for-loops and recursion stored somewhere deep in the recess of their subconscious. An attempt has been made to use simple terminology, only giving those definitions we deem important in later chapters. For new users, we recommend that Part be read ``in one sitting'' beginning at Chapter - Exploring the Basics and following the discussion and examples through to the end of Chapter - A Guide to Debugging.

An experienced programmer may find that Part need only be skimmed in order to familiarize themself with the pecularities of the Darwin system. Once comfortable with the language, users may find that it acts as a short reference guide for looking up commands ``on the fly''. Towards this end, we have attempted to make each chapter self-contained.

Chapter - Exploring the Basics provides a basic session with Darwin designed to give new users a feeling of how to interact with the system. Chapter through to Chapter provides a more in depth tour of the basic Darwin language with a focus on the most commonly used commands and routines built into the kernel. We recommend new users become familiar with the topics covered in these chapters before venturing into Part .

The more esoteric and application specific routines are discussed in Chapter - Genetic Databases, Chapter - Randomization, Statistics and Visualization, Chapter - Producing HTML Code, Chapter - Darwin's Interprocessor Skills, and Chapter - Calling External Functions.

Chapter - Genetic Databases provides an in depth look at how Darwin builds, stores and manipulates genetic databases. In some sense, these data structures are the cornerstone of the system and a fluency with their manipulation will greatly ease the difficulty of programming in Darwin. Chapter - Randomization, Statistics and Visualization contains an overview of the randomization and statistics functions followed with an explanation of the basic primitives available for graphing and plotting information. These are used extensively throughout later chapters, most notably Chapter - Dayhoff Matrices and Mutation Matrices, Chapter - Coping with Insertions and Deletions, Chpater - Generating Random Sequences, and Chapter - Phylogenetic Trees.

An indepth reading of Chapter - Overloading, Polymorphism and Object Orientation, Chapter - Measuring Performance and Chapter - Producing HTML Code should be postponed until the reader feels he/she is particularly comfortable with the system.

Chapter Darwin's Interprocessor Skills explains the mechanisms built into Darwin for interprocessor communication. These routines allow users to fragment large computationally intensive jobs into smaller pieces which can be distributed automatically to other processors. A complete understanding of this topic is not necessary for one to proceed into later chapters with the exception of the latter half of Chapter - All against All where a program is given which performs an exhausive matching of a set of amino acid sequences.

Each chapter in Part Darwin and Problems from Biochemistry examines a different bioinformatic problem. Every chapter contains (1) a statement of the problem, (2) a discussion concerning any biologic assumptions we make about the data, (3) an explanation of how we model the problem mathematically, (4) a description of the algorithm, (5) a Darwin implementation, (6) a discussion about the accuracy and efficiency of our algorithm,and (7) a short guide to the literature. In this manner, we tour the Darwin libraries motivating each routine and data structure with a concrete example. Beyond the understanding of the Darwin libraries, we hope such a presentation gives users

an understanding of some of the classic problems from bioinformatics,
an understanding of the underlying biochemistry involved in these problems,
an understanding of the mathematical model upon which these algorithms are predicated,
an understanding of how the algorithms works, and
a conceptual overview of how to structure programs in Darwin.

Part

- The Reference Guide provides a complete list of the Darwin built-in functions and libraries. It is structured into sections containing routines related by function (e.g. Mathematical Functions, Input/Output Functions, string Searching Functions, and so forth).

The appendices contain some general material including a short introduction to statistics and dynamic programming. For those readers unfamiliar with the mathematics underlying the models we use, these chapters will provide a deeper understanding of our methods. All of the examples and programs used throughout this manual are available via the world wide web (WWW) or by ftp (file transfer protocol). The COMPUTATIONAL BIOCHEMISTRY RESEARCH GROUP at ETH-Zürich maintains a web cite at:

http://cbrg.inf.ethz.ch/

The home page located at this address contains links to this manual and the example code files. If you do not have access to a web browsers, the ftp address

ftp inf.ethz.ch
user: anonymous

also mirrors these files.

Our group at ETH-Zürich and the University of Florida at Gainsville continue to add code to the Darwin system and we regularly make this new code available via the above web site. Readers are encouraged to submit their code into our algorithms repository. If you feel you have a particularly useful, novel or simply better algorithm for a problem, please send us e-mail at the adress below.

If you should have any comments, suggestions or questions about Darwin, we can be reached by e-mail at

darwin@inf.ethz.ch

We appreciate any feedback which will help us improve both the system and the manual.

Next: Contents

Gaston Gonnet
1998-09-15