**G. Gonnet - M. Hallett**

**29 August 1997**

**The Evolution of Darwin**

This section is reserved for a short description written by Gonnet and Benner about the early beginnings of Darwin- the motivation, who worked on it etc.

**The following can be completed at the end.**

Darwin was designed to be a workbench where biochemists could
forge tools to explore genetic data quickly and easily. The workbench
is a *partially interpreted general purpose language* containing
many built-in routines and data structures (tools from which tools
can be built) especially tailored to the wants and needs of the
bioinformatics community.

Darwin provides a flexible structure to hold a complete or partial genetic database. It is general enough to hold DNA, RNA or amino acid sequences and allows an unlimited amount of annotation information to be kept alongside each entry.

(Example of a SwissProt in Darwin entry) (Example of an EMBL in Darwin entry)

Dayhoff matrices

(Example of a Dayhoff matrix)

PAM distance

One of the first motivations for the Darwin system was to allow for
the complete ``self matching'' of the entire SwissProt database. The
goal was to compare every entry (some 30,000 at the time) against
every other entry. It is a simple task to create *pairwise
comparisons* of entries in Darwin.

(Example of a AA-AA comparison)

AA-AA comparisons with PAM distances)

(example of PAM distance related to Dayhoff matrices)

The built-in Darwin pairwise comparison algorithms are based on the
method of *maximum likelihood*. In broad terms, this methodology
is based on the belief that we can test whether two sequences are *truly*
homologous by testing the hypothesis

Gene finding

(Example of a DNA-AA comparison)

Phylogenetic Trees

(Example of a Phylogenetic tree - rooted)

Multiple Sequence alignments

(Example of an MSA)

Probabilistic ancestral sequences

(Example of a PAS)

Secondary Structure

(Example from SAINT)

Molecular Weight Traces

(Example of a Molecular weight trace)

programming language

(Examples of the programming language)

**End of Sample Session**

This book is divided into three parts. Part - *An
Introduction to Darwin* has been designed to familiarize even the most
computer illiterate amongst us with the basic Darwin environment.
We have to tried to write to the biologist/biochemist who perhaps has
had a first year Introduction to Computer Science course and who has
distant memories of for-loops and recursion stored somewhere deep in
the recess of their subconscious.
An attempt has been made to use simple
terminology, only giving those definitions we deem important in later
chapters. For new users, we recommend that Part be read
``in one sitting'' beginning at Chapter - *Exploring the Basics* and following the discussion and examples
through to the end of Chapter - *A Guide to
Debugging*.

An experienced programmer may find that Part need only be skimmed in order to familiarize themself with the pecularities of the Darwin system. Once comfortable with the language, users may find that it acts as a short reference guide for looking up commands ``on the fly''. Towards this end, we have attempted to make each chapter self-contained.

Chapter - *Exploring the Basics*
provides a basic session with Darwin designed to
give new users a *feeling* of how to interact with the system.
Chapter through to Chapter
provides a more in depth tour of the basic
Darwin language with a focus on the most commonly used commands and
routines built into the
kernel. We recommend new
users become familiar with the topics covered in these chapters before
venturing into Part .

The more esoteric and application specific routines
are discussed in Chapter - *Genetic Databases*,
Chapter - *Randomization, Statistics and
Visualization*, Chapter - *Producing HTML Code*,
Chapter - *Darwin's Interprocessor Skills*,
and Chapter - *Calling External Functions*.

Chapter - *Genetic Databases* provides an in depth look at how
Darwin builds, stores and manipulates genetic databases. In some
sense, these data structures are the
cornerstone of the system and a fluency with their manipulation will
greatly ease the difficulty of programming in Darwin.
Chapter - *Randomization, Statistics and Visualization*
contains an overview of the randomization and statistics functions
followed with an explanation of the basic primitives available for graphing and
plotting information. These are used extensively throughout later
chapters, most notably
Chapter - *Dayhoff Matrices and Mutation
Matrices*, Chapter - *Coping with Insertions and
Deletions*, Chpater - *Generating Random
Sequences*, and Chapter - *Phylogenetic Trees*.

An indepth reading of Chapter - *Overloading, Polymorphism and
Object Orientation*,
Chapter - *Measuring Performance* and
Chapter - *Producing HTML Code* should be postponed
until the reader feels he/she is particularly comfortable with the system.

Chapter *Darwin's Interprocessor Skills*
explains the mechanisms built into
Darwin for interprocessor communication. These routines allow users
to fragment large computationally intensive jobs into smaller pieces
which can be distributed automatically to other processors. A
complete understanding of this topic is not necessary for one to
proceed into later chapters with the exception of the latter half of
Chapter - *All against All*
where a program is given which performs an exhausive matching of a set of
amino acid sequences.

Each chapter in Part *Darwin and Problems from
Biochemistry*
examines a different bioinformatic problem. Every chapter contains
(1) a statement of the problem, (2) a discussion concerning any
biologic assumptions we make about the data,
(3) an explanation of how we
model the problem mathematically, (4) a description of the algorithm,
(5) a Darwin implementation, (6) a discussion about the accuracy and
efficiency of our algorithm,and (7) a short guide to the literature.
In this manner, we tour the Darwin libraries motivating each routine
and data structure with a concrete example.
Beyond the understanding of the Darwin libraries, we hope such a
presentation gives users

- an understanding of some of the classic problems from bioinformatics,
- an understanding of the underlying biochemistry involved in these problems,
- an understanding of the mathematical model upon which these algorithms are predicated,
- an understanding of how the algorithms works, and
- a conceptual overview of how to structure programs in Darwin.

The appendices contain some general material including a short introduction to statistics and dynamic programming. For those readers unfamiliar with the mathematics underlying the models we use, these chapters will provide a deeper understanding of our methods. All of the examples and programs used throughout this manual are available via the world wide web (WWW) or by ftp (file transfer protocol). The COMPUTATIONAL BIOCHEMISTRY RESEARCH GROUP at ETH-Zürich maintains a web cite at:

Our group at ETH-Zürich and the University of Florida at Gainsville continue to add code to the Darwin system and we regularly make this new code available via the above web site. Readers are encouraged to submit their code into our algorithms repository. If you feel you have a particularly useful, novel or simply better algorithm for a problem, please send us e-mail at the adress below.

If you should have any comments, suggestions or questions about
Darwin, we can be reached by e-mail at

- Contents
- List of Tables
- List of Figures
- An Introduction to Darwin
- Exploring the Basics
- Advanced String Manipulation
- Procedures
- Lists and Arrays Revisited
- Structured Types
- Iteration and Recursion
- Input/Output
- Genetic Databases
- Randomization, Statistics and Visualization
- Polymorphism
- System Commands
- A Guide to Debugging
- Measuring Performance
- Producing HTML Code
- Darwin's Interprocessor Skills
- Calling External Functions

- Darwin and Problems from Biochemistry
- Point Accepted Mutations and Dayhoff Matrices
- Insertions and Deletions
- The Pairwise Comparison of Amino Acid Sequences
- Searching for Genes
- All versus All
- Phylogenetic Trees
- Phylogenetic Trees
- Multiple Sequence Alignments
- Probabilistic Ancestral Sequences
- Predicting Secondary Structure
- Random Sequence Generation
- Searching with Fragment Sequences
- Molecular Weight Traces

- The Reference Guide
- Programming Tools and Functions
- Sets, Lists, Arrays and Strings
- Searching Functions
- Graphic Functions
- Statistical Functions
- Randomization Functions
- Mathematical Functions
- The Absolute Value Function
- The Arc-tangent Function
- The Cosine Function
- The Error Function
- The Complimentary Error Function
- The Inverse Complementary Error Function
- The Exponential Function
- The Factorial Function
- The Floor Function
- Gaussian Elimination
- The Natural Logarithm Function
- The Logarithm base 10 Function
- The Maximum Function
- Maximize Function
- The Minimum Function
- Minimize Function
- Minimize2D Function
- DisconMinimize Function
- The Modulus Function
- The Rounding Function
- The Sine Function
- The Square Root Function
- Summation
- The Tangent Function
- Matrix Transposition
- The Truncate Function
- The Riemann Zeta Function
- The Zip Function

- Printing Routines
- Input and Output Functions
- Nucleic, Amino Acid and Genetic Code Functions
- Dayhoff and Mutation Matrix Functions
- Nucleic Peptide Matching Functions
- Pairwise Matching Functions
- All against All Functions
- Graph and Tree Functions
- Tree Construction Functions
- Multiple Sequence Alignment and Probabilistic Ancestral Sequence Functions
- Secondary Structure Prediction Functions
- Enzyme Digestion Functions
- System Functions and Variables
- HTML Functions
- Interprocessor Functions
- External Function Call Functions
- Types, Operators and Expressions
- Error Messages

- The Appendices
- Bibliography
- About this document ...