What is an alignment?

An alignment attempts to represent the evolutionary relationship between two protein sequences by placing them side-by-side so that codons encoding amino acids paired in an alignment have arisen from a single codon in a single ancestral gene, at least with the highest probability. An alignment shows what amino acids substitutions have been accepted since two proteins diverged from their common ancestor. In principle, an alignment is "correct" if it correctly represents actual events in the historical past; a correct alignment matches amino acid codons that are descendent from a single codon in an ancestral protein, correctly reconstructs ancestral sequences, and indicates substitutions, insertions and deletions as they actually occurred during historical evolution. Proving that an alignment is correct is essentially impossbile, of course. In general, the accuracy of an alignment is judged by a score that represents the probability that an alignment is correct.

Organisms that are closer together in evolutionary terms should have sequences which are more it similar than organisms which are further apart. If two residues are aligned the implication is that both residues decended from the same residue in a common ancestor. To align the sequences means to put the residue pairs together, inserting gaps where necessary to illustrate the evolutionary relationship between the sequences.

The following three sequences are taken from triosephosphate isomerase sequences in rice, mosquito, human and monkey.

Rice:     CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQVAAQNCWVKKGVSA
Mosquito: MNGDKASIADLCKVLTTGPLNADTEVVVGCPAPYLTLARSQLPDSVCVAAQNCYKVPKISP
Human:    MNGRKQSLGELIGTLNAAKVPADTEVVCAPPTAYIDFARQKLDPKIAVAAQNCYKVTNISP
Monkey:   MNGRKQNLGELIGTLNAAKVPADTEVVCAPPTAYIDFARQKLDPKIAVAAQNCYKVTNISP

[Human, Mosquito]
lengths=61,61 simil=228.0, PAM_dist=80.8265, offsets=-936189416,-936189584,
  identity=49.2%, similarity=11.5%
MNGRKQSLGELIGTLNAAKVPADTEVVCAPPTAYIDFARQKLDPKIAVAAQNCYKVTNISP
|||.|.|:.!|...|....:.||||||.. |..|:.:||.:|...!.|||||||||..|||
MNGDKASIADLCKVLTTGPLNADTEVVVGCPAPYLTLARSQLPDSVCVAAQNCYKVPKISP

[Rice, Mosquito]
lengths=55,53 simil=117.7, PAM_dist=117.766, offsets=-936190011,-936189583,
  identity=36.4%, similarity=18.2%
NGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQVAAQNCW
||....!..!.|!|..|.!.:.  .||||. | .!|.:.!|||...! ||||||!
NGDKASIADLCKVLTTGPLNAD__TEVVVGCPAPYLTLARSQLPDSVCVAAQNCY

[Rice, Human]
lengths=61,59 simil=132.9, PAM_dist=122.615, offsets=-936190011,-936189415,
  identity=31.1%, similarity=26.2%
NGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQVAAQNCWVKKGVS
||..:.:.:!!..||..:!.:.  .|||.:||..!!...!.:|.|:|.||||||!....!|
NGRKQSLGELIGTLNAAKVPAD__TEVVCAPPTAYIDFARQKLDPKIAVAAQNCYKVTNIS

[Human, Monkey]
lengths=61,61 simil=731.3, PAM_dist=3.25498, offsets=-936189416,-936189200,
  identity=98.4%, similarity=0.0%
MNGRKQSLGELIGTLNAAKVPADTEVVCAPPTAYIDFARQKLDPKIAVAAQNCYKVTNISP
||||||.||||||||||||||||||||||||||||||||||||||||||||||||||||||
MNGRKQNLGELIGTLNAAKVPADTEVVCAPPTAYIDFARQKLDPKIAVAAQNCYKVTNISP

There are several things to notice about the alignments. The alignments consist of:

1.
a header which includes information about the lengths of the strings, measures of evolutionary distance (similarity score and PAM distance), the location of the strings in the database (offsets), the percentage of residues that are identical, and the similarity.
2.
the two sequences being matched
3.
a string of characters which gives a visual indication about the strength of the match. The closer a character is to a vertical line, the stronger the match between the two characters. The order of the charcters from strongest match to weakest match is: | vertical line (an exact match), ! exclamation point, : colon, . period, space
4.
the alignments can also contain gaps which are indicated by an underscore (_). The aligment of the sequences from humans and rice contains a gap of length two at positions 23 and 24. We conjecture that there was an ancestor of rice and humans which had a protein sequence for triosephosphate isomerase. We dont know what this sequence was but the alignment suggests that either the DV was not present in the ancestral sequence and was added to the rice sequence or the DV was present in the ancestral sequence and was deleted from the human sequence. It is conceivable that only the D was present in the ancestral sequence and that the D was deleted from the human sequence and the V was added to the rice sequence. This requires two insertion/deletion events while the first two models require only 1 insertion/deletion event so we consider the first two models more plausible. We can make a better estimation of the ancestral sequence by looking at more than two members of the family but that we leave for later.

Types of Alignments

There are several types of alignments that we will be examining.

1.
Proteins aligned with proteins - the above example
2.
DNA/DNA or DNA/RNA alignments. DNA/DNA alignments are differnt from protein alignments because of the following reason. A codon which codes for an amino acid sequence in a protein is made of 3 bases B1,B2,B3. As you can see from the Genetic Code,
the influence of the 3 bases is not the same. In fact, quite often the 3rd base doenst influence the amino acid at all (proline, glycine, alanine, arginine, ). The proteins are evolving under functional constraints- the selective pressure is on the protein , not the DNA. Therefore, if a base changes in the 3rd position and it doesnt affect the amino acid that is being encoded there is no selective pressure on that base to be conserved. The organism will not die, natural selection will not take place. So, if we are aligning coding DNA then we can not assume that a mutation of B1 to B1' occurs with the same frequency as B3 to B3'. Therefore matching 1 coding base at a time for coding DNA is not good. Non-coding DNA or DNA/RNA doesnt make a difference. xxx
3.
Multiple Sequence Alignments- We usually restrict this to proteins only. We start with pairwise alignments and try to align the whole family.
4.
string editing - This is a classic problem in computer science. Take two words and trasnform them with the minimum number of steps. The score for a transformation is the nubmer of steps it takes to get from one to the other.

How do we Align?
We need a theory that will explain how sequences mutate to find the optimal solution. What is the optimal solution? The optimal solution is that you align such that the matched characters at one position in the two different strings being aligned have come from a common ancestor. When the two are the same, we conjecture that they were the same in the common ancestor with a high probability. When the two are different, we conjecture that one or both have mutated since the time of the common ancestor.

ACDEF species 1

A??EF  ancestor

ALEEF species 2

Mutations happen at random at the DNA level. When the mutated DNA is transcribed into a protein, the function of the protein may be changed. If it isn't changed, then the organism will continue to live. If the change is harmful to the organism, then it will die and not pass the faulty DNA to the next generation or its decendents will be weaker and eventually removed by natural selection. The mutation may be beneficial to the organism and as a result, it is better suited to live than the other organims. In this case, natural selection should increase the frequency of this gene as the generations pass. {it The DNA mistakes happen at random but the mutations that are accepted by natural selection are far from random. The databases only reflect the sequences that have been accepted by natural selection over millions of years of evolution. If the organism does not survive because of a deleterious mutation, this has the same effect as the change not happening or as the change happening with a lower probability.

Markovian Model of Evolution

We need a model of evolution to develop a scoring mechanism. The chosen model is a first order Markovian model in that it assumes that subsequent amino acid substitutions in a protein occur with a probability independent of previous substitutions, that substitutions occur independently at different positions in the polypeptide chain, and that a single substitutions matrix can represent the probability of amino acid substitution at any and all positions in a protein.

This model is, of course, an approximation. Real proteins adopt three-dimensional conformations where amino acids distant in the sequence come in contact and therefore interact. Thus, residues in a protein sequence need not undergo substitutions independent of substitution at other positions in the protein. Likewise, biological function constrains the types of amino acid substitutions that are acceptable to natural selection. Therefore, amino acids need not suffer mutation independently, either in sequence or in time.

It is well known that amino acids near an active site are more conserved than expected under the Markov model. It is also accepted that positions on the surface of a folded structure tolerate more variation than the positions inside.

Back mutations are also more probable than the corresponding mutation. For example if we have a F -> G -> F mutation, the G -> F mutation is more probable than a random G-> F mutation because we know that at one point in time the F was already accepted by natural selection (the protein was working with the F in the first place so mutating back to F has a high probability of not killing the organism).

To summarize the Markovian model-

1.
the mutations depend only on the amino acid itself (they are independent of their neighbors)
2.
we use a single mutation matrix to describe the substitutions
3.
a deletion is treated the same as an amino acid.

Scoring the Alignment

A mutation matrix for amino acids is a 20 x 20 matrix that contains the probability of mutation of each amino acid into the others. Here is the Mutation Matrix for 1 PAM evolutionary distance.

It is important to note here that the scoring matrix is defined for a specific PAM distance. This is most easily seen by considering the diagonal and off-diagonal elements. In a matrix describing the alignment of two closesly related proteins, the diagonal terms are large relative to the off diagonal terms; more amino acids have been conserved than have been replaced. In contrast, in a matrix describing two more distantly related proteins, the diagonal terms are small relative to the off-diagonal terms; many more amino acids have been replaced than have been conserved. Here is the the 100 PAM unit Mutation Matrix for comparison to the 1 PAM matrix. The diagonal elements are still bigger than the off-diagonal elements but the probability of a mutation is much greater.

In the one PAM mutation matrix V->R has a low value and F-> Y has a higher value meaning that the mutation of a F (phenylalanine) into a Y (tyrosine) has a higher probability. The columns should add to one. We also accept a "21st amino acid" which is nothing (a gap representing an insertion or a deltion).

Each mutation matrix is associated with some amount of mutation. For historical reasons the unit of mutation is called a PAM (point accepted mutation) unit. The 1 PAM mutation matrix mutates amino acids with probability of 1%. On the average, 1 PAM unit is 1% mutation. For a sequence of 100 amino acids, 1 will be mutated. We can easily calculate the 100 PAM mutation matrix from the 1 PAM mutation matrix it is, M100. We calculate the other mutation matrices from the 1 PAM mutation matrix because 1 PAM is small enough to ignore back mutations- ie the probability of a back mutation (the same residue being mutated twice) at 1 PAM. is .01 * .01 or .0001 which is small enough to ignore. This is a sufficinetly good approximation.

As seen in the 1 PAM unit Mutation Matrix , the mutation matrix is not symmetic. This is because the amino acids occur with different frequencies.

100 PAM units separate most of the animals for functional proteins. 250 is the limit - beyond that it is hard to separate the signal from the noise.

Minf is not so different from M250. Minf should only reflect the frequencies of the amino acids in nature.

Sequence Alignment - Dayhoff Matrix

For reasons both historical and algebraic, the mutation matrix is transformed into a new matrix termed a Dayhoff matrix (in honor of the first author, Margaret O. Dayhoff). The Dayhoff matrix is related to the mutation matrix The elements of the Dayhoff matrix are 10 times the logarithms of probabilities that the indexed amino acids will be paired in an alignment by reasons of common ancestry divided by the probabilities that the pairing would occur by chance. Here is how the elements of the Dayhoff Matrix are calculated from Mutation matrices. The probability of alignment by random chance is just the frequencies of the amino acids in the database. The probability of aligment by reasons of ancestry is calculated by summing over all amino acids the probability that that amino acid mutated into the ones at this position in the alignment. (see slide) We sum over all amino acids because we dont know what X was. We have a Dayhoff entry for each pair of amino acids. We align sequences such that this probability is maximized. ie We maximize the probability of having evolved from a common ancestor ( a maximum likelihood alignment) against the null hypothesis of being randomly aligned.

Here is the 250 PAM Dayhoff Matrix.

How do we find a sequence alignment such that for

ACDEFG
|  ||
ALEEFHG

DAA + DCL +DDE +DEE +DFF ...

is maximized?

Here is where we coincide with string editing. In string editing every change counts as one but here every change has a value that we pull from the Dayhoff Matrix. Here we have a more complicated function that is given by a matrix.

We solve it with Dynamic Programming.

Dynamic Programming -Global Alignment

Align two sequences (words) ENOUGH and GENUG.

Each box will have the best possible score for aligning the two sequences to that point.

The box that we are looking at has the best score to that point. The box to its right contains the score if the letter at the top of that column is deleted. To calculate the score we add the cost of deleting that letter to the score in the box that we are looking at. The box below the one we are looking at contains the score if we delete the letter for that row. To calculate the score we add the cost of deleting that letter to the score in our box. The box diagonally to the right and below contains the score for a matching the letters at the top of the column and the row. To calculate the score for that box we add the cost of a match to the score in our box and add it to the new box.

We use the Similarity Matrix for this. If the similarity matrix contains costs then we want to minimze the score. If the similarity matrix contains similarities then we want to maximize the score. The best match between Genug and Enough has value 1 with the arbitrary scoring matrix we decided on at the outset. Here is the final slide of the dynamic programming of ENOUGH and GENUG

There are two phases of dynamic programming.

1.
compute the optimum
2.
find the match
Only at the end (backtracking) can you find the best match.

The global alignment uses both whole strings to score.

Dynamic Programming -Global Alignment - Cost free end gaps

Sometimes proteins often have tails that dont align so we consider the case that we dont want to penalize at all for deleting at the very ends. In this case we take the largest value from the last column and row. To eliminate the penalty for deleting at the beginning, we put 0 in the first column and row (ie there is no penalty for deleting letters at the ends), recalculate and then follow the traceback and see where it ends.

Dynamic Programming -Local Alignment

The next variation is the Local Alignment where we let any substring of string one match with any substring of string two and look for the highest score. You might use this when you want to match a domain in one protein with the same domain in another protein, knowing that the rest of the two proteins dont match.

We do this in the following manner. If at any point we calculate a negative number, we toss it out and replace it by a zero and cut the alignment there. A negative score means youre better off starting from scratch so you chop and restart. Then you take the biggest number from the last row and column to find the end.




next up previous
Next: About this document ...
Gina Cannarozzi
2000-11-07