Next: The Pairwise Comparison of Up: Insertions and Deletions Previous: Insertions and Deletions

Gap Probabilities and Penalties

Our algorithms for aligning amino acid sequences are based on the test of whether the two sequences evolved from a common ancestral sequence or not. This model allowed us to derive scores (used by the dynamic programming algorithm) from the mutation probabilities. In this section, we derive gap penalty costs from the probabilities of a gap occuring.

The gap model we use here is the one which is implicit in the vast majority of the present literature. This assumes that gaps are accidents at the DNA level which produce a random insertion or deletion of k amino acids with probability q(k). This probability is independent of the amino acids being inserted or deleted and of any neighbouring amino acids. It is a natural assumption that this probability also depends on the PAM distance t between the two sequences, in particular, gaps in similar sequences are less frequent than gaps in distant ones. Then

$\begin{displaymath}Pr \{ k{\rm -gap\ at\ distance\ }t \} = q(k,t) \end{displaymath}$

Gap penalties used with dynamic programming are typically of the form a+bk. In some cases the values of a and b depend on the PAM distance t. This gap penalty function used for dynamic programming will be denoted by d(k,t). For example

d(k,t) = a+bk

The linear gap penalty function is very popular, in part because it is possible to compute dynamic programming alignments very efficiently using Gotoh's algorithm[18].

The relation between d(k,t) and q(k,t) is simply

$\begin{displaymath}d(k,t) = 10 \log_{10} q(k,t) \end{displaymath}$

This relation is not well known, so we will prove it here. Readers not interested in the mathematical theory of alignments should skip the rest of this section.

The purpose of aligning sequences with dynamic programming is to find the alignment which maximizes the probability of the two sequences having evolved from a common ancestor as opposed to being just random sequences. Let A and B be two sequences with lengths n_A and n_B. Let A_i be the i^th amino acid in A (similarly for B). Let f_i be the frequency of amino acid i as defined in the previous section. The probability that the two sequences are random is

$\begin{displaymath}P_R = Pr \{ {\rm random} \} = \prod_{i=1}^{n_A} f_{A_i} \times \prod_{i=1}^{n_B} f_{B_i} \end{displaymath}$

For example, given an alignment

$\begin{displaymath}\begin{array}{cccccc} A_1 & A_2 & - & - & - & A_3 \\ X_1 &... ...& X_5 & X_6 \\ B_1 & B_2 & B_3 & B_4 & B_5 & B_6 \end{array} \end{displaymath}$

where X denotes the unknown ancestral parent sequence. The probability of this alignment being the consequence of having a common ancestor is

$\begin{displaymath}P_A = Pr \{ \mbox{A,B evolved from X} \} = \end{displaymath}$

$\begin{displaymath}\sum_{X_1=1}^{20} f_{X_1} (M^{t/2})_{A_1X_1} (M^{t/2})_{B_1X_... ...2=1}^{20} f_{X_2} (M^{t/2})_{A_2X_2} (M^{t/2})_{B_2X_2} \times \end{displaymath}$

for all the possible X₁ which evolved into A₁ on one side and into B₁ on the other, and similarly for X₂

$\begin{displaymath}\times \ q(3,t) f_{B_3} f_{B_4} f_{B_5} \ \times \end{displaymath}$

where this term denotes the probability of a gap of length 3 on one sequence times the probability of the sequence B₃B₄B₅ on the other and

$\begin{displaymath}\times \sum_{X_6=1}^{20} f_{X_6} (M^{t/2})_{A_3X_6} (M^{t/2})_{B_6X_6} \end{displaymath}$

for the alignment of A₃ against B₆. For this alignment the quotient of the probabilities is

$\begin{displaymath}\frac{P_A}{P_R} = \frac{(M^t)_{B_1A_1}}{f_{B_1}} \times \frac... ...}{f_{B_2}} \times q(3,t) \times \frac{(M^t)_{B_6A_3}}{f_{B_6}} \end{displaymath}$

once property grouped and simplified. To find the alignment which maximizes this probability we transform the maximization of a product (of all positive values) into the maximization of a sum by computing the logarithm of each term. To find the alignment which maximizes a sum, we use the dynamic programming algorithm. For historical reasons and to work with simpler numbers, these logarithms are multiplied by 10, which has no effect on the maximization process. Notice that the terms like

$\begin{displaymath}D[A_1,B_1] = 10 \log_{10} ( \frac{(M^t)_{B_1A_1}}{f_{B_1}} ) \end{displaymath}$

are exactly the values of the standard Dayhoff matrices. The deletion cost for the above example, under dynamic programming, must be

$\begin{displaymath}d(3,t) = 10 \log_{10} q(3,t) \end{displaymath}$

It is obvious how to generalize this result for any number or length of gaps.

The relationship can be applied in two directions: either we can estimate the probabilities from a sample and compute the appropriate gap penalties or we can compute the probabilities that are implied by a gap penalty function. A linear gap penalty function implies an exponential distribution of probabilities.

$\begin{displaymath}d(k,t) = a+bk = 10 \log_{10} q(k,t) \end{displaymath}$