next up previous contents
Next: Computing expected scores Up: Searching with Fragment Sequences Previous: Searching with Fragment Sequences

Generating random scores

The first approach consists of matching random sequences against the fragment and studying the distribution of the similarity score. A large number (say 100) random amino acid strings are generated, the best possible match between the given fragment and the test sequence is computed, and then the mean and standard deviation of the similarities are calculated.

Simple univariate statistics  can be collected using internal Darwin functions. To collect these statistics we must initialize the corresponding counters with the structure Stat. 

> ByRefShake := Stat('similarity score with LocalAlign'):
> ByDynProgr := Stat('similarity score with DynProg'):
The argument to Stat can be any name, a printable definition is convenient.

The following statements produce the random test sequences, match them to the given fragment, and update the statistical information using the updatew  function for both a refined match against a typical random sequence and exactly against a random sequence of the same length.

> SeqPos := GetOffset(SearchSeq):
Error, DB must be assigned a database

> for i to 100 do
Set SeqPos to be like a database pointer to the given fragment and run a loop 100 times. The next statements compute a random sequence. The length of this random sequence affects the quality of the result. The longer the sequence, the better the chances of finding a good match. To solve this problem we will select a random entry in the database and use its length as the length of the random sequence. In this way, our random sequences will have the same length distribution as the database.
>     RandEntry := round( Rand()*DB[TotEntries] + 0.5 );
>     RandLen := length( Strings(Sequence(Entry(RandEntry)))[1]);
>     RandSeq := CreateRandSeq(RandLen,AF);
Now we can compute the refined match and collect its similarity
>     m := LocalAlign( Match(SeqPos,GetOffset(RandSeq)), DM);
>     update(ByRefShake,m[Sim]);
and secondly, with a sequence of the same length, we compute an exact match
>     n := DynProg(SearchSeq,CreateRandSeq(Len,AF),DM,Len,Len);
>     update(ByDynProgr,n[1]);
>     od:
Error, cannot apply selector on object

We can display or use the information in two different ways. First we use the stats  function which prints all the information. Internal constants define the intervals to be at the 95% confidence level.  For the refined-shaked matches, the similarity score is characterized by

> print(ByRefShake);
Error, (in Stats_print) division by zero

The second method uses selectors on the Stat structure. Mean,  Variance,  VarVariance  (the variance of the reported variance), Skewness,  Excess,  Number,  MeanVar  (the mean and its 95% confidence interval as a name), Minimum,  and Maximum.  The function ProbStdDev takes a probability as an argument and returns the number of standard deviations  away from the mean that will give this probability. The function ProbStdDev can be defined in terms of the inverse complementary error function  erfcinv  as:

> ProbStdDev :=  p -> sqrt(2)*erfcinv(2*p):
The values for 0.025 and 0.001 are well known:
> ProbStdDev(0.025), ProbStdDev(0.001);
1.9600, 3.0902
To expect 10 random matches, we should set the probability to be 10/n, where n is the number of entries in the database. The goal should then be
> Z := ProbStdDev( 10 / DB[TotEntries] );
Error, cannot apply selector on object
standard deviations away from the mean random score.

From this data we can compute an approximation to the goal:

> Goal := ByRefShake[Mean] + Z*sqrt(ByRefShake[Variance]);
Error, (in Stats_select) division by zero

The last piece of information needed is the maximum possible similarity score. This can be computed by adding all the corresponding diagonal entries of the Dayhoff matrix for each amino acid in the fragment:

> MaxGoal := 0:
> for i to Len do
>     MaxGoal := MaxGoal + DM[Sim,AToInt(SearchSeq[i]),AToInt(SearchSeq[i])]
>     od:
> MaxGoal;
45.9342
Since the goal is relatively far from the maximum possible score, we are assured that the interval is not too narrow.


next up previous contents
Next: Computing expected scores Up: Searching with Fragment Sequences Previous: Searching with Fragment Sequences
Gaston Gonnet
1998-09-15