Simple univariate statistics
can be collected using internal Darwin functions.
To collect these statistics we must initialize
the corresponding counters with the structure
Stat
.
> ByRefShake := Stat('similarity score with LocalAlign'): > ByDynProgr := Stat('similarity score with DynProg'):The argument to
Stat
can be any name, a printable
definition is convenient.
The following statements produce the random test sequences,
match them to the given fragment,
and update the statistical information
using the updatew
function for both a refined match against a typical random
sequence and exactly against a random sequence of
the same length.
> SeqPos := GetOffset(SearchSeq): Error, DB must be assigned a database > for i to 100 doSet
SeqPos
to be like a database pointer to
the given fragment and run a loop 100 times.
The next statements compute a random sequence.
The length of this random sequence affects the
quality of the result.
The longer the sequence, the better the chances
of finding a good match.
To solve this problem we will select a random
entry in the database and use its length as the length
of the random sequence.
In this way, our random sequences will have the same
length distribution as the database.> RandEntry := round( Rand()*DB[TotEntries] + 0.5 ); > RandLen := length( Strings(Sequence(Entry(RandEntry)))[1]); > RandSeq := CreateRandSeq(RandLen,AF);Now we can compute the refined match and collect its similarity
> m := LocalAlign( Match(SeqPos,GetOffset(RandSeq)), DM); > update(ByRefShake,m[Sim]);and secondly, with a sequence of the same length, we compute an exact match
> n := DynProg(SearchSeq,CreateRandSeq(Len,AF),DM,Len,Len); > update(ByDynProgr,n[1]); > od: Error, cannot apply selector on object
We can display or use the information in two different ways.
First we use the stats
function which prints all the information.
Internal constants define the intervals to be at the 95%
confidence level.
For the refined-shaked matches, the similarity score is
characterized by
> print(ByRefShake); Error, (in Stats_print) division by zero
The second method uses selectors on the Stat
structure.
Mean
,
Variance
,
VarVariance
(the variance of the
reported variance), Skewness
,
Excess
,
Number
,
MeanVar
(the mean and its 95%
confidence interval as a name),
Minimum
, and Maximum
.
The function ProbStdDev
takes a probability
as an argument and returns the number of standard
deviations away from the mean
that will give this probability.
The function ProbStdDev
can be defined in terms
of the inverse complementary error
function
erfcinv
as:
> ProbStdDev := p -> sqrt(2)*erfcinv(2*p):The values for 0.025 and 0.001 are well known:
> ProbStdDev(0.025), ProbStdDev(0.001); 1.9600, 3.0902To expect 10 random matches, we should set the probability to be 10/n, where n is the number of entries in the database. The goal should then be
> Z := ProbStdDev( 10 / DB[TotEntries] ); Error, cannot apply selector on objectstandard deviations away from the mean random score.
From this data we can compute an approximation to the goal:
> Goal := ByRefShake[Mean] + Z*sqrt(ByRefShake[Variance]); Error, (in Stats_select) division by zero
The last piece of information needed is the maximum possible similarity score. This can be computed by adding all the corresponding diagonal entries of the Dayhoff matrix for each amino acid in the fragment:
> MaxGoal := 0: > for i to Len do > MaxGoal := MaxGoal + DM[Sim,AToInt(SearchSeq[i]),AToInt(SearchSeq[i])] > od: > MaxGoal; 45.9342Since the goal is relatively far from the maximum possible score, we are assured that the interval is not too narrow.