Exploiting sparse statistics for sequence-based prediction of the effect of mutations
Abstract:
Recent work showed that there is a significant difference between the
statistics of amino acid triplets and quadruplets in sequences of
folded proteins and randomly generated sequences.
These statistics were used to assign a score to each sequence and
make a prediction whether a sequence is likely to fold.
The present paper extends the statistics to higher multiplets and
suggests a way to handle the treatment of multiplets that were not found
in the set of folded proteins.
In particular, foldability predictions were done along the line of the
previous work using pentuplet statistics and a way was found to combine
the quadruplet and pentuplets statistics to improve the foldability predictions.
A different, simpler, score was defined for hextuplets and
heptuplets and were used to predict the direction of stability change
of a protein upon mutation.
With the best score combination the accuracy of the prediction was 73.4%.