Description of the program Pspace: a program to plan the
covering of a protein space and to search for orthologs
Mihaly Mezei
Department of Pharmacological Sciences,
Icahn School of Medicine at Mount Sinai
New York, NY 10029
Mihaly.Mezei@mssm.edu
March 27, 2006.
The program is run interactively.
At the start it establishes a log file. After that,
it continually offers the user the
choice of one of the following functions:
- Select and read a scoring matrix and set the gap-penalty values for
initiating and extending a gap.
The 66 matrices provided by the database
AAindex,
Version 3.0 (Kawashima, S., Ogata, H., and Kanehisa, M.;
AAindex: amino acid index database.
Nucleic Acids Res. 27I, 368-369 (1999))
have been included into the distribution.
- Read a set of sequences
(in FASTA format)
either as first or as second set (for possible ortholog search).
A Postscript plot of the distribution of pairwise % identities and
alignment scores is also generated.
- Check a set for redundancy:
cluster the sequences in a set by % homology and select the one in the
'middle' as the representative of the cluster.
The 'middle' is defined as the sequence whose lowest homology
with the rest of the cluster member is the highest.
- Initialize the weight calculation,
assuming that no structure has been determined
for the proteins represented by the sequences in the set
- Add sequences representing proteins with known structure to a set
- Add sequences representing proteins with unknown structure to a set
- Change the status of selected proteins from unknown to known
- Define residues of special emphasis in sequences already
in the database and specify a different percentage identity threshold
for this subset of residues
- Find a subset of sequences that covers the whole set
using one of the four algorithms:
- Greedy and coordinated:
Determine structure of the protein with the highest weight in the set U
- Stochastic and coordinated:
Determine structures of proteins from the set U
with a probability proportional to a weight associated with each protein
- Random and coordinated:
Determine structures of proteins from the set U
with uniform probability considering only proteins whose weight is positive
- Random and uncoordinated:
Determine structures of proteins from the set U
with uniform probability considering all proteins in the set U
- Match the sequences on the two sets
with one of the following alorithms (list of matches will be written
to f ile with extension .mat):
- For each sequence S1in one set, list all sequences in the other set
that are within user-defined percentage (Default: 5 %) of the
best match to S1
(sequences in the second set may appear on more than one list)
- For each sequence S1 in one set, list all sequences in the other set
that are within user-defined percentage (Default: 5 %) of the
best match to S1
and are better matched to S1 than to any other sequence in
the first set (sequences in the second set may apear on only one list)
- Match sequences in an optimal way (maximize the minimum match score)
using the optimization procedure called Hungarian method
- Report the weights assigned to the sequences in a set on the log file
- Report the content of the whole database
(sequences, pairwise scores, weights) in the log file
- Save the database
- Restore the database
- Exit
Compilation of the program
The program is written in Fortran 77. Its size is governed by
the parameters (the number between the braces is the value set
in the source code), established in the first line of the program
- MAXAA {30} - maximum number different amino acids
- MAXRES {2000} - maximum number of residues in a sequence
- MAXSEQ {1000} - maximum number of sequences in one set
- MAXNG {100} - maximum number of sequences in the 'vicinity'
of a sequence
- MAXDB {1000} - maximum number of sequences in the database
- MAXSPAV {20} - average number of special residues
per sequence in the database
It should be compiled at the highest optimization level for maximum speed.
For example, using the g77 compiler the compilation can be executed
by
g77 -O4 -o pspace.exe pspace.f