July 27, 2020.
I. | DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM |
II. | RUNNING THE PROGRAM - INPUT |
III. | THE OUTPUT ON THE TERMINAL |
IV. | FILE FORMATS |
V. | EXAMPLES |
VI. | INSTALLATION |
VII. | CHANGING DIMENSION (ARRAY SIZES) |
I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM
II.RUNNING THE PROGRAM - INPUT
The program is run from the command line by issuing the following command:
fold -op operation -di directive -di directive ...
where operation is one of the followings:
Code Explanation of the operation
GOR1 SS prediction with GOR-1 method
GOR2 SS prediction with GOR-2 method
CF01 SS prediction with Chou-Fassman-1 method
CF02 SS prediction with Chou-Fassman-2 method
CF03 SS prediction with Chou-Fassman-3 method
ADJS AA-AA correlation as a function of sequence distance
ADFB Forward and backward AA neighbor propensities comparison
TRIP Triplet frequency distribution generation
QUAD Quadruplet frequency distribution generation
PENT Pentuplet frequency distribution generation
HEXA Hextuplet frequency distribution generation
HEPT Heptuplet frequency distribution generation
SC03 Triplet score calculation for a protein chain
SC04 Quadruplet score calculation for a protein chain
SC34 Triplet-Quadruplet score calculation for a protein chain
SC05 Pentuplet score calculation for a protein chain
CORT Calculation of correlation between triplet score and Tmelt
PLOT Generating GNUplot input from several output files
OVRL Calculation of overlaps among distributions
NRDS Calculation of residue number/chain distributions
PRFO Predict the likeliness of foldability for a set of sequences
FILT Filter a sequence set for percent identity
FILI Continue filtering a sequence set for percent identity
MUTA Mutation effects prediction
PCOV n-tuplet coverage as a function of sample size
CHPR Chameleon propensity calculation
RCIF Reads a list of PDB-CIF file names, a file with seq & SS info
and the following input directives have been implemented:
Directive Explanation Default Operation(s) -op Operation (see above) Help -da (READ|URAND|WRAND): Sequence list source READ -in Input file name -it Melting temperature input file name CORT -ch Chameleon list input file CHPR -ou Output file name -ml Minimum sequence length to use 20 -ns Number of sequences to generate 100000 URAND WRAND -sl Length of sequences to generate 200 URAND WRAND -ni Number of neighbor distance increments to use 20 ADJS ADFB -ka (DATA|GENERIC|LOCAL): AA propensity source GENERIC -sf (SSPR|PDB|PIR): Sequence input format SSPR -sc (LIN|LN): Score type (linear or logarithmic) LN TRIP QUAD -ad Adjacency frequency data file name (w/o extension) trip.dat ADJS -fd Asymmetry frequency data file name (w/o extension) trip.dat ADFB -td Triplet frequency data file name (w/o extension) trip.dat TRIP SC03 SC34 PRFO MUTA -qd Quadruplet frequency data file name (w/o extension) quad.dat QUAD SC04 SC34 PRFO MUTA -pd Pentuplet frequency data file name (w/o extension) pent.dat PENT SC05 PRFO MUTA -6d Hextuplet frequency data file name (w/o extension) hexa.dat HEXA MUTA -7d Heptuplet frequency data file name (w/o extension) hept.dat HEPT MUTA -md File of mutations MUTA -ts Significancy threshold for adjacency scores 1.2 SCAD SCFB PRFO -fs First sequence to use 1 -ls Last sequence to use MAXSEQ -ss (H|S|L): Secondary structure to limit analysis -rl Ratio limit for adjacency gnuplot input 1.15 ADJS -st Initial value on the X axis for plot input -1.0 PLOT -ic Increment on the X axis for plot input 0.02 PLOT -lf Name of the file listing the files to extract distributions from PLOT OVRL PRFO -pm Minimum percent identity to keep FILT FILI -gp Gap penalty for sequence alignment -12.0 FILT FILI -ep Penalty for gap extension in sequence alignment -1.0 FILT FILI -na (WANN|NANN): SS annotations will or will not be copied to the filtered list WANN FILT FILI -nr Progress report and checkpoint file writing frequency 10000 FILT FILI -sa (ARIT|GEOM|C45F): Score averaging: arithmetic mean | geometric mean | quad with pent corr. ARIT PRFO -lu Folding prediction uncertainty 1.2 PRFO -us Distribution type to use for foldability prediction (ADJS|TRIP|QUAD|PENT) PRFO -uw (1|2) - 1: use uniform|propensity-weighted random reference distributon 2 PRFO -sd Distribution ratio limit for folding prediction 1.0 PRFO -sd Random number seed 1357 -sn Tuplets are defined by steps 1, 2, ... 1 TRIP QUAD PENT HEXA HEPT -nf facnull: probabilty of n-tuplets not found are set to facnull/20n 0.5 TRIP QUAD PENT HEXA HEPT -hd Minimum number of trailing/leading HIS to consider His tags and remove 3 -iv (NONE|ECH1|ECH2|ECH3): Output level ECH1 -hp Help (this and the previous table)
III. THE OUTPUT OF THE PROGRAM
The program prints on the terminal a summary information about the data used: files read and written, number of sequences checked, and processed. The output written on the file specified by the -ou directive depends on the operation selected. In each case it includes the various parameter choices (either set or used by default). In most cases it prints the AA propensity distribution.
Most distributions are gathered in 100-element bins. The number in ech bin is printed and a simple histogram is generated on the output file.
The program can read the protein information in two different formats:
While the distribution includes an executable
(compiled with Intel Fortran under Linux) for other architecture it may be
necessary to compile it.
The program has to be compiled with Fortran, e.g.,:
f77 -o fold fold.f
to obtain the executable fold
Some compilers fail due to a so-called 'relocation error'.
When using the Intel Fortran compiler (ifort), adding the compiler directives
-mcmodel=medium -share_intel
solved the problem. With some of the other compilers (but not the GNU compiler)
the compilation key -fpic was found to solve the problem.
By default, the heptamer option is disabled as the program may need larger arrays than what the compiler is able to handle (e.g., the GNU compiler). To enable heptamer calculation, change the parameter IUSE7 from 1 to 20.
If the program is to be compiled with Fortran95 then there is a preprocessor in the Simulaid distribution f77tof95.f and f77tof95_f95.f (the second one is the Fortran95 version of the first one) that changes the syntax to conform to Fortran95 requirements.
The sizes of the arrays are established with parameter statements throughout the code. Several symbols user used for this purpose. There are certain relations between these symbols, so changing one of them is likely to require changes in some others. Below is a list of these symbols (the program checks for violations).