Description of the program FOLD: exploration of protein sequence-structure relations

Mihaly Mezei

Department of Pharmacological Sciences,
Icahn Sinai School of Medicine at Mount Sinai,
New York, NY 100102

Mihaly.Mezei@mssm.edu

July 27, 2020.

Reference: M. Mezei, On predicting foldability of a protein from its sequence Proteins, 88,, 355-365 (2020). DOI:10.1002/prot.25811
M. Mezei, Exploiting sparse statistics for sequence-based prediction of the effect of mutations Algorithms,12, 214 (2019). DOI:10.3390/a12100214
Reference: M. Mezei, Foldability and chameleon propensity of fold-switching protein sequences Proteins, , accepted (2020). DOI:

 
 
I.  DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM
II.  RUNNING THE PROGRAM - INPUT
III.  THE OUTPUT ON THE TERMINAL
IV.  FILE FORMATS
V.  EXAMPLES
VI.  INSTALLATION
VII.  CHANGING DIMENSION (ARRAY SIZES)

I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM

II.RUNNING THE PROGRAM - INPUT

The program is run from the command line by issuing the following command:

fold -op operation -di directive -di directive ...
where operation is one of the followings:

Code Explanation of the operation
GOR1 SS prediction with GOR-1 method
GOR2 SS prediction with GOR-2 method
CF01 SS prediction with Chou-Fassman-1 method
CF02 SS prediction with Chou-Fassman-2 method
CF03 SS prediction with Chou-Fassman-3 method
ADJS AA-AA correlation as a function of sequence distance
ADFB Forward and backward AA neighbor propensities comparison
TRIP Triplet frequency distribution generation
QUAD Quadruplet frequency distribution generation
PENT Pentuplet frequency distribution generation
HEXA Hextuplet frequency distribution generation
HEPT Heptuplet frequency distribution generation
SC03 Triplet score calculation for a protein chain
SC04 Quadruplet score calculation for a protein chain
SC34 Triplet-Quadruplet score calculation for a protein chain
SC05 Pentuplet score calculation for a protein chain
CORT Calculation of correlation between triplet score and Tmelt
PLOT Generating GNUplot input from several output files
OVRL Calculation of overlaps among distributions
NRDS Calculation of residue number/chain distributions
PRFO Predict the likeliness of foldability for a set of sequences
FILT Filter a sequence set for percent identity
FILI Continue filtering a sequence set for percent identity
MUTA Mutation effects prediction
PCOV n-tuplet coverage as a function of sample size
CHPR Chameleon propensity calculation
RCIF Reads a list of PDB-CIF file names, a file with seq & SS info

and the following input directives have been implemented:

DirectiveExplanation DefaultOperation(s)
-opOperation (see above)Help 
-da(READ|URAND|WRAND): Sequence list sourceREAD 
-inInput file name  
-itMelting temperature input file name CORT
-chChameleon list input file CHPR
-ouOutput file name  
-mlMinimum sequence length to use20 
-nsNumber of sequences to generate100000URAND WRAND
-slLength of sequences to generate200URAND WRAND
-niNumber of neighbor distance increments to use20ADJS ADFB
-ka(DATA|GENERIC|LOCAL): AA propensity sourceGENERIC 
-sf(SSPR|PDB|PIR): Sequence input formatSSPR 
-sc(LIN|LN): Score type (linear or logarithmic)LNTRIP QUAD
-adAdjacency frequency data file name (w/o extension)trip.datADJS
-fdAsymmetry frequency data file name (w/o extension)trip.datADFB
-tdTriplet frequency data file name (w/o extension)trip.datTRIP SC03 SC34 PRFO MUTA
-qdQuadruplet frequency data file name (w/o extension)quad.datQUAD SC04 SC34 PRFO MUTA
-pdPentuplet frequency data file name (w/o extension)pent.datPENT SC05 PRFO MUTA
-6dHextuplet frequency data file name (w/o extension)hexa.datHEXA MUTA
-7dHeptuplet frequency data file name (w/o extension)hept.datHEPT MUTA
-mdFile of mutations MUTA
-tsSignificancy threshold for adjacency scores1.2SCAD SCFB PRFO
-fsFirst sequence to use1 
-lsLast sequence to useMAXSEQ 
-ss(H|S|L): Secondary structure to limit analysis  
-rlRatio limit for adjacency gnuplot input1.15ADJS
-stInitial value on the X axis for plot input-1.0PLOT
-icIncrement on the X axis for plot input0.02PLOT
-lfName of the file listing the files to extract distributions from PLOT OVRL PRFO
-pmMinimum percent identity to keep FILT FILI
-gpGap penalty for sequence alignment-12.0FILT FILI
-epPenalty for gap extension in sequence alignment-1.0FILT FILI
-na(WANN|NANN): SS annotations will or will not be copied to the filtered listWANNFILT FILI
-nrProgress report and checkpoint file writing frequency10000FILT FILI
-sa(ARIT|GEOM|C45F): Score averaging: arithmetic mean | geometric mean | quad with pent corr. ARITPRFO
-luFolding prediction uncertainty1.2PRFO
-usDistribution type to use for foldability prediction (ADJS|TRIP|QUAD|PENT) PRFO
-uw(1|2) - 1: use uniform|propensity-weighted random reference distributon2PRFO
-sdDistribution ratio limit for folding prediction1.0PRFO
-sdRandom number seed1357 
-snTuplets are defined by steps 1, 2, ...1TRIP QUAD PENT HEXA HEPT
-nffacnull: probabilty of n-tuplets not found are set to facnull/20n0.5TRIP QUAD PENT HEXA HEPT
-hdMinimum number of trailing/leading HIS to consider His tags and remove3 
-iv(NONE|ECH1|ECH2|ECH3): Output levelECH1 
-hpHelp (this and the previous table)  

III. THE OUTPUT OF THE PROGRAM

The program prints on the terminal a summary information about the data used: files read and written, number of sequences checked, and processed. The output written on the file specified by the -ou directive depends on the operation selected. In each case it includes the various parameter choices (either set or used by default). In most cases it prints the AA propensity distribution.

Most distributions are gathered in 100-element bins. The number in ech bin is printed and a simple histogram is generated on the output file.

IV. FILE FORMATS

The program can read the protein information in two different formats:

V. EXAMPLES

  1. Filtering a sequence set for similarity:
    fold -op FILT -da READ -in ss.txt -pm 50 -ou ss_nr50.out
    The filtered set (maximum 50% similarity) will be in the file ss_50.flt

  2. Secondary structure prediction with the GOR-1 method on the filtered sequence set:
    fold -op GOR1 -da READ -in ss_50.flt -ml 20 -sf SSPR -ou gor1_nr50.out

  3. Secondary structure prediction on a radom sequence set:
    fold -op GOR1 -da WRAND -ns 100000 -ls 200 -ou gor1_wrand.out

  4. Generating GNUplot input to plot distributions:
    ls gor1*.out > gor1.list
    fold -op PLOT -st 1 -ic 1 -lf gor1.list -ou gor1.plot

  5. Generating triplet and quadruplet (relative) propensities
    fold -op TRIP -da READ -in ss_50.flt -sc LN -td trip_nr50.dat -ou trip_nr50.out
    fold -op QUAD -da READ -in ss_50.flt -sc LN -qd quad_nr50.dat -ou quad_nr50.out

  6. Calculating triplet and quadruplet scores on an input sequence set:
    fold -op SC03 -da READ -in disprot.pir -sf PIR -sc LN -td trip_nr50.dat -ou trip_idp_score.out
    fold -op SC05 -da READ -in disprot.pir -sf PIR -sc LN -qd quad_nr50.dat -ou quad_idp_score.out

  7. Calculating triplet and quadruplet scores on a random sequence set:
    fold -op SC03 -da URAND -sc LN -td trip_nr50.dat -ns 100000 -ls 200 -ou trip_urand_score.out
    fold -op SC04 -da URAND -sc LN -qd quad_nr50.dat -ns 100000 -ls 200 -ou quad_urand_score.out

  8. Generating GNUplot input to plot sequence score distributions:
    ls trip*score.out > trip.list
    fold -op PLOT -st -1.0 -ic 0.02 -lf trip.list -ou trip.plot

    ls quad*score.out > quad.list
    fold -op PLOT -st -1.0 -ic 0.02 -lf quad.list -ou quad.plot

  9. Calculating adjacency propensities:
    fold -op ADJS -in ss_nr50.txt -ni 10 -ou adjs.out

  10. Calculating adjacancy asymmetry propensities:
    fold -op ADFB -in ss_nr50.txt -ni 10 -ou adfb.out

  11. Calculating correlation between triplet scores and inputted melting temperatures:
    fold -op CORT -in trip_ss_nr50_score.out -it Tm_data_full.txt -ou TM_corr.out

  12. Calculating overlaps between triplet or quadruplet score distributions:
    fold -op OVRL -lf trip.list fold -op OVRL -lf quad.list

  13. Calculating triplet scores on selected secondary structure elements only:
    fold -op TRIP -in ss_nr50.txt -ss H -td trip_nr50_H.dat -ou trip_nr50_H.out

VI. INSTALLATION

While the distribution includes an executable (compiled with Intel Fortran under Linux) for other architecture it may be necessary to compile it. The program has to be compiled with Fortran, e.g.,:
f77 -o fold fold.f
to obtain the executable fold

Some compilers fail due to a so-called 'relocation error'. When using the Intel Fortran compiler (ifort), adding the compiler directives
-mcmodel=medium -share_intel
solved the problem. With some of the other compilers (but not the GNU compiler) the compilation key -fpic was found to solve the problem.

By default, the heptamer option is disabled as the program may need larger arrays than what the compiler is able to handle (e.g., the GNU compiler). To enable heptamer calculation, change the parameter IUSE7 from 1 to 20.

If the program is to be compiled with Fortran95 then there is a preprocessor in the Simulaid distribution f77tof95.f and f77tof95_f95.f (the second one is the Fortran95 version of the first one) that changes the syntax to conform to Fortran95 requirements.

VII. CHANGING DIMENSIONS

The sizes of the arrays are established with parameter statements throughout the code. Several symbols user used for this purpose. There are certain relations between these symbols, so changing one of them is likely to require changes in some others. Below is a list of these symbols (the program checks for violations).

IMPORTANT: Parameter statements for most symbols occur several places in the program. When a change is required, it has to be carried out at ALL occurences!