Description of the program FOLD: exploration of protein sequence-structure relations

Mihaly Mezei

Department of Pharmacological Sciences,
Icahn Sinai School of Medicine at Mount Sinai,
New York, NY 100102

July 27, 2020.

Reference: M. Mezei, On predicting foldability of a protein from its sequence Proteins, 88,, 355-365 (2020). DOI:10.1002/prot.25811
M. Mezei, Exploiting sparse statistics for sequence-based prediction of the effect of mutations Algorithms,12, 214 (2019). DOI:10.3390/a12100214
Reference: M. Mezei, Foldability and chameleon propensity of fold-switching protein sequences Proteins, , accepted (2020). DOI:

I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM

II. RUNNING THE PROGRAM - INPUT

III. THE OUTPUT ON THE TERMINAL

IV. FILE FORMATS

V. EXAMPLES

VI. INSTALLATION

VII. CHANGING DIMENSION (ARRAY SIZES)

I.	DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM
II.	RUNNING THE PROGRAM - INPUT
III.	THE OUTPUT ON THE TERMINAL
IV.	FILE FORMATS
V.	EXAMPLES
VI.	INSTALLATION
VII.	CHANGING DIMENSION (ARRAY SIZES)

I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM

II.RUNNING THE PROGRAM - INPUT

The program is run from the command line by issuing the following command:

fold -op operation -di directive -di directive ...
where operation is one of the followings:

Code Explanation of the operation GOR1 SS prediction with GOR-1 method GOR2 SS prediction with GOR-2 method CF01 SS prediction with Chou-Fassman-1 method CF02 SS prediction with Chou-Fassman-2 method CF03 SS prediction with Chou-Fassman-3 method ADJS AA-AA correlation as a function of sequence distance ADFB Forward and backward AA neighbor propensities comparison TRIP Triplet frequency distribution generation QUAD Quadruplet frequency distribution generation PENT Pentuplet frequency distribution generation HEXA Hextuplet frequency distribution generation HEPT Heptuplet frequency distribution generation SC03 Triplet score calculation for a protein chain SC04 Quadruplet score calculation for a protein chain SC34 Triplet-Quadruplet score calculation for a protein chain SC05 Pentuplet score calculation for a protein chain CORT Calculation of correlation between triplet score and Tmelt PLOT Generating GNUplot input from several output files OVRL Calculation of overlaps among distributions NRDS Calculation of residue number/chain distributions PRFO Predict the likeliness of foldability for a set of sequences FILT Filter a sequence set for percent identity FILI Continue filtering a sequence set for percent identity MUTA Mutation effects prediction PCOV n-tuplet coverage as a function of sample size CHPR Chameleon propensity calculation RCIF Reads a list of PDB-CIF file names, a file with seq & SS info

and the following input directives have been implemented:

Directive Explanation Default Operation(s) -op Operation (see above) Help -da (READ|URAND|WRAND): Sequence list source READ -in Input file name -it Melting temperature input file name CORT -ch Chameleon list input file CHPR -ou Output file name -ml Minimum sequence length to use 20 -ns Number of sequences to generate 100000 URAND WRAND -sl Length of sequences to generate 200 URAND WRAND -ni Number of neighbor distance increments to use 20 ADJS ADFB -ka (DATA|GENERIC|LOCAL): AA propensity source GENERIC -sf (SSPR|PDB|PIR): Sequence input format SSPR -sc (LIN|LN): Score type (linear or logarithmic) LN TRIP QUAD -ad Adjacency frequency data file name (w/o extension) trip.dat ADJS -fd Asymmetry frequency data file name (w/o extension) trip.dat ADFB -td Triplet frequency data file name (w/o extension) trip.dat TRIP SC03 SC34 PRFO MUTA -qd Quadruplet frequency data file name (w/o extension) quad.dat QUAD SC04 SC34 PRFO MUTA -pd Pentuplet frequency data file name (w/o extension) pent.dat PENT SC05 PRFO MUTA -6d Hextuplet frequency data file name (w/o extension) hexa.dat HEXA MUTA -7d Heptuplet frequency data file name (w/o extension) hept.dat HEPT MUTA -md File of mutations MUTA -ts Significancy threshold for adjacency scores 1.2 SCAD SCFB PRFO -fs First sequence to use 1 -ls Last sequence to use MAXSEQ -ss (H|S|L): Secondary structure to limit analysis -rl Ratio limit for adjacency gnuplot input 1.15 ADJS -st Initial value on the X axis for plot input -1.0 PLOT -ic Increment on the X axis for plot input 0.02 PLOT -lf Name of the file listing the files to extract distributions from PLOT OVRL PRFO -pm Minimum percent identity to keep FILT FILI -gp Gap penalty for sequence alignment -12.0 FILT FILI -ep Penalty for gap extension in sequence alignment -1.0 FILT FILI -na (WANN|NANN): SS annotations will or will not be copied to the filtered list WANN FILT FILI -nr Progress report and checkpoint file writing frequency 10000 FILT FILI -sa (ARIT|GEOM|C45F): Score averaging: arithmetic mean | geometric mean | quad with pent corr. ARIT PRFO -lu Folding prediction uncertainty 1.2 PRFO -us Distribution type to use for foldability prediction (ADJS|TRIP|QUAD|PENT) PRFO -uw (1|2) - 1: use uniform|propensity-weighted random reference distributon 2 PRFO -sd Distribution ratio limit for folding prediction 1.0 PRFO -sd Random number seed 1357 -sn Tuplets are defined by steps 1, 2, ... 1 TRIP QUAD PENT HEXA HEPT -nf facnull: probabilty of n-tuplets not found are set to facnull/20ⁿ 0.5 TRIP QUAD PENT HEXA HEPT -hd Minimum number of trailing/leading HIS to consider His tags and remove 3 -iv (NONE|ECH1|ECH2|ECH3): Output level ECH1 -hp Help (this and the previous table)

III. THE OUTPUT OF THE PROGRAM

The program prints on the terminal a summary information about the data used: files read and written, number of sequences checked, and processed. The output written on the file specified by the -ou directive depends on the operation selected. In each case it includes the various parameter choices (either set or used by default). In most cases it prints the AA propensity distribution.

Most distributions are gathered in 100-element bins. The number in ech bin is printed and a simple histogram is generated on the output file.

IV. FILE FORMATS

The program can read the protein information in two different formats:

SSPR: Legacy PDB - the format used by the ss.txt file from the Protein Data Bank. For each protein chain, the one-letter AA codes specify the sequence and one-letter codes specify the secondary structure.
PIR: A file containing the sequence in PIR format (one title line starting with ">", followed by lines with the 1-residue amino acid codes.

V. EXAMPLES

Filtering a sequence set for similarity:
fold -op FILT -da READ -in ss.txt -pm 50 -ou ss_nr50.out
The filtered set (maximum 50% similarity) will be in the file ss_50.flt
Secondary structure prediction with the GOR-1 method on the filtered sequence set:
fold -op GOR1 -da READ -in ss_50.flt -ml 20 -sf SSPR -ou gor1_nr50.out
Secondary structure prediction on a radom sequence set:
fold -op GOR1 -da WRAND -ns 100000 -ls 200 -ou gor1_wrand.out
Generating GNUplot input to plot distributions:
ls gor1*.out > gor1.list
fold -op PLOT -st 1 -ic 1 -lf gor1.list -ou gor1.plot
Generating triplet and quadruplet (relative) propensities
fold -op TRIP -da READ -in ss_50.flt -sc LN -td trip_nr50.dat -ou trip_nr50.out
fold -op QUAD -da READ -in ss_50.flt -sc LN -qd quad_nr50.dat -ou quad_nr50.out
Calculating triplet and quadruplet scores on an input sequence set:
fold -op SC03 -da READ -in disprot.pir -sf PIR -sc LN -td trip_nr50.dat -ou trip_idp_score.out
fold -op SC05 -da READ -in disprot.pir -sf PIR -sc LN -qd quad_nr50.dat -ou quad_idp_score.out
Calculating triplet and quadruplet scores on a random sequence set:
fold -op SC03 -da URAND -sc LN -td trip_nr50.dat -ns 100000 -ls 200 -ou trip_urand_score.out
fold -op SC04 -da URAND -sc LN -qd quad_nr50.dat -ns 100000 -ls 200 -ou quad_urand_score.out
Generating GNUplot input to plot sequence score distributions:
ls trip*score.out > trip.list
fold -op PLOT -st -1.0 -ic 0.02 -lf trip.list -ou trip.plot
ls quad*score.out > quad.list
fold -op PLOT -st -1.0 -ic 0.02 -lf quad.list -ou quad.plot
Calculating adjacency propensities:
fold -op ADJS -in ss_nr50.txt -ni 10 -ou adjs.out
Calculating adjacancy asymmetry propensities:
fold -op ADFB -in ss_nr50.txt -ni 10 -ou adfb.out
Calculating correlation between triplet scores and inputted melting temperatures:
fold -op CORT -in trip_ss_nr50_score.out -it Tm_data_full.txt -ou TM_corr.out
Calculating overlaps between triplet or quadruplet score distributions:
fold -op OVRL -lf trip.list fold -op OVRL -lf quad.list
Calculating triplet scores on selected secondary structure elements only:
fold -op TRIP -in ss_nr50.txt -ss H -td trip_nr50_H.dat -ou trip_nr50_H.out

VI. INSTALLATION

While the distribution includes an executable (compiled with Intel Fortran under Linux) for other architecture it may be necessary to compile it. The program has to be compiled with Fortran, e.g.,:
f77 -o fold fold.f
to obtain the executable fold

Some compilers fail due to a so-called 'relocation error'. When using the Intel Fortran compiler (ifort), adding the compiler directives
-mcmodel=medium -share_intel
solved the problem. With some of the other compilers (but not the GNU compiler) the compilation key -fpic was found to solve the problem.

By default, the heptamer option is disabled as the program may need larger arrays than what the compiler is able to handle (e.g., the GNU compiler). To enable heptamer calculation, change the parameter IUSE7 from 1 to 20.

If the program is to be compiled with Fortran95 then there is a preprocessor in the Simulaid distribution f77tof95.f and f77tof95_f95.f (the second one is the Fortran95 version of the first one) that changes the syntax to conform to Fortran95 requirements.

VII. CHANGING DIMENSIONS

The sizes of the arrays are established with parameter statements throughout the code. Several symbols user used for this purpose. There are certain relations between these symbols, so changing one of them is likely to require changes in some others. Below is a list of these symbols (the program checks for violations).

MAXAA {10000}: maximum number of AAs per protein chain
MAXINC {20}: largest AA sequence distance to calculate adjacency statistics
MAXSEQ {2000000}: maximum number of protein chains to process
MAXAASEQ {200000000}: maximum number of residues (sum over all sequences) to process

IMPORTANT: Parameter statements for most symbols occur several places in the program. When a change is required, it has to be carried out at ALL occurences!

Code	Explanation of the operation
GOR1	SS prediction with GOR-1 method
GOR2	SS prediction with GOR-2 method
CF01	SS prediction with Chou-Fassman-1 method
CF02	SS prediction with Chou-Fassman-2 method
CF03	SS prediction with Chou-Fassman-3 method
ADJS	AA-AA correlation as a function of sequence distance
ADFB	Forward and backward AA neighbor propensities comparison
TRIP	Triplet frequency distribution generation
QUAD	Quadruplet frequency distribution generation
PENT	Pentuplet frequency distribution generation
HEXA	Hextuplet frequency distribution generation
HEPT	Heptuplet frequency distribution generation
SC03	Triplet score calculation for a protein chain
SC04	Quadruplet score calculation for a protein chain
SC34	Triplet-Quadruplet score calculation for a protein chain
SC05	Pentuplet score calculation for a protein chain
CORT	Calculation of correlation between triplet score and Tmelt
PLOT	Generating GNUplot input from several output files
OVRL	Calculation of overlaps among distributions
NRDS	Calculation of residue number/chain distributions
PRFO	Predict the likeliness of foldability for a set of sequences
FILT	Filter a sequence set for percent identity
FILI	Continue filtering a sequence set for percent identity
MUTA	Mutation effects prediction
PCOV	n-tuplet coverage as a function of sample size
CHPR	Chameleon propensity calculation
RCIF	Reads a list of PDB-CIF file names, a file with seq & SS info

Directive	Explanation	Default	Operation(s)
-op	Operation (see above)	Help
-da	(READ\|URAND\|WRAND): Sequence list source	READ
-in	Input file name
-it	Melting temperature input file name		CORT
-ch	Chameleon list input file		CHPR
-ou	Output file name
-ml	Minimum sequence length to use	20
-ns	Number of sequences to generate	100000	URAND WRAND
-sl	Length of sequences to generate	200	URAND WRAND
-ni	Number of neighbor distance increments to use	20	ADJS ADFB
-ka	(DATA\|GENERIC\|LOCAL): AA propensity source	GENERIC
-sf	(SSPR\|PDB\|PIR): Sequence input format	SSPR
-sc	(LIN\|LN): Score type (linear or logarithmic)	LN	TRIP QUAD
-ad	Adjacency frequency data file name (w/o extension)	trip.dat	ADJS
-fd	Asymmetry frequency data file name (w/o extension)	trip.dat	ADFB
-td	Triplet frequency data file name (w/o extension)	trip.dat	TRIP SC03 SC34 PRFO MUTA
-qd	Quadruplet frequency data file name (w/o extension)	quad.dat	QUAD SC04 SC34 PRFO MUTA
-pd	Pentuplet frequency data file name (w/o extension)	pent.dat	PENT SC05 PRFO MUTA
-6d	Hextuplet frequency data file name (w/o extension)	hexa.dat	HEXA MUTA
-7d	Heptuplet frequency data file name (w/o extension)	hept.dat	HEPT MUTA
-md	File of mutations		MUTA
-ts	Significancy threshold for adjacency scores	1.2	SCAD SCFB PRFO
-fs	First sequence to use	1
-ls	Last sequence to use	MAXSEQ
-ss	(H\|S\|L): Secondary structure to limit analysis
-rl	Ratio limit for adjacency gnuplot input	1.15	ADJS
-st	Initial value on the X axis for plot input	-1.0	PLOT
-ic	Increment on the X axis for plot input	0.02	PLOT
-lf	Name of the file listing the files to extract distributions from		PLOT OVRL PRFO
-pm	Minimum percent identity to keep		FILT FILI
-gp	Gap penalty for sequence alignment	-12.0	FILT FILI
-ep	Penalty for gap extension in sequence alignment	-1.0	FILT FILI
-na	(WANN\|NANN): SS annotations will or will not be copied to the filtered list	WANN	FILT FILI
-nr	Progress report and checkpoint file writing frequency	10000	FILT FILI
-sa	(ARIT\|GEOM\|C45F): Score averaging: arithmetic mean \| geometric mean \| quad with pent corr.	ARIT	PRFO
-lu	Folding prediction uncertainty	1.2	PRFO
-us	Distribution type to use for foldability prediction (ADJS\|TRIP\|QUAD\|PENT)		PRFO
-uw	(1\|2) - 1: use uniform\|propensity-weighted random reference distributon	2	PRFO
-sd	Distribution ratio limit for folding prediction	1.0	PRFO
-sd	Random number seed	1357
-sn	Tuplets are defined by steps 1, 2, ...	1	TRIP QUAD PENT HEXA HEPT
-nf	facnull: probabilty of n-tuplets not found are set to facnull/20ⁿ	0.5	TRIP QUAD PENT HEXA HEPT
-hd	Minimum number of trailing/leading HIS to consider His tags and remove	3
-iv	(NONE\|ECH1\|ECH2\|ECH3): Output level	ECH1
-hp	Help (this and the previous table)