Description of the program CHAM: search for chameleon sequences in
the Protein Data Bank (PDB).
Mihaly Mezei
Department of Pharmacological Sciences,
Icahn Sinai School of Medicine at Mount Sinai,
New York, NY 100102
Mihaly.Mezei@mssm.edu
Aug. 03, 2018.
Reference: M. Mezei, Revisiting chameleon sequences in the Protein Data Bank
Algorithms 11, (2018).
DOI:
I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM
II.RUNNING THE PROGRAM - INPUT
The program can be run either interactively or from the command line.
When the program is run interactively, the user will be prompted for the
following items
- <lmin> and <lmax>: chameleon search will be performed for all L
<:lmin> > L < <:lmax>
- <root>: output files named <:root>.res, <:root>.dtl and
<:root>.log will be written containing the results, details of
the calculation and error/debug messages, resp.
- <:information source>: either the name of the file containing the
file names of all PDB or mmCIF files or the file with the sequences and
secondary structure annotations
- file format: one of <:pdb|cif|ANN> specifying legacy PDB,
PDBc/mmCIF or FASTA sequence with DSSP annotation
The default is ANN
- When the ANN format is chosen the user has the option to specify a
<mapping file> that contains the
mapping of each amino acid to the classes chosen.
- : 0, 1 or 2 (default: 0). When it is 1 or 2,
debug information will be printed on the file <output file root>.log
Running from the command line involves issuing the following command:
cham -tc argument -tc argument ...
where tc one of
mn, mx, ro, db, ff, mp.
The respective arguments and their default values are listed below.
- -mn minimum chameleon length [5]
- -mx maximum chameleon length [24]
- -ro file name root [chameleon]
- -db debug level [0]
- -lf list of files/sequences [ss.txt]
- -ff file format (pdb|cif|ass) [ass]
- -mp residue mapping file
- -wa list of sequences and annotations written (mmCIF input only)
III. THE OUTPUT ON THE TERMINAL
Besides the results printed on the <root>.res file the program
also prints information on the progress of the database scan and
the summary of the results
(e.g., number of sequences checked, number of chameleons found).
The results printed on the <:output file root>.res file
include the following items:
- Overall statistics about the occurrence of each amino acid in the database,
in helices and in sheets
- The distribution oh helix and sheet lengths
- For each chamelon length searched for
- The number of chameleons found
- The sequence and a helix-sheet pair where this chamelon occurs
- A list of all occurences of this chameleon (excluding multiple copies
in the same protein)
- Chameleon propensity statistics for this chamelon length and
the same statistics cumulatively
IV. FILE FORMATS
The program can read the protein information is three different formats:
V. INSTALLATION
While the distribution includes an executable
(compiled with Intel Fortran under Linux) for other architecture it may be
necessary to compile it.
The program has to be compiled with Fortran, e.g.,:
f77 -o cham cham.f
to obtain the executable cham
Some compilers fail due to a so-called 'relocation error' when optimizing
at levels higher than one is asked.
When using the Intel Fortran compiler (ifort), adding the compiler directives
-mcmodel=medium -share_intel
solved the problem. With some of the other compilers (but not the GNU compiler)
the compilation key
-fpic was found to solve the problem.
If the program is to be compiled with Fortran95 then there
is a preprocessor in the
Simulaid
distribution f77tof95.f and f77tof95_f95.f (the
second one is the Fortran95 version of the first one) that changes
the syntax to conform to Fortran95 requirements.
VI. CHANGING DIMENSIONS
The sizes of the arrays are established with parameter statements
throughout the code. Several symbols user used for this purpose.
There are certain relations between these symbols,
so changing one of them is likely to require changes in some others.
Below is a list of these symbols (the program checks for violations).
- MAXSEQ {20000000}: maximum number of L-residue sequences
- MAXRES {100000}: maximum number of residues in a protein
- MAXCHAM {1000000}: maximum number of the total number
(of all lengths L) of chameleons found
- MAXLEN (24): maximum length of a chameleon to search for
NOTE: when changing MAXLEN, all character*25 occurences
have to be changed by replacing 25 with MAXLEN+1
- MAXHS {125}: maximum helix or sheet length
(for length statistics calculation)
IMPORTANT: Parameter statements for most symbols occur several places
in the program. When a change is required, it has to be carried out
at ALL occurences!