Description of the program CHAM: search for chameleon sequences in the Protein Data Bank (PDB).

Mihaly Mezei

Department of Pharmacological Sciences,
Icahn Sinai School of Medicine at Mount Sinai,
New York, NY 100102

Aug. 03, 2018.

Reference: M. Mezei, Revisiting chameleon sequences in the Protein Data Bank Algorithms 11, (2018). DOI:

I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM

II. RUNNING THE PROGRAM - INPUT

III. THE OUTPUT ON THE TERMINAL

IV. FILE FORMATS

V. INSTALLATION

VI. CHANGING DIMENSION (ARRAY SIZES)

I.	DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM
II.	RUNNING THE PROGRAM - INPUT
III.	THE OUTPUT ON THE TERMINAL
IV.	FILE FORMATS
V.	INSTALLATION
VI.	CHANGING DIMENSION (ARRAY SIZES)

I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM

II.RUNNING THE PROGRAM - INPUT

The program can be run either interactively or from the command line.

When the program is run interactively, the user will be prompted for the following items

<lmin> and <lmax>: chameleon search will be performed for all L
<:lmin> > L < <:lmax>
<root>: output files named <:root>.res, <:root>.dtl and <:root>.log will be written containing the results, details of the calculation and error/debug messages, resp.
<:information source>: either the name of the file containing the file names of all PDB or mmCIF files or the file with the sequences and secondary structure annotations
file format: one of <:pdb|cif|ANN> specifying legacy PDB, PDBc/mmCIF or FASTA sequence with DSSP annotation The default is ANN
When the ANN format is chosen the user has the option to specify a <mapping file> that contains the mapping of each amino acid to the classes chosen.
: 0, 1 or 2 (default: 0). When it is 1 or 2, debug information will be printed on the file <output file root>.log

Running from the command line involves issuing the following command:

cham -tc argument -tc argument ...
where tc one of mn, mx, ro, db, ff, mp. The respective arguments and their default values are listed below.

-mn minimum chameleon length [5]
-mx maximum chameleon length [24]
-ro file name root [chameleon]
-db debug level [0]
-lf list of files/sequences [ss.txt]
-ff file format (pdb|cif|ass) [ass]
-mp residue mapping file
-wa list of sequences and annotations written (mmCIF input only)

III. THE OUTPUT ON THE TERMINAL

Besides the results printed on the <root>.res file the program also prints information on the progress of the database scan and the summary of the results (e.g., number of sequences checked, number of chameleons found). The results printed on the <:output file root>.res file include the following items:

Overall statistics about the occurrence of each amino acid in the database, in helices and in sheets
The distribution oh helix and sheet lengths
For each chamelon length searched for
- The number of chameleons found
- The sequence and a helix-sheet pair where this chamelon occurs
- A list of all occurences of this chameleon (excluding multiple copies in the same protein)
- Chameleon propensity statistics for this chamelon length and the same statistics cumulatively

IV. FILE FORMATS

The program can read the protein information is three different formats:

pdb: Legacy PDB - the sequence and secondary structure information are read from the SEQRES and HELIX or SHEET records, resp.
cif: PDBX/mmCIF - the sequence and secondary structure information are read from structure files in the new PDB format, using the pdbx_seq_one_letter_code and pdbx_strand_id records, resp.
ANN: A file containing the sequence and secondary structure annotation for all structures. Each chain has a separate entry in the following format:
```
>PDBID:chainid:sequence
lines with the 1-character residue label
>PDBid:chainid:secstr
same number of lines as above with the secondary structure annotation
```
The secondary structure can be annotated with the following characters:
- H: alpha helix
- G: 3-helix (3/10 helix)
- I: 5 helix (pi helix)
- S: bend
- B: residue in isolated beta-bridge
- T: hydrogen bonded turn
- E: extended strand, participates in beta ladder
When the -mp option is used, the program also ask for a file name specifying the mapping of the 20 amino acids to different amino acid classes. The file has to list the 20 one-character amino acid symbols in one line and the symbols they are mapped to in the next line.
When the -wa option is used, the program also ask for a file name to write the sequence and annotation extracted from the mmCIF files read; the format is tha same as the file read with -ff ANN.

V. INSTALLATION

While the distribution includes an executable (compiled with Intel Fortran under Linux) for other architecture it may be necessary to compile it. The program has to be compiled with Fortran, e.g.,:
f77 -o cham cham.f
to obtain the executable cham

Some compilers fail due to a so-called 'relocation error' when optimizing at levels higher than one is asked. When using the Intel Fortran compiler (ifort), adding the compiler directives
-mcmodel=medium -share_intel
solved the problem. With some of the other compilers (but not the GNU compiler) the compilation key -fpic was found to solve the problem.

If the program is to be compiled with Fortran95 then there is a preprocessor in the Simulaid distribution f77tof95.f and f77tof95_f95.f (the second one is the Fortran95 version of the first one) that changes the syntax to conform to Fortran95 requirements.

VI. CHANGING DIMENSIONS

The sizes of the arrays are established with parameter statements throughout the code. Several symbols user used for this purpose. There are certain relations between these symbols, so changing one of them is likely to require changes in some others. Below is a list of these symbols (the program checks for violations).

MAXSEQ {20000000}: maximum number of L-residue sequences
MAXRES {100000}: maximum number of residues in a protein
MAXCHAM {1000000}: maximum number of the total number (of all lengths L) of chameleons found
MAXLEN (24): maximum length of a chameleon to search for
NOTE: when changing MAXLEN, all character*25 occurences have to be changed by replacing 25 with MAXLEN+1
MAXHS {125}: maximum helix or sheet length (for length statistics calculation)

IMPORTANT: Parameter statements for most symbols occur several places in the program. When a change is required, it has to be carried out at ALL occurences!