Description of the program SIMULAID: a collection of utilities
designed
to help setting up and analyze molecular simulations.
Mihaly Mezei
Department of Pharmacological Sciences,
Icahn Sinai School of Medicine at Mount Sinai,
New York, NY 10029
Mihaly.Mezei@mssm.edu
Apr. 17, 2023.
Reference: M. Mezei, Simulaid: a simulation facilitator and analysis program,
J. Comp. Chem., 31, 2658-2668 (2010).
DOI:10.1002/jcc.21551
0. GENERAL INTRODUCTION TO SIMULATION SETUPS
In general, a simulation requires three different types of
information:
- Description of the interaction energy terms
- Description of the molecules in the system
- Description of the boundary conditions.
For analysis, the
- Trajectory (history) file
may also be needed.
Simulaid will help in various aspects of establishing these. In order to
help clarify the various functionalities of Simulaid, a brief
description is given of the various components of these
descriptions. Frequent reference will be made to the PDB (Protein
Data Bank) format that is considered the 'lingua franca' of
molecular simulations (although it has a number of 'dialects')
and to formats used by the simulation packages Amber, Charmm,
Gromos/Gromacs, Discover, Macromodel, and MMC.
0.1. Description of the interaction energy terms
Most simulation packages rely on the concept of
atomtype for
the specification of the interaction potential, supplemented by a
partial charge. It is generally assumed that force
constants, dispersion and exhange repulsion terms are
transferable to a larger extent than the partial charges.
The atomtypes can be defined by alphanumeric labels (Amber,
Charmm,
Gromos/Gromacs, Discover) or simple integers (Macromodel, OPLS)
or both
(MMC). In most cases, a parameter file contains the data
necessary
0. GENERAL INTRODUCTION TO SIMULATION SETUPS
In general, a simulation requires three different types of
information:
- Description of the interaction energy terms
- Description of the molecules in the system
- Description of the boundary conditions.
For analysis, the
- Trajectory (history) file
may also be needed.
Simulaid will help in various aspects of establishing these. In order to
help clarify the various functionalities of Simulaid, a brief
description is given of the various components of these
descriptions. Frequent reference will be made to the PDB (Protein
Data Bank) format that is considered the 'lingua franca' of
molecular simulations (although it has a number of 'dialects')
and to formats used by the simulation packages Amber, Charmm,
Gromos/Gromacs, Discover, Macromodel, and MMC.
0.1. Description of the interaction energy terms
Most simulation packages rely on the concept of
atomtype for
the specification of the interaction potential, supplemented by a
partial charge. It is generally assumed that force
constants, dispersion and exhange repulsion terms are
transferable to a larger extent than the partial charges.
The atomtypes can be defined by alphanumeric labels (Amber,
Charmm,
Gromos/Gromacs, Discover) or simple integers (Macromodel, OPLS)
or both
(MMC). In most cases, a parameter file contains the data
necessary
to extract the various coefficients to be used in the energy
expressions
(Amber, Charmm, Gromos/Gromacs, Discover) but they may be built
into the
program source code (Macromodel) or both (MMC).
0.2. Description of the molecules in the system
Most simulation packages use a hierarchical approach to define
the assembly of atoms to be simulated.
- The largest unit is called a segment in Charmm, a
chain in
PDB. MMC has two segments: the solvent and the
solute. Segments or chains are likely to be different
(macro)molecules, although they don't have to be.
- Each segment is generally broken into residues. A
residue is usually a repeating unit that may occur several times
at different places (i.e., an amino or nucleic acid).
- In some instances (Charmm, Discover) residues are further
partioned into groups whose total charge is likely to be
an integer (in most cases zero).
- In any event, each residue involves a collection of
atoms.
Simulaid can read molecular structures in the following formats:
- Charmm .CRD
- Charmm .CRD, extended format
- Brookhaven PDB
- Charmm PDB
- Autodock3 .pdbqs
- Autodock4 .pdbqt
- Macromodel
- MMC .slt
- Gromos/Gromacs
- Tripos .mol2
- Schrodinger Maestro .mae
- PDBx/mmCIF .cif
- InsightII .car
- NXYZ (atomic number and coordinates)
- SXYZ (chemical symbol and coordinates)
- SXYZRQ (chemical symbol, coordinates, atomic radius and charge)
- Grasp .crg
Depending on the input format used, Simulaid can also write in
most of the formats listed above.
It can not write a .mae or a .cif file and it can only write a .mol2 file
if the input was also in .mol2 format.
Since residues are considered frequently occuring molecular
units, several packages (Amber, Charmm, Discover) use a so-called
residue topology file (RTF) for their definition. For amino and
nucleic acids the PDB nomenclature is followed, but only
approximately. This standard includes definitions for amino acid
and nucleic acid residues. These definitions consist of a
three-letter residue name and atomnames that can be up to four
characters long.
Simulaid can read topology files in the following formats:
To arrive at the fully defined system, most packages rely on
an annotated coordinate file containing the Cartesian coordinates
of the atoms in the system as well as on the RTF and parameter
files. For Amber, the coordinate file is in the PDB format while
for Charmm it is in the so-called CRD format. Once the list of
residues and atoms are known to Amber or Charmm, they will turn
to their RTF file where the combination of residue names and
atomnames will provide for each atom their atomtypes and partial
charges. The .car file of Discover, the .dat file of Macromodel,
the .mol2 file of Tripos,
as well as the .slt file of MMC, however will contain not only
the residue names and atomnames, but the atom types and partial
charges as well.
The record formats of the formatted input records can be
listed by Simulaid.
The RTF file of Amber and Charmm also specifies the bonding
pattern. Macromodel and Tripos, on the other hand, require the bond list in
the coordinate file while MMC will deduce the bonding pattern from the
coordinates themselves.
Segments are usually defined by the input stream.
0.3. Description of the boundary conditions.
The simulated system may be a droplet in vacuum or a unit cell
of a periodic system. Each program has to be explicitly
instructed in the input stream as to the size and shape of the periodic cell - the options vary
from package to package. Charmm uses an additional file of the
type .img that specifies the centers of all surrounding cells,
but most packages use built-in routines for the different cell
types allowed.
0.4. Trajectories.
Simulaid can read trajectories in the following formats:
- Charmm DCD
- Amber
- MMC (both binary and ASCII formats)
- Macromodel
- Stack of PDB files
- Stack of Charmm CRD files
If a trajectory is in multiple files then the
file names should be derived from the name of the first segment.
Assuming that the trajectory starts with file x.dcd,
subsequent segments should be called x_2.dcd, x_3.dcd, etc.
NOTE on binary files: it is important to compile Simulad with the
same (type) of compilers as the program that generated the trajectory
was compiled with. Binary files generated with Fortran-77 and Fortran-90
are different; similarly, they can be different for 32-bit and 64-bit systems.
0.5. Clustering options.
Simulaid has implemented several clustering algorithms.
- Single link clustering: For a given threshold distance for any pair
of clusters the shortest distance between representatives of each cluster
is larger than the threshold. For any member of a cluster (of more than one
members) there is at least one other cluster member within the threshold
distance.
- K-means/K-medoids clustering: with the number of clusters set by the user,
the program iteratively assigns the objects to the cluster whose center is
the nearest to it.
The center of the cluster can be established in two ways:
- K-means: Calculate the center-of-mass of each cluster and set the center there.
This is meaningful when the coordinates of the nodes are also known
and the distances represent the cartesian distance between these nodes.
- K-medoids: Pick the member of each cluster as its center whose distance from the
member farthes from is is the minimum. This option is useful when the
distances are not representing cartesian distances.
- A clustering based on maximum number of neighbors: Pick the (an) object
that has the most number of neighbors (i.e., within threshold distance);
and call a cluster and remove it and it neighbors from the list.
Adjust the neighbor counts and repeat until the list is exhausted.
- Density-based clusterings (REF1).
These are a variant of single link clustering
where the bond definition is tied to the neighborhood:
- the number of common neighbors is not less than an input minimum (CNN)
- as above and also the nodes have to be within a maximum distance
- both nodes have a minimum number of neighbors and are
within a maximum distance (DBSCAN)
- Clustering into cliques, i.e., clusters where every pair of members
in a cluster is within a threshold distance.
For a given threshold this clustering is not unique since the clustering
is necessarily a partitioning but there can be cliques that overlap.
Single-link and maximum neighbor based clustering can be run either
by specifying a distance cutoff, resulting in an unspecified number of
clusters or by specifying the number of clusters
in which case the cutoff is udjusted until this number of cluster results.
Furthermore, single-link clustering can be used to do a hierarchical
clustering using several different distance cutoffs. In this case clusters
obtained with a larger cutoff are further clustered with the next smaller
cutoff.
I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM
I.0. Online help:
Any prompt containing a '?' within the bracketed field
(e.g., 'Do you want ... [y/n/?] ?' or 'Number of ... [100,?]=')
will respond with an explanation if the user types just a '?'.
Some of the prompts asking for a selection from a list also
has a help option that can be activated by typing a '?'.
I.1. Operations performed by several functions:
Online tips:
Any prompt containing a '+' within the bracketed field
(e.g., 'Do you want ... [y/n/+] ?')
will provide a tip if the user types just a '+'.
In each case, the prompt is repeated after the explanation has been printed.
For translation/rotation and trajectory conversions Simulaid
offers the option to reset the system into the periodic cell.
The use of periodic boundary conditions may be restricted to
two or one dimensions; i.e., the resetting is applied only in a selected
coordinae plane or along a selected axis.
This option also selects periodic images of each solute molecule
in such a way that the solute appears as compact as possible.
Since Simulaid considers each solute segment a single molecule,
this may result in moving groups of molecules (e.g., lipids)
always together.
Thus, for functions that may involve resetting the system into
a PBC cell, Simulaid offers the users the option to define a number
of residues as molecular residues, i.e., groups of atoms that
are to be reset into the PBC cell together but
independent of the rest of the solute.
Furthermore, residues can also be defined as ions - these
are also treated individually but they are not moved around only to
keep them in the simulation cell, but not in an
attempt to make the solute more compact,
I.2. Specific operations performed by Simulaid:
- Geometry optimizations
- Find the sphere of smallest radius still enclosing a
set of atoms and translate the atoms to the center of this
smallest enclosing sphere (REF2). This
allows the solvation of a solute with the smallest number of
solvents providing a solvation layer of given minimum thickness.
It is possible to input a solvated system where the solvents are
excluded from the sphere calculations but will be subsequently
translated along with the atoms included (i.e., the solute);
solvents outside a layer of user-specified thickness will be
eliminated.
- Find the orientation of the solute molecule so that
the volume of the cube or rectangle enclosing it is the smallest
(REF3). Smallest enclosing cube is
optimal for a Delphi run, smallest enclosing rectangle is optimal
for grand-canonical runs where a bulk solvent layer is required.
The user is asked if cube or rectangle is required. The
optimization is performed starting from the input orientation as
well as from a user defined number of random orientations to
provide several local minima of the optimization problem from
which the best will be chosen.
Option is provided for limiting the optimization to rotations around
one of the coordinate axes.
- Find the orientation of the solute molecule that
maximizes the smallest image-image distance (REF1). The optimization is performed
starting from the input orientation as well as from a user
defined number of random orientations to provide several local
minima of the optimization problem. The user has a choice of
optimizing the orientation in a periodic
simulation cell defined on input or the construction of the
smallest possible periodic simulation cell
with a given minimum image-image distance.
Again, option is provided for limiting the optimization
to rotations around one of the coordinate axes.
For the minimization of the cellsize the initial cell
dimensions are obtained from the maximum extension of the
molecule in the X, Y, and Z direction for cubic, rectangular and
hexagonal prism cells, from the smallest enclosing sphere for a
face-centered cubic, hexagonal close packed, or truncated
octahedron cell and from input for a cell defined by a Charmm
image file. Once the best orientation is found, the cell's
dimensions are reduced until the smallest image-image distance
reaches a user-defined minimum. Comparing the runs with different
starting orientations, the program choses the one with the
smallest volume.
- Perform cleanup operations.
Cleanup operations include:
- ensuring that atom and residue numbers are consecutive
(starting from user-defined atom and residue numbers);
- ensuring that all atoms of a residue are in a contiguous
stretch;
- ensuring that all PDB files have a chain (segment) ID. When
none present in the input file, the first chain id will be 'A'.
Every time a 'TER' record is encountered in the input file the
segment id will be changed to the next upper-case letter,
followed by the next lower case letters and finally, the digits
'0' to '9'.
- ensuring that Macromodel files have both 1-letter and
3-letter residue names.
- For PDB output files, it gives an option to write
CONECT records as well
- for MMC's .slt and Autodock's .pdbqs or .pdbqt
files it provides an option
to adjust charges to result in integral residue charges.
- Perform a variety of data
conversions.
- Conversions changing file formats
- Rearrange atoms within a residue according to a
template. Atomnames within a residue can be sorted and selected
to conform to a prescribed template. This may be useful, for
example, when a trajectory file is created from configurations
that don't conform to the RTF file used by the program reading
the trajectory, or when the configuration contains extra atoms
(such as lone pairs). The atoms and their desired order is
specified by a template file and/or by
the RTF file (in Amber, Charmm or Gromacs formats only)
describing the residues in the desired order. Currently, template
files are provided for Amber, Amber94, and Charmm22. If the
residue is not found in the template or the number of atoms in
the template is less than the number in the structure or if the
correspondence between the two atomsets is not one-to-one then
that residue will be left unchanged. The names of the template
files (found in the installation directory)
are listed by the program. Again, they can be modified by the
user and placed in their own directory to allow the treatment of
non-standard, patched or other terminal residues.
- Convert to a different file syntax. Conversions to
most of the recognized input formats can be requested. The
following limitations apply to these conversions:
- If the input format was PDB, the charges will be left as
zero;
- Macromodel output may select generic atomtypes for the
potential function specification when it fails to recognize the
bonding pattern around an atom
- For Macromodel output, the program may not be able to
determine the proper bond orders. The bond orders will be
especially problematic if the input file does not contain
hydrogens
- InsightII .car files will have atom symbols for potential
types
- Can not convert to Grasp .crg format - use the Create a Grasp .crg file option for that
- Can not convert to Tripos .mol2 format
- Create an input configuration for the MMC Monte Carlo
program. The MMC Monte Carlo program's input requires
coordinates, charges and potential type information for each
atom. There are three options to establish atomtypes and partial
charges:
- For Macromodel input and Insight .car file input (assuming
that the latter has been prepared with AMBER potential types) the
program will use the Amber codes;
- For Charmm or Amber input the atom codes can be established
from the residue and atomnames from the Charmm PSF file or the
Amber top file corresponding to the input structure.
- If no PSF or top file is available,
a conversion files
can be used: one of amb_prot_ua.dat, amb_all.dat,
amb_all94.dat, and cha_all22.dat are provided
in the distribution or a file prepared by the user.
Terminal residues may have to be corrected manually,
though. The Charmm conversion file also has the protonated ASP
and GLU residue information, using residue names ASX and GLX,
respectively.
- For Charmm, Amber and Gromacs the conversion information can
be also extracted from multiple RTF file(s) - the user will be prompted
for the filenames.
- Convert trajectories.
This is currently limited to Charmm, Amber, MMC and Macromodel formats.
During conversions, the user has the option to
- rearrange the atoms within residues.
- place only selected configurations on the output file.
- center the solute molecule(s)
in a periodic cell.
- rotate the whole system for each frame around a selected axis
or by an inputted rotation matrix.
- Split an MMC trajectory generated in the grand-canonical ensemble
into Amber trajectories with fixed number of solvents.
- Create an input for Amsol 3.5. Amsol requires the
specification of atomtypes according to its own convention
besides the atomic coordinates and atomic numbers. This
functionality determines these codes based on the molecule's
connectivity table.
- Create an input for Gaussian's ONIOM option.
Besides the X, Y, Z coordinates of each atom,
this option requires the atomic number, the Amber atom type,
partial charge of the atoms as well as the treatment category (H, M, or L).
This option works analogously to the input for creating input for MMC:
it can use the built-in list for deducing types and charges from atom names
and/or topology file(s) in Amber, Charmm or Gromos/Gromacs format.
The H, M, and L regions can be selected by user provided lists or by
spheres of different radii drawn around a user-selected atom.
For the latter, option is provided to extend the list specified by the
inner spheres to avoid breaking non C-C bonds.
Optionally, it also produces a connectivity input, with bond orders
deduced from the coordinates, just like for conversion to Macromodel
format.
- List the positions of the various items in a coordinate
file record for formatted input types.
- Conversions changing atom and or residue names
- 'Regularize' PDB atomnames. Brookhaven specification
uses the first two characters of the name as the right-adjusted
chemical symbol and put trailing numeric digits to the place of
the first space for hydrogens. This functionality enforces this
convention.
- Undo PDB regularization.This functionality moves all
leading numeric characters found in atomnames to the end of the
name.
- Left-adjust atom names.
- Convert residue and atomnames from different
conventions. Unfortunately, PDB atomnames (and even file
syntax) follow many different conventions and are slightly
different in the various software systems in use. This
functionality modifies atom and residue names
based on the source of the names and the software system in use.
Conversion of the atom and residue names are governed by a file called pdb_nam.dat (found in the
installation directory). Residues or
atomnames not in the file will be left unchanged. At present, the
file contains data to interconvert protein and nucleic acid
residues for the Amber, Amber94, Charmm, Moil, IUPAC, Macromodel,
ChemX and PDB conventions. Conversions from Macromodel would
allow for dropping lone-pair 'atoms' and the change of names of
the form C0* to their corresponding PDB-type names only. As a
result, conversions to Macromodel convention have been disabled.
When converting to PDB convention, the one selected by InsightII
will be used. When converting from PDB convention, a list of
known variants will be examined (most of these variants were
compiled by Dr. D. Strahs). If there is a copy of the conversion file
pdb_nam.dat in the local directory (the directory Simulaid
is run from), it will be used instead, so users can add
additional conversions to the existing residues or add new
residues. A local copy of pdb_nam.dat can also be used for
making selected changes by eliminating or changing conversion
information from the 'master' file.
- Trajectory - structure file conversions
- Unpack a collection of
configurations into single files. This functionality takes a
large file containing a number of configurations and separates
them into single configurations.
There are several options of selecting the files to be unpacked.
An input file XXX.YYY will
result in files XXX.1.YYY, XXX.2.YYY, etc.
- Unpack a trajectory into
single configurations, either in a single file or in separate
files. The selection options and the
naming convention of the new files are the same as for
unpacking configurations (see above).
The format of the unpacked file can be specified by the user.
For PDB output, option is given mark eachs structure as MODEL.
- Convert (pack) a series of configurations into a trajectory
file. This option will create a Charmm, Amber or MMC binary
trajectory file and put the input configuration as well as all
subsequent configurations into this file as consecutive
snapshots.
When packing into MMC binary format, Simulaid will search for
energy information if the input structures are in PDB or Charmm CRD format
s follows:
- In Charmm .CRD files, Checking the title records for
the string '* energy=' or '* Energy=', or '* E=', followed by the energy
- In PDB files, Checking the REMARK records for the string
'REMARK energy=' or 'REMARK Energy=', or 'REMARK E=', followed by
the energy
- Edit the conformation
(translate/rotate, add/delete atoms, etc)
- Edit the configuration. Editing
allows to
- select (keep only one) chain/segment
- excise (drop only one) chain/segment
- keep only the backbone
- keep only the alpha carbons
- drop all aliphatic (bonded to carbon) hydrogens
(and add its charge to the carbon's)
- select atom range to keep
- select atom range to drop
- select residue range to keep
- select residue range to drop
- select residue name to drop
-
Translate and/or rotate the inputted structure
- The following translations can be performed
- Translate by a user-specified vector.
- Translate by a user-specified vector and
reset the molecules into a
user-specified periodic cell.
- Center the solute molecule(s) in a user-specified periodic cell.
For more than one solute molecules, the images resulting in the most
compact solute will be selected.
Translation is useful when switching
between sytems with the origin at the corner (e.g., Amber) and at
the center (e.g., Charmm and MMC) or when a simulation
lets molecules move away from the periodic cell.
- Rotation can be performed
around a user-specified axis by a user-specied angle
or by a user-specified rotation matrix
- Scale the coordinates (e.g., change units)
- Separate molecules (to obtain a clearer picture)
- Swap the chirality of a selected atom
- Permute the coordinates of the configuration read.
- Modify the configuration. Modifying
allows to
- replace an atom
- add an atom
When replacing a terminal atom (i.e., on with only one bonded neighbor)
the bond length can be changed. When adding a new atom X, the program
will let the user establish a bonded sequence of heavy atoms,
R3-R2-R1-X and the user will have to specify the R1-X distance,
R2-R1-X angle and the R3-R2-R1-X torsion angle.
- Replace the coordinates of a configuration.
This option allows the replacement of just the coordinates in a structure
without changing the rest of the file.
The replacement can be
- simply a copy of an other structure
- the coordinates of the original structure translated and rotated to
obtain the best overlay to an inputted reference structure
The source of the coordinates or the reference structure can
be in any format except .mol2 for the system to be changed.
- Overlay a structure on a reference structure
- Transform a peptide to retro-inverso. This option
essentially swaps the petide >C=O with the paptide >NH.
It is recommended to do a short minimization on the transformed structure,
especially when prolenes are involved.
- Replace the charges in a configuration.
For now,
both the input and the replacing structure must be in Autodock's .pdbqt format.
- Extract the CA carbons of the protein backbone into a PDB file
- Change the (solvent) water molecules to their gas
phase geometry. Minimizations or molecular dynamics may have been
run without restraining the solvents to their experimental
geometry. This option 'fixes' these solvents.
- Eliminate solvents outside a simulation cell.
Simulation setup may involve the 'soaking' of a solute in some
solvent bath. This option allows the carving out of a periodic
cell of any of the above mentioned types.
- Reverse the optimization.
This option transforms back the structure in the original position/orientation.
The translation/rotation information is read from the comment lines
in the input structure file as saved there by Simulaid.
- Create a full unit cell from the asymmetric unit.
This option extracts from the PDB file the transformation information that
defines the symmetry-related images of the asymmetric unit and performs
these transformations.
- Create biological oligomers from the asymmetric unit.
This option extracts from the PDB file the transformation information that
defines the symmetry-related images forming the biological oligomers
and performs these transformations.
- Gather all crystal contacts of the asymmetric unit.
This option gathers all copies of the asymmetric unit that are in contact
with the first copy.
Each copy may be related to the first asymmetric unit by
symmetry transformation and/or unit cell translation.
- Create the coordinates of all atoms in the neighboring
PBC cells
- Miscellaneous structure-derived files, analyses
(sequence, Charmm RTF, Amber energy decomposition, UHBD dat, etc)
- Extract the sequence list
from the inputted structure. For structures carrying residue
information this function will extract the list of the residues
in a user selectable format:
- Charmm segment definition format
- PDB (SEQRES) format
- PIR format (first line: title; subsequent lines:
one letter residue codes without any separator).
- GCG - the proprietary format of the Wisconsin GCG program
- Convert sequences
- Create a Grasp .crg file from RTF file(s)
- Create a UHBD .dat file from RTF and parameter file(s)
(for now, only in Charmm format)
- Print the centers of neigboring image cells and cell vertices.
The cell centers will written as a pdb file called imagecell.pdb
and the vertices in an other PDB file called pbcvertices.pdb,
complete with CONECT statements to draw the cell polygon's edges.
- Write a PDB file for a user-defined rectangle
- Generate torsion angle input for Monte Carlo
calculations:
- Torsion angle list in Macromodel (Bmin) format (general or
protein (i.e., with rigid peptide bond))
- Torsion angle list in
MMC format
for proteins
(backbone with rigid peptide bond and sidechains or sidechains only)
or for a general molecule.
- Create Charmm topology file residue records for
residues specified by the user.
When the input is in InsightII .car format, the atom types are
also converted, using the algorithm (and routines) of the program
Intocham.
- Extract and tabulate results for selected properties from the large
output tables created by the script mm_pbsa.pl
that is part of the Amber suite of programs analyzing Amber trajectories.
Simulaid will recognize if the input tables contain residue-residue
contributions or residue-molecule contributions.
The table colums are the selected energy contribution terms, except for
the case when one property is selected to tabulate from from residue-residue
tables - in that case the table printed will have both rows and colums
as residues, the range of which is specified by the user.
In addition, correlations between selected properties can also be calculated.
- Perform a number of conformation analyses on the
solute.
The analyses can be done on the configuration read or on a trajectory
specified by the user.
When analyzing a trajectory, options are provided for
analyzing only selected configurations.
The time axis of plots of the evolution of various properties may be
labeled by the frame number, or in units of time (ps, ns, or ms).
The following analyses are implemented:
- General clustering
- Cluster atoms. This option takes the input structure, and
clusters
the atoms based on their distances.
This function can be used for a data set representing a density distribution
- the clustering will delineate the regions of different density maxima.
- Cluster based on an input matrix
This option reads in a matrix representing distances and performs clustering
on it. The input is assumed to be in the following form:
Size of the matrix (N)
N times:
Row number, Row name
N numbers, representing one row of the matrix (separated by spaces)
This matrix can be transformed before clustering using several formulae:
- R(i,j) = 1 - R(i,j)
- R(i,j) = R(i,j)^N
- R(i,j) = R(i,j)^(1/N)
- R(i,j) = (1- R(i,j))^N
- R(i,j) = (1- R(i,j))^(1/N)
For K-means and K-medoid clustering it can produce an inertia plot that
helps deciding on the number of clusters to request.
- Start logging the keyboard input.
When logging is turned on, the program writes all keyboard input on a file
that can be later used as input to repeat the run:
simulaid < <logfile>
Allow Simulaid to make the input
predictable, otherwise you have to remove or
rename all files crated before 'replaying' the log file.
- Make the interactive quiz predictable.
In certain cases, Simulaid has extra questions
or omits some questions, depending on the data. This can confuse runs
that use a log file as input. To avoid this problem, when this option
is set, the extra qestions will assume default answers and the questions that
may be eliminated in some cases will not be.
II.RUNNING THE PROGRAM - INPUT
The program is run interactively: the user is first asked to select the
operation to be performed then (if needed) the
- name of the input STRUCTURE file.
Input syntax can be
- (.CRD) Charmm CRD, either in the original (80 characters/atoms,
4-character names) to the extended format introduced with Charmm 32 (132
characters/atoms, 8-character names, 10-character integers)
- (.pdb) Original Protein Data Bank format - there are
several variants:
- Brookhaven format
- Charmm PDB (Charmm PDB differs from Brookhaven PDB mainly
in the placement of the segment ID)
- Autodock PDBQS - an extension of the PDB format containig
partial charge and solvation parameters (used by Autoock 3)
- Autodock PDBQT - an extension of the PDB format containig additional
partial charge (used by Autoock 4)
- Macromodel
- Macromodel/Xcluster (for more than one conformations, a
Macromodel format conformation is followed by only x,y,z
information as provided by Xcluster)
- (.slt) MMC solute description file format
- (.gro) Gromos/Gromacs coordinate file format
- (.mol2) Tripos Mol2 coordinate file format (only for input)
- (.car) Insight car format
- Free format as defined by the Insight format filenxyz.frm (number of atoms in the first line,
followed by atomic number and x,y,z coordinates in each of the
successive lines)
- Free format as defined by the Insight format filesxyz.frm (like nxyz.frm, except that chemical
symbols are expected instead of atomic numbers)
- Free format as defined by the Insight format filesxyzqr.frm (like sxyz, but appended by charge
and residue number).
- (.crg) Grasp charge file
- If the extension of the input file is neither of the extensions
specified above the program asks the
format of the input data.
- type of calculation to be run,
- name of the output file.
Whenever relevant, it asks the
- simulation cell shape. The
following (periodic) boundary conditions are recognized:
- cubic
- rectangular
- hexagonal prism (both regular or irregular hexagon),
with user-specified coordinate axis along the prism axis
and with user-specified coordiante axis going through the hexagon vertex
(e.g., X axis along the prism axis and vertices
of the hexagon are on the Z axis - as used by MMC)
- face-centered cubic
- truncated octahedron, X axis normal to a square face - as used by Charmm
- truncated octahedron, X axis normal to a hexgon face - as used by Amber
- hexagonal close-packed
- image-file based
- sphere (used only for trimming solvents)
- no boundary conditions
- residue name of the solvent (default: HOH for PDB,
WTR for Insight and TIP3 otherwise)
- number of solvent atoms for solvents other than TIP3,
WTR or HOH.
- For runs optimizing position or orientation it offers the option of excluding
hydrogens.
- Sphere optimization asks the width
of the extra solvent layer - this width will be used for
volume calculations and (when relevant) excluding excess
solvents.
- Orienting run asks the minimum
image-image distance to be achieved. The final cell size will
be determined with this value in sight.
- Trajectory conversions will use the
input configuration to define the order of the atoms in the input
trajectory and may be given a second configuration file to
establish the target trajectory's atom order.
Creation of a Grasp .crg file involves specifying the RTF
file name(s). Note that the Grasp format limits atom names to
four characters and residue names to three characters. If either
of these limits is exceded, the program will truncate the names
and print a warning. Also, the program requires a valid 'input'
filename, but it will be ignored.
For most runs with conversions or with
cleanup the program asks the number of
additional structures to process in the same way. This allows
the processing of a batch of files with one run. For Charmm
format output, the segment id of the additional structures will
be appended by 1,2, etc., as long as there is 'room' for this
additional information in the four-digit sequence id (i.e., the
original segment id contained enough blanks).
Additional input
structures may be either in the same file or in separate files -
you will be asked which. Each file may contain several structures
- you will be asked how many. If they are in separate files, and
the first file was called ether A.B or A.1.B then
the remaining files have to be called A.2.B, A.3.B,
etc. Similar naming convention applies to the output files.
The output file(s) will contain the modified molecule. Except
for runs requesting a transformation of syntax, it will use the
same syntax type as the input file. The header comments for
Charmm and PDB formats will be extended with a description of the
transformation(s) done by the program and/or the results of the
optimization.
Running Simulaid on a batch of files
Simulaid has been succesfully run from a c-shell script on a batch of files.
The script requires a copy of all the answers to the interactive quiz
durig the Simulaid run. This can be obtained by running Simulaid
and turn on on input logging.
It is advisable to set the input
predictable in this case.
This list has to be put into the
script file between the simulaid << fin command and the
terminator string fin. In the example below all .pdb
files in the directory are converted from PDB atom/residue name
convention to Charmm's convention.
Also, the generation of the list of files to be processed has
to be adapted to each case.
#! /bin/csh -f
#
# script to convert from PDB name convention to Charmm convention
#
set files = `ls *.pdb`
# The variable $file lists all the .pdb files in this directory
foreach file ($files)
set file1 = `echo $file | sed s/\.pdb/_ch\.pdb/`
# Generated a new file name from the original .pdb file name to use as output
echo Reading $file and writing file $file1
simulaid << fin >> junk
p
n
f
$file
b
$file1
y
9
3
fin
end
#rm junk
exit
III. THE OUTPUT ON THE TERMINAL
The terminal output will include the title cards and the
number of atoms found in the input file (the number of hydrogens
also if they are not to be considered) for all types of runs.
- For enclosing sphere calculation:
-
The initial radius found based on the center determined by
shifting along the X, Y, and Z axes by
(X(min)+X(max))/2, (Y(min)+Y(max))/2,
(Z(min)+Z(max))/2, respectively, where
X(min), X(max) are the smallest and largest X
coordinates, etc.;
- The optimized radius;
- The volume of the initial and optimized sphere and the
number of waters required to fill it in.
- The volume of the initial and optimized spheres increased by
the water layer width and the number of waters required to fill
it in.
- The vector shifting the molecule to the center of the
smallest sphere.
- If solvents were present in the input file, the number of
them remaining that is within the layer of specified width around
the optimized sphere.
- For minimizing the bounding box:
- The initial bounding box sixe
- Messages describing the progress of optimization.
- The smallest bounding boxes found in each minimization try
- The optimal bounding box sixe
- The vector shifting the molecule's COM to (0,0,0) and
the rotation matrix to the optimal orientation
- For orientation optimization:
- The shortest distance between atoms on different images in
the original and in the new orientation. The optimization will be
performed starting from the input orientation as well as from a
user-defined number of different random orientations.
- Messages describing the progress of optimization.
- The cell parameter(s), like the edge lengths, corresponding
to the cell shape used in the calculation.
- The cell volume and the number of waters required to fill it
in (without the solute present).
- The vector shifting the molecule's COM to (0,0,0) and
the rotation matrix to the optimal orientation
- For solvent geometry fixing calculations:
- The number of solvents found.
- For elimination of waters outside a simulation cell:
- The number of waters kept and the number of non-water atoms
found.
- The cleanup operations, performed
automatically at the end of the run or at special request. Note,
that automatically invoked cleanup operations start the residue
and atomnumbers from one.
- The number of atoms found detached from their residue
- A message if resequencing was or was not needed
- Conversion output
- For each type of conversion a variety of warnings or error
messages may be obtained indicating that the input file does not
conform to some assumptions or the templates used are not
prepared for the particular input. These messages should be
heeded.
- For residue-name bearing input a sequence list can be written
to a file.
- The analysis option provides a menu from which a choice
should be made and ask if trajectory scan is requested.
After the analysis requested is performed
the program returns to this menu allowing the user to do an other one
or to stop.
If additional configurations are read from the input file,
the last analysis will be repeated.
- Each type of analysis prints the name of the output file
(generated from the input file name by adding a special suffix)
where the requested information is written.
- Various statistics generated over the lists are also printed
on the terminal.
Furthermore, several internal consistency checks are
implemented to increase the chance of detecting any error that
might have remained in the program. Failing such a test results
in a message preceded by PROGRAM ERROR. This always indicates a
bug in the program and should be brought to the attention of the
author, together with the input file that was processed when the
error occurred.
IV. FILE FORMATS
- The name conversion file
pdb_nam.dat is of the following syntax:
- Part 1: nconv,ncol (free format)
- nconv: Number of name conventions included.
- ncol: number of data columns in the file.
- Part 2: 8-character names of the conventions (in one line):
"Amber ", "Amber94 ", "Charmm ",
"Moil ", "IUPAC ",
"Macromod","ChemX ", " PDB "
PDB has to be the last convention.
- Part 3: Data for residue name conversions. Each line has
ncol entries.
- The first entry is a generic residue name.
- Entries two to nconv+1 are the actual residue
names corresponding to possible variants in the various
conventions.
- Entries from nconv+2 to ncol are possible
additional PDB variants. A blank entry means that conversions to
that convention of the corresponding atoms will retain the
original name. An entry of "****" means that conversions to that
convention will result in dropping those atoms.
- This section is concluded with a line for residue name
"DONE".
-
Part 4: Data for atom name conversions. Each line has ncol
entries. The first entry is a generic residue name as defined in
Part 3. Entries two to nconv+1 are the atom names
corresponding to the variants in the various conventions.
Entries from nconv+2 to ncol are possible
additional PDB variants. Entries for PDB varians have to conform
to the PDB convention requiring that atom names for elements with
single-character chemical id be in the second position -
leaving the first one either blank or filled with a number (e.g.,
"_OD1" or "2HA1").
There are a number of entries with blank generic residue names
- they represent names that convert to each other for any
residue. This section is also concluded with a line for residue
name "DONE".
- The template file for the orders of
the atoms within each residue has 8-character entries in the form
"RESN ","ATNM ". The current
versions have been directly generated from the RTF files of Amber
and Charmm.
- The file containing the
information needed to convert to MMC Monte Carlo input has
8-character entries "RESN ",
"ATNM ", "PFCD ",
"Charge","GROUPa b" where
- RESN is the residue name
- ATNM is the atom name
- PFCD is the left-adjusted atom-type name
- Charge is the partial charge given with decimal point
- GROUPa b gives information about the carous charge groups
within a residue. For each group, a=1 for the first atom of the
group and a=0 for the rest; b is the sequence number of the group
within the residue (i.e., b=1,...,b=2,..., etc).
Again, they have been extracted from the RTF files of Amber
and Charmm, respectively. Note that the atom order template file
has the same syntax as the MMC conversion file for the residue
and atomnames thus an MMC conversion file can be also used for
atom order template file.
V. INSTALLATION
The distribution contains the source code and some data files.
Installation of the program requires the following steps:
- Putting these data files into a directory of your choice,
referred to as the installation directory.
- Changing in the block data
- the string MYPATH to the path of the installation directory
- the fourth number in the data ldatapaths statement (00)
to the number of characters in /MYPATH/
- Removing the C@**
(i.e., deleting the first four characters in the line)
comments from lines corresponding to your environment
in the subroutine datprt (near the end of the source code) for
the printing of the date on output files as follows:
- "C@UG" --> "" - SGI IRIX Fortran compiler
- "C@G7" --> "" - g77 Fortran compiler
- "C@AB" --> "" - Absoft Fortran compiler
- "C@EF" --> "" - Intel Fortran compiler
- "C@HP" --> "" - Hewlett-Packard Fortran compiler
- "C@AX" --> "" - IBM AIX Fortran compiler
- The subroutine datprt checks if the program is executed on
certain hostnames, assumed to be the headnodes of a cluster.
The hostnames (e.g., minerva2 in the distribution copy) should be
replaced with the head node hostname where Simulaid is installed.
This will enable Simulaid to warn the user that
longer runs should be sent to a compute node.
- Compile the program with Fortran, e.g.,:
f77 -o simulaid.bin simulaid.f
to obtain the
executable simulaid.bin
Some compilers fail due to a so-called 'relocation error' when optimizing
at higher levels or specifying larger dimensions (see below) than the default.
When using the Intel Fortran compiler (ifort), adding the compiler directives
-mcmodel=medium -share_intel
solved the problem. With some of the other compiler (but not the GNU compiler)
the compilation key
-fpic was found to solve the problem.
- If the program is to be compiled with Fortran95 then there
is a preprocessor in the distribution f77tof95.f and f77tof95_f95.f (the
second one is the Fortran95 version of the first one) that changes
the syntax to conform to Fortran95 requirements.
The distribution directory also contains the files
nxyz.frm, sxyz.frm and sxyzrq.frm that give
the decription of the data format for some of the InsightII
free-formats. The Charmm image files for the non-rectangular
boundary conditions, hexy.img, hexz.img,
fcc.img, to.img, and hcp.img are also given.
VI. CHANGING DIMENSIONS
The sizes of the arrays are established with parameter statements
throughout the code. Several symbols user used for this purpose.
There are certain relations between these symbols,
so changing one of them is likely to require changes in some others.
Below is a list of these symbols and their relations that have to be
maintained (the program checks for violations).
IMPORTANT: Parameter statements for most symbols occur several places
in the program. When a change is required, it has to be carried out
at ALL occurences!
- MAXPHI: maximum number of
Delphi gridpoints
Also, MAXPHI3 is the memory used for 2D-data - see below
the parameter definitions depending on MAXPHI
- MAXREC: Maximum number of records to read
> maximum number of atoms
must be divisible by 20
- MAXRSD: Maximum number of residues
- MAXNEIG: Maximum number of neighbors of an atom in the neighbor list
(7+2*MAXNEIG)*MAXREC < MAXPHI3
- MAXBONDS: Maximum number of bonds (hydrogen, hydrophobic or salt bridge)
- MAX2D: maximum number of frames for the
2-D RMSD map
MAX2D < MAXBONDS
4*MAXBONDS+MAX2D+2*MAX2D2 < MAXPHI3
- MAXCV: maximum number of residues for the
CV-based domain map
8*MAXCV+5*MAXCV2 < MAXPHI3
- MAXRCORR: maximum number of residues for the
residue correlation map
2*MAXRCORR*2+16*MAXRCORR < MAXPHI3
- MAXCONN: maximum number of residues for the
residue distance map
3*MAXCONN2-11*MAXCONN < MAXPHI3
The current version uses MAXREC=200000, MAXRSD=70000, MAXNEIG=25,
MAXPHI=400, MAXBONDS=10000, MAX2D=5000, MAXCV=2000, MAXRCORR=5000 .
Valid values for a larger version (ca. 1.1 Gb) are
MAXPHI=640, MAX2D=10000, MAXCV=800, MAXRCORR=10000.
In additions to the symbols above, there are other size parameters
whose value can be chosen independently (the current value is in braces):
- MAXFRAMES {50000}: the number of trajectory frames to store for analysis
- MAXCOPY {600}: the number of different data (e.g., torsion angles) per frame
to store for analysis
- MAXHX {50}: the maximum number of residues per protein helix
- MAXNHX {16}: the maximum number of helices for a single TRAJHELIX run
- MAXBRDIGEATOM {8000}: Maximum number of bridge anchor atoms:
- MAXBRDIGETYPE {500}: Maximum number of bridge destination atoms:
- MAXBRIDGELEN {4}: Maximum number of H bonds in a bridge:
- MAXDDISTR {200}: Maximum number of atom pairs to calculate the distance distribution
- MAXDDBIN {20}: The number of distance bins in the distance distributions
- MAXCDLIST {2000}: The total number of
atom pairs that can be involved
in distance distribution calculation
VII. IRIS GL GRAPHICS VERSION
Earlier versions of Simulaid included optional code for graphics output.
Since support for Iris GL stoppped in the early 90s, the graphics code
has been removed from Simulaid.
The latest version with this code is saved as simulaid_IGL.f and simulaid_IGL.html.
VIII. REFERENCES
1. O. Lemke and B.G. Keller,
Density-based cluster algorithms for the identification of core sets,
J. Chem. Phys.,145, 164104 (2016).
DOI:10.1063/1.4965440.
2. M. Mezei, Optimal Position of the Solute
for Simulations,
J. Comp. Chem. 18,812-815 (1997).
DOI:10.1002/(SICI)1096-987X(19970430)18:6<812::AID-JCC6>3.0.CO;2-V.
3. M. Mezei, Finding of the smallest
enclosing cube to improve molecular modeling,
Information Newsletter for Computer Simulation of Condensed
Phases, CCP5, Daresbury Lab., No 47, (2000).
4. J. Cornette, K. B. Cease, H. Margalit, J. L. Spouge,
J. A. Berzofsky and C. DeLisi,
Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins,
J. Mol. Biol., 195, 659-685 (1987).
DOI:10.1016/0022-2836(87)90189-6.
5. M. Mezei, A new method for mapping macromolecular topography,
J. Molec. Modeling and Simul.,
21, 463-472 (2003).
DOI:10.1016/S1093-3263(02)00203-6.
6. Adam D. Schuyler, Heather A. Carlson, and Eva L. Feldman,
Computational Methods for Predicting Sites of Functionally Important Dynamics
J. Phys. Chem. B,113, 6613-6622 (2009)
10.1021/jp808736c
7. I. Visiers, B.B. Braunheim, and H. Weinstein,
Prokink: a protocol for numerical evaluation of helix
distortions by proline.
Prot. Engrg., 13, 603-606 (2000).
DOI:10.1093/protein/13.9.603
8. M. Mezei and M. Filizola,
TRAJELIX: A computational tool for the geometric characterization of
protein helices during molecular dynamics simulations.
J. Computer-Aided Molecular Design, 20, 97-107 (2006).
DOI:10.1007/s10822-006-9039-1.
9. D. Cremer and J.A. Pople,
General definition of ring puckering coordinates,
J. Am. Chem. Soc., 97, 1354 (1975).
DOI:10.1021/ja00839a011.
10. C. Altona and M. Sundaralingam,
Conformational analysis of the sugar ring in nucleosides and nucleotides.
New description using the concept of pseudorotation
J. Am. Chem. Soc., 94, 8205 (1972).
DOI:10.1021/ja00778a043.
11. W. Kabsch and C. Sander,
Dictionary of protein secondary structure:
Pattern recognition of hydrogen-bonded and geometrical features.
Biopolymers, 22, 2577-2637 (1983).
DOI:10.1002/bip.360221211.