Description of the program Dockres:
Summary of the results of docking a library to a target by
Autodock-4, Autodock-Vina, eHiTS, PLANTS, GOLD, DOCK or Glide
Mihaly Mezei
Department of Pharmacological Sciences
Icahn School of Medicine at Mount Sinai
New York, NY 10029
Mihaly.Mezei@mssm.edu
Aug. 14, 2020
The program Dockres scans the result of
Autodock
(Version 4) or
Autodock-Vina or
eHiTS or
PLANTS or
GOLD or
DOCK or
Glide
docking runs with a series of ligands.
It gathers the top binders and displays a variety
of statistics, both on the ligand set and on the top binding poses.
The ligands selected can be filtered by a number of criteria.
File convention
Dockres uses a one letter code for the screening software used:
- A: Autodock-4
- V: Autodock-Vina
- E: eHiTS
- P: PLANTS
- G: GOLD
- D: DOCK
- L: Glide
This code is included in the names of the various directories and files
created by Dockres.
Input of the program
Besides the structure file for the target macromolecule
(of the form macro.pdb*, or (for GOLD) macro.mol2)
Dockres assumes the availability of the following files (the notation
macro stands for the name of the macromolecule file's name without
the .pdbqs .pdbqt, .pdb or .mol2 extension):
Except for DOCK and Glide, a file called macro_<sw>.dir
listing the docking result files (for PLANTS the directories the docking result of
each ligand are)
where <sw> is a one letter code for the screening software used.
It can be created with the script getdir.csh
(or by the the user with a text editor) prior to running Dockres.
Format of the file macro_<sw>.dir:
- First record (optional): the name of the
Autdock grid-parameter file or the eHiTS clip file (including the path
relative to the current directory). Omit for Autodock-Vina and GOLD.
- Second to last records:
the sequence number of the ligand, followed by
the name of a ligand docking result file
(.dlg for Autodock-4, .pdbqt for Autodock-Vina, .sdf for eHiTS,
ligand name as the directory name for PLANTS or
.mol2 files for GOLD, again including the path)
- one record for each ligand docked
For example, docking with Autodock-4 ligands
ligx.mol2, ligy.mol2 and ligz.mol2 to macomolecule
mm.pdbqt will result in files
ligx.mm.dlg, ligy.mm.dlg, and ligz.mm.dlg.
Thus the user has to prepare a file called mm.dir,
with the following content
mm.gpf
1 ligx.mm.dlg
2 ligy.mm.dlg
3 ligz.mm.dlg
Note that for GOLD, Dockres assumes that each pose is
in a sperate file in the result directory and they are of the form
gold_soln_<structure>#l_#p.mol2 where #l is the
ligand number and #p is the pose number.
In addition, Dockres needs
- For Autodock-4
- The grid-parameter file (the one with the .gpf extension,
used as the input to the Autogrid run)
- Optionally a file with the flexible part of the
target with .pdbqt extension (default name: macro_flex.pdbqt).
- For Autodock-Vina
- optionally a file with the flexible part of the
target with .pdbqt extension (default name: macro_flex.pdbqt).
- For eHiTS runs
- the clip file (in .pdb format)
Dockres can be run both interactively from a terminal or in batch mode,
specifying the run parameters as command-line options.
The terminal inputs can also be logged to a file that can be used to rerun
a job, possibly after editing it.
When compiled with the
parallel code included it has to be run
in batch mode (with concomittant restrictions, vide infra).
In interactive mode it starts with asking (possibly a subset of the)
for the following information:
- The docking software used
- The existence (and the name, if applicable) of a file converting the
ligand ID numbers to a different number (e.g., from eMolecules parent ID to
eMolecules version ID).
- The macromolecule file name (without the .pdb* or .mol2 extension),
macro.
For Autodock 4 it will ask for the existence of flexible part.
If the answer is positive, it will ask for the flexible .pdbqt file 's name.
- If the checkpoint file macro_<sw>.ckp is present
then the program
will ask if it is to be used or deleted.
If use is requested, it will read its contents and
skip to the
result summary segment.
- The user is given the option to change the thresholds for
hydrogen bonds, salt bridges and hydrophobic bonds.
- For pdb file input, the program asks for the availability of
partial charges.
- When charges are available, the program gives the option to still use
atom-type based bond definitions (in that case the program will not
check for salt bridges)
- The number of top poses per ligand to
consider candidates for the top scoring list (to 'extract')
- The use or ignoring of the clustering of poses done by
Autodock-4
- The amount of information per ligand to be printed on the output file
macro_<sw>.res
- Optionally, the atom number of the center of binding site
- When a binding site is specified, the user has the option to
select a
ligand atomtype
to use for the distance calculation
(otherwise the ligand atom nearest
to the binding site atom will be used)
- The size of the region around each ligand pose where target atoms
will be searched for contact. This region is defined by finding the smallest
rectangle around the ligand (aligned with the coordinate axes) and
extending it in every direction ('padding')
by a user-defined number (default: 7 A).
- Optionally, a target ligand file to which the poses can be compared.
Once this information is given, the docking result files are read and the data
is extracted from each.
Besides the coordinates of the pose, the program extracts the docking score.
For Autodock-4 and eHiTS the user has to select between the wo scores
calculated by the docking program.
This extraction may take some time - for larger libraries the
program periodically will print a report of the progress.
Once the data is gathered, a checkpoint file is written and the result
summary starts.
The result summary starts with printing on the terminal
the list of the top-scoring poses,
the number of poses in the top-score ranges,
and a plot showing the distribution
of the location of poses over the macromolecule's residues.
The program then gives the user the option to
- For Autodock, rescore the free energies based on the multiplicity
of each pose
- Limit the number of poses listed for the same ligand
- Calculate the RMSD between different top-scoring poses of the same ligand
- Extract docked poses. The default is to extract a PDB file
with the macromolecule and the selected top-scoring poses, but the
user can specify the list of poses to include as well.
For rigid macromolecule a single file can be generated
with the different poses added to the macromolecule
as additional residues (residue name LIG, chain id L).
Extracted poses can be saved separately as single files or a
NMR-style file containg the extracted ligand poses as MODELs
(note tha VMD refuses to read such a file if the ligands ar not
all identical but Pymol cab read it fine).
For flexible macromolecule, each pose will result in a complete
file with the macromolecule and the ligand whose name will be
a comination of the name of the macromolecule,
the ligand and the pose number.
- Generate the same distribution restricted to a set of ligands
specified by the user
- Repeating the calculation of the various statistics
At this point, additional filtering criteria can be added:
- New value of the number of poses/ligand to extract
- List and/or range of residue numbers of the macromolecule
atom nearest to the docked pose
- Minimum and maximum charge of the ligand
- Minimum number of hydrogen bonds, salt bridges and hydrophobic bonds
- If a target ligand file was specified at the outset of the run,
the minimum number of contacts between the ligand and the target and
the maximum average contact distance between the ligand and the target.
If no (more) repetition of calculations are requested, the program proceeds
to the last stage where it offers the user the following options:
- Extract the ligand files of top scoring ligands into a separate directory
- Generate a list of top-scoring ligands
- Calculate the Tanimoto coefficent matrix for top-scoring ligands
based on the similarity of the target atoms or residues in contact
with the ligand.
The matrix will be written on a file called
macro_<sw>.tan the format Simulaid can read it for
clustering.
- Calculate the number of hydrogen bonds, salt bridges and hydrophobic
bonds the top ligands form with each solute atom or residue
- Generate grid maps for the distribution of ligand atom types in
contact with the target
- Extract the SMILES strings corresponding to the top-scoring ligands
from a file of the SMILES strings in the library used
For each of the options above the user can specify the number of top-scoring
ligands to use and decide if different isomers/tautomers of the same ligand
should be included.
In batch mode the following information can be specified:
A possible batch run call can be
> dockres -mm hemoglobin -sw eHiTS -np 20 -ol 2 -ib 99
The first two items (-mm and -sw) are compulsory;
preferably, they should be the first two items, allowing all warning and
error messages to be printed on the log file
macro_<sw>.res.
For the rest of the input that can be specified in interactive mode
the default values are used.
Batch run with flexible macromolecule has not yet been implemented.
Some inputs currently can only be provided in the interactive mode
(e.g., hydrogen-bond thresholds, filtering options).
To use a non-default option for which no command-line input is implemented,
an interactive run is required that can be started from the checkpoint file.
It will not be CPU intensive since the time-consuming data gathering
has been completed already.
Output of the program
Dockres will create the following files:
A log file called
macro_<sw>.res where all result will be printed.
If it is already present, it will write instead to
macro_<sw>_N.res where N is the smallest integer such that
no file with that number exists.
A file called macro_<sw>.ckp containig all the information gathered
allowing the repeated extraction of data with different filtering
criteria without having to perform the time-consuming scan of the
.dlg files
Additional output files can be created optionally -
see above the batch directives -nx, -py -tl, -ta,
-tr, -cs, -gr and -gc. For example,
-nx creates a PDB file(s) containing extracted ligand poses with the
macromolecule.
The file macro_<sw>.res will contain
- The docking grid description
- The information requested about the docking of each ligand
- The distribution of a number of molecular properties in the ligand set:
- Number of hydrogen-bond donors
- Number of hydrogen-bond acceptors
- Number of rotatable bonds
- Number of NO2 groups
- Number of rings
- Molecular weight
- Charge
- For each chemical element occuring in the ligand set
the number of such atoms
A typical example of such output (for information in the number of
rotatable bonds in a ligand) is
Distribution of number of rotatable bonds over 11 ligands
Average= 4.1818 S.D.= 1.6414
.00 .00 .18 .18 .18 .36 .00 .00 .09 .00
+----+----+----+----+----+----+----+----+----+----+
1.00 | |
.90 | |
.80 | |
.70 | |
.60 | |
.50 | |
.40 | |****| |
.30 | |****| |
.20 | |****|****|****|****| |
.10 | |****|****|****|****| |****| |
+----+----+----+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7 8 9
Here the X axis is the value of the property for which the distribution is
calculated;
the Y axis is the fraction of ligands having a particular value of the property;
the numbers on the top give the actual fractions.
In this example, the highest column is for 5. This means that most ligands
in this library have 5 rotatable bonds.
The hight of the column is at 0.4, meaning that between 30% and 40% of
the ligands have 5 rotatable bonds - the actual number is 36%, shown on top.
The number on top shows 0.36, meaning
that 36%
- For both score types extracted (energy and free energy or energy and score)
- The number of poses having docking score in 0.5 kcal/mol
intervals within 5 kcal/mol of the best score.
This list helps deciding the number of top-scoring poses to extract -
this number has to be specified at this point.
- The list of top-scoring ligands, giving
- Score (energy or free energy (Autodock);
energy or inhibition constant (Autodock);
energy or score (eHiTS); fitness or free energy (GOLD))
- The number of times this pose was found (multiplicity)
- Ligand identifier as deduced from the file name read from
the file macro_<sw>.dir
- Ligand sequence number in the file macro_<sw>.dir
- Ligand molecular weight
- Number of hydrogen-bond donors of the ligand molecule
- Number of hydrogen-bond acceptors of the ligand molecule
- Number of rotatable bonds of the ligand molecule
- Number of NO2 groups in the ligand molecule
- Number of rings in the ligand molecule
- Number of ligand-macromolecule hydrogen-bonds
- Total charge of the ligand molecule
- The macromolecule atom (index, name, residue number) nearest to the
ligand
- The chemical formula of the ligand
- The list of ligand-target contact atom pairs. Contact is defined as
a pair of target ligand atoms that are mutually proximal
(i.e., if the ligand atom nearest to target atom iat is ial
then iat is the nearest target atom to ial)
- If requested (whenever more then one pose per ligand was allowed
to enter the top-scoring list), the RMSD between pairs of poses
of the same ligand that are on the top-scoring list.
- The distribution of the number of poses with respect to the
macromolecule residue. A typical example (shown only for residues 51-150)
looks like this:
+---------+---------+---------+---------+---------+-
10| |
| |
| |
| * |
| * |
| * |
| * |
| * * |
|** * * |
1|** * * |
+---------+---------+---------+---------+---------+-
51 100
+---------+---------+---------+---------+---------+-
10| * |
| * |
| * |
| * |
| * |
| * |
| * |
| * |
| * |
1| * |
+---------+---------+---------+---------+---------+-
101 150
The height of the column is proportional to the number of ligands docked to
the residue represented by the X axis.
The residues to which a ligand is docked can be assigned by using the residue
where the closes contact is or using all residues that include atoms on the
contact list (vide supra).
- The distribution of the docking free energies or eHiTS scores
at the macromolecule residues.
This is represented by a plot as shown below (from the same run the example
above is taken):
Largest count= 7
+---------+---------+---------+---------+---------+-
-4.40| 2 |
|1 2 |
|17 |
| 7 |
| |
|1 4 |
| |
| |
| |
-7.15| |
+---------+---------+---------+---------+---------+-
51 100
+---------+---------+---------+---------+---------+-
-4.40| |
| |
| |
| |
| |
| 4 |
| M |
| 1 |
| 1 |
-7.15| 1 |
+---------+---------+---------+---------+---------+-
101 150
Here again the X axis represent the residue number, the Y axis the docking
energy or free energy.
Whenever a number appears, it indicates that ligands were docked to
the corresponding residue, having the corresponding docking (free) energy.
The number (between 0 and 9) is proportional to the number of ligands
in that category. M (instead of a digit) shows the residue/energy combination
with the largest number of members (the value is given before the plots);
the digits 0 - 9 represent proportionally smaller number of members.
Compilation of the program
The program is written in Fortran 77. Its size is governed by
the parameters (the number between the braces is the value set
in the source code), established in the first line of the program
- MAXMACA {30000} - maximum number of atoms in the macromolecule
- MAXMACR {5000} - maximum number of residues in the macromolecule
- MAXLINES {500000} - maximum number of lines in the log file of a single
ligand
- MAXDOCKSYN {7} - maximum number of docking syntaxes allowed
- MAXMOL {1000000} - maximum number of ligands
- MAXPOSE {15} - maximum number of poses per ligand to extract
from the log files
- MAXLIG {20000} - maximum number of atoms per ligand
- MAXMATCH {01} - maximum number of contacts per ligand;
if contact statistics is generated set it to 30
- MAXGRID {128} - maximum number of grid points in a contact map
(in one dimension)
- MAXNODE {1024} - maximum number of CPUs to use in the parallel mode
- MAXSMIEXTRACT {1024} - maximum number of SMILESstrings to extract
There are some constraints on the parameters:
- MAXMACA > MAXLIG
- MAXMOL > MAXMACA
- 256 > MAXPOSE
- 7*MAXPOSE*MAXMOL > 14*MAXGRID^3
Several of the parameter definitions are repeated in different subroutines.
If any one is changed, all occurrences have to be changed similarly
The program uses several arrays of size MAXMOL*MAXPOSE, dominating the memory
requirement.
It should be compiled at the highest optimization level for maximum speed.
For example, using the f77 compiler the compilation can be executed
by
f77 -O4 -o dockres.exe dockres.f
Some compilers fail due to a so-called 'relocation error' when optimizing
at levels higher than one is asked.
When using the Intel Fortran compiler (ifort), adding the compiler directives
-mcmodel=medium -share_intel
solved the problem. With some of the other compiler (but not the GNU compiler)
the compilation key
-fpic was found to solve the problem.
The optional parallelization is using the MPI library.
Note, that this requires running in batch mode,
with the concomittant restrictions.
Furthemore, the parallel version does not work for DOCK or Glide.
To compile Dockres with the parallel code included, first remove the 'C@DM'
string from the source code:
> cat dockres.f | sed 'C@DM'd > dockres_mpi.f
> f77mpi -o dockres -O4 dockres_mpi.f
The name of the MPI-enabled compiler may be different in your system and
additional libraries may also be needed to be invoked.
For parallelized runs, the parameter MAXMOL can be set to less than
the total number of ligands - it should be just large enough to hold
data for Nmolec/NCPU.
In this case, however, the program stops after writing the checkpoint file
and a separate single-CPU run, compiled with the parameter MAXMOL
set large enough to hold all ligands should be used to print/save the results.
This option is useful for distributed memory systems where the
majority of nodes have relatively small memory.
Note that if Fortran-90 is used for one compilation, then it should be used
for the version reading the checkpoint file as well, otherwise
the binary files will be incompatible.