Description of the program Dockres:
Summary of the results of docking a library to a target by Autodock-4, Autodock-Vina, eHiTS, PLANTS, GOLD, DOCK or Glide

Mihaly Mezei

Department of Pharmacological Sciences
Icahn School of Medicine at Mount Sinai
New York, NY 10029

Aug. 14, 2020

The program Dockres scans the result of Autodock (Version 4) or Autodock-Vina or eHiTS or PLANTS or GOLD or DOCK or Glide docking runs with a series of ligands. It gathers the top binders and displays a variety of statistics, both on the ligand set and on the top binding poses. The ligands selected can be filtered by a number of criteria.

File convention

Dockres uses a one letter code for the screening software used:

A: Autodock-4
V: Autodock-Vina
E: eHiTS
P: PLANTS
G: GOLD
D: DOCK
L: Glide

This code is included in the names of the various directories and files created by Dockres.

Input of the program

Besides the structure file for the target macromolecule (of the form macro.pdb*, or (for GOLD) macro.mol2) Dockres assumes the availability of the following files (the notation macro stands for the name of the macromolecule file's name without the .pdbqs .pdbqt, .pdb or .mol2 extension):

Except for DOCK and Glide, a file called macro_<sw>.dir listing the docking result files (for PLANTS the directories the docking result of each ligand are) where <sw> is a one letter code for the screening software used. It can be created with the script getdir.csh (or by the the user with a text editor) prior to running Dockres.

Format of the file macro_<sw>.dir:

First record (optional): the name of the Autdock grid-parameter file or the eHiTS clip file (including the path relative to the current directory). Omit for Autodock-Vina and GOLD.
Second to last records: the sequence number of the ligand, followed by the name of a ligand docking result file (.dlg for Autodock-4, .pdbqt for Autodock-Vina, .sdf for eHiTS, ligand name as the directory name for PLANTS or .mol2 files for GOLD, again including the path) - one record for each ligand docked

For example, docking with Autodock-4 ligands ligx.mol2, ligy.mol2 and ligz.mol2 to macomolecule mm.pdbqt will result in files ligx.mm.dlg, ligy.mm.dlg, and ligz.mm.dlg. Thus the user has to prepare a file called mm.dir, with the following content

mm.gpf
1 ligx.mm.dlg
2 ligy.mm.dlg
3 ligz.mm.dlg

Note that for GOLD, Dockres assumes that each pose is in a sperate file in the result directory and they are of the form gold_soln_<structure>#l_#p.mol2 where #l is the ligand number and #p is the pose number.

In addition, Dockres needs

For Autodock-4
- The grid-parameter file (the one with the .gpf extension, used as the input to the Autogrid run)
- Optionally a file with the flexible part of the target with .pdbqt extension (default name: macro_flex.pdbqt).
For Autodock-Vina
- optionally a file with the flexible part of the target with .pdbqt extension (default name: macro_flex.pdbqt).
For eHiTS runs
- the clip file (in .pdb format)

Dockres can be run both interactively from a terminal or in batch mode, specifying the run parameters as command-line options. The terminal inputs can also be logged to a file that can be used to rerun a job, possibly after editing it. When compiled with the parallel code included it has to be run in batch mode (with concomittant restrictions, vide infra).

In interactive mode it starts with asking (possibly a subset of the) for the following information:

The docking software used
The existence (and the name, if applicable) of a file converting the ligand ID numbers to a different number (e.g., from eMolecules parent ID to eMolecules version ID).
The macromolecule file name (without the .pdb* or .mol2 extension), macro. For Autodock 4 it will ask for the existence of flexible part. If the answer is positive, it will ask for the flexible .pdbqt file 's name.
If the checkpoint file macro_<sw>.ckp is present then the program will ask if it is to be used or deleted. If use is requested, it will read its contents and skip to the result summary segment.
The user is given the option to change the thresholds for hydrogen bonds, salt bridges and hydrophobic bonds.
For pdb file input, the program asks for the availability of partial charges.
When charges are available, the program gives the option to still use atom-type based bond definitions (in that case the program will not check for salt bridges)
The number of top poses per ligand to consider candidates for the top scoring list (to 'extract')
The use or ignoring of the clustering of poses done by Autodock-4
The amount of information per ligand to be printed on the output file macro_<sw>.res
Optionally, the atom number of the center of binding site
When a binding site is specified, the user has the option to select a ligand atomtype to use for the distance calculation (otherwise the ligand atom nearest to the binding site atom will be used)
The size of the region around each ligand pose where target atoms will be searched for contact. This region is defined by finding the smallest rectangle around the ligand (aligned with the coordinate axes) and extending it in every direction ('padding') by a user-defined number (default: 7 A).
Optionally, a target ligand file to which the poses can be compared.

Once this information is given, the docking result files are read and the data is extracted from each. Besides the coordinates of the pose, the program extracts the docking score. For Autodock-4 and eHiTS the user has to select between the wo scores calculated by the docking program. This extraction may take some time - for larger libraries the program periodically will print a report of the progress. Once the data is gathered, a checkpoint file is written and the result summary starts.

The result summary starts with printing on the terminal the list of the top-scoring poses, the number of poses in the top-score ranges, and a plot showing the distribution of the location of poses over the macromolecule's residues. The program then gives the user the option to

For Autodock, rescore the free energies based on the multiplicity of each pose
Limit the number of poses listed for the same ligand
Calculate the RMSD between different top-scoring poses of the same ligand
Extract docked poses. The default is to extract a PDB file with the macromolecule and the selected top-scoring poses, but the user can specify the list of poses to include as well. For rigid macromolecule a single file can be generated with the different poses added to the macromolecule as additional residues (residue name LIG, chain id L). Extracted poses can be saved separately as single files or a NMR-style file containg the extracted ligand poses as MODELs (note tha VMD refuses to read such a file if the ligands ar not all identical but Pymol cab read it fine). For flexible macromolecule, each pose will result in a complete file with the macromolecule and the ligand whose name will be a comination of the name of the macromolecule, the ligand and the pose number.
Generate the same distribution restricted to a set of ligands specified by the user
Repeating the calculation of the various statistics
At this point, additional filtering criteria can be added:
- New value of the number of poses/ligand to extract
- List and/or range of residue numbers of the macromolecule atom nearest to the docked pose
- Minimum and maximum charge of the ligand
- Minimum number of hydrogen bonds, salt bridges and hydrophobic bonds
- If a target ligand file was specified at the outset of the run, the minimum number of contacts between the ligand and the target and the maximum average contact distance between the ligand and the target.

If no (more) repetition of calculations are requested, the program proceeds to the last stage where it offers the user the following options:

Extract the ligand files of top scoring ligands into a separate directory
Generate a list of top-scoring ligands
Calculate the Tanimoto coefficent matrix for top-scoring ligands based on the similarity of the target atoms or residues in contact with the ligand. The matrix will be written on a file called macro_<sw>.tan the format Simulaid can read it for clustering.
Calculate the number of hydrogen bonds, salt bridges and hydrophobic bonds the top ligands form with each solute atom or residue
Generate grid maps for the distribution of ligand atom types in contact with the target
Extract the SMILES strings corresponding to the top-scoring ligands from a file of the SMILES strings in the library used

For each of the options above the user can specify the number of top-scoring ligands to use and decide if different isomers/tautomers of the same ligand should be included.

In batch mode the following information can be specified:

help The program lists the possible online arguments with a brief explanation and stops
-sw software : the docking software, one of ATD4, ATDV, eHiTS, PLANTS, GOLD, DOCK or Glide
-mm macro : the macromolecule name (without the.pdb*)
-ne number of poses to extract from the log files
-nl number of top poses {20} to list with detailed info
-nd maximum number of poses per ligand to list.
-sc Score to use (1 or 2): for programs that generate two scores this option selects the one to sort by and to list
{2,1,2,2,1,1,2 for ATD4, ATDV, eHiTS, PLANTS, GOLD, DOCK or Glide, resp.}
-st score_threshold {1000.0} - poses with scores greater then this threshold specified will be filtered out
-td Number of top DOCK ligands {0] to consider
-rc 1: repeat the run starting from the checkpoint file macro_<sw>.ckp
-ic 'yes' or 'no' {yes} - use or don't use Autodock clustering
-ol Output level {1} : number between 0 and 3, higher number results in more detailed output from the result scan.
-cv Ligand ID conversion file name, a file with a title line and the input and output ligand ID numbers, comma separated
-se 0 or 1: {0} When 1 is read the will be aborted run if a ligand with error was found
-ib Binding site information - the solute atom index from which ligand distances will be gathered for later filtering
-cb Binding site center - the x, y, and z coordinates of the putative binding site

-lt Binding site type - the ligand atom type for which ligand target distances will be gathered for later filtering. If not specified, the nearest ligand atom will be used.

Ligand type number list:
 H :H-C  = 1 C :>C<  =11 N :>N<  =21 O :C-O-H=31 O :C-O-C=41 O-:C-On =51
 H :H-N  = 2 C :>C=  =12 N :-N<  =22 O :N-O-H=32 O :C-O-N=42 O-:P-On =52
 H :H-O  = 3 C :C=-C =13 N :-N=  =23 O :O-O-H=33 O :C-O-O=43 O-:S-On =53
 H :H-P  = 4 C :C=-N =14 N :-N-=C=24 O :P-O-H=34 O :C-O-P=44 O-:*On* =54
 H :H-S  = 5 C :Carom=15 N :*N*  =25 O :S-O-H=35 O :C-O-S=45
 H :H*   = 6 C :*C*  =16             O :C=O  =36 O :*OX* =46
                                     O :P=O  =37
                                     O :S=O  =38             P :P*   =58
                                                             S :S*   =59
                                                             **:*    =60

-rb Binding site distance limit {100000} - ligand poses farther than this limit will be filtered out
-tg targetfile: read the structure in the file targetfile; the top scoring ligand poses can be compared to it.
-il ligand #: the target ligand is identical with the ligand # in the library - the RMSD between the docked poses and the target ligand will be calculated.
-bb Minimum backbone length - peptide ligands shorter than this number (in A) will be dropped
-we ligand padding {7.0} - the value to add to the ligand coordinates when searching for nearby macromolecule atoms
-if Induced-fit docking {0} - When present for a Glide run, the input file is assumed to be the result of an induced-fit docking
-nx number of poses {0} to save in the PDB file
-py 0 or 1: {0} create a Pymol run file to accompany the PDB file (when the input is 1)
-ls : extract top poses
- -ls c: Single file extracted, ligands are added as separate residues
- -ls s: For each ligand, a separate file is extracted with the macromolecule present in each
- -ls m: Single file extracted, ligands are added as NMR-type MODELs
- -ls l directory name: Separate files are created for the target and each selected ligand; ligand files will be prefixed by their rank and put into directory name
- -ls n: No ligand-target complex is extracted (default)
-sd 0 or 1: {1} When 1 is read the name in the.sdf file (first line) will be used to list with detailed info
-un 0 or 1: {1} When 1 is read the -sd_name record in the input .mae file will be used as the ligand name (Glide only)
-ui 0 or 1: {1} When 1 is read the -sd_ID record in the input .mae file will be added to the ligand name (Glide only)
-ns 0 or 1: {0} When 1 is read the ligand poses to be listed will not be sorted by score. This is useful when all ligands are listed.
-fu free energy units {kcal/mol}, either kcal/mol or Ki: the free energy scores can either printed in kcal/mol or as inhibition contant Ki (concentration); Ki=exp(FE/(kT)).
-tl number of top scoring ligand ids to list in a file called macro_<sw>.lst.
-lx nligextract directory/file name: put the top nligextract ligand files into directory name (for non-Glide runs) or into file name (for Glide runs).
-ll Ligand list file name: read the list of ligands to extract from the docked pose file (for Glide only).
-sx number of top scoring ligand SMILES strings output file name library SMILES string file will extract the asked number of files to the otput file, assuming that the library file has the SMILES strings followed by the ligand names (one line per ligand)
-ta number of top scoring ligands to use for calculating the Tanimoto similarity matrix based on ligand-macromolecule atom contacts, written to macro_<sw>.tan.
-tr number of top scoring ligands to use for calculating the Tanimoto similarity matrix based on ligand-macromolecule residue contacts, written to macro_<sw>.tan.
-sm 0 or 1: {1} Combine the isomers/tautomers into a single entry in the file macro_<sw>.lst.
-cs Number of top ligands to use for binding-type statistics
-px Number of poses per ligands to use for binding-type statistics and/or contact maps
-OS iPtoOS 0 or 1: {0} replace P with OS (S-bound oxygen) for binding-type statistics and/or contact maps
-rn rngmax: {4.0} Residues within rngmax Å of each ligand pose listed will be printed
-gr nliggrid, gridspace, origin_x, origin_y, origin_z ng_x, ng_y, ng_z
- nliggrid is the number of ligands to use for the map,
- gridspace the spacing of the grid (in Å)
- origin_x, origin_y, origin_z are the coordinates of the lower left corner of the grid
- ng_x, ng_y, ng_z are the number of grid points in the X, Y, and Z directions, resp.
-gc , same arguments as for -gr but origin is now the center of the box
-no 0, or 1 or 2 or 3:
- 0: maps are the raw counts;
- 1: maps are normalized by the number of occurrence of each type in the contact list;
- 2: maps are normalized by the number of occurrence of each type in all of the ligands used
- 3: Each gridpoint is normalized by the total number of contacts at that gridpoint

A possible batch run call can be

> dockres -mm hemoglobin -sw eHiTS -np 20 -ol 2 -ib 99

The first two items (-mm and -sw) are compulsory; preferably, they should be the first two items, allowing all warning and error messages to be printed on the log file macro_<sw>.res. For the rest of the input that can be specified in interactive mode the default values are used.

Batch run with flexible macromolecule has not yet been implemented.

Some inputs currently can only be provided in the interactive mode (e.g., hydrogen-bond thresholds, filtering options). To use a non-default option for which no command-line input is implemented, an interactive run is required that can be started from the checkpoint file. It will not be CPU intensive since the time-consuming data gathering has been completed already.

Output of the program

Dockres will create the following files:

A log file called macro_<sw>.res where all result will be printed. If it is already present, it will write instead to macro_<sw>_N.res where N is the smallest integer such that no file with that number exists. A file called macro_<sw>.ckp containig all the information gathered allowing the repeated extraction of data with different filtering criteria without having to perform the time-consuming scan of the .dlg files

Additional output files can be created optionally - see above the batch directives -nx, -py -tl, -ta, -tr, -cs, -gr and -gc. For example, -nx creates a PDB file(s) containing extracted ligand poses with the macromolecule.

The file macro_<sw>.res will contain

The docking grid description
The information requested about the docking of each ligand

The distribution of a number of molecular properties in the ligand set:

Number of hydrogen-bond donors
Number of hydrogen-bond acceptors
Number of rotatable bonds
Number of NO₂ groups
Number of rings
Molecular weight
Charge

For each chemical element occuring in the ligand set the number of such atoms

A typical example of such output (for information in the number of rotatable bonds in a ligand) is

 Distribution of number of rotatable bonds over      11 ligands
 Average=    4.1818 S.D.=    1.6414
 
            .00  .00  .18  .18  .18  .36  .00  .00  .09  .00
          +----+----+----+----+----+----+----+----+----+----+
    1.00  |                                                 |
     .90  |                                                 |
     .80  |                                                 |
     .70  |                                                 |
     .60  |                                                 |
     .50  |                                                 |
     .40  |                        |****|                   |
     .30  |                        |****|                   |
     .20  |         |****|****|****|****|                   |
     .10  |         |****|****|****|****|         |****|    |
          +----+----+----+----+----+----+----+----+----+----+
             0    1    2    3    4    5    6    7    8    9

Here the X axis is the value of the property for which the distribution is calculated; the Y axis is the fraction of ligands having a particular value of the property; the numbers on the top give the actual fractions. In this example, the highest column is for 5. This means that most ligands in this library have 5 rotatable bonds. The hight of the column is at 0.4, meaning that between 30% and 40% of the ligands have 5 rotatable bonds - the actual number is 36%, shown on top. The number on top shows 0.36, meaning that 36%

For both score types extracted (energy and free energy or energy and score)

The number of poses having docking score in 0.5 kcal/mol intervals within 5 kcal/mol of the best score. This list helps deciding the number of top-scoring poses to extract - this number has to be specified at this point.
The list of top-scoring ligands, giving
- Score (energy or free energy (Autodock); energy or inhibition constant (Autodock); energy or score (eHiTS); fitness or free energy (GOLD))
- The number of times this pose was found (multiplicity)
- Ligand identifier as deduced from the file name read from the file macro_<sw>.dir
- Ligand sequence number in the file macro_<sw>.dir
- Ligand molecular weight
- Number of hydrogen-bond donors of the ligand molecule
- Number of hydrogen-bond acceptors of the ligand molecule
- Number of rotatable bonds of the ligand molecule
- Number of NO₂ groups in the ligand molecule
- Number of rings in the ligand molecule
- Number of ligand-macromolecule hydrogen-bonds
- Total charge of the ligand molecule
- The macromolecule atom (index, name, residue number) nearest to the ligand
- The chemical formula of the ligand
- The list of ligand-target contact atom pairs. Contact is defined as a pair of target ligand atoms that are mutually proximal (i.e., if the ligand atom nearest to target atom iat is ial then iat is the nearest target atom to ial)
If requested (whenever more then one pose per ligand was allowed to enter the top-scoring list), the RMSD between pairs of poses of the same ligand that are on the top-scoring list.

The distribution of the number of poses with respect to the macromolecule residue. A typical example (shown only for residues 51-150) looks like this:

 
         +---------+---------+---------+---------+---------+-
       10|                                                  |
         |                                                  |
         |                                                  |
         | *                                                |
         | *                                                |
         | *                                                |
         | *                                                |
         | *                          *                     |
         |**                *         *                     |
        1|**                *         *                     |
         +---------+---------+---------+---------+---------+-
          51                                             100
 
         +---------+---------+---------+---------+---------+-
       10|                        *                         |
         |                        *                         |
         |                        *                         |
         |                        *                         |
         |                        *                         |
         |                        *                         |
         |                        *                         |
         |                        *                         |
         |                        *                         |
        1|                        *                         |
         +---------+---------+---------+---------+---------+-
          101                                            150

The height of the column is proportional to the number of ligands docked to the residue represented by the X axis. The residues to which a ligand is docked can be assigned by using the residue where the closes contact is or using all residues that include atoms on the contact list (vide supra).

The distribution of the docking free energies or eHiTS scores at the macromolecule residues. This is represented by a plot as shown below (from the same run the example above is taken):

 Largest count= 7
 
         +---------+---------+---------+---------+---------+-
    -4.40|                            2                     |
         |1                           2                     |
         |17                                                |
         | 7                                                |
         |                                                  |
         |1                 4                               |
         |                                                  |
         |                                                  |
         |                                                  |
    -7.15|                                                  |
         +---------+---------+---------+---------+---------+-
          51                                             100
 
         +---------+---------+---------+---------+---------+-
    -4.40|                                                  |
         |                                                  |
         |                                                  |
         |                                                  |
         |                                                  |
         |                        4                         |
         |                        M                         |
         |                        1                         |
         |                        1                         |
    -7.15|                        1                         |
         +---------+---------+---------+---------+---------+-
          101                                            150

Here again the X axis represent the residue number, the Y axis the docking energy or free energy. Whenever a number appears, it indicates that ligands were docked to the corresponding residue, having the corresponding docking (free) energy. The number (between 0 and 9) is proportional to the number of ligands in that category. M (instead of a digit) shows the residue/energy combination with the largest number of members (the value is given before the plots); the digits 0 - 9 represent proportionally smaller number of members.

Compilation of the program

The program is written in Fortran 77. Its size is governed by the parameters (the number between the braces is the value set in the source code), established in the first line of the program

MAXMACA {30000} - maximum number of atoms in the macromolecule
MAXMACR {5000} - maximum number of residues in the macromolecule
MAXLINES {500000} - maximum number of lines in the log file of a single ligand
MAXDOCKSYN {7} - maximum number of docking syntaxes allowed
MAXMOL {1000000} - maximum number of ligands
MAXPOSE {15} - maximum number of poses per ligand to extract from the log files
MAXLIG {20000} - maximum number of atoms per ligand
MAXMATCH {01} - maximum number of contacts per ligand; if contact statistics is generated set it to 30
MAXGRID {128} - maximum number of grid points in a contact map (in one dimension)
MAXNODE {1024} - maximum number of CPUs to use in the parallel mode
MAXSMIEXTRACT {1024} - maximum number of SMILESstrings to extract

There are some constraints on the parameters:

MAXMACA > MAXLIG
MAXMOL > MAXMACA
256 > MAXPOSE
7*MAXPOSE*MAXMOL > 14*MAXGRID^3

Several of the parameter definitions are repeated in different subroutines. If any one is changed, all occurrences have to be changed similarly

The program uses several arrays of size MAXMOL*MAXPOSE, dominating the memory requirement.

It should be compiled at the highest optimization level for maximum speed. For example, using the f77 compiler the compilation can be executed by

f77 -O4 -o dockres.exe dockres.f

Some compilers fail due to a so-called 'relocation error' when optimizing at levels higher than one is asked. When using the Intel Fortran compiler (ifort), adding the compiler directives
-mcmodel=medium -share_intel
solved the problem. With some of the other compiler (but not the GNU compiler) the compilation key -fpic was found to solve the problem.

The optional parallelization is using the MPI library. Note, that this requires running in batch mode, with the concomittant restrictions. Furthemore, the parallel version does not work for DOCK or Glide. To compile Dockres with the parallel code included, first remove the 'C@DM' string from the source code:

> cat dockres.f | sed 'C@DM'd > dockres_mpi.f

> f77mpi -o dockres -O4 dockres_mpi.f

The name of the MPI-enabled compiler may be different in your system and additional libraries may also be needed to be invoked.

For parallelized runs, the parameter MAXMOL can be set to less than the total number of ligands - it should be just large enough to hold data for N_molec/N_CPU. In this case, however, the program stops after writing the checkpoint file and a separate single-CPU run, compiled with the parameter MAXMOL set large enough to hold all ligands should be used to print/save the results. This option is useful for distributed memory systems where the majority of nodes have relatively small memory.

Note that if Fortran-90 is used for one compilation, then it should be used for the version reading the checkpoint file as well, otherwise the binary files will be incompatible.

Description of the program Dockres: Summary of the results of docking a library to a target by Autodock-4, Autodock-Vina, eHiTS, PLANTS, GOLD, DOCK or Glide

Mihaly Mezei

Description of the program Dockres:
Summary of the results of docking a library to a target by Autodock-4, Autodock-Vina, eHiTS, PLANTS, GOLD, DOCK or Glide