SHREC 2021 : Surface-based protein domains retrieval

Envisioned task

The aim of this track is to assess the performance of shape retrieval methods on a dataset of related multi-domain protein surfaces.

Domains are structural as well as functional sub-units of proteins, that can exist independently of the rest of the protein. Usually, proteins are made of two or more of such domains, and are the level at which protein interactions and functions are studied. To compare proteins at the domain level for similarities is a common task in structural biology, biochemistry or drug discovery. Proteins can be described as non-rigid surfaces representing their solvent-excluded surface (SES) as defined by Connoly (Connoly et al., J Appl Cryst. 1983). Additional, biologically-relevant information can be provided, such as electrostatics, to further describe these molecular shapes.

This track proposes a set of representing the conformational space of 10 query domains extracted from the PFAM database (El-Geabli et al., NAR, 2019) as well as 554 surfaces of multi-domain proteins. Compared to the previous Protein Shape Retrieval contests, we focus on the evaluation of the performance to retrieve 10 individual domains among a set of 554 multi-domains protein surfaces.

Dataset and Ground Truth

Ten individual domains involved in protein-protein (7 domains) or protein-DNA (3 domains) will be extracted from the PFAM database, and a representative structure of each of these domains will be provided to the participants as query for the retrieval task.

Six hundreds and three single-domain or multi-domain protein surfaces will be provided to the participants, in two versions : a shape-only file and shape+electrostatics file (provided as .ply files). Each protein will include at least one of the query domain, meaning that several proteins will match several queries.

The structures were retrieved and protonated using propka (Sondergaard et al., JCTC, 2011; Olsson et al., JCTC, 2011). All solvent-excluded surfaces (SES) are calculated using EDTSurf (Xu et al., Plos One, 2009; atomic partial charges were computed using APBS (Jurrus et al., Protein Sci, 2018).

The participants are asked to produce a distance-to-the-query dissimilarity matrix, using either the shape-only, the shape+electrostatics or both versions for each query. The ground truth is derived from PFAM database classification; only the family level of the database are used to generate the ground truth, and will be analyzed for the final report.








Here is the protocol used to compute the electrostatics at the surface of the proteins:

we remove any heterogen atom (water molecules, ligands, etc) and unwanted chains/atoms

we protonated the resulting pdb structure with pdb2pqr, using propka to compute the pka values of the ionizable groups at pH=7.2. At this step, we also generate the input file for the next step (apbs). Here is the command line we used: pdb2pqr30 --ff CHARMM --pdb-output fn.out.pdb --whitespace --apbs-input --titration-state-method propka --with-ph 7.2 -o 7.2 -q fn.pdb fn.pqr

we then computed the electrostatics using apbs using the files and fn.pqr generated at the previous step

we used EDTSurf to compute the surface mesh, discarding the inner protein cavities, with the following command line: EDTSurf -p 1.4 -s 4 -h 2 -i fn.out.pdb -o fn.ply

in the apbs suite, there is a conversion tool called multivalue that we used to compute the electrostatics value at the surface point. It takes as input the dx file generated by apbs and the shape file from EDTSurf (be careful, multivalue requires a specific input format!)

finally, we gathered all these data into the final files *.electro.ply


Standard metrics of previous shape retrieval experiments will be used: precision - recall evaluation, Nearest Neighbor, first-tier and second-tier, mean average precision and confusion matrix. The participants are expected to return their results as a distance matrix file in binary format.

It is important for the participants to provide runtimes and hardware specifications for their calculations since it is a critical information for processing large datasets, notably in this particular context of molecular surfaces.

Scheduled timeline (subject to minor revision)