SHREC 2021 : Surface-based protein domains retrieval

Envisioned task

The aim of this track is to assess the performance of shape retrieval methods on a dataset of related multi-domain protein surfaces.

Domains are structural as well as functional sub-units of proteins, that can exist independently of the rest of the protein. Usually, proteins are made of two or more of such domains, and are the level at which protein interactions and functions are studied. To compare proteins at the domain level for similarities is a common task in structural biology, biochemistry or drug discovery. Proteins can be described as non-rigid surfaces representing their solvent-excluded surface (SES) as defined by Connoly (Connoly et al., J Appl Cryst. 1983). Additional, biologically-relevant information can be provided, such as electrostatics, to further describe these molecular shapes.

This track proposes a set of representing the conformational space of 10 query domains extracted from the PFAM database (El-Geabli et al., NAR, 2019) as well as 554 surfaces of multi-domain proteins. Compared to the previous Protein Shape Retrieval contests, we focus on the evaluation of the performance to retrieve 10 individual domains among a set of 554 multi-domains protein surfaces.

Dataset and Ground Truth

Ten individual domains involved in protein-protein (7 domains) or protein-DNA (3 domains) will be extracted from the PFAM database, and a representative structure of each of these domains will be provided to the participants as query for the retrieval task.

Six hundreds and three single-domain or multi-domain protein surfaces will be provided to the participants, in two versions : a shape-only file and shape+electrostatics file (provided as .ply files). Each protein will include at least one of the query domain, meaning that several proteins will match several queries.

The structures were retrieved and protonated using propka (Sondergaard et al., JCTC, 2011; Olsson et al., JCTC, 2011). All solvent-excluded surfaces (SES) are calculated using EDTSurf (Xu et al., Plos One, 2009; atomic partial charges were computed using APBS (Jurrus et al., Protein Sci, 2018).

The participants are asked to produce a distance-to-the-query dissimilarity matrix, using either the shape-only, the shape+electrostatics or both versions for each query. The ground truth is derived from PFAM database classification; only the family level of the database are used to generate the ground truth, and will be analyzed for the final report.

ply_electro.tar.gz

ply_shape.tar.gz

queries_ply_electro.tar.gz

queries_ply_shape.tar.gz

SHREC2021_shape.cla

SHREC2021_electro.cla

Protocol

Here is the protocol used to compute the electrostatics at the surface of the proteins:

we remove any heterogen atom (water molecules, ligands, etc) and unwanted chains/atoms

we protonated the resulting pdb structure with pdb2pqr, using propka to compute the pka values of the ionizable groups at pH=7.2. At this step, we also generate the input file for the next step (apbs). Here is the command line we used: pdb2pqr30 --ff CHARMM --pdb-output fn.out.pdb --whitespace --apbs-input fn.in --titration-state-method propka --with-ph 7.2 -o 7.2 -q fn.pdb fn.pqr

we then computed the electrostatics using apbs using the files fn.in and fn.pqr generated at the previous step

we used EDTSurf to compute the surface mesh, discarding the inner protein cavities, with the following command line: EDTSurf -p 1.4 -s 4 -h 2 -i fn.out.pdb -o fn.ply

in the apbs suite, there is a conversion tool called multivalue that we used to compute the electrostatics value at the surface point. It takes as input the dx file generated by apbs and the shape file from EDTSurf (be careful, multivalue requires a specific input format!)

finally, we gathered all these data into the final files *.electro.ply

Evaluation

Standard metrics of previous shape retrieval experiments will be used: precision - recall evaluation, Nearest Neighbor, first-tier and second-tier, mean average precision and confusion matrix. The participants are expected to return their results as a distance matrix file in binary format.

It is important for the participants to provide runtimes and hardware specifications for their calculations since it is a critical information for processing large datasets, notably in this particular context of molecular surfaces.

Scheduled timeline (subject to minor revision)

Jan 27, 2021 - The dataset is made available on shrec2021.drugdesign.fr. The participants are allowed to run their calculations.
Feb 15, 2021 - Registration deadline. Registration must be sent to Matthieu Montès and Florent Langenfeld.
March 05, 2021 - Submission deadline of the results to the organizers. Each participant is allowed 1 result matrix for each of the shape-only and shape+electrostatics problems. A brief summary to be included in the track report is written by each participant and submitted alongside the results.
March 08, 2020 - The organizers circulate the evaluation of all participants of the tracks, and release the ground truth.
March 10, 2020 - The organizers send a draft of the track report to the participants for reviews, comments and feedback.
March 15, 2021 - The track review is submitted for review.
April 15, 2021 - Reviews done, first stage decision on acceptance or rejection.
May 15, 2021 - First revision done.
June 15, 2021 - Second stage decision on acceptance or rejection.
June 30, 2021 - Final revision.
July 05, 2021 - Final decision on acceptance.
September, 2021 - Publication online in Computers & Graphics journal.