An overview and invitation to contribute to ChEMBL curation with PPDMs

PPDMs has been in the making for more than a year and is a follow-up on a conference paper we published in 2012. As in 2012, our objective is to map small molecule binding sites to protein domains, the structural units that form recurring building blocks in the evolution of proteins. An application note describing PPDMs is just out in Bioinformatics.

Mapping small molecule binding to protein domains

The mapping facilitates the functional interpretation of small molecule-protein interactions - if you understand which domain in a protein is targeted, you are in a better position to anticipate the downstream effect. Mapping small molecule binding to protein domains also provides a technical advantage to machine-learning approaches that incorporate protein sequence information as a descriptor to predict small molecule bioactivity. Reducing the sequence descriptor to the part that mediates small molecule binding increases the informative content of the descriptor. This is best exemplified by the domain-poisoning problem, illustrated below.

Result of a hypothetical query using as input the rat Tyrosine-protein phosphatase Syp (P35235) - and one of the hits, retrieved from a BLAST query against the ChEMBL target dictionary - the rat Tyrosine-protein kinase SYK (Q64725). The significant e-value for this query results from high scoring alignments of the SH2 domains. At the same time, the overlap between small molecules binding both proteins is expected to be low.

A simple heuristic

For individual experiments, it is often quite trivial to decide which domain was targeted. For example, medicinal chemists know whether their compound is a kinase inhibitor or one of a handful of SH2 inhibitors. This knowledge, while easily gleaned by the expert, is implicit and cannot be accessed programmatically. Hence we were motivated to implement a solution that could achieve this across as many measured bioactivities as possible.

Our initial implementation of mapping small molecules to protein domains consisted of a simple heuristic: Identify domains with known small molecule interaction and use these domains as a look-up when mapping measured bioactivities to protein domains. This process is illustrated in the figure below.

A catalogue of validated domains was extracted from assays against single-domain proteins (step 1, 2) and projected onto measured bioactivities in ChEMBL (step 3). Three possible outcomes are: i) A successful mapping if exactly one of the Pfam-A domain models from the catalogue matches the sequence; ii) No mapping if none of the Pfam-A domain models from the catalogue match the sequence; iii) A conflicting mapping if multiple domain models from the catalogue match the sequence.

Despite its simplicity, this method works surprisingly well, owing to the fact that protein domains that are relevant to drug discovery are prioritised in Pfam-A model curation. Another factor that contributes here is the conservative route taken by many drug discovery projects that focus on targets that are in well characterised protein families. However, as illustrated by the cases labelled ii) and iii), some constellations are not covered by the simple heuristic.

A public platform to review and improve mappings

Measured activities in ChEMBL falling into category iii) from the illustration above amount to only a fraction of the total but often reflect interesting biology. DHFR-TS for example is a multi-functional enzyme combining both a DHFR and Thymidylate_synt domain that occurs in the group of bikonts, which includes Trypanosoma and Plasmodium. In humans (and all metazoa), these domains occur as separate enzymes.

Small molecule inhibitors exist for both domains, DHFR (yellow, with Pyrimethamine) and Thymidylate synthase (blue, with Deoxyuridine monophosphate).

We built PPDMs as a platform to resolve such cases. PPDMs aggregates information that supports manual mapping assignments based on medicinal chemistry knowledge. New mappings can be committed to the PPDMs logs and then transferred to the ChEMBL database in future releases.

The Conflicts section on the website summarises conflicts (cases that correspond to category iii as discussed above) that were encountered when the mapping was applied to measured activities in the ChEMBL database and offers an interface to resolve them.

The Evidence section provides the full catalogue of domains for which we found evidence of small molecule binding. Evidence for the majority of domains in this list is provided in the form of measured bioactivities in ChEMBL, while in a few cases we provide a reference to the literature. These are cases where well-known domains occur exclusively in multi-domain architectures, such as 7tm_2 and 7tm_3. The catalogue can be downloaded in full from this section.

PPDMs also provides logs of individual assignments - these can be queried by date, user and comments left when the assignment was made. A log of all assigned mappings can be downloaded from this section. Another way to review assigned mappings is through the Resolved section, where assignments are grouped by domain architecture.

We invite everyone with an interest in the matter to sign up with PPDMs, whether it's simply for playing around, resolving remaining conflicts, or reviewing existing assignments. Please get in touch and we'll sort out a login for you!

felix

The ChEMBL-og

Search This Blog

An overview and invitation to contribute to ChEMBL curation with PPDMs

Comments

Popular posts from this blog

New SureChEMBL announcement

A python client for accessing ChEMBL web services

New Drug Approvals - Pt. XVII - Telavancin (Vibativ)

Accessing SureChEMBL data in bulk

Multi-task neural network on ChEMBL with PyTorch 1.0 and RDKit