ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Wednesday, 30 January 2013

Paper: A Ligand’s-Eye View of Protein Similarity

Gerard and I have just had a News & Views published in Nature Methods - link to the pdf is here. This is a commentary on a paper by Lin et al. which uses metrics derived from pharmacological similarity to cluster proteins - there are some interesting differences between the same proteins clustered by sequence similarity, anyway, here's the N&V and below is the discussed paper (pdf link here)

%A G. Van Westen 
%A J.P. Overington
%D 2013 
%T A Ligand’s-Eye View of Protein Similarity
%J Nature Methods
%V 10
%P 116-117 
%O doi:10.1038/nmeth.2339

%A H. Lin
%A M.F. Sassano
%A B.L. Roth
%A B.K. Shoichet
%T A pharmacological organization of G protein-coupled receptors
%J Nature Methods
%V 10
%P 140-146
%D 2013
%O doi:10.1038/nmeth.2324


ChEMBL_15 Released

We are pleased to announce the release of ChEMBL_15. This version of the database was prepared on 23rd January 2013 and contains:

1,434,432 compound records
1,254,575 compounds (of which 1,251,913 have mol files)
10,509,572 activities
679,259 assays
9,570 targets
48,735 documents
17 activity data sources

You can download the data from the ChEMBL ftpsite:

Please see chembl_15_release_notes.txt for full details of all changes in this release, including important schema changes!

Data changes since the last release:
We have made several major changes/additions to the data in ChEMBL_15:

  • Incorporation of data from the USP Dictionary of USAN and International Drug Names.
  • Incorporation of monoclonal antibody clinical candidates and sequences.
  • Creation of targets for protein complexes and protein families.
  • Standardisation of activity data and identification of potential issues.
  • Annotation of predicted compound binding domains for subset of activity data.

These data sets are described in more detail in the release notes and will also be the subject of future blog posts. In addition, we have incorporated new data from the following sources:

  • Open TG-GATEs
  • TP-search transporter database
  • MMV Malaria Box screening data
  • GSK Tuberculosis screening data
  • GSK deposited supplementary data
  • DNDi Trypanosoma brucei screening data
  • Harvard malaria screening data
  • WHO-TDR malaria screening data

Database changes since the last release:
This release of ChEMBL contains major changes to the schema and data model, particularly around the representation of protein targets. 

Please see the release notes, ERD and schema documentation for more details of these changes. We will also run a series of webinars over the coming weeks, describing the new schema and the changes.

Interface changes since the last release:
New data tables have been introduced to display search results and bioactivity data. These tables allow users to customise the display and choose which columns they want to include. By default, a standard set of columns are included in the view, but additional columns can be added by clicking on the show/hide button above the table.

A BLAST search for biotherapeutic drugs has been included on the 'Ligand Search' tab (formerly 'compound search'), allowing retrieval of protein drugs by sequence similarity.

The 'Browse Drugs' tab now includes information for monoclonal antibody clinical candidates and compounds with USANs in addition to approved drugs. Additional fields have been added and drug icons have been divided into two sets representing structure-specific information (green) and product-specific information (blue) - the latter are shown only for approved drugs.

(btw the picture above is built from ChEMBL assay descriptions - thanks to George)

Tuesday, 29 January 2013

ChEMBL 15 Schema Changes

ChEMBL_15 will be released this week. As mentioned previously, there will be some major schema changes. For many users, the most significant of these will be:

1) Removal of protein-specific information (e.g., sequences/accessions) from the target_dictionary to a separate 'component_sequences' table. The target_dictionary now includes entries for protein complexes, protein families and other 'group' targets. These then link to their protein components via the target_components table.

2) Removal of the assay2target table. Each assay now links only to a single target (though this target may consist of multiple proteins in the case of a protein complex/family). Information previously included on the assay2target table (tid, confidence_score etc) is now on the assays table.

We have provided a diagram and documentation of the new schema on the chembl ftp site:
ChEMBL_15 release documentation

Please take some time to familiarise yourselves with the changes before integrating the new dataset. Further information will be provided in the release notes, and we will be running a webinar in the next few weeks to explain the changes.

Monday, 28 January 2013

UniChem Released

For data managers of chemistry resources, the maintenance of structure-based links to other chemistry resources can be a tedious chore. The job is all the more burdensome knowing that your counterparts in other chemistry based-resources are essentially duplicating your efforts, in order to keep their links to your resource updated.

In an attempt to remove this duplication of effort, and automate the processes involved, we have developed UniChem,  and which is described in a recent publication.

Getting structure-based links out of UniChem can be achieved either via the web-interface or the web services. For automated updating, using the web-services is often the best choice. The current set of web service methods has been designed to allow users several options for how they might obtain links data. Below are detailed two possibilities.

One such option would be to use the following methods: First, query UniChem for all valid src_id’s using the ‘GetSrcIds’ method. Then, iterate through this list and retrieve, using the ‘GetSourceInfo‘ method, all the details of these sources that you require (eg: the ‘base-url’ for constructing links). Lastly, iterate through the src_id list once more, this time retrieving all the mappings from your source to each of the other sources, using the ‘GetMapping’ method. Combining the results of the second and third queries can provide you with all the mappings from your compound identifiers to the URLs for the compounds in the other sources. These data can be stored locally, and queried and incorporated into a compound page when required. Periodic refreshes of these local tables by repeating the above process would be required to pick up UniChem updates.

Alternatively, you may wish to create links more dynamically, using, for example, the ‘GetVerboseSrcCpdIdsFromInchiKey’ method. Using this method, compound web pages may be populated with all links as the page is requested, after querying UniChem on the fly with the InChIKey. Returned from this single query is a list of sources which contain valid compound links. For each of the sources, a keyed list describes information such as the ‘base-url’, etc. One of the keys (‘src-compound_id’) maps to an array of src-compound_ids. Combining the ‘base-url’ with each of the src_compound_ids gives the required links. See the example of this method in the link immediately above.

Sunday, 27 January 2013

New Drug Approvals 2012 - Pt. XXXV - Elvitegravir/Cobicistat/Emtricitabine/Tenofovir disoproxil fumerate (STRIBILD®)



ATC Code : J05AR09
On August 27, FDA approved the complete regimen for treatment of Human Immunodeficiency Virus -1 (HIV-1) infection in adults who are antiretroviral treatment-naïve. STRIBILD®, combination of a HIV-1 integrase strand transfer inhibitor (INSTI) - Elvitegravir, a pharmacokinetic enhancer - Cobicistat and two nucleos(t)ide analog HIV-1 Reverse Transcriptase (RT) inhibitors (NRTI's) - Emtricitabine/Tenofovir disoproxil.

Acquired immunodeficiency syndrome (AIDS) is a disease of the human immune system caused by HIV infection, in which progressive failure of the immune system allows life-threatening opportunistic infections and cancers to thrive. HIV infects and kills vital cells involved in immune system such as T helper cells (specifically CD4+ T cells, macrophages and dendritic cells. When CD4+ T cell numbers decile below a critical level, cell-mediated immunity is lost and the body becomes progressively more susceptible to opportunistic infections.

HIV/AIDS is a global pandemic. As of 2012 approximately 34 million people have HIV worldwide. Of these approximately 16.8 million are women and 3.4 million are less than 15 years old. For more information on the disease epidemiology or any other information on HIV/AIDS, check Wikipedia or UNAIDS.

The management of HIV/AIDS typically includes the use of antiretroviral drugs which are medications for the treatment of infection of HIV. Different antiretroviral drugs restrain the growth and reproduction of HIV, that are broadly classified by the phase of the retrovirus life-cycle that the drug inhibits.

The life-cycle of HIV (all steps 1 to 6) can be as short as about 1.5 days and HIV lacks proofreading enzymes. These cause the virus to mutate very rapidly, resulting in high genetic variability. When antiretroviral drugs are used improperly, these multi-drug resistant (MDR) strains can become dominant genotypes. This lead to development of combination therapy - wherein several drugs (different classes of antiretroviral drugs), typically three or four, are taken in combination, the approach is known as highly active antiretroviral therapy (HAART). 

In recent years, many such complex regimens has been developed and termed as fixed-dose combinations. Some other examples of fixed-dose combination drugs approved by FDA for HIV treatment can be found here. And one such combination drug is STRIBILD®; which is a fixed-dose combination of ElvitegravirCobicistatEmtricitabine and Tenofovir DF. Elvitegravir, emtricitabine and tenofovir directly suppress viral reproduction. Cobicistat increases the effectiveness of the combination by inhibiting liver enzymes that metabolise the other components. In this regimen of drugs Elvitegravir and Cobicistat are the new molecular entities (NME), the rest two emtricitabine (prescribing info.) and tenofovir (prescribing info.) are pre-approved, prescribed NRTI drugs.

Elvitegravir (Research Code: GS1937, ChEMBLCHEMBL204656PubChemCID 5277135ChemSpider4441060 ) inhibits the strand transfer activity of HIV-1 integrase, an HIV-1 encoded enzyme that is required for viral replication. Inhibition of integrase prevents the integration of HIV-1 DNA into host genomic DNA, blocking the formation of the HIV-1 provirus and propagation of the viral infection. Elvitegravir does not inhibit human Topoisomerases I or II.

IUPAC Name : 6-(3-Chloro-2-fluorobenzyl)-1-[(2S)-1-hydroxy-3-methylbutan-2-yl]-7-methoxy-4-oxo-1,4-dihydroquinoline-3-carboxylic acid
Canonical SMILES : COc1cc2N(C=C(C(=O)O)C(=O)c2cc1Cc3cccc(Cl)c3F)[C@H](CO)C(C)C
Standard InChI : 1S/C23H23ClFNO5/c1-12(2)19(11-27)26-10-16(23(29)30)22(28)15-8-14(20(31-3)9-18(15)26)7-13-5-4-6-17(24)21(13)25/h4-6,8-10,12,19,27H,7,11H2,1-3H3,(H,29,30)/t19-/m1/s1

Cobicistat (PubChemCID 25151504ChemSpider25084912) is a selective, mechanism-based inhibitor of cytochromes P450 of the CYP3A subfamily. Inhibition of CYP3A-mediated metabolism by cobicistat enhances the systemic exposure of CYP3A substrates, such as elvitegravir, where bioavailability is limited and half-life is shortened by CYP3A-dependent metabolism.

IUPAC Name : 1,3-thiazol-5-ylmethyl [(2R,5R)-5-{[(2S)-2-[(methyl{[2-(propan-2-yl)-1,3-thiazol-4-yl]methyl}carbamoyl)amino]-4-(morpholin-4-yl)butanoyl]amino}-1,6-diphenylhexan-2-yl]carbamate
Canonical SMILES : CC(C)c1nc(CN(C)C(=O)N[C@@H](CCN2CCOCC2)C(=O)N[C@H](CC[C@H](Cc3ccccc3)NC(=O)OCc4cncs4)Cc5ccccc5)cs1
Standard InChI : 1S/C40H53N7O5S2/c1-29(2)38-43-34(27-53-38)25-46(3)39(49)45-36(16-17-47-18-20-51-21-19-47)37(48)42-32(22-30-10-6-4-7-11-30)14-15-33(23-31-12-8-5-9-13-31)44-40(50)52-26-35-24-41-28-54-35/h4-13,24,27-29,32-33,36H,14-23,25-26H2,1-3H3,(H,42,48)(H,44,50)(H,45,49)/t32-,33-,36+/m1/s1

The recommended dose of STRIBILD is one tablet administered orally once a day, which contains 150 mg of elvitegravir, 150 mg of cobicistat, 200 mg of emtricitabine, and 300 mg of tenofovir disoproxil fumarate. Peak plasma concentrations were observed 4 hrs post-dose for elvitegravir with Cmax of 1.7 ± 0.4, 3 hrs for cobicistat with Cmax of 1.1 ± 0.4. Almost 98-99% of elvitegravir bound to human plasma, whereas cobicistat was 97-98% bound. Median terminal plasma half-life of 12.9 for elvitegravir was found with 94.8% and 6.7% of the administered dose excreted in feces and urine respectively. Cobicistat exhibited 3.5 hrs of plasma half-life with 86.2% and 8.2% of the administered dose excreted in feces and urine.

Full prescribing information can be found here.

The license holder is GILEAD, and the product website is

MMV 11th Call for proposals - H2L and LO for Malaria Drug Discovery

Many of the readers of the ChEMBL-og are interested in drug discovery against neglected and rare diseases. One of the great things for us in this field is the opening up of data in this field - there was the almost simultaneous release of primary HTS data from GSK, Novartis & St. Judes in 2011, more recently the results of a GSK HTS for TB. Having this data publicly available, for all, means that many smart people can analyse the data, and of course, pooling data in this way effectively is equivalent to running the assay against a far larger compound set, and allows more powerful cheminformatics analysis to identify chemical series, preliminary SAR, etc. Many of these datasets are available in our ChEMBL-NTD and ChEMBL-Malaria archives - and we know 2013 will be a great year for more data just like this! All these data are available for download, in the exact form as supplied by the depositor, no accounts/passwords, no lock-in to a software infrastructure, with no restrictions - just as it should be. Free the Data to set the World Free of Disease!

One of our partners, Medicine for Malaria Ventures (MMV) have recently announced an opportunity to get some real funding to take this data forward to real drugs. Further details of the call are here.

The call is for projects in the hit-to-lead (H2L) and lead optimization (LO) stages for new families of molecules specifically addressing the key priorities of the malaria eradication agenda: transmission blocking via the human host, and prevention of P. vivax relapse through killing of liver stage hypnozoites or reactivating them so as to be killed in the blood stages. In addition, proposals are sought for novel chemical series with a long half-life (ideally > 10 hours in rodents) and confirmed in vivo efficacy that could have potential for well tolerated P. falciparum chemoprophylaxis or asexual blood stage treatment in humans. Any proposals based on existing chemotypes must clearly address known issues.

The deadline for applications is 12 noon CET March 15th,

Saturday, 26 January 2013

New Drug Approvals 2012 - Pt. XXXIV - RaxibacumabTM

ATC Code:
Wikipedia: Raxibacumab

On December 14th 2012 the FDA approved Raxibacumab for the treatment of inhalation anthrax, a form of anthrax caused by the inhalation of anthrax spores. The drug is also approved to treat inhalation anthrax when alternative therapies are not available or appropriate. Raxibacumab is a 146 kDa monoclonal antibody that is designed to neutralize the toxin secreted by Bacillus Anthracis. The FDA granted raxibacumab fast track designation, priority review, and orphan product designation.

Bacillus Anthracis toxin (Anthrax toxin) is a secreted three protein exotoxin. It consists of two enzyme components; lethal factor (LF, PDB 1PWU), a bacterial endopeptidase and edema factor (EF, PDB 1PWW), a bacterial adenylate cyclase. These are combined with one cell-binding protein; protective antigen (PA, PDB 1ACC). The individual components are non toxic and the combination of the enzyme components with the cell-binding protein makes them toxic. PA, in the form of a 83kDa protein, binds to the Anthrax Toxin receptor. Upon binding a 20kDa fragment is cleaved of the protein. The remaining protein (PA63) self assembles into a ring shaped oligomer. This oligomer acts as a pore precursor through which the enzymatic components enter the cell. EF, an 88kDa protein, acts as a Ca2+ and calmodulin dependant adenylate cyclase, raising cAMP levels (up to 200 fold in CHO cells) and disturbing water homeostasis in the cell. In turn disturbing signaling pathways and immune function. LF,  an 89kDa protein, is a Zn2+ dependant endopeptidase. The protein cleaves mitogen-activated proten kinase kinases (MAPKKs). This leads to altered signalling pathways and apoptosis. 

(Image adapted from

Raxibacumab, efficacy has not been tested in humans but instead in monkey's and rabbit's for ethical reasons. Safety trials were conducted in 326 healthy human volunteers.  

Raxibacumab is available as a single-use vial which contains 1700 mg/34 mL (50 mg/mL) raxibacumab injection. Raxibacumab is administered as a single dose of 40 mg/kg intravenously over 2 hours and 15 minutes after dilution in 0.9% Sodium Chloride Injection, USP (normal saline) to a final volume of 250 mL. 

The PK of raxibacumab are linear over the dose range of 1 to 40 mg/kg following single IV dosing in humans. Following single IV administration of raxibacumab 40 mg/kg in healthy, male and female human subjects, the mean Cmax  and AUCinf were 1020.3 ± 140.6 mcg/mL and 15845.8 ± 4333.5 mcg·day/mL, respectively. Mean raxibacumab steady-state volume of distribution was greater than plasma volume, suggesting some tissue distribution. Clearance values were much smaller than the glomerular filtration rate indicating that there is virtually no renal clearance of raxibacumab. 

The license holder is GlaxoSmithKline and the prescribing information can be found here.

Monday, 21 January 2013

New Drug Approvals 2012 - Pt. XXXII - Bedaquiline (SirturoTM)

ATC Code: J04AK05
Wikipedia: Bedaquiline

On December 28, the FDA approved Bedaquiline (as the fumarate salt; tradename: Sirturo; Research Code: R-403323 (for Bedaquiline Fumarate), R-207910 and TMC-207 (for Bedaquiline)), a novel, first-in-class diarylquinoline antimycobacterial drug indicated for the treatment of pulmonary multi-drug resistant tuberculosis (MDR-TB) as part of combination therapy in adults.

Turbeculosis is an infectious disease caused by the mycobacteria Mycobacterium tuberculosis, which usually affects the lungs. MDR-TB occurs when M. tuberculosis becomes resistant to the two most powerful first-line treatment anti-TB drugs, Isoniazid (ChEMBL: CHEMBL64) and Rifampin (ChEMBL: CHEMBL374478). Bedaquiline is the first anti-TB drug that works by inhibiting mycobacterial adenosine 5'-triphosphate (ATP) synthase (for Uniprot_IDs, clique here), an enzyme essential for the replication of the mycobacteria.

ATP is the most commonly used energy currency of cells for most organisms. ATP synthase produces ATP from adenosine phosphate (ADP) and inorganic phosphate using energy from a transmembrane proton-motive force generated by respiration. The image above depicts a model of the mycobacterial ATP synthase. ATP synthase has two major structural domains, F0 and F1, that act as a biological rotary motor. The F1 domain is composed of subunits α3 (Uniprot: P63673), β3 (Uniprot: P63677), γ3 (Uniprot: P63671), δ and ε (Uniprot: P63662); the F0 domain includes one a subunit (Uniprot: P63654), two b subunits (Uniprot: P63656) and 9 to 12 c subunits (Uniprot: P63691) arranged in a symmetrical disk. The F0 and F1 domains are linked by central stalks (subunits γ and ε) and peripheral stalks (subunits b and δ). The proton-motive force fuels the rotation of the transmembrane disk and the central stalk, which in turn modulates the nucleotide affinity in the catalytic β subunit, leading to the production of ATP.

It has been shown that mutation in the atpE gene, which encodes the c subunit, of the mycobacterial ATP synthase, confers resistant to Bedaquiline, suggesting that Bedaquiline binds crucially to this target (although almost certainly other components of the complex are required for a competent binding site), inhibiting the proton pump of M. tuberculosis and therefore interfering with the rotation properties of the transmembrane disk, leading to ATP depletion.
>ATPL_MYCTU ATP synthase subunit c
Another notable feature is the high specificity of Bedaquiline for mycobacteria. This is due to the fact that there is very limited sequence similarity between the mycobacterial and human atpE proteins.

Bedaquiline is a diarylquinoline antimycobacterial drug, which displays both planar hydrophobic moieties and hydrogen-bonding acceptor and donor groups. It has a molecular weight of 555.50 Da (671.58 for the fumarate salt), an ALogP of 6.93, 4 hydrogen-bond acceptors and 1 hydrogen-bond donor, and therefore not fully rule-of-five compliant.

Name: (1R, 2S)-1-(6-bromo-2­ methoxy-3-quinolinyl)-4-(dimethylamino)-2-(1-naphthalenyl)-1-phenyl-2-butanol
Canonical Smiles: COc1nc2ccc(Br)cc2cc1[C@@H](c3ccccc3)[C@@](O)(CCN(C)C)c4cccc5ccccc45
InChI: InChI=1S/C32H31BrN2O2/c1-35(2)19-18-32(36,28-15-9-13-22-10-7-8-14-26(22)28)30(23-11-5-4-6-12-23)27-21-24-20-25(33)16-17-29(24)34-31(27)37-3/h4-17,20-21,30,36H,18-19H2,1-3H3/t30-,32-/m1/s1

The recommended dosage of Bedaquiline is 400 mg once daily for 2 weeks followed by 200 mg 3 times per week for 22 weeks with food.

Bedaquiline shows a volume of distribution of approximately 164 L and a plasma binding protein of > 99.9%. Bedaquiline is primarily subjected to oxidative metabolism by CYP3A4 leading to the formation of the N-monodesmethyl metabolite (M2), which is 4 to 6 times less active in terms of antimycobacterial potency. It is mainly eliminated in feces and the mean terminal half-life T1/2 of Bedaquiline and M2 is approximately 5.5 months.

The license holder is Janssen Therapeutics and the full prescribing information of Bedaquiline can be found here.


Friday, 18 January 2013

DjangoCon - Vote for ChEMBL!!!!

We have a talk entered for DjangoCon Europe 2013 - and there is a vote underway for this - the title of the talk you may wish to vote for is "Do you feel the chemistry? Developing scientific applications with Django." which is, you probably agree, a pretty interesting subject. You will need a github account to vote - but being hip cats you'll have one already.

I must point out that this post has nothing to do with the smash-hit block-buster film Django Unchained from the superstar director Quentin Tarantino (but search engines may well be too stupid to realised this and bump the rank of this post).

Update: Voting is now closed.


Wednesday, 16 January 2013

Where should you/can you publish your ChEMBL research?

Well, we've got to about 125 citations(1) for the main ChEMBL database paper so far, which for a year is a pretty good haul we think. Given this reasonably big number, we thought it would be appropriate to analyse where the use of ChEMBL makes it's way into the published literature - or what is our 'research user community'(2). A simple way to analyse this is to look at papers that cite ChEMBL, grouped by journal. The graph is below - it's a classic log-normal/power law style frequency-class distribution.

So J Chemical Information & Modelling (JCIM) is about 20% of all citations, and could indicate that the biggest early impact of ChEMBL is in the development of novel methods for compound design - which was one of our hopes for what our work and the ChEMBL data could achieve - better, safer drugs, quicker! Then there's the database community in Nucleic Acids Research (this is quite an unusual journal for comp chemists and modellers, but it is the de facto (and highest profile) place to publish "resource" papers in the life sciences, and it's a completely Open journal (3)) - so the data is being used and integrated elsewhere; then J Med Chem - the premier medicinal chemistry journal, and so on. It is also notable that ChEMBL has contributed centrally to two Nature full Articles this year (covered in earlier posts) - and given how infrequently chemistry makes the pages of the might Nature and Science this is great news for us, and probably good for the entire community with respect to profile and awareness of the field!!

It's interesting to see the strong trend to JCIM - this probably means that they have a receptive set of reviewers and know how to route stuff to the right people (of course if they then reject 95% of all ChEMBL citing papers that's not such good news).

So what next - it got us thinking about how we would expect ChEMBL to impact the field/literature long term - it's really really unlikely that papers that use methods and further integrated data and discover drugs will ever cite the ChEMBL NAR paper. But we will try and track the ripples that ChEMBL makes over time......

A few notes.
1) Citation data is from Google Scholar. But c'mon google - give us an API.
2) We know that many people who use ChEMBL are not really interested in publishing, that they are not free to publish their work, or that they don't have the time to publish, alongside all the other junk they have to deal with.
3) Open, Closed, Gold, Green, Good, Evil, Cow, Horse..... The ChEMBL NAR paper itself (the one that has the 125 citations analysed above) is Open Access, and the entire ChEMBL database team is solely funded by The Wellcome Trust (including my position), so we are under the obligations of their requirements for Open Access publishing. We cannot of course influence where researchers publish use of ChEMBL (and there are many publications that use the ChEMBL data that do not cite our papers :( ), but they will be under their own funders requirements - and remember that not all research is tax-payer (or similar) funded, so not all funders are as motivated to worry about Open Access, especially if it is yet an additional cost. So unfortunately, not all the papers that use ChEMBL are Open Access. But if you can, publish all your research and reviews Open Access - go on, it will make you smile and dogs in the street will like you!

Update - I've (jpo) done a bit of editing on this post overnight - I rushed it yesterday to catch a train, and thought that some additional context and comment was required.

jpo and francis

Reminder: Pipeline Pilot Cambridgeshire UGM

This is a gentle reminder for the Cambridgeshire Pipeline Pilot Users Group Meeting that will take place on Thursday 17th January 2013 (aka tomorrow), at 3pm here at the ChEMBL HQ.

This is the agenda for the meeting:

1. Welcome and Host talk:  George Papadatos + Gerard van Westen
      Cool things with Pipeline Pilot and ChEMBL
2. Peter Woollard (GSK)
    Using Pipeline Pilot for computational biology capabilities, where it has helps the most and where it is less used.
3. Richard Carter (Oxford Nanopore Technology):
      PP on a memory stick
4. Mike Cherry (Accelrys) :
      Repetitive Data Flow
5. Question and Answer session including:
   - how people have found NGS components  and TAC components
6. Willem van Hoorn (Accelrys)
      Matched Molecular Pairs
7. Adrian Stevens (Accelrys)
      Upcoming chemistry components in PP9.0

There's still time so if you fancy attending, drop us a line.


Monday, 14 January 2013

Paper: UniChem

We have just had a paper published on UniChem - simple name, simple functionality, but we love it, and it has become the way that we map ChEMBL to other data sources and keep things linked in real time, and also keep the ChEMBL molecule tables manageable. It's published in the Open Access Journal of Cheminformatics.

There is an interface on the above UniChem link, but for most use we anticipate REST web services access - details are on the link above.

The link to the provisional pdf is here.

One of the jolly blog pixies is writing a blog post showing some use cases for UniChem - and I have a lovely thing called "Chive" to tell you about in a few weeks!

%T UniChem: a unified chemical structure cross-referencing and identifier tracking system
%A J. Chambers
%A M. Davies
%A A. Gaulton
%A A. Hersey
%A S. Velankar
%A R. Petryszak
%A J. Hastings
%A L. Bellis
%A S. McGlinchey
%A J.P. Overington
%J Journal of Cheminformatics 
%D 2013
%V 5
%O doi:10.1186/1758-2946-5-3

Saturday, 12 January 2013

New Drug Approvals 2012 - Pt. XXXIII - Apixaban (ELIQUIS®)

ATC code : B01AF02
Wikipedia : Apixaban

On December 28, FDA approved Apixaban (Trade Name: ELIQUIS®; ChEMBLCHEMBL231779KEGGD03213; ChemSpider8358471; DrugBankDB07828; PubChemCID 10182969) as an anticoagulant for prevention of venous thromboembolism and related events, indicated to reduce the risk of stroke and systemic embolism in patients with non-valvular atrial fibrillation. 

Atrial fibrillation (AF) is most common cardiac arrhythmia (irregular heart beat). There are many classes of AF according to American College of Cardiology (ACC), American Heart Association (AHA) and the European Society of Cardiology (ESC) one of which is non-valvular AF - absence of rheumatic mitral valve disease, a prosthetic heart valve, or mitral valve repair (AF which not caused by a heart valve problem). Usually AF increases the degree of stroke risk, can be up to seven times that of the average population. AF is one of the major cardiogenic risk factors for stroke. For instance, patients with inappropriate or abnormal blood clotting (coagulation disorder) will result in clot formation in heart which can easily find their way into the brain, resulting in stroke.

Coagulation (thrombogenesis) is the process by which blood forms clots. Coagulation cascade has two pathways which lead to fibrin formation, they are intrinsic pathway and extrinsic pathway. The pathways are a series of reactions, in which a zymogen of a serine protease and its glycoprotein co-factor are activated to become active components that then catalyze the next reaction in the cascade, ultimately resulting in cross-linked fibrin. Apixaban belongs to Direct factor Xa inhibitors ('xabans') class of anticoagulant drugs, which directly acts on Factor X (FX) in the coagulation cascade without antithrombin as mediator. 

Apixaban is reversible and selective active site inhibitor of Factor Xa (FXa) . It does not require antithrombin III for antithrombotic activity. Apixaban inhibits free and clot-bound FXa, and prothrombinase activity. Apixaban has no direct effect on platelet aggregation, but indirectly inhibits platelet aggregation induced by thrombin. By inhibiting FXa, apixaban decreases thrombin generation and thrombus development.

The PDBe entry (PDBe : 2p16) for the crystal structure for human Factor X (chain A & chain L) in complex with Apixaban (blue-green - molecule shaped) is shown above.

IUPAC Name : 1-(4-methoxyphenyl)-7-oxo-6-[4-(2-oxopiperidin-1-yl)phenyl]-4,5,6,7-tetrahydro-1H-pyrazolo[3,4-c]pyridine-3-carboxamide
Canonical SMILES : COc1ccc(cc1)n2nc(C(=O)N)c3CCN(C(=O)c23)c4ccc(cc4)N5CCCCC5=O
Standard InChI : 1S/C25H25N5O4/c1-34-19-11-9-18(10-12-19)30-23-20(22(27-30)24(26)32)13-15-29(25(23)33)17-7-5-16(6-8-17)28-14-3-2-4-21(28)31/h5-12H,2-4,13-15H2,1H3,(H2,26,32)

Apixaban is available for oral administration at doses of 2.5 mg and 5 mg. It displays prolonged absorption with bioavailability of ~50% for doses up to 10 mg. Plasma protein binding was estimated to be ~87% and Vss is ~21 liters. Apixaban is metabolized by mainly via CYP3A4 with minor contributions from CYP1A2, CYP2C8, CYP2C9, CYP2C19 and CYP2J2. Approximately 25% of Apixaban is recovered in urine and faeces. Despite a short clearance half-life about 6 hrs, apparent half-life is 12 hrs, due to prolonged absorption phase; renal excretion accounts to 27% of the clearance.

Apixaban comes with a boxed warning for risks and remedies while discontinuing drug. There is one other direct factor Xa inhibitor approved by FDA in 2011, Rivaroxaban (ChEMBL : CHEMBL198362, ATC code  : B01AX06, PubChem : CID6433119), was "first in class" FXa inhibitor (can be accessed by one of our old blog posts, here) which had similar boxed warning along with spinal/epidural hematoma in surgical settings.

The license holder is Bristol-Myers Squibb, and the product website is

Full prescribing information can be found here.


Privacy and the ChEMBL Database

Privacy is pretty important - for example, in the picture above I have protected to privacy of two colleagues, as I think I should ;) In fact I've even made sure that the black box securing their identities is not a layer on the image that can be trivially removed.....

Chemistry is a little different to some other areas of life-science research, and there is a little more caution applied typically in the use of 'public' database systems by people working on chemical structures - primarily because of patenting and novelty. There are probably similar privacy/security concerns over sequence data too - and in ChEMBL we've covered that too. I'm not going to drift on to what constitutes a 'publication', and all that sort of stuff since 1) I'm not qualified, 2) I don't have the time (and 1) anyway), and 3) it attracts trolls (and 1) and 2) anyway).

I have been asked for a talk through on the usage and query privacy of ChEMBL as part of the great OpenPHACTS project for some time; so here it is - to make it clear - I'm not an expert, but I do worry about these things, and I read a lot. Any feedback or suggestions would be great in the comment section.

ChEMBL is hosted on production machines at a pair of physically separated load-balanced Class 3 data centers in London. These are pretty close to one of the main Internet backbones in the UK, so reliability, latency and throughput is pretty good. The ChEMBL database and application is automatically loaded from a staging system at Hinxton. Once it leaves our staging area, we can't access the production data/server at all; in fact only a small number of named staff, using all sorts of access control and logging mechanisms can get into the machine rooms.

You may have noted that we use https: on the ChEMBL url above - even if you try and force use of http: to access the server, it will switch you over to https: (go on try it, I told you so). This ensures 1) that the server you access really is the genuine ChEMBL server (you should see a little lock in the corner of your browser), and 2) that the traffic between your client and our server is encrypted, and so no one can simply sit on the same network as you, listening to all your queries. So this is pretty secure, the tls standard used by https: is relied on by essentially everyone who implement secure and private web sites. It takes a little care to actually get https: to work properly - with a common reason for non-validation (so the little padlock doesn't appear) being the use of http: links on the nominally https: source page, or http: links to third party sites such as for advertisers, etc.

We don't (currently) have a green bar in the browser for this https: service - the green bar (or something similar depending on your browser) comes from the use of a Extended Validity Certificate (EVC). For these, you and your Certification Authority need to do a little more paperwork, and then spend a little more money. There is no difference in the technical security - the little padlock is the mark of security, not the green bar, just that the certificate authority has done some more work to validate that you really physically are who you say you are and so on. At the moment, sites like PayPal and so forth have EVCs, but they will no doubt spread, as the public starts to associate only sites with a green url bar with 'enhanced' security, and assume that the green thing is The Mark of website safety.

We do not use accounts to access the ChEMBL website - there is no need for the things we do - any personalisation is done via cookies saved on your machine in the cookies folder (we have an Institutional cookie policy too, that describes what cookies we will write on your computer). It is not straightforward to implement good password systems, as many large professional internet companies have amply shown (LinkedIn - I'm thinking of you!), and for us we don't need them for ChEMBL, so we haven't bothered.

There is also an Institutional Privacy Policy which covers a broad range of personal type data across all our activities (including recruitment, etc).

There is an Institutional Terms of Use for all institute resources. There is usage logging performed on the servers for internally reviewing the use of our services, or for spotting of problems (like DOS attacks, innocent scripting that can look like a DOS (Ben ;) )) and to collect statistics (like total usage, distribution of users, etc), to track enhanced usage following interface/data addition (this makes us feel good sometimes, it's nice to know our things are used). This data is all private, and is forbidden from being shared other than at aggregate level with third parties/collaborators.

The ChEMBL web application is written to not store any user queries (chemical structures or sequences, or text queries), other than storage required for application and database performance - so for example some automatically flushed, short-lifetime caches that are part of Oracle, and as I've said above, we don't have access to these anyway on the production servers.

We do not run google analytics on our ChEMBL application (but some of the Institutes services do, and we do on the ChEMBL-og) - it is tempting to do use GA for the fancy plots and maps, but what it means is that a third party (Google for GA) will be seeing all the query IP source addresses and url strings. Google already know enough about me, they don't also need to know I have a late night penchant for 4-amino-anilines as well.

So, if I was to extract some general principles from the above:
  • Use https: for everything - there's no real cost over http:, and make sure it validates!
  • Have a clear and easy to find Terms of Use.
  • Have a clear and easy to find license for any data.
  • Have a cookie policy and explain to users what the cookies you use are.
  • Have a privacy policy.
  • Keep your security certificates up to date.
  • Do not store any user queries for later analysis.
  • Think carefully before placing a user account system on your software - Does it really need it. If you do need implement one; for example your application has user uploaded data, has complex long running queries, or stores intermediate results, etc.? Read widely and plan defensively before you do. 
  • If you use third party analytics tools, make sure that your users know this, and if privacy is a concern to you, make sure you're also familiar with their ToU.
  • If you deploy things 'on the cloud' - read the agreement and T&Cs that you have with the company for your use of their services. Usually they do a very good job of dodging any responsibility, and sometimes grant themselves rights you would not expect. (We don't use third party cloud provision for any of our services - but we do use the cloud for some data entry portals. For these we're not doing anything that really requires great privacy, since once we've entered the data, we give it away anyway). And once you've read the T&Cs, read them again.
  • ChEMBL is typically "tighter" than the our Institute policies, but I think it's too confusing to make this specifically clear.....
Update - two things, 1) we do have a privacy policy specific to ChEMBL on our page and 2) The readers of the ChEMBL-og are very smart people, really you are. My attempts at protecting the privacy of one of the fellas above was woeful - I left his name badge in plain view! Doh! Sorry.

Friday, 11 January 2013

Paper: Fuelling Open-Source Drug Discovery: 177 Small-Molecule Leads against Tuberculosis

As it was announced last year, some of our collaborators in GSK Tres Cantos just published the results of a large antimycobacterial phenotypic screening campaign against Mycobacterium bovis BCG with hit confirmation in M. tuberculosis H37Rv. After the screening and in silico cascade, a set of 177 potent non-cytotoxic H37Rv hits was identified, providing a plethora of diverse potential starting points for new synthetic lead-generation activities to the global scientific community.

The dataset is hosted in ChEMBL and can be downloaded from here with a short description here.

%T Fueling Open-Source Drug Discovery: 177 Small-Molecule Leads against Tuberculosis
%A L. Ballell
%A R.H. Bates
%A R.J. Young
%A D. Alvarez-Gomez
%A E. Alvarez-Ruiz
%A V. Barroso
%A D. Blanco
%A B. Crespo
%A J. Escribano
%A R. González
%A S. Lozano
%A S. Huss
%A A. Santos-Villarejo
%A J.J. Martín-Plaza
%A A. Mendoza
%A M.J. Rebollo-Lopez
%A M. Remuiñan-Blanco
%A J.L. Lavandera
%A E. Pérez-Herran
%A F.J. Gamo-Benito
%A J.F. García-Bustos
%A D. Barros
%A J.P. Castro
%A N. Cammack
%J ChemMedChem


New Drug Approvals 2012 - Pt. XXXI - Lomitapide (JuxtapidTM)

ATC Code: C10AX12
Wikipedia: Lomitapide

On December 21st, the FDA approved Lomitapide (Tradename: Juxtapid; Research Codes: BMS-201038-04, BMS-201038, AEGR-733), a Microsomal triglyceride transfer protein (MTP) inhibitor, as a complement to a low-fat diet and other lipid-lowering treatments, in patients with homozygous familial hypercholesterolemia (HoFH).

Familial hypercholesterolemia is a genetic disorder, characterised by high levels of cholesterol rich low-density lipoproteins (LDL-C) in the blood. This genetic condition is generally attributed to a faulty mutation in the LDL receptor (LDLR) gene, which mediates the endocytosis of LDL-C.

Lomitapide, trough the inhibition of the microsomal triglyceride transfer protein in the liver, prevents the assembly of Apoliprotein B-containing lipoproteins, which is required for the formation of LDLs, thus contributing to lower the circulating LDL-C levels.

The Microsomal triglyceride transfer protein, which resides in the lumen of the endoplasmic reticulum, is a heterodimer composed of the microsomal triglyceride transfer protein large subunit (Uniprot: P55157; ChEMBL: CHEMBL2569), and the protein disulfide isomerase. Lomitapide binds to the large subunit.

>MTP_HUMAN Microsomal triglyceride transfer protein large subunit

There are no known 3D structures for this protein.

Lomitapide (IUPAC: N-(2,2,2-trifluoroethyl)-9-{4-[4-({[4'-(trifluoromethyl)biphenyl-2- yl]carbonyl}amino)piperidin-1-yl]butyl}-9H-fluorene-9-carboxamide; Canonical smiles: FC(F)(F)CNC(=O)C1(CCCCN2CCC(CC2)NC(=O)c3ccccc3c4ccc(cc4)C(F)(F)F)c5ccccc5c6ccccc16; PubChem: 9853053; Chemspider: 8028764 ; ChEMBL: CHEMBL354541; Standard InChI Key: MBBCVAKAJPKAKM-UHFFFAOYSA-N) is a synthetic compound with a molecular weight of 693.7 Da, nine hydrogen bond acceptors, two hydrogen bond donors, and has an ALogP of 7.79. The compound is therefore not compliant with the rule of five.

Lomitapide is available in the capsular form and the recommended starting daily dose is 5mg, with the possibility to gradually increase it, based on acceptable safety and tolerability, up to a maximum of 60mg. It has an apparent volume of distribution of 985-1292 L, upon oral administration of a single 60-mg dose, and its absolute bioavailability is 7%. Lomitapide binds extensively to plasma proteins (99.8%). The mean terminal half-life (t1/2) of lomitapide is 39.7 hours, being mainly metabolised by CYP3A4. This reliance on CYP3A4 for metabolism leads to multiple opportunities for drug-drug interactions with both CYP3A4 inhibitors and inducers, therefore when combining lomitapide with other lipid-lowering therapies, i.e. statins, a dose adjustment is required.

Lomitapide has been given a black box warning due to an increase in transaminases (alanine aminotransferase [ALT] and/or aspartate aminotransferase [AST]) levels after exposure to the drug.

The license holder for JuxtapidTM is Aegerion Pharmaceuticals, and the full prescribing information can be found here.