ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Tuesday, 27 October 2015

Advanced keyword and structure searches with SureChEMBL

Previously in the SureChEMBL series, we described how to access SureChEMBL data in bulk, offline and locally. So, you may ask, what is the point in using the SureChEMBL web interface? Well, how about the unprecedented functionality that allows you to submit very granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus - at the same time?

Let’s see each one separately first:

Lucene-powered keyword searching

You may use the main text box for simple keyword-based patent searches, such as ‘Apple’, ‘diabetes’ or even 'chocolate cake' (the patent corpus as a recipe book is a new use-case here). You will get a lot of results and probably a lot of noise. With Lucene fields, you can slice and dice a query by indicating specific patent sections and bibliographic metadata, such as date/year of filing or publication, assignee, patent classification code, patent authority, etc. For example, to search for the term ‘diabetes’ only in the abstract of patents, you can search with:

where ab is the Lucene query field for abstract. For a full list of Lucene queries, see here. Furthermore, you can combine these fields with boolean operators (AND, OR, NOT - always in UPPER case) and brackets. For example to find US patents published in 2014 which also mention the word ‘diabetes’ in the title or abstract, you could search with:

(ttl:diabetes OR ab:diabetes) AND pdyear:2014 AND pnctry:US

or even limit it to more med-chem relevant patent hits by using the appropriate IPC hierarchical classification codes:

(ttl:diabetes OR ab:diabetes) AND ic:(C07D AND (A61K OR A61P)) AND pdyear:2014 AND pnctry:US

Is that all? No, you could also use wildcards, such as * and ?, as well as proximity searches:

(ttl:diabet* OR ab:diabet*) AND pdyear:2014 AND pnctry:US

A couple of thing worth pointing out here:
1) in the way described above, you may search not only the chemically-annotated (EP, US, WO, JP patents) or chemically-relevant corpus but any patent within SureChEMBL’s broad coverage, such as French, German, British, Chinese, Australian, Canadian, etc., patents about any topic:

pa:"Apple Inc" AND ab:vehicle AND pnctry:CN

for such cases, just remember to check the 'All authorities' box on the right hand side panel.
2) If the Lucene query syntax seems too complicated, almost the same functionality is available via a more user-friendly field-based widget called Fielded Search:

ChemAxon-powered structure searching

To begin with, SureChEMBL provides basic substructure and similarity searches against the currently 17 million chemical structures, powered by ChemAxon’s JChem technology. Some of you may have noticed that we have recently done some refurbishment around the sketchers and we now provide the latest MarvinJS sketcher as the sole source of structure input. We also removed the manual entry box, as it is superseded by functionality described below. Behind the scenes, we use the native ChemAxon inter-conversion functionality to ensure maximum compatibility and minimum information loss during structure conversions. The good news is that you can input a structure in several ways (besides sketching it from scratch), e.g. SMILES, SMARTS, CML, InChI, Molfile and IUPAC/trivial name. Just click and paste your string on the MarvinJS sketcher or open the import dialogue to paste it right there - or even upload a file. More importantly, you may now take advantage of more advanced query features, such as (NOT) atom and bond lists, explicit hydrogens, as well as the Markush-friendly position variation and repetition ranges.

For example, this is a query that combines atom, not atom, and bond lists, as well as explicit hydrogens to control substitution:

Or this one, which combines position variation and linker repetition range:

Again, don't forget that you have additional control over the MW range of the search hits, as well as their exact location in the patent document (title, abstract, claims, description, images/molfiles).

Combined keyword and structure searching

Finally, as mentioned in the beginning, you can easily submit combined keyword and structure queries, such as this one: our knowledge, there's no other freely available patent searching resource or interface out there providing this type of functionality but we're happy to stand corrected...

As usual, for any questions or feedback, drop us a line.

George and Nathan

Monday, 19 October 2015

Is ChEMBL down or is it just me?

Have you ever wondered whether your favorite resource of bioactive molecules data is down or there is some temporary network issue, that makes it unavailable from your end? There are many online tools, that can help in such cases (for example or similar websites). We, however, provide now a much better solution: ChEMBL status page:

As you may notice, the status page is hosted on GitHub, so it is outside of the EBI infrastructure. This means that even when ChEMBL core websites are down, you should still be able to see the status page (assuming that GitHub is online, which is a quite reasonable assumption, despite occasional incidents). We've placed a link to the status page at the bottom of the left-side navigation menu on the main ChEMBL web page, as it provides some useful information even when everything is fine.

The status page presents information about the health of ChEMBL's most critical resources (main web interface, REST API, ADME Sarfari, SureChEMBL, UniChem and more) along with cumulative availability data grouped by time (from last day, week, month, year and all time). As you can see from the data presented on the status page, we have some pretty impressive availability rate: more than 99% for every monitored resource!

For those of you interested in the technical details - we use a service called Uptime Robot in order to collect availability data. Uptime Robot allows to define up to 50 monitors (each monitoring a single URL) - for free. It also provides an API to retrieve collected data and present/share it online without having to visit the Uptime Robot webpage.

There is a nice open source JavaScript widget called Upscuits, which provides a nice overview of data collected by Uptime Robot. Since the widget is written in JavaScript, it can be hosted on any static page friendly environment. The ChEMBL team uses GitHub for hosting our open source repositories anyway, so GitHub pages were an obvious choice.

We have been using the ChEMBL GitHub Organisation page for quite some time for mirror posts from this blog (we use Jekyll to do this) so creating another simple website with status dashboard provided by Upscuits/Uptime Robot was a breeze. We hope the new page will help diagnosing any availability issues that may occur.

Tuesday, 13 October 2015

Paper: Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Our collaborators in GSK have just published an Open Access paper in the Journal of Cheminformatics. It is a comparative study of the quality of chemistry extraction from patent documents and includes patent chemistry sources derived by automated text-mining, such as SureChEMBL and the IBM/NIH data set. Among other things, the paper provides a useful detailed overview of SureChEMBL's chemistry annotation specifications.

While conducting this study, we realised that this task is far from trivial for several reasons: 
  • The patent corpus is inherently noisy, ambiguous and error-rich.
  • There are diverse use cases and accuracy expectations when it comes to chemistry extracted from patents.
  • Not all the chemistry found in a patent document is of equal importance.
  • Compound standardisation variants such as stereoisomers, tautomers, salts and mixtures is always an issue.
  • There is a distinct lack of an open Gold Standard when it comes to standardised chemistry extracted from relevant full text patent documents. Recently, there have been several attempts towards text-mining standards provided by BioCreative and publications such as this one, which offer position and type of chemical named entities but not converted structures.  
  • The commercial patent chemistry vendors do not disclose their extraction specifications, which makes any comparisons even harder.

Here is the background:

First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

%T Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
%A S. Senger
%A L. Bartek
%A G. Papadatos
%A A. Gaulton

%J Journal of Cheminformatics
%D 2015
%O doi:10.1186/s13321-015-0097-z

George and Anna