Why we Need a Medical Patentome Array

Patent databases are proliferating. There is not a single IP document published by the major offices (and an increasing number of smaller ones) that you could not download – free of charge because patent law demands distribution of information in exchange for granting exclusivity. In addition to all the usual Boolean searches for inventors, assignees and IPC categorization codes, the public databases maintained by the U.S., European and PCT offices allow you to search the full text. Google Patents offers bitmap scans of U.S. patents back to the late 18th century and has put everything granted by the USPTO since 1976 through a optical character recognition (OCR) process; the European ESPACENET has been doing the same using other resources.

Then there are dozens of vendors who provide added value services on a subscription basis. In most cases these are more or less ingenious applications of semantic search technologies which have been evolving like mad since the millennium. SureChem offers automatically extracted chemical information. Entire meta-services, such as Intellogist, exist solely to keep track of this scenery and to comment on the products’ features.

And yet all such services do not nearly make full use of the scientific and technical information in IP documents in the way and to the extent medical researchers need it. This is because all patent databases are designed around the needs of intellectual property professionals – and these specialists have needs that are very different from the desires of those who want to treat patent documents the same way as peer review papers, namely as a source of scientific and technical information.

Working off OCRed paper documents has severe technical limitations. Look at how chemical names are rendered and no chemistry software will make sense of it — no OCR software has dictionaries or correction algorithms for chemistry. Go back to the 1980s and you have typewritten patents; the OCR output that is so error-riddled even for normal text that you barely get an idea of the content. You have chemical formulas drawn with stencils or even by hand, and reaction schemes that combine all of this. A good example for the gibberish that results can be seen here. Mind you, all that is for English documents. Imagine the patent is, lets say, in French and you feed the OCR output into Google Translate, as the PCT’s PatentScope database offers to do…

But now imagine for a moment that all IP documents relevant to a particular well-defined medical field have been identified and collated. Imagine the buried, distorted and implicit information contained therein captured in fully corrected fashion, chemistry and pharmacology extracted, annotated, tagged and hyperlinked within the database as well as to outside sources of scientific literature and chemistry. Now let all the cutting-edge semantic and chemistry software work on that. Imagine the prior art granularity such a “patentome” would provide. Imagine how many questions it could answer!

But hasn’t IBM done such a thing for millions of patents last year (as reported e.g. here)? Well, what Big Blue did with its Strategic IP Insight Platform simply was what it does best: applying its hardware superiority to Big Data. The output is actually quite interesting, and will be useful to anybody who seeks the emulation of an entry-level patent analyst with no particular specialization. Its impressive only in the way IBM boasted that their SyNAPSE project had developed a “cognitive chip” that allowed them to simulate a cat brain by 2009, with the simulation of a macaque brain now within immediate reach. That type of PR came perilously close to a scam (and not a few cognition scientists say it was one), but of course the media picked up on it.

Don’t get me wrong, one day we will have artificial intelligence systems that will accomplish all this — and more if they actually achieve semi-sentience (which I guess won’t happen until about 2025). Even then these machines will need to train on material that has been compiled and value-enhanced by human experts.

Until then we need true, high-granularity patentomes with perfect data quality for well-defined but overlapping areas of the medical sciences. With its THIRDSPACE project H.M. Pharma Consultancy has developed a plan how this could be accomplished through qualified crowdsourcing, and we are progressing towards the exemplary patentome of ocular pharmacology and biotechnology. That will ultimately allow perfect navigation of the ophthalmology patent space, with interconnections to the outside online universe. Stay tuned while we work on it!