Our THIRDSPACE Project: The Elephant In The Background

With all the drug repurposing work and truckloads of other tasks from our customers sitting on our (mostly virtual) desks, H. M. Pharma Consultancy hasn’t reported on its patent database project for a long time now. Its not that we had pushed it into the backseat: we all are multitaskers, constantly running THIRDSPACE data collection and data cleaning work as tightly coordinated background processes.

What drives us to build a next-generation patent database? Primarily, it is the desire to access all published information. You should think that patents would be the most accessible segment of all, with everything the major patent offices publish being freely available online for no charge. Theoretically this is true, in practice… well, to a degree.

In our times every document, and of course every patent application, is “digitally born” – created on a wordprocessor. The majority of national patent offices has electronic filing procedures in place. Does this mean that at least current submissions are available to the public in directly machine-readable form – in some XML format perhaps? For Japan and Korea the answer is yes; for China, a “sometimes.” But elsewhere – most especially in Europe – the smallest common denominator dictates workflow at the patent offices. Which means that even if the submission was made electronically using machine-readable PDFs these will simply be printed, as are with the scanner-generated bitmap PDFs which can also be filed. These hardcopies are then OCRed along with the submissions that arrive in paper form to begin with. What finally goes online is a bitmap scan supplemented with an uncorrected OCR output which can be anything between pretty exact and almost undecipherable, depending on the page content. The World Intellectual Property Office, which for the most part depends on what the national patent offices send, has established a 49-paragraph recommendation for page layouts that facilitate optical character recognition.

The problem is compounded when the patent language is not English, and the situation is especially bad with Asian documents. Sure, online machine translation from Google and Microsoft is available right at the Patentscope website. Here is what the algorithms can do to a patent application for a periodontitis drug:
 
At first sight it would seem as if Microsoft’s performance were much better than Google’s. But what is a “dwarf deodorant”? Or a “Zhang of addition?”

Now remember: You are human; you have intuition, and you might even know enough about periodontitis to figure out (in a fashion) what is being discussed here. A text mining algorithm has no such capabilities. If you hand it gibberish of this sort, its output will be… well, distorted. This will remain so until we finally have limited artificial intelligences to work with – twenty or thirty years from now perhaps.

Until then we will need systems like our THIRDSPACE project which clean all this mess manually – with back-braking, painstaking expert work – and add more value by preparing meaningful abstracts and coding the patent documents for MeSH keywords (so that you can search alongside with Medline/PubMed) and also by coding target disease conditions according to the WHO ICD-10 classification. And we also extract chemical information and make it machine-readable.The final result: sets of a few thousand patents each which have true content for a focused field, such as ophthalmology. (This is what we are working on now.)

Imagine the implications: all patent information available on all levels, at your fingertips. More comprehensive with patent documents than PubMed with peer-review papers.

THIRDSPACE. Its growing in our machines. Within H.M. Pharma Consultancy we are already making good use of it on behalf of our customers.