The problem with Big Data is that it is so Big!  This issue is especially true in the world of healthcare and drug development.  It is difficult to see across all the good clinical/scientific work going on around the world and understand exactly where we are headed and how we will get there for any given disease area. 

It’s about time we move from manually tackling the pharmaceutical data to using the power of natural language processing (NLP) and machine learning.  NLP is at its early stage in the pharma industry and brings a strong potential to find trends and insights buried within data. NLP and machine learning algorithms bring in several opportunities to analyze not only the text-heavy data in clinical trials, but also whitepapers from medical journals published across the globe.

When you have unstructured and messy data to clean and organize, NLP is the best approach for distilling the information into something that we can digest and apply.  NLP techniques can clean text data from punctuations, random whitespace, and unused numbers and boil it down to the most sensible data and categories for us to analyze. There are different machine learning algorithms for textual data such as classification, clustering, auto-summarizing and many more. These algorithms are helpful, for instance, when we want to classify the clinical trials into different patient segments, or use clustering to explore the data.

We have explored three NLP approaches for summarizing large bodies of scientific literature in order to determine how well each works in parsing out common themes and trends.  Finding an approach to quickly distill common themes from a large volume of studies will have far reaching application in the pharmaceutical industry and beyond.  The three approaches applied to this problem were as follows:

1)  Classification – Letting the Experts tell us what to look for

2)  Clustering – Letting the data tell the story

3)  Auto Summarizing – Distilling copious amounts of text information

In our following posts, we will describe our findings for each technique and our plans for continuing to solve this problem. 

Spoiler alert: none of these approaches are perfectly suited on their own for accomplishing the task described above, but each has an element of success that can be built upon.  We believe that to reach the goal described above we will need to combine several techniques in addition to those we have tried, and perhaps some that have not yet been created. 

We are happy to share our successes as well as our failures here to contribute to the larger conversation about using the power of NLP.