One of the most common and powerful approaches in NLP provides the content experts an opportunity to label each data segment for a portion of the dataset and then analyze these labels to apply to the rest of the dataset. Some key questions need to be answered when applying this approach in different environments.  For example, how many “expert” labels do we need to create before the classification works effectively?  How can we evaluate this in advance?  Are we limiting ourselves to only extracting from the data what we already believe to be true?  We will discuss that last one in our following post on clustering.

We applied the classification technique to one of our client’s projects to determine patient segments in clinical trials.  This information is not necessarily added to the clinical trial information directly and if it is, the information can be in any one of a dozen data elements that are rarely labeled consistently. Imagine patients’ segments as a tree structure with the root being initial discovery of the disease prior to treatment, with the first level of branches as “yes” or “no” to initial treatment”. The child nodes for the first level branches are A, B, C, D and E.

To classify deeper, if the patients within the trial were considered prior segment A, they can then belong to either segment A or C. When the patients have not gone through prior treatment, they can then be classified as B, D or E.

With thousands of trials and the several categories mentioned above, it becomes a challenging task to classify such trials manually. Therefore, to automate this process, we built a text classification model using NLP data preprocessing methods. By using historical trials data that had been previously categorized into patient segments (labels), this enabled our predictive model to learn how to categorize new, unlabeled data. In our initial test, we used 80% of the historical trials data as a training set and accurately predicted the correct patient segment across the five branches in 20% of the remaining trials. The accuracy for our model varied depending on how many pre-labeled trials were included in the training set.

These initial results were a good start and prompted us to continue to improve. We further refined our approach by first breaking the five segments into two super-segments, using this technique and then applying a more refined approach on those two super-segments.

Our results improved, but one key finding was the keywords we thought would help us solve problems of the segment assignments were not helpful.  For example, “metastatic”; you would think that word would provide a very clear understanding of the segments assignment, but in fact it did not.  It is all the words surrounding that term that make the difference. For instance, searching for the term “metastatic” alone would not work because, “not metastatic”, is also a very common term.  Use of n-grams is important in sorting through these issues.

It became clear that word groupings and terms used around the key anchor words are extremely important in determining these segments accurately.  We are currently considering additional techniques akin to sentiment analysis for solving these types of problems.  The context of where the anchor term is found is equally as important as the term itself.  Essentially, the approach would identify the anchor terms and then analyze the words surrounding that term to determine the meaning / “sentiment”.

Accuracy is another topic for conversation as well.  The accuracy for any predictive model is calculated by taking the sum of true positives and negatives divided by total population.  Though, in applying this process, we also identified errors made by the “experts” in their initial classification; so how accurate are they? We believe however, that the accuracy will improve over time with a bigger labeled dataset as our training dataset grows and integration of other techniques are added to support this classification approach.  The client is happy, but we are determined to improve our approach.

Check out our Case Studies for more examples of how Ozmosi can develop solutions for your data needs. 

 

How Next-Generation Probability of Success Forecasting Can Improve Clinical Trial Accuracy by 44%

Unlocking Next-Gen POS Forecasting for Biopharma Success In the high-stakes world of biopharma, advanced Probability of Success (POS) forecasting can revolutionize the landscape of clinical trials. By adopting next-gen POS forecasting models, companies can...

Market Overview: GLP-1 Agonists and the Obesity Market

Introduction to GLP-1 AgonistsGLP-1 agonists have been pivotal in the pharmaceutical market for nearly two decades, beginning with the FDA approval of AstraZeneca’s Byetta in 2005. Since then, the landscape has seen numerous entries and exits, leaving Novo Nordisk and...

FDA Accelerated Approval, Breakthrough Therapy, and Fast Track Designations Supercharge Drug Development

FDA Expedited Drug Development Programs The Food and Drug Administration (FDA) follows an established and lengthy approval process that ensures patients have access to therapeutic agents proven to be safe and effective. The process relies upon a structured framework...

Using Data to Optimize Clinical Trial Recruitment

The Importance of a High-Performing Clinical Trial Partnership Pharmaceutical companies are heavily dependent on clinical trials to assist with the placement, promotion, and sales of their products. If they are introducing a new mechanism of action (MOA) or modality...

China Reshaping the Clinical Trial Pipeline

Finally, the data is in here at OZMOSI, and we can start tracking and reporting on the clinical development expected to come out of China.  Gone are the days of developing drugs only in the US and then watching them slowly make their way to the rest of the world....

New Developments in Alzheimer’s Treatment

OZMOSI’s predictions for where Alzheimer's treatment is headed. The Ravaging Effects of Alzheimer’s The devastation of living with an Alzheimer’s diagnosis is exponentially affecting more and more people. A 2019 study conducted by The Lancet estimated 57 million...

The Clinical Trial Powershift

Pharmaceutical companies are being pushed aside when it comes to who is driving clinical outcomes. Despite the growing number of industry trials year over year, the pharmaceutical industry is taking a back seat to the even faster growing hospital groups and...

Finding the Needle in a Haystack with NLP

The problem with Big Data is that it is so Big!  This issue is especially true in the world of healthcare and drug development.  It is difficult to see across all the good clinical/scientific work going on around the world and understand exactly where we are headed...

Healthiest States Index of The USA 2024

Health and wellness are pivotal for leading a wholesome life. Good health is a blessing. Time and health are the two most precious assets for human beings. Good health provides better possibilities for us to overcome challenges in life and reap its benefits.  Do you...

Uncovering New Catalyst Events in the Pharmaceutical and Biotech Markets

The Challenges with Predicting Catalyst Events “Chasing headlines” for catalyst events in the biotech and pharmaceutical markets is a common frustration of investing in these spaces. Predicting these headlines in advance is a primary goal, along with mastering the...