Using NLP Data to Classify Patient Segments in Clinical Trial Data

One of the most common and powerful approaches in NLP provides the content experts an opportunity to label each data segment for a portion of the dataset and then analyze these labels to apply to the rest of the dataset. Some key questions need to be answered when applying this approach in different environments. For example, how many “expert” labels do we need to create before the classification works effectively? How can we evaluate this in advance? Are we limiting ourselves to only extracting from the data what we already believe to be true? We will discuss that last one in our following post on clustering.

We applied the classification technique to one of our client’s projects to determine patient segments in clinical trials. This information is not necessarily added to the clinical trial information directly and if it is, the information can be in any one of a dozen data elements that are rarely labeled consistently. Imagine patients’ segments as a tree structure with the root being initial discovery of the disease prior to treatment, with the first level of branches as “yes” or “no” to initial treatment”. The child nodes for the first level branches are A, B, C, D and E.

To classify deeper, if the patients within the trial were considered prior segment A, they can then belong to either segment A or C. When the patients have not gone through prior treatment, they can then be classified as B, D or E.

With thousands of trials and the several categories mentioned above, it becomes a challenging task to classify such trials manually. Therefore, to automate this process, we built a text classification model using NLP data preprocessing methods. By using historical trials data that had been previously categorized into patient segments (labels), this enabled our predictive model to learn how to categorize new, unlabeled data. In our initial test, we used 80% of the historical trials data as a training set and accurately predicted the correct patient segment across the five branches in 20% of the remaining trials. The accuracy for our model varied depending on how many pre-labeled trials were included in the training set.

These initial results were a good start and prompted us to continue to improve. We further refined our approach by first breaking the five segments into two super-segments, using this technique and then applying a more refined approach on those two super-segments.

Our results improved, but one key finding was the keywords we thought would help us solve problems of the segment assignments were not helpful. For example, “metastatic”; you would think that word would provide a very clear understanding of the segments assignment, but in fact it did not. It is all the words surrounding that term that make the difference. For instance, searching for the term “metastatic” alone would not work because, “not metastatic”, is also a very common term. Use of n-grams is important in sorting through these issues.

It became clear that word groupings and terms used around the key anchor words are extremely important in determining these segments accurately. We are currently considering additional techniques akin to sentiment analysis for solving these types of problems. The context of where the anchor term is found is equally as important as the term itself. Essentially, the approach would identify the anchor terms and then analyze the words surrounding that term to determine the meaning / “sentiment”.

Accuracy is another topic for conversation as well. The accuracy for any predictive model is calculated by taking the sum of true positives and negatives divided by total population. Though, in applying this process, we also identified errors made by the “experts” in their initial classification; so how accurate are they? We believe however, that the accuracy will improve over time with a bigger labeled dataset as our training dataset grows and integration of other techniques are added to support this classification approach. The client is happy, but we are determined to improve our approach.

Check out our Case Studies for more examples of how Ozmosi can develop solutions for your data needs.

← Pharma Forecasting Case Study – Shaping R&D Strategy Covid-19 Clinical Trials and Where to Get Treated Update →

How Next-Generation Probability of Success Forecasting Can Improve Clinical Trial Accuracy by 44%

by Webtyde | July 14, 2025 | New Technology, Probability of Success | 0 Comments

Unlocking Next-Gen POS Forecasting for Biopharma Success In the high-stakes world of biopharma, advanced Probability of Success (POS) forecasting can revolutionize the landscape of clinical trials. By adopting next-gen POS forecasting models, companies can...

Market Overview: GLP-1 Agonists and the Obesity Market

by Beau Bush | July 14, 2025 | Market Scan | 0 Comments

Introduction to GLP-1 AgonistsGLP-1 agonists have been pivotal in the pharmaceutical market for nearly two decades, beginning with the FDA approval of AstraZeneca’s Byetta in 2005. Since then, the landscape has seen numerous entries and exits, leaving Novo Nordisk and...

Not Your Grandparents’ Probability of Success Forecasts

by Webtyde | July 7, 2025 | New Technology, Probability of Success | 0 Comments

Redefining Probability of Success in Pharma: A Data-Driven Revolution In the world of pharmaceutical strategic planning and analytics, traditional Probability of Success (POS) forecasts are a familiar, yet often frustrating approach to assessing clinical risk. While...

Clinical Trial Success Rates: What Makes Some Companies Stand Out?

by Webtyde | July 7, 2025 | New Technology, Probability of Success | 0 Comments

Our comprehensive analysis of over 30,000 clinical trials across more than 4,000 biopharmaceutical companies reveals significant variations in clinical trial success rates. This disparity exists even among trials in the same phase and targeting the same disease,...

Healthiest States Index of The USA 2024

by Webtyde | June 26, 2025 | Disease Area Trends | 0 Comments

Health and wellness are pivotal for leading a wholesome life. Good health is a blessing. Time and health are the two most precious assets for human beings. Good health provides better possibilities for us to overcome challenges in life and reap its benefits. Do you...

Using Data to Optimize Clinical Trial Recruitment

by Webtyde | June 26, 2025 | Clinical Trial Trends | 0 Comments

The Importance of a High-Performing Clinical Trial Partnership Pharmaceutical companies are heavily dependent on clinical trials to assist with the placement, promotion, and sales of their products. If they are introducing a new mechanism of action (MOA) or modality...

FDA Accelerated Approval, Breakthrough Therapy, and Fast Track Designations Supercharge Drug Development

by Beau Bush | June 26, 2025 | Industry Trends | 0 Comments

FDA Expedited Drug Development Programs The Food and Drug Administration (FDA) follows an established and lengthy approval process that ensures patients have access to therapeutic agents proven to be safe and effective. The process relies upon a structured framework...

Uncovering New Catalyst Events in the Pharmaceutical and Biotech Markets

by Webtyde | June 26, 2025 | Industry Trends | 0 Comments

The Challenges with Predicting Catalyst Events “Chasing headlines” for catalyst events in the biotech and pharmaceutical markets is a common frustration of investing in these spaces. Predicting these headlines in advance is a primary goal, along with mastering the...

Finding the Needle in a Haystack with NLP

by Webtyde | June 26, 2025 | Industry Trends, New Technology | 0 Comments

The problem with Big Data is that it is so Big! This issue is especially true in the world of healthcare and drug development. It is difficult to see across all the good clinical/scientific work going on around the world and understand exactly where we are headed...

The Clinical Trial Powershift

by Webtyde | June 26, 2025 | Clinical Trial Sponsors, Clinical Trial Trends | 0 Comments

Pharmaceutical companies are being pushed aside when it comes to who is driving clinical outcomes. Despite the growing number of industry trials year over year, the pharmaceutical industry is taking a back seat to the even faster growing hospital groups and...

Using NLP Data to Classify Patient Segments in Clinical Trial Data

Recent Posts