Using NLP Data to Classify Patient Segments in Clinical Trial Data

One of the most common and powerful approaches in NLP provides the content experts an opportunity to label each data segment for a portion of the dataset and then analyze these labels to apply to the rest of the dataset. Some key questions need to be answered when applying this approach in different environments. For example, how many “expert” labels do we need to create before the classification works effectively? How can we evaluate this in advance? Are we limiting ourselves to only extracting from the data what we already believe to be true? We will discuss that last one in our following post on clustering.

We applied the classification technique to one of our client’s projects to determine patient segments in clinical trials. This information is not necessarily added to the clinical trial information directly and if it is, the information can be in any one of a dozen data elements that are rarely labeled consistently. Imagine patients’ segments as a tree structure with the root being initial discovery of the disease prior to treatment, with the first level of branches as “yes” or “no” to initial treatment”. The child nodes for the first level branches are A, B, C, D and E.

To classify deeper, if the patients within the trial were considered prior segment A, they can then belong to either segment A or C. When the patients have not gone through prior treatment, they can then be classified as B, D or E.

With thousands of trials and the several categories mentioned above, it becomes a challenging task to classify such trials manually. Therefore, to automate this process, we built a text classification model using NLP data preprocessing methods. By using historical trials data that had been previously categorized into patient segments (labels), this enabled our predictive model to learn how to categorize new, unlabeled data. In our initial test, we used 80% of the historical trials data as a training set and accurately predicted the correct patient segment across the five branches in 20% of the remaining trials. The accuracy for our model varied depending on how many pre-labeled trials were included in the training set.

These initial results were a good start and prompted us to continue to improve. We further refined our approach by first breaking the five segments into two super-segments, using this technique and then applying a more refined approach on those two super-segments.

Our results improved, but one key finding was the keywords we thought would help us solve problems of the segment assignments were not helpful. For example, “metastatic”; you would think that word would provide a very clear understanding of the segments assignment, but in fact it did not. It is all the words surrounding that term that make the difference. For instance, searching for the term “metastatic” alone would not work because, “not metastatic”, is also a very common term. Use of n-grams is important in sorting through these issues.

It became clear that word groupings and terms used around the key anchor words are extremely important in determining these segments accurately. We are currently considering additional techniques akin to sentiment analysis for solving these types of problems. The context of where the anchor term is found is equally as important as the term itself. Essentially, the approach would identify the anchor terms and then analyze the words surrounding that term to determine the meaning / “sentiment”.

Accuracy is another topic for conversation as well. The accuracy for any predictive model is calculated by taking the sum of true positives and negatives divided by total population. Though, in applying this process, we also identified errors made by the “experts” in their initial classification; so how accurate are they? We believe however, that the accuracy will improve over time with a bigger labeled dataset as our training dataset grows and integration of other techniques are added to support this classification approach. The client is happy, but we are determined to improve our approach.

Check out our Case Studies for more examples of how Ozmosi can develop solutions for your data needs.

← Pharma Forecasting Case Study – Shaping R&D Strategy Covid-19 Clinical Trials and Where to Get Treated Update →

Healthiest States Index of The USA 2024

by Webtyde | June 26, 2025 | Disease Area Trends | 0 Comments

Health and wellness are pivotal for leading a wholesome life. Good health is a blessing. Time and health are the two most precious assets for human beings. Good health provides better possibilities for us to overcome challenges in life and reap its benefits. Do you...

Using Data to Optimize Clinical Trial Recruitment

by Webtyde | June 26, 2025 | Clinical Trial Trends | 0 Comments

The Importance of a High-Performing Clinical Trial Partnership Pharmaceutical companies are heavily dependent on clinical trials to assist with the placement, promotion, and sales of their products. If they are introducing a new mechanism of action (MOA) or modality...

FDA Accelerated Approval, Breakthrough Therapy, and Fast Track Designations Supercharge Drug Development

by Beau Bush | June 26, 2025 | Industry Trends | 0 Comments

FDA Expedited Drug Development Programs The Food and Drug Administration (FDA) follows an established and lengthy approval process that ensures patients have access to therapeutic agents proven to be safe and effective. The process relies upon a structured framework...

Uncovering New Catalyst Events in the Pharmaceutical and Biotech Markets

by Webtyde | June 26, 2025 | Industry Trends | 0 Comments

The Challenges with Predicting Catalyst Events “Chasing headlines” for catalyst events in the biotech and pharmaceutical markets is a common frustration of investing in these spaces. Predicting these headlines in advance is a primary goal, along with mastering the...

Finding the Needle in a Haystack with NLP

by Webtyde | June 26, 2025 | Industry Trends, New Technology | 0 Comments

The problem with Big Data is that it is so Big! This issue is especially true in the world of healthcare and drug development. It is difficult to see across all the good clinical/scientific work going on around the world and understand exactly where we are headed...

The Clinical Trial Powershift

by Webtyde | June 26, 2025 | Clinical Trial Sponsors, Clinical Trial Trends | 0 Comments

Pharmaceutical companies are being pushed aside when it comes to who is driving clinical outcomes. Despite the growing number of industry trials year over year, the pharmaceutical industry is taking a back seat to the even faster growing hospital groups and...

China Reshaping the Clinical Trial Pipeline

by Webtyde | June 26, 2025 | Clinical Trial Trends, Global Trends | 0 Comments

Finally, the data is in here at OZMOSI, and we can start tracking and reporting on the clinical development expected to come out of China. Gone are the days of developing drugs only in the US and then watching them slowly make their way to the rest of the world....

New Developments in Alzheimer’s Treatment

by Webtyde | June 26, 2025 | New Technology | 0 Comments

OZMOSI’s predictions for where Alzheimer's treatment is headed. The Ravaging Effects of Alzheimer’s The devastation of living with an Alzheimer’s diagnosis is exponentially affecting more and more people. A 2019 study conducted by The Lancet estimated 57 million...

Letting the Data Tell the Story

by Webtyde | June 23, 2025 | Industry Trends, New Technology | 0 Comments

In our previous post we described the technique for assigning categories to data, based on input from content experts within a “training database”. This technique is effective for summarizing large, text-heavy data into specific categories for summaries and improved...

Keeping up with the Pharmaceutical Industry in Asia

by Webtyde | June 23, 2025 | Clinical Trial Trends, Global Trends | 0 Comments

Keeping up with the pharmaceutical landscape around the world, especially in Asia, is essential in today’s market. Here are some of the biggest news stories emerging from Asian countries that are making big strides in drug development. China: Recently, China has been...

Using NLP Data to Classify Patient Segments in Clinical Trial Data

Recent Posts