One of the most common and powerful approaches in NLP provides the content experts an opportunity to label each data segment for a portion of the dataset and then analyze these labels to apply to the rest of the dataset. Some key questions need to be answered when applying this approach in different environments. For example, how many “expert” labels do we need to create before the classification works effectively? How can we evaluate this in advance? Are we limiting ourselves to only extracting from the data what we already believe to be true? We will discuss that last one in our following post on clustering.
We applied the classification technique to one of our client’s projects to determine patient segments in clinical trials. This information is not necessarily added to the clinical trial information directly and if it is, the information can be in any one of a dozen data elements that are rarely labeled consistently. Imagine patients’ segments as a tree structure with the root being initial discovery of the disease prior to treatment, with the first level of branches as “yes” or “no” to initial treatment”. The child nodes for the first level branches are A, B, C, D and E.
To classify deeper, if the patients within the trial were considered prior segment A, they can then belong to either segment A or C. When the patients have not gone through prior treatment, they can then be classified as B, D or E.
With thousands of trials and the several categories mentioned above, it becomes a challenging task to classify such trials manually. Therefore, to automate this process, we built a text classification model using NLP data preprocessing methods. By using historical trials data that had been previously categorized into patient segments (labels), this enabled our predictive model to learn how to categorize new, unlabeled data. In our initial test, we used 80% of the historical trials data as a training set and accurately predicted the correct patient segment across the five branches in 20% of the remaining trials. The accuracy for our model varied depending on how many pre-labeled trials were included in the training set.
These initial results were a good start and prompted us to continue to improve. We further refined our approach by first breaking the five segments into two super-segments, using this technique and then applying a more refined approach on those two super-segments.
Our results improved, but one key finding was the keywords we thought would help us solve problems of the segment assignments were not helpful. For example, “metastatic”; you would think that word would provide a very clear understanding of the segments assignment, but in fact it did not. It is all the words surrounding that term that make the difference. For instance, searching for the term “metastatic” alone would not work because, “not metastatic”, is also a very common term. Use of n-grams is important in sorting through these issues.
It became clear that word groupings and terms used around the key anchor words are extremely important in determining these segments accurately. We are currently considering additional techniques akin to sentiment analysis for solving these types of problems. The context of where the anchor term is found is equally as important as the term itself. Essentially, the approach would identify the anchor terms and then analyze the words surrounding that term to determine the meaning / “sentiment”.
Accuracy is another topic for conversation as well. The accuracy for any predictive model is calculated by taking the sum of true positives and negatives divided by total population. Though, in applying this process, we also identified errors made by the “experts” in their initial classification; so how accurate are they? We believe however, that the accuracy will improve over time with a bigger labeled dataset as our training dataset grows and integration of other techniques are added to support this classification approach. The client is happy, but we are determined to improve our approach.