In our previous post we described the technique for assigning categories to data, based on input from content experts within a “training database”.  This technique is effective for summarizing large, text-heavy data into specific categories for summaries and improved visualization.  While this approach is useful for those purposes, it will not allow us to uncover new insights or trends because we are imposing a preconceived and finite set of options, or in other words, what we already know.

The following describes our approach to using clustering techniques for exploring text-heavy data. We applied this technique to two different datasets: scientific journals and presidential speeches.

Dataset #1 – Scientific Journals

For the scientific journals, we analyzed over five hundred whitepapers on a specific disease area to identify trends and key topics for that disease area over the last few years.  To accomplish this task, we created term-document matrices, an NLP technique that builds a matrix for frequency of words present in abstracts of whitepapers that are available on  We identified trends in research focus across products and approaches, and we could track how these evolved.

We further explored this data using text clustering and creating scatter plots of the clusters. Text clustering does not require any labeled dataset; it is a way to explore the data and let the data tell the story in the form of clusters. Scatter plots for clusters are handy for visualizing and understanding the analysis. We explored the output by applying a clustering algorithm called K Means clustering, where K represents the number of clusters that is decided manually and utilizes the term document matrices.  This approach still needs more exploration, since as you can imagine, the results will vary depending the number of clusters we are “looking for”.

The abstracts for clustering have many characteristics and may be redundant. Some might wonder how the terms or the relevant characteristics in each cluster are represented in a scatter plot. To get the relevant characteristics and visualize them, we used Principal Component Analysis (PCA). Using linear combination with the existing characteristics, PCA builds new ones and constructs the best possible features to summarize the data and then cluster similar words using the above-mentioned clustering algorithm.

Dataset #2 – Presidential Speeches

A similar technique was applied to a set of presidential speeches using the techniques described above.  The campaign speeches were by Barack Obama, Donald Trump, Jeb Bush, Hillary Clinton, and Ted Cruz. Here, we felt confident that we should see certain tendencies and similarities, and the outcomes would help us to calibrate our approach.

We created a 2D visualization of the term frequency-inverse document frequency (TF-IDF) matrix that was created based on the word counts taken from the four candidate’s speeches. Note that the TF-IDF matrix is very large, as it has 5 rows (one for each candidate) and N columns (where N is the number of unique unfiltered words from all the speeches). It would be impossible for us to visualize an N-dimensional space where N is in the thousands; therefore, we must make use of dimensionality reduction techniques.

Principal component analysis (PCA) is one such technique. Its general idea is that it looks to map a high dimensional space to a low dimension (in our case, 2D). It does this by finding the directions in the N-dimensional space that have the highest variance and redefines the axes to be in these directions (as compared to the typical origin we think of using when plotting a graph). In this way, the new axes (in our case PC1 and PC2) are linear combinations of the original N-dimensions. Although not entirely important, we mention that after creating PC1, PCA ensures that PC2 is orthogonal (i.e., at a right angle) to PC1, which is done so that the representation is natural since we always look at graphs that have the x and y axis sitting at 90-degree angles to one another.

We can see from the graph that Bush, Clinton, and Obama are quite near to each other (with Bush and Obama being the closest), while Trump and Cruz appear to be far away in space off in different directions. There are several reasons that we can use to describe this phenomenon, and so below, we have just listed a few systematically and ordered them based on significance.

1) First, we note that this entire graph is based on the words used during their speeches; thus, the terms and frequency are the only information we have extracted, which does not consider the semantic meaning.

2) Bush, Clinton, and Obama are likely close to each other because they all ran a “typical” campaign. They discussed the common key points that presidential campaigns have discussed in the past, and therefore the terms they were using are similar, resulting in them appearing close in the 2D representation.

3) It is a known fact that both Trump and Cruz were atypical Republican candidates. Thus, they discussed and focused on not-so-typical things in their speeches (e.g., building a wall with Mexico). This can be seen in both the x and y axes.

4) Clinton and Obama will appear close together since they are both typical Democrats and discuss the same content, so the words they’re using will be similar. This also brings in Bush, who was a typical Republican. Although they would have differing opinions, they would be discussing the same content/topics and using the same words. Using the ideas from (1), we can see how this would bring Bush closer together with Clinton and Obama.

PC1 tells us about policies and issues. PC2 denote characteristics; how candidates characterize themselves and each other in their campaign speeches.

After further analysis, such as finding the top and bottom 30 words that drove Policies/Issues (PC1) and Character (PC2) for these campaign speeches, we found the top 30 words were mostly from the democrats and Jeb Bush’s speeches, while the bottom 30 belonged to Donald Trump and Ted Cruz.


The results from the above analysis reinforced our confidence in this technique to summarize information and highlight themes. The most exciting part of this approach was it could be applied to develop “fingerprints” or themes. For example, each candidate showed separation from the other candidates on the axes above; but their own speeches scored extremely close to each other, highlighting a consistency and reliability that we would hope to see an analysis of language and themes.  This could be applied more broadly to a body of scientific research to identify core themes in the current research and perhaps to even identify missing themes.  More work and research is needed to explore the use of this technique further, but for now we will continue to track the thinking and trends in our scientific literature using techniques like these.

Using Data to Optimize Clinical Trial Recruitment

The Importance of a High-Performing Clinical Trial Partnership Pharmaceutical companies are heavily dependent on clinical trials to assist with the placement, promotion, and sales of their products. If they are introducing a new mechanism of action (MOA) or modality...

New Developments in Alzheimer’s Treatment

OZMOSI’s predictions for where Alzheimer's treatment is headed. The Ravaging Effects of Alzheimer’s The devastation of living with an Alzheimer’s diagnosis is exponentially affecting more and more people. A 2019 study conducted by The Lancet estimated 57 million...

Healthiest States Index of The USA 2024

Health and wellness are pivotal for leading a wholesome life. Good health is a blessing. Time and health are the two most precious assets for human beings. Good health provides better possibilities for us to overcome challenges in life and reap its benefits.  Do you...

China Reshaping the Clinical Trial Pipeline

Finally, the data is in here at OZMOSI, and we can start tracking and reporting on the clinical development expected to come out of China.  Gone are the days of developing drugs only in the US and then watching them slowly make their way to the rest of the world....

FDA Accelerated Approval, Breakthrough Therapy, and Fast Track Designations Supercharge Drug Development

FDA Expedited Drug Development Programs The Food and Drug Administration (FDA) follows an established and lengthy approval process that ensures patients have access to therapeutic agents proven to be safe and effective. The process relies upon a structured framework...

Uncovering New Catalyst Events in the Pharmaceutical and Biotech Markets

The Challenges with Predicting Catalyst Events “Chasing headlines” for catalyst events in the biotech and pharmaceutical markets is a common frustration of investing in these spaces. Predicting these headlines in advance is a primary goal, along with mastering the...

Finding the Needle in a Haystack with NLP

The problem with Big Data is that it is so Big!  This issue is especially true in the world of healthcare and drug development.  It is difficult to see across all the good clinical/scientific work going on around the world and understand exactly where we are headed...

The Clinical Trial Powershift

Pharmaceutical companies are being pushed aside when it comes to who is driving clinical outcomes. Despite the growing number of industry trials year over year, the pharmaceutical industry is taking a back seat to the even faster growing hospital groups and...

When Good Things DON’T Come to Those Who Wait: How Pharma Companies Compare in Governance Efficiency

Predicting when a competing product will be approved is essential in any commercial forecast and for any business insights team with their eye on the market.  Here at OZMOSI, we have been spending our summer building out a machine learning approach to further improve...

Novo Nordisk Diabetes Pipeline Analysis

Novo Nordisk remains a top player in the Diabetes market with blockbuster insulins NovoLog®  and Levemir® which brought in a combined almost $3.5 billion in 20166, Victoza® a GLP-1 that brought in $2.1 billion in 20166, and recently launched Basal insulin/GLP-1...