In our previous post we described the technique for assigning categories to data, based on input from content experts within a “training database”. This technique is effective for summarizing large, text-heavy data into specific categories for summaries and improved visualization. While this approach is useful for those purposes, it will not allow us to uncover new insights or trends because we are imposing a preconceived and finite set of options, or in other words, what we already know.
The following describes our approach to using clustering techniques for exploring text-heavy data. We applied this technique to two different datasets: scientific journals and presidential speeches.
Dataset #1 – Scientific Journals
For the scientific journals, we analyzed over five hundred whitepapers on a specific disease area to identify trends and key topics for that disease area over the last few years. To accomplish this task, we created term-document matrices, an NLP technique that builds a matrix for frequency of words present in abstracts of whitepapers that are available on pubmed.gov. We identified trends in research focus across products and approaches, and we could track how these evolved.
We further explored this data using text clustering and creating scatter plots of the clusters. Text clustering does not require any labeled dataset; it is a way to explore the data and let the data tell the story in the form of clusters. Scatter plots for clusters are handy for visualizing and understanding the analysis. We explored the output by applying a clustering algorithm called K Means clustering, where K represents the number of clusters that is decided manually and utilizes the term document matrices. This approach still needs more exploration, since as you can imagine, the results will vary depending the number of clusters we are “looking for”.
The abstracts for clustering have many characteristics and may be redundant. Some might wonder how the terms or the relevant characteristics in each cluster are represented in a scatter plot. To get the relevant characteristics and visualize them, we used Principal Component Analysis (PCA). Using linear combination with the existing characteristics, PCA builds new ones and constructs the best possible features to summarize the data and then cluster similar words using the above-mentioned clustering algorithm.
Dataset #2 – Presidential Speeches
A similar technique was applied to a set of presidential speeches using the techniques described above. The campaign speeches were by Barack Obama, Donald Trump, Jeb Bush, Hillary Clinton, and Ted Cruz. Here, we felt confident that we should see certain tendencies and similarities, and the outcomes would help us to calibrate our approach.
We created a 2D visualization of the term frequency-inverse document frequency (TF-IDF) matrix that was created based on the word counts taken from the four candidate’s speeches. Note that the TF-IDF matrix is very large, as it has 5 rows (one for each candidate) and N columns (where N is the number of unique unfiltered words from all the speeches). It would be impossible for us to visualize an N-dimensional space where N is in the thousands; therefore, we must make use of dimensionality reduction techniques.
Principal component analysis (PCA) is one such technique. Its general idea is that it looks to map a high dimensional space to a low dimension (in our case, 2D). It does this by finding the directions in the N-dimensional space that have the highest variance and redefines the axes to be in these directions (as compared to the typical origin we think of using when plotting a graph). In this way, the new axes (in our case PC1 and PC2) are linear combinations of the original N-dimensions. Although not entirely important, we mention that after creating PC1, PCA ensures that PC2 is orthogonal (i.e., at a right angle) to PC1, which is done so that the representation is natural since we always look at graphs that have the x and y axis sitting at 90-degree angles to one another.
We can see from the graph that Bush, Clinton, and Obama are quite near to each other (with Bush and Obama being the closest), while Trump and Cruz appear to be far away in space off in different directions. There are several reasons that we can use to describe this phenomenon, and so below, we have just listed a few systematically and ordered them based on significance.
1) First, we note that this entire graph is based on the words used during their speeches; thus, the terms and frequency are the only information we have extracted, which does not consider the semantic meaning.
2) Bush, Clinton, and Obama are likely close to each other because they all ran a “typical” campaign. They discussed the common key points that presidential campaigns have discussed in the past, and therefore the terms they were using are similar, resulting in them appearing close in the 2D representation.
3) It is a known fact that both Trump and Cruz were atypical Republican candidates. Thus, they discussed and focused on not-so-typical things in their speeches (e.g., building a wall with Mexico). This can be seen in both the x and y axes.
4) Clinton and Obama will appear close together since they are both typical Democrats and discuss the same content, so the words they’re using will be similar. This also brings in Bush, who was a typical Republican. Although they would have differing opinions, they would be discussing the same content/topics and using the same words. Using the ideas from (1), we can see how this would bring Bush closer together with Clinton and Obama.
PC1 tells us about policies and issues. PC2 denote characteristics; how candidates characterize themselves and each other in their campaign speeches.
After further analysis, such as finding the top and bottom 30 words that drove Policies/Issues (PC1) and Character (PC2) for these campaign speeches, we found the top 30 words were mostly from the democrats and Jeb Bush’s speeches, while the bottom 30 belonged to Donald Trump and Ted Cruz.
The results from the above analysis reinforced our confidence in this technique to summarize information and highlight themes. The most exciting part of this approach was it could be applied to develop “fingerprints” or themes. For example, each candidate showed separation from the other candidates on the axes above; but their own speeches scored extremely close to each other, highlighting a consistency and reliability that we would hope to see an analysis of language and themes. This could be applied more broadly to a body of scientific research to identify core themes in the current research and perhaps to even identify missing themes. More work and research is needed to explore the use of this technique further, but for now we will continue to track the thinking and trends in our scientific literature using techniques like these.