Finding the Narrative with Natural Language Processing

When I first started studying data science, one of the areas I was most excited to learn was natural language processing. “Unsupervised machine learning” certainly has a mystical ring to it, and, coming from finance roles in startups, working with text data felt like the polar opposite from what I’m used to.

My experience from working in finance roles in startups did, however, teach me the importance of process and storytelling with data. Because finance touches everything, you get a sense of how all the different parts of the business fit together, how data flows between them, and how to effectively communicate performance and strategy to various, non-technical audiences.

There’s no shortage of articles detailing how various topic modeling algorithms work and tutorials on how to code them up (and I relied heavily on them while working through this project) — what felt missing, however, was a high-level framework for implementing them into a comprehensive workflow and how to distill findings and insights into a compelling narrative.

With this in mind, I’ll be outlining my process for topic modeling and creating a few insightful visualizations using the Russian Troll Tweets dataset:

  1. Text preprocessing
  2. Topic discovery with pyLDAvis
  3. “Semi-Supervised” topic modeling with CorEx
  4. Creating visualizations

If you’d like to see exactly how I put this together, you can find the code for this project on my GitHub.

Text Preprocessing

In general, text pre-processing should include lowercasing all words, removing punctuation and stop words, and stemming or lemmatization. When working with tweets, in addition to the normal text-preprocessing tasks we also have to consider hashtags, acronyms, re-tweet syntax (‘RT @scrapfishies:…’), emojis, and other elements.

Should hashtags be be segmented (divided into their unique words) or kept as a single concatenated string? Well, I’d argue that it depends on the hashtag. As an example, the #blacklivesmatter hashtag was used frequently in this corpus — segmenting would give us 3 distinct tokens: ‘black’, ‘lives’, and ‘matter’. Unless we use bigrams or trigrams when tokenizing, we might lose some of the hashtag’s potency which could affect the themes that emerge when topic modeling. What about a #donaldtrump or #hillaryclinton hashtag? Should these be segmented or left concatenated? Instead of drawing hard lines, my guideline is this:

The more salient a hashtag is in the corpus, the stronger the case for leaving it intact.

This is where domain knowledge and exploratory data analysis will be your guide. Spend time reviewing hashtags and consider segmenting less salient ones with the help of a word segmentation library or by using a dictionary and pandas’ replace() method for a more manual approach.

A text preprocessing pipeline for a corpus of tweets could include the following:

  1. Lowercase and remove punctuation, emojis, URLs, and retweet prefixes
  2. Segment hashtags where appropriate
  3. Remove stop words
  4. Lemmatize (I really like spaCy’s lemmatizer for this!)

Remember: this will be an iterative process! As you get into the topic modeling phase, you may find that you need to go back and make tweaks to your pipeline. For example, spaCy’s lemmatizer changed ‘[Mike] Pence’ to ‘penny’, and so I had to account for this in my workflow. I also found that making a basic dictionary of word frequencies was a quick way to get a feel for the prevalence of words and hashtags in the corpus which informed my text preprocessing pipeline.

Topic Discovery with pyLDAvis

pyLDAvis is an interactive latent Dirichlet allocation (LDA) visualization library for Python that can be a great foundational framework for topic modeling.

Below is a static sample output with 12 topics from the Russian troll tweets dataset:

12-topic pyLDAvis visualization with the Russian troll tweet dataset

On the right, we have the 30 most salient terms for a given topic — as you click through the numbered topics or hover over their corresponding bubbles on the left, these words and their frequencies will change. As you pan through the topics, you may be able to identify the theme by the keywords.

On the left, we can see numbered bubbles of varying sizes placed on a grid. These topics have been projected onto the two-dimensional plane using principal component analysis (PCA). The size of the bubble gives a sense of the topics prevalence in the corpus, and the distances between bubbles gives a sense of topic similarity. As an example, we can see a lot of overlap in topics 7, 8, and 9, while topic 12 sits alone in the bottom left quadrant (topic 12 represented a bunch of German tweets — makes sense!).

By adjusting the number of topics with pyLDAvis, we can start to get a feel for the different themes in the corpus. Eventually, you should find a ‘sweet spot’ where topics are distinct enough from one another while maintaining coherent themes.

“Semi-Supervised” Learning with CorEx

Another powerful topic modeling framework is CorEx (a portmanteau of ‘Correlation Explanation’), which allows users to integrate their domain knowledge through the use of “anchor” words. Using these anchor words, we can highlight smaller topics that might be hidden by larger ones as well as tease them out further to get subtopics.

As an example, from pyLDAvis, a few topics emerged: Donald Trump, Hillary Clinton, police violence, race and the Black Lives Matter movement, and tweets in German. But I also remember seeing were tweets about Barrack Obama and Islam, and I want to get a sense of these as possible themes, too. I can set the following anchors (anchors can be a single word or a list) :

anchor_topics = [['donald', 'trump'],                                
['hillary', 'clinton'],
['merkel', 'muss', 'die', 'ist', 'ich', 'das'],
['police', 'officer', 'shoot']
['obama', 'barack'],
['muslim', 'islam']]

Next, I’ll tell CorEx to use these 7 anchor words as topics and then find 3 more (n_hidden=10):

topic_model = ct.Corex(n_hidden=10, 
seed=42), words=words, docs=proc_tweets,
anchors=anchor_topics, anchor_strength=8)

Plotting their correlation scores, I can see that the Trump and Clinton topics were the strongest in my group of anchored topics, that Black Lives Matter, police violence, and Islam were maybe less prevalent that I suspected, and that tweets about Barrack Obama could be a topic worth exploring more.

Similarly, knowing that Donald Trump and Hillary Clinton are the dominant topics in this corpus, I can set several anchors for each to see what subtopics emerge:

anchor_topics = [['donald', 'trump'],
['donald', 'trump'],
['donald', 'trump'],
['donald', 'trump'],
['hillary', 'clinton'],
['hillary', 'clinton'],
['hillary', 'clinton'],
['hillary', 'clinton']]

Which gives me the following subtopic keywords:

Depending on the scope of the project, I might expand the lists of anchor words for each topic to further tease them out, or continue to explore smaller themes in the corpus.

Creating Visualizations

Creating visualizations for natural language processing projects is not the most intuitive task. Word clouds or bar charts depicting word frequencies can be a neat addition to a suite of visualizations, but leave much to be desired in terms of crafting a compelling narrative.

Word cloud depicting vocabulary frequency in Russian troll tweets corpus

In the example of the Russian troll tweets dataset, we have timestamp details about when tweets were made. Plotting tweet frequency over time piques our interest: we can see a period of relatively low activity leading up to May 2016, followed by a sharp spike lasting about a year.

What’s going on during this highly active period? What are they tweeting about? If we align the tweets’ topics by the date they were created, we can start to answer these questions:

From pyLDAvis, I was able to identify half a dozen or so distinct, relevant topics in the corpus. From there, I labeled each tweet with its dominant topic (LDA assumes documents are a mixture of topics and provides a probability distribution of each). Plotting the topic frequencies over time starts to give us some meaningful insights: we can see that the huge spike mid-2016 can be mostly attributed to tweets about Trump leading up to the US Presidential Election that year.

Let’s say our thesis about this corpus is that these Russian accounts were being used to influence the US election in support of Trump, plotting topic frequency over time doesn’t tell the entire story. We’re missing tone — sure, there are tweets about Trump, but are they actually in support of his candidacy?

Applying sentiment analysis to the tweets might give us some insight:

VADER sentiment analysis is the preferred library for social media corpora and was used above on three topics: Clinton and Trump (our focal points), and the “General Twitter” topic (tweets about music, games, holidays, etc.) as a type of baseline or reference point. We quickly see the sharp contrast in tone between the tweets about Clinton and Trump, especially in our period of interest: mid-2016 through the election.

These two charts together help us construct a narrative that the US presidential candidates were the dominant focus of these accounts during the election year, and that they clearly favored of one in particular.

Another great visualization tool Scattertext, which leverages the spaCy language model to provide rich, interactive scatterplots depicting word frequencies with regards to a label. In the example below, we have our 2016 presidential candidates on either axis and the most prevalent terms associated with each plotted, colored, and labeled in the space.

ScatterText provides a more interactive way to view word frequencies and context by label

Clicking on a word or searching for it in the chart reveals the reference documents — in this case, the tweets containing the selected term.

The clusters of words in the scatterplot can also be revealing. If we look at the most salient terms used in the Clinton topic, we see a very specific theme and tight clustering. The vocabulary is concerned with the various ‘scandals’ surrounding Clinton during this time and some negative hashtags.

Cluster of most salient terms used in the Clinton topic

Conversely, the most salient terms used in the Trump topic are less densely clustered, and more general and slogany: ‘#trumpforpresident’, ‘Republican’, and ‘rally’.

Cluster of most salient terms used in the Trump topic

Again, we can get a sense of the tone of the tweets for these topics, albeit without a sentiment score.

Final thoughts

There are a lot of exciting libraries and frameworks out there for topic modeling and natural language processing — I’m barely scratching the surface. But hopefully some of you will find this as a useful overview for how to put some of them together and gives you some ideas for how to present your findings in a coherent and compelling narrative.

I still think unsupervised learning and NLP are pretty magical, and I’m even more excited to continue building my data science toolkit and finding meaning in chaos.

Again, you can find all of the code on my GitHub. If you found this article helpful, please 👏!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store