Twitter Bot or Not
Twitter bot detection using supervised machine learning
Whether you’re on Twitter or stay away from social media altogether, these platforms affect us all — from shaping public discourse to entertainment to spreading information.
The existence of bots on these platforms has gained a lot of attention in recent years, and yet many people are still unaware of or misunderstand their presence and purpose on platforms like Twitter.
And so it’s important that we start with a simple working definition of a bot:
A Twitter bot is a software bot that controls a Twitter account via the Twitter API. It may autonomously perform actions like tweeting, retweeting, liking, following, unfollowing, or direct messaging other users.
Bots are designed for a variety of purposes: they can be creative, helpful, informative, and even funny.
There are, of course, more nefarious bots — bots that can spread misinformation or scam other users. The presence of these bots can degrade our experience on these platforms and worse: our trust in one another.
Instead of bucketing all bots into good or bad, I think developing bot awareness is key to preserving social trust and the integrity of these platforms. And so with that in mind, I wanted to create a tool that could help Twitter users and spectators alike become more bot aware: Twitter Bot or Not.
Users can enter in any Twitter handle and see the probability of that account being a bot based on a dozen or so account-level features (more on that to come). These predictions are intended to provide the user with peace of mind with regards to who they’re following or interacting with.
I developed a supervised machine learning classification model using Python, the sklearn library, and XGBoost, and then built a simple app using Flask that I deployed on Heroku. In this post, I’ll walk through a few of the steps that went into developing the model and the app. All of the code and notebooks can be found on my GitHub.
Dataset and model features
For this project, I used the Twitter Bot Accounts dataset on Kaggle, which has account-level information approximately 37,000 Twitter users, labeled as ‘bot’ or ‘human’. Account-level information means we’re not actually looking at tweets, but rather activity information like number of tweets and likes, network information like number of follows and friends, whether or not the user is verified, etc.
When exploring the dataset, right off the bat we can see that bot and human accounts behave differently — for example, humans tweet more frequently than their robot counterparts:
Their networks also look differently, too — humans have bigger networks, both in terms of accounts they follow and that follow them.
I used these account-level details, along with a few other engineered metrics such as overall network size and reach as the features in the model.
Model scoring and evaluation
I wanted this model to accurately label bots as such, but not at the expense of labeling everything a bot — so I sought to find and tune a model that would achieve a balance between both precision and recall.
I tried a number of classification models from the sklearn library: Logistic Regression, K-Nearest Neighbors, Naive Bayes, Decision Tree, Random Forest, and XGBoost. Recognizing the non-linear nature of the data and the number of features, it wasn’t a surprise that the last two performed best ‘out of the box’ with regards to balanced precision and recall scores.
With some additional parameter tuning, XGBoost stood out as the best model.
Building the Flask app
After training on the full dataset, I pickled the final model to be used in a Flask app that would allow others to the check bot-likeliness of any Twitter users.
The app itself would be minimal — just a search bar where users could type in the Twitter account handle and the results. The structure of the app was easily coded up with a few lines of HTML, and I used the Bulma CSS framework to help with a simple and clean style.
To pull live Twitter information via their API, I set up a developer account, which is a straightforward application of providing a few details of who you are and your intentions. Within a few hours of submission, I had my access keys.
With the help of Tweepy, a Python library for interfacing with the Twitter API, I was able to pull the same account-level information I needed to make predictions on new, live users.
If you’re new to Tweepy, here’s a nice chunk of starter code that’ll allow you to access your home timeline, create a Pandas DataFrame of tweets by a specific user or topic, and actually send tweets from Python.
I’m quite pleased with how this project turned out. I think the model does a reasonable job of making predictions on live data and I found myself lurking Twitter feeds playing ‘bot or not’: looking for trollish or bot-like comments and running a search on their handle with the app.
Of course there’s no way to be sure, but it’s pretty fun when your hunch is validated by AI.
I’m extremely interested in using Twitter more in the future and I’m excited to add natural language processing to my toolkit to dig deeper into the data. I see this project as the first piece in a larger body of work that explores networks, bubbles, and the nature of our curated realities on these platforms.
Thanks for reading — please check out the app and let me know your bot-likelihood!