Determining the popularity of our elected officials has typically been the responsibility of polling companies who generate the samples, process the data and deliver the reports. The world of polling, however, is changing. Due to the explosion of social media, an abundance of data and inexpensive tools are available to those who want to generate their own conclusions about pretty much anything that generates public information. My friends and I, all enrolled in the same artificial intelligence course at the University of British Columbia, decided to harness this abundant data and our newly aquired AI knowledge to build a system that could determine the popularity of the President of the United States on any given day based on Twitter traffic.

 Data Collection

Twitter’s streaming API was used to capture all tweets containing the term “obama” in real time. To perform this task, a PHP script was developed using Phirehose to capture the tweets matching our criteria and store them as daily chunks. The script ran 16 hours per day for 15 days and cleaned the tweets as they were captured in order to remove URLs and punctuation which were deemed unnecessary for the purposes of the project. All tweets were converted to lower case and stop words (is, the, it etc.) were removed to make post processing faster and simpler.

Naive Bayes Classification

In order to predict how a word would be related to a popularity level, the system that we designed had to be trained. The Rasmussen Report, which provides daily presidential ratings from polling results, was used to train our classifier. Each word was assigned a probability given a class which was one of a spread of values provided by the Rasmussen Report. In the case of our tool, the Rasmussen Report had a maximum rating of -11 and a minimum rating of -18. Given these ratings, the maximum probability for a given word, in reference to the range of ratings, was assigned to that word. The Naives Bayes Classifier was used since we assumed that all tweets were conditionally independent of eachother in determining Prob(Rating | Word 1,…,Word n). Tweets were then classified based on the presence of words that, using the Bayes Classifier, became associated with a given classification of popularity.

Results

The results were both predictable and surprising. Despite the small amount of training data and the short collection time frame, our classifier correctly predicted the same presidential approval rating as the Rasmussen report 64% of the time. While hardly proof that naive bayes was effective at classifying twitter data, these results demonstrated the ability of a simple classification scheme in combination with plentiful training data to make relatively accurate predictions. However, to dampen enthusiasm about already modest results, the training data was fairly one dimensional. Changes in the presidential approval rating were not large during our training phase and overall approval ratings were negative for the entire collection period. In order to have a better chance of making accurate predictions, more data over a longer time period, with lots of variation would likely result in more accurate classification

 Business Use Cases for Classification

Business applications in the web application market that make use of AI powered classification are not immediately obvious. However, there are many ways that automated classification can be used to aid business functions. Help systems that classify combinations of words as pertaining to a certain problem can provide smaller, more specific set of solutions to customers, allowing them to troubleshoot issues more quickly. Images can also be classified to automate folder creation or to help generate ads that matched a users interests more closely. And then, of course, is the continued ability to predict popularity of anything that generates enough data to train a classifier.

In general, AI classification adds another way for developers to serve their customers more detailed and interesting products, that more accurately and automatically reflect interests and needs. As more data becomes available at a low cost to the consumer, the opportunities for developing new and innovative products are enormous.