Twitter Sentiment Classification Using Distant Supervison

This was a project report that looked at using emoticons to create a labeled data set for tweets.

About the Data

The authors noted that tweets are different from many other sources used for sentiment analysis - things like movie reviews - in that:

  • they are character limited (140 characters at the time of the paper, it has since doubled)
  • there is a huge amount of data to pull - and it is continuously being generated
  • there is an unusual amount of slang and non-normal spelling
  • it isn't subject specific - you can filter using the API, but twitter itself isn't a single-subject service

Using Emoticons as Labels

The use of emoticons to decide if a tweet is positive, or negative has the benefit of automatically creating a labled dataset, but since they are used as the labels they have to remove them from the training set, removing one of the more useful ways of identifying the tweet sentiment.

Getting and Cleaning the Data

The pulled 100 tweets form the API every 2 minutes until they had 800,000 positive and 800,000 negative tweets (after removing some tweets in pre-processing). The API lets you query by emoticon so the used ":)" to grab positive tweets (the API matches any known equivalent emoticon) and ":(" for negative tweets. They removed re-tweets and duplicates as well as any tweet that had both positive and negative emoticons in them. They then replaced usernames with the token USERNAME and URLs with URL and limited the number of consecutively repeated characters to 2.