This was a project report that looked at using emoticons to create a labeled data set for tweets.
About the Data
The authors noted that tweets are different from many other sources used for sentiment analysis - things like movie reviews - in that:
- they are character limited (140 characters at the time of the paper, it has since doubled)
- there is a huge amount of data to pull - and it is continuously being generated
- there is an unusual amount of slang and non-normal spelling
- it isn't subject specific - you can filter using the API, but twitter itself isn't a single-subject service
Using Emoticons as Labels
The use of emoticons to decide if a tweet is positive, or negative has the benefit of automatically creating a labled dataset, but since they are used as the labels they have to remove them from the training set, removing one of the more useful ways of identifying the tweet sentiment.
Getting and Cleaning the Data
The pulled 100 tweets form the API every 2 minutes until they had 800,000 positive and 800,000 negative tweets (after removing some tweets in pre-processing). The API lets you query by emoticon so the used ":)" to grab positive tweets (the API matches any known equivalent emoticon) and ":(" for negative tweets. They removed re-tweets and duplicates as well as any tweet that had both positive and negative emoticons in them. They then replaced usernames with the token
USERNAME and URLs with
URL and limited the number of consecutively repeated characters to 2.