Today I celebrate the four-year anniversary of my blog that tries to explain key data science ideas in plain language and socialize what continues to make its appearance in mainstream conversations. In my three-year anniversary blog, I was talking about using computer vision in a solar-powered solution to detect pests that are destroying crops in sub-Saharan Africa and alert farmers before the crops are infected. What a difference a year makes as today I want to shift gears entirely to talking about one of the most common parts of natural language processing (NLP) – sentiment analysis.
Sentiment analysis captures the opinions, attitudes and emotions or voice of the customer by turning spoken or written text from a variety of sources into a way that they can be processed computationally. One of the most fascinating problems sentiment analysis tries to solve is quantifying often subjective information (someone’s feelings). Knowing people’s reactions can be used in a wide variety of domains: from humanitarian to marketing.
According to Wikipedia, the most basic task in sentiment analysis is classifying the polarity of a given text. The polarity is how much a document, sentence or part of a sentence is positive, negative or neutral. Beyond basic polarity sentiment analysis dives deeper into basic emotional states as defined by Ekman, including “anger”, “disgust”, “fear”, “happiness”, “sadness”, “surprise”, or “neutral.” There’s been a lot of recent research about finding out emotions from social media Tweets so let’s look at an example from Axel Schultz et al. titled, A Fine-Grained Sentiment Analysis Approach for Detecting Crisis Related Microposts.
The processing pipeline Schultz describes is common to many natural language processing tasks: 1) get the data, 2) pre-process the data (see my March & April 2019 blogs), 3) extract features and 4) classify the text.
Schultz analyzes three datasets from freely available social media data. Dataset one is a random sampling of 200 Tweets that have a location of Seattle, Washington on March 6, 2012. Dataset two is a random sampling of 2000 Tweets that have a location of Seattle, Washington on March 6, 2012. Features were extracted including word unigram extraction, part-of-speech tagging, character trigram and fourgram extraction, extraction of syntactic features and extraction of sentiment features. Tweets were manually classified with one of these emotions (“anger”, “disgust”, “fear”, “happiness”, “sadness”, “surprise”, or “neutral”) by nine volunteers. The features were combined and evaluated using the Naive Bayes, Naive Bayes Multinomial Model and Support Vector Machine models. Accuracy of the Tweet annotation classification was measured between annotators and using the machine learning models with precision, accuracy and recall metrics often used in data science. The highest precision, accuracy and recall of the machine learning models was 0.658.
In my upcoming blogs, I’ll dive a bit more into Unigram Extraction, Part-of-Speech tagging, and Trigram and Fourgram Extraction. And feel free to reach out about any questions or ideas. I’d love to hear from you as I continue this blogging journey.