In my February blog, I explained how to use the tokenization technique in Natural Language Processing (NLP) to predict whether a particular Tweet could be geolocated to a particular neighborhood in the city of Caracas, Venezuela. Almost 37,000 Spanish Tweets that had a latitude and longitude from the city of Caracas, Venezuela were used to observe reactions to the food shortages within each of the city’s five municipalities from December 2014 to October 2016.
Today I want to delve into the details of the stemming technique, which is the next natural step in turning human language into 0s and 1s the computer can understand. Stemming takes away suffixes or prefixes in a word and leaves the root or the lemma. I used the NLTK library within the Python programming language was used to analyze the text from these Tweets. NLTK is a great beginner library and includes common computational linguistic techniques. There are many great blogs out there that will give you code snippets if you want to delve straight in.
This is Tweet # 24 out of 2835 filtered results. It was written at 18:42 on January 4, 2015 in the Baruta municipality. For privacy considerations the author of the Tweet is not shown.
Taking the original language in the Tweet and applying tokenization, or breaking each word in a sentence into a list gave us this result in Python.
There are several types of stemming techniques including Snowball and Porter. Porter is known for its speed and simplicity. Snowball supports several foreign languages, including Spanish. So is we apply Snowball stemming to the tokenized list above and compare it to Porter Stemming for the original Tweet, we will get the following result.
Even though the Snowball stemmer seems to be more accurate, there is still further processing to do. We have several cases where a couple of words have been combined into one such as “h.pan”, “anaquelesVaciosEnVenzuela” and “venezuela…renuncia.” From human domain expertise, we know that the word ‘nicolas’ probably refers to Venzuelan President Nicolas Maduro so the stemming into ‘nicol’ from the Snowball stemmer or ‘nicola’ from the Porter stemmer is not accurate. There is also punctuation such as colons, periods, commas and hashtags remaining in the processed text. Next month I’ll talk about removing stopwords and punctuation from the Tweet text.