Today I want to present the results of the project I began in October using social media data mining techniques to analyze the Venezuelan food shortages. Social media from countries with limits to free speech is often the most reliable source of event occurrence and is a reliable alternative form of journalism. Spatiotemporal analysis of location-based social media data allows new ways to describe events. Almost 37,000 Spanish geo-tagged Tweets from the city of Caracas, Venezuela were used to observe reactions to the food shortage crisis within each of the city’s five municipalities from December 2014 to October 2016. I wanted to test the hypothesis of whether certain Tweets are particular to a municipality location and I used multinomial naïve Bayes, logistic regression and k-nearest neighbor machine learning classifiers to do so.
The data used in my research comes from a corpus of Spanish Tweets collected from December 2014 to October 2016. Only geotagged tweets using Twitter’s Streaming API (https://dev.twitter.com) were collected to avoid API rate limits. Each tweet is up to 140 characters of text and is associated with a user id, timestamp, latitude and longitude. Latitude and longitude of each municipality (Baruta, Chaco, El Hatillo, Libertador and Sucre) in Caracas, Venezuela were determined to four decimal places and included in my search criteria.
The number of Tweets about the food crisis during the sample period is about 35% of all Tweets from Caracas implying the significance of the event to citizens living in the city. Search terms included #AnaquelesVaciosEnVenezuela / #EmptyShelvesinVenezuela, , #VzlaTieneHambre /#VenezuelaisHungry, escasez/scarcity (noun), hambre/hungry (adjective), and alimentos / foods (noun). The following top words that were not included in the five original search terms included ‘puebl’ (‘pueblo’/people), ‘distribu’ (‘distribución’/distribution), , ‘col’ (‘cola’/line), ‘medicin’ (‘medicina’/medicine), ‘compr’ (‘compra’/buy) and ‘gobiern’ (‘gobierno’/government). The word “Hambre” (hunger) was the most popular in all municipalities except the Libertador municipality which had the same popularity as “Alimentos”, “#AnaquelesVaciosEnVenezuela”, and “VzlaTieneHambre.”
The highest number of filtered Tweets in a month occurred in January 2015 with 632. The number of Tweets drops considerably to 66 in May 2015 and then fluctuates between 11 and 122 for the remaining 16 months. The highest number of Tweets occurs during the second month of the studied time period. The drop in Tweets is consistent with prior research into how information is diffused as time elapses after the initial start of the event – also known as Immediacy .
Machine learning classifiers detect patterns and can predict future events using past data. Labels 1 – 6 were manually assigned to each of the municipalities in Caracas (1 = Baruta, 2 = Caracas, all, 3 = Chacao, 4 = El Hatillo, 5 = Libertador, 6 = Sucre). There are a total of 2,820 labeled data points and the largest category (Chacao) has a count of 900. Therefore a majority baseline classifier would get an accuracy of 900/2820, or 0.319. The MNB accuracy is 0.383, the LR accuracy is 0.448 and the k-NN accuracy is 0.422. The MNB accuracy outperforms the baseline by 16.7%, the LR outperforms it by 28.8% and the k-NN outperforms it by 24.4%. Of the three types of machine learning classifiers used, Logistic regression was the most accurate with an accuracy of 0.448.
Since all of these models perform better than the baseline classifier, a reasonable conclusion to the initial hypothesis is that words used in a Tweet in Caracas are signals of their municipality location. This is a significant finding that Tweet texts can be discovered at the micro / municipality level. Machine learning classifiers can also be used with a certain confidence to predict where Tweets might occur.