Automating Twitter Data Annotation Process for Sentiment Analysis
Main Article Content
Abstract
Background:
Sentiment analysis algorithms require high-quality annotated data during the training phase. However, this requirement has led to complex, time-consuming and costly manual data annotation process. To address these challenges, this research proposes an automatic data annotation process for sentiment analysis.
Materials and Methods:
Three semantic orientation measures (Pointwise Mutual Information, latent Semantic Analysis, and Word2Vec), five classification algorithms (K-Nearest Neighbors, Logistic Regression, naïve Bayes, Random Forest, Support Vector Machine) and NRC lexicon thesaurus are used to automate the process of tweet annotation for sentiment analysis.
Results:
Tweets were annotated using five classifiers and three semantic measures, forming fifteen combinations. The Inter-Annotator Agreement (IAA) among these combinations was evaluated using Cohen’s Kappa statistic. The obtained results show that (Pointwise Mutual Information + Logistic Regression) and (Pointwise Mutual Information + Naïve Bayes) achieved the highest agreement score of 0.7008.
Conclusion:
These results have shown that the corpus-based semantic orientation measures have provided substantive results. However, it can still be enhanced through the use of a broader vocabulary, the application of contextual information and the implementation of the newest deep learning algorithms.
Article Details
Section

This work is licensed under a Creative Commons Attribution 4.0 International License.