Skip navigation
Please use this identifier to cite or link to this item: http://repository.iitr.ac.in/handle/123456789/15616
Title: A comparison among significance tests and other feature building methods for sentiment analysis: A first study
Authors: Sharma R.
Mondal D.
Bhattacharyya P.
Gelbukh A.
Published in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Abstract: Words that participate in the sentiment (positive or negative) classification decision are known as significant words for sentiment classification. Identification of such significant words as features from the corpus reduces the amount of irrelevant information in the feature set under supervised sentiment classification settings. In this paper, we conceptually study and compare various types of feature building methods, viz., unigrams, TFIDF, Relief, Delta-TFIDF, χ2 test and Welch’s t-test for sentiment analysis task. Unigrams and TFIDF are the classic ways of feature building from the corpus. Relief, Delta-TFIDF and χ2 test have recently attracted much attention for their potential use as feature building methods in sentiment analysis. On the contrary, t-test is the least explored for the identification of significant words from the corpus as features. We show the effectiveness of significance tests over other feature building methods for three types of sentiment analysis tasks, viz., in-domain, cross-domain and cross-lingual. Delta-TFIDF, χ2test and Welch’s t-test compute the significance of the word for classification in the corpus, whereas unigrams, TFIDF and Relief do not observe the significance of the word for classification. Furthermore, significance tests can be divided into two categories, bag-of-words-based test and distribution-based test. Bag-of-words-based test observes the total count of the word in different classes to find significance of the word, while distribution-based test observes the distribution of the word. In this paper, we substantiate that the distribution-based Welch’s t-test is more accurate than bag-of-words-based χ2 test and Delta-TFIDF in identification of significant words from the corpus. © Springer Nature Switzerland AG 2018.
Citation: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), (2017), 3- 19
URI: https://doi.org/10.1007/978-3-319-77116-8_1
http://repository.iitr.ac.in/handle/123456789/15616
Issue Date: 2017
Publisher: Springer Verlag
Keywords: Buildings
Classification (of information)
Computational linguistics
Data mining
Sentiment analysis
Text processing
Building methods
Classification decision
Cross-lingual
Different class
Feature sets
Sentiment classification
Significance test
Total counts
Testing
ISBN: 9783319771151
ISSN: 3029743
Author Scopus IDs: 55582575200
57204465418
7101803108
Author Affiliations: Sharma, R., Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, India
Mondal, D., Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, India
Bhattacharyya, P., Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, India
Corresponding Author: Sharma, R.; Department of Computer Science and Engineering, Indian Institute of Technology BombayIndia; email: raksha@cse.iitb.ac.in
Appears in Collections:Conference Publications [CS]

Files in This Item:
There are no files associated with this item.
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.