Browsing by Author "Singh Bisht, Balwant"

Now showing 1 - 1 of 1

Sentiment Analysis on Hindi-English Code-mix Data
(Indian Statistical Institute, 2023-06-23) Singh Bisht, Balwant
Social media has emerged as a prominent platform for expressing opinions, leading to the development of a unique language known as code-mix text. This form of language incorporates words from multiple languages, such as Hindi and English in India. While sentiment analysis techniques have achieved moderate success in handling English texts, the same level of effectiveness has not been attained when dealing with code-mix text. In this study, we propose deep learning techniques to address the challenges of sentiment analysis in code-mix Hindi-English text data. Leveraging a pre-trained cross-lingual large language model called XLM-RoBERTa, we employ a transfer learning approach. Four distinct approaches are employed to train the model for sentiment analysis on a Hinglish dataset. The first approach involves training the model using the Hinglish dataset exclusively. The other three approaches utilise mixed datasets, where one includes the augmentation of Spanish-English and Marathi-English datasets with the Hinglish dataset, the second approach solely relies on the mixed dataset without Hinglish data, while the final approach exclude the Spanish-English data. The trained models are evaluated on the same Hinglish dataset, and their performance is compared. The results indicate that the approach of increasing the training data by arbitrarily combining different kinds of mixed datasets does not yield improvements over previous findings. But combining the data of languages with similar linguistic characteristics can result in better performance. This highlights that the problem associated with scarcity of data for code-mixed languages can be effectively solved by using data of similar languages. In conclusion, our study emphasises the ongoing challenge of limited data for code-mixed languages. We demonstrate that augmenting the training data with various mixed datasets does not lead to enhanced performance but the data of similar languages can be combined to produce better outcomes. These findings provide valuable insights for future research in sentiment analysis of code-mix text.