Treballs Finals del Màster en Ciència Cognitiva i Llenguatge, Facultat de Filosofia, Universitat de Barcelona, Curs: 2017-2018, Tutor: Toni Badia
In this thesis we study the effect of word reordering as preprocessing for
Cross-Lingual Sentiment Analysis. We try different reorderings in two target
languages (Spanish and Catalan) so that their word order more closely resembles the
one from our source language (English). Our original expectation was that a Long
Short Term Memory classifier trained on English data with bilingual word
embeddings would internalize English word order, resulting in poor performance
when tested on a target language with different word order. We hypothesized that
the more the word order of any of our target languages resembles the one of our
source language, the better the overall performance of our sentiment classifier would
be when analyzing the target language. We tested five sets of transformation rules
for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from
two sources: two papers by Crego and Mariño (2006a and 2006b) and our own
empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that
the bilingual word embeddings that we are training our Long Short Term Memory
model with do not improve any English word order learning by part of the model
when used cross-lingually. There is no improvement when reordering the Spanish
and Catalan texts so that their word order more closely resembles English, and no
significant drop in result score even when applying a random reordering to them
making them almost unintelligible, neither when classifying between 2 options
(positive-negative) nor between 4 (strongly positive, positive, negative, strongly
negative). We also replicated this with two different classifiers: a Convolutional
Neural Network and a Support Vector Machine. The Convolutional Neural Network
should primarily learn only short-range word order, while the Long Short Term
Memory network should be expected to learn as well more long-range orderings. The
Support Vector Machine does not take into account word order. Subsequently, we
analyzed the prediction biases of these models to see how they affect the reordering
results. Based on this analysis, we conclude that the lacking results of the Long
Short Term Memory classifier when fed a reordered text do not respond to a
problem of prediction bias. In the process of training our models, we use two
bilingual lexicons (English-Spanish and English-Catalan) (Hu and Liu 2004) that
contain words that typically are key for analyzing the sentiment of a sentence that
we use to project our bilingual word embeddings between each language pair. Due
to the results we got in the reordering experiments, we conjectured that what
determines how our models are classifying the sentiment of the target languages is
whether these lexicon words appear or not in the input sentence. Finally, because of
this, we test different alterations on the target languages corpora to determine
whether this conjecture is strengthened or not. The results seem to go in favor of it.
Our main conclusion, therefore, is that Part of Speech-based word reordering of a
target language to make its word order more similar to a source language does not
improve the results on sentiment classification of our Long Short Term Memory
classifier trained on source language data, regardless of the granularity of the
sentiment, based on our bilingual word embeddings.