How to Use Machine Learning for Sentiment Analysis with Word Embeddings

Pablo Grill
October 17, 2018

Researchers have tried out different ways of using machine learning to automatically predict the sentiment of a sentence. Predicting sentiment is a typical problem in NLP (Natural Language Processing), and there are many techniques that approach it with different methods of machine learning.

Most of these papers and techniques take a feature extraction approach and apply it to a "traditional" machine learning algorithm. In the last few years, the appearance of word embeddings has made the usage of neural networks more common than traditional algorithms because of their superior results.

In this article, I use convolutional neural networks and word embeddings to predict the sentiment of Amazon reviews.

Sentiment analysis allows for the automatic prediction of opinions and emotions.
Sentiment analysis allows for the automatic prediction of opinions and emotions.

Amazon Reviews Corpus

When we take a look at the taxonomy of machine learning algorithms, neural networks are classified as supervised learning. In supervised learning, the algorithms learn a function that maps a specific input to a specific output using a set of pre-classified examples called a training set. When the learning phase is complete, we can apply the learned function to classify new examples.

In the case of neural networks, we have a set of "neurons" that computes a specific function called the activation function. The neurons are grouped in layers and linked together with specific weights. Each neuron receives an input from the previous layer and applies the activation function to the next layer.

The number of layers, neurons per layer, and type of activation function make up the neural network architecture.

Given a set of neurons with a specific architecture, the learning algorithm adjusts the values of the weights using the training set.

A training set allows a supervised learning algorithm to learn a function.
A training set allows a supervised learning algorithm to learn a function.

In the case of sentiment analysis, we want to find a function that can take a sentence and determine its sentiment (positive or negative). In order to learn this function using neural networks, however, we need to train the machine with a corpus of sentences and their corresponding sentiments.

In this post, we will use the Amazon Reviews Dataset published by Kaggle.

This dataset is a CSV file that contains a set of 4,000,000 Amazon reviews with the following information:

  • Title
  • Review Text
  • Label

The label can be label 1 if the review has 1 or 2 stars or label 2 if the review has 4 or 5 stars. Reviews with 3 stars (neutral reviews) are not included in the dataset. This is an especially interesting dataset because it contains reviews that cover many topics.

For example, a positive review (with label 2):

Title: A fascinating insight into the life of modern Japanese teens

Text: I thoroughly enjoyed Rising Sons and Daughters. I don't know of any other book that looks at Japanese society from the point of view of its young people poised as they are between their parents' age-old Japanese culture of restraint and obedience to the will of the community, and their peers' adulation of Western culture. True to form, the "New Young" of Japan seem to be creating an "international" blend, as the Ando family demonstrates in this beautifully written book of vignettes of the private lives of members of this family. Steven Wardell is clearly a talented young author, adopted for some of his schooling into this family of four teens, and thus able to view family life in Japan from the inside out. A great read!

And a negative review (with label 1):

Title: ADDONICS PORTABLE CD DRIVE - I am disappointed in its performance

Text: I am disappointed in its performance. It seems underpowered and is constantly trying to read CDs, half the time unsuccessfully. I am going to try to return it to Amazon.

This dataset is split into two groups:

  • the "Training Set" containing 360,000 reviews

  • the "Testing Set" containing 400,000 reviews

The idea is to use the training set to train our neural network to assess sentiment and then evaluate how well it has learned to do so by using the testing set.


In order to compare the results, I will define a baseline using the Liu and Hu Opinion Lexicon. This lexicon contains a list of words that are classified as positive or negative.

The baseline algorithm is very simple and intuitive. The idea is to count the number of positive and negative words that a sentence has. If the number of positive words is greater than negative, we can assume that the sentiment of the sentence is positive. If we have more negative words, the sentence is assumed negative. However, there are many cases that this algorithm doesn't cover, such as negations.

The baseline algorithm is the following:

from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank

def demo_liu_hu_lexicon(sentence):
    tokenizer = treebank.TreebankWordTokenizer()
    pos_words = 0
    neg_words = 0
    tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]

    for word in tokenized_sent:
        if word in opinion_lexicon.positive():
            pos_words += 1
        elif word in opinion_lexicon.negative():
            neg_words += 1
    if pos_words > neg_words:
        return 'Positive'
    elif pos_words < neg_words:
        return 'Negative'
        return 'Neutral'

When we apply this algorithm to the Amazon dataset, we achieve 50% accuracy.

Neural Networks Algorithms

The neural network algorithms require the input to be expressed as high dimension (greater than 100) integer vectors. In order to use a neural network in NLP tasks, we need a way to convert sentences into vectors. The task of converting sentences in vectors is called pre-processing, and there are several ways of accomplishing it.

In this article, I will use a word embedding model to obtain the vectors associated with the sentences. Word embeddings are a mapping between words and vectors built with an automatic algorithm. Word embeddings are very useful in NLP tasks because the generated vectors maintain the semantic relation between words.

For example, the function that transforms Man into Woman will return Queen if we apply it to King.

An illustration of Word2vec
An illustration of Word2vec

Word embeddings also maintain semantic similarity. Words that have similar semantics will generate vectors that are similar.

In the context of this post, I will use a word embedding model provided by Google generated with the Word2vec* algorithm from a corpus of news.

To obtain a vector that represents the sentences, we need to pre-process them. The first step is to tokenize each sentence, which will convert them into a list of tokens.

For example:

This sound track was beautiful! It paints the senery [sic] ** in your mind so well I would recomend [sic] it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^

is tokenized in the following way:

[*'this', 'sound', 'track', 'was', 'beautiful', '!', 'it', 'paints', 'the', 'senery', 'in', 'your', 'mind', 'so', 'well', 'i', 'would', 'recomend', 'it', 'even', 'to', 'people', 'who', 'hate', 'vid', '.', 'game', 'music', '!', 'i', 'have', 'played', 'the', 'game', 'chrono', 'cross', 'but', 'out', 'of', 'all', 'of', 'the', 'games', 'i', 'have', 'ever', 'played', 'it', 'has', 'the', 'best', 'music', '!', 'it', 'backs', 'away', 'from', 'crude', 'keyboarding', 'and', 'takes', 'a', 'fresher', 'step', 'with', 'grate', 'guitars', 'and', 'soulful', 'orchestras', '.', 'it', 'would', 'impress', 'anyone', 'who', 'cares', 'to', 'listen', '!', '^_^'*]

The next step consists of replacing all the "stop words" and preserving only the N most common tokens. By maintaining only the "N" most common tokens, we can ignore those words that occur in few reviews and don't contribute to the learning algorithm.

After conducting some experimental tests with the training set by varying the value of N, I was able to determine that N=7000 is the optimal value.

This is what a review looks like after replacing the stop words and keeping only the most common tokens:

[*'<UNK>', '<UNK>', 'finished', 'reading', '<UNK>', '<UNK>', '<UNK>', 'wicked', '<UNK>', '<UNK>', '<UNK>', 'fell', '<UNK>', 'love', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'expected', '<UNK>', 'average', 'romance', 'read', '<UNK>', '<UNK>', 'instead', '<UNK>', 'found', 'one', '<UNK>', '<UNK>', 'favorite', 'books', '<UNK>', '<UNK>', 'time', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'thought', '<UNK>', 'could', 'predict', '<UNK>', 'outcome', '<UNK>', '<UNK>', 'shocked', '<UNK>', '<UNK>', 'writting', '<UNK>', '<UNK>', 'descriptive', '<UNK>', '<UNK>', 'heart', 'broke', '<UNK>', 'julia', "'s", '<UNK>', '<UNK>', '<UNK>', 'felt', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'instead', '<UNK>', '<UNK>', '<UNK>', 'distant', 'reader', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'lover', '<UNK>', 'romance', 'novels', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'must', 'read', '<UNK>', '<UNK>', "n't", 'let', '<UNK>', 'cover', 'fool', '<UNK>', '<UNK>', 'book', '<UNK>', 'spectacular', '<UNK>*']

Finally, we must define an embedding matrix and replace words with the corresponding index in the matrix. This method is described in more detail here.

After this step, we will be left with the following:

[7000, 97, 362, 7000, 283, 7000, 7000, 5936, 7000, 7000, 7000, 7000, 313, 7000, 16, 7000, 6, 1359, 7000, 14, 7000, 40, 7000, 552, 7000, 7000, 126, 44, 7000, 7000, 7000, 390, 7000, 126, 7000, 1681, 7000, 7000, 7000, 7000, 7000, 7000, 655, 7000, 7000, 48, 390, 7000, 7000, 7000, 23, 44, 7000, 7000, 6737, 157, 7000, 3762, 7000, 7000, 302, 7000, 7000, 1131, 7000, 7000, 2921, 7000, 3830, 7000, 7000, 7000, 6, 4382, 83, 7000, 1971, 7000, 216, 7000, 7000]

Basic Neural Model

The first approach that I took to predict the sentiment of a review was using a basic neural network with two dense layers.

The neural networks were defined and trained using keras with a TensorFlow background.

The architecture of the network was defined as:

model = Sequential()
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Where embedding_layer is defined as:

Embedding( len(text_vector_dict_keys), 300, weights=[embedding_matrix] input_length=MAX_WORDS, trainable=False)

The first layer is an embedding layer. This layer receives a vector with an index and returns a matrix with the corresponding word embeddings.

The second layer is a flatten layer. This layer concatenates all the vectors in a single vector. If we adjust the vectors to 500 words and the word embeddings model has dimension 300, the result of this layer is a vector with the dimensions 500x300.

Finally, the third and fourth layers are dense layers and reduce the dimensions of these vectors to 250 and 1 accordingly.

The learning algorithm only fit the weights in the 3rd and 4th layers. The weights in the first layer are not trained because they represent the word embedding model.

This basic neural network operates with 80% accuracy. This is a 30% improvement compared with the baseline. However, the state of art in sentiment analysis is over 95% accuracy, so the algorithm still requires a little bit of improvement.

CNN Neural Model

The next step I took was replacing the basic neural network with a convolutional neural network (CNN). CNNs were designed to process images quickly. The problem with regular networks is that all their layers are fully connected, meaning that if we create a large number of hidden layers, the cost of adjusting the weights is high.

The definition of CNNs and their differences from regular networks can be found here.

In summary, with CNNs we don't consider the input vector as a whole, we only consider the nearest element of each vector component. This change greatly reduces the fit of the networks because the number of connections between neurons is considerably lower.

CNN can be used in problems in which the neighborhood is important. This condition is satisfied by images (the changes in a pixel affects the images in terms of its neighborhood) and also in sentences.

Regular Neural Network
Regular Neural Network
Convolutional Neural Network
Convolutional Neural Network

We can use CNN in our problem with the following architecture:

model = Sequential()
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(Conv1D(filters=64, kernel_size=3, padding='same', activation='relu'))
model.add(Conv1D(filters=128, kernel_size=3, padding='same', activation='relu'))
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The architecture is similar to the previous solution. The main difference is that we introduced some additional layers with convolutions.

When I apply this network to the Amazon reviews corpus, the achieved results are about 90% accurate. These results are good, and I consider them acceptable.

Conclusions and Ways to Improve

In this article, I have shown that the use of word embeddings to solve NLP problems is a good approach. Word embeddings have properties that we can use to our benefit. The property of maintaining the semantic relations between words is very important.

Also, they let us omit the feature extraction phase. This point is very important because in some NLP problems, the required features are not trivial and the feature extraction requires a lot of research and testing.

Another advantage that word embeddings have is that they allow us to use neural networks in NLP tasks. Without word embeddings, using neural networks for NLP is not possible because we would need to extract more than 100 features.

I have also demonstrated how CNNs can be used outside of image processing. CNNs are useful in sentiment analysis, improving the accuracy by 10% with respect to traditional neural networks .

In a future problem, we could use different word embedding models to obtain better results. For example, we could create a vector model from a review corpus instead of a news corpus because using a sentiment word embedding should be more accurate than a generic word embedding.

"How to Use Machine Learning for Sentiment Analysis with Word Embeddings" by Pablo Grill is licensed under CC BY SA. Source code examples are licensed under MIT.

Photo by sophilabs.

Categorized under research & learning.

We are Sophilabs

A software design and development agency that helps companies build and grow products by delivering high-quality software through agile practices and perfectionist teams.