Find out if a text is positive or negative
Sentiment analysis is an important application of natural language processing, as it makes it possible to predict what a person thinks given the text she has written.
For example, let's say you own a company and you would like to monitor the opinion of your customers on twitter. It's fairly easy to detect the tweets in which your company or products is mentioned, and to find out how many times these tweets are liked or retweeted.
But tweets and likes do not mean that people like what you're doing! Maybe they're just destroying the reputation of your company online, or they like a funny tweet in which somebody says your products are really bad.
That's where sentiment analysis is needed: it will tell you whether the tweet is positive or negative for your company, and how you should interpret all these likes and retweets.
This post is the third part of my tutorial series about natural language processing with the yelp dataset . You will learn how to classify the reviews of the yelp dataset as positive or negative with two different deep neural networks.
First, we will try a simple network consisting of an embedding layer , a dense layer , and a final sigmoid neuron .
Then, we will see how convolutional layers can help us improve performance.
If you don't know the terms used above, you can refer to these tutorials:
A GPU will save you a lot of time, as we want to train fairly complex networks on a large number of events here.
To get access to a GPU for your training, you can simply run this tutorial on the Google Colaboratory platform by clicking this link . Make sure to change the runtime to run on a GPU.
The other possibility is to use your own machine. Install Anaconda for python 3.X, TensorFlow, and keras on Windows or Linux . Then:
yelp
directory, and open
yelp_simplenet.ipynb
Now let's initialize our tools:
# the usual stuff:
import matplotlib.pyplot as plt
import numpy as np
import keras
# get reproducible results
from numpy.random import seed
seed(0xdeadbeef)
from tensorflow import set_random_seed
set_random_seed(0xdeadbeef)
# needed to run on a mac:
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
The reviews of the yelp dataset come as a very large JSON lines file containing the reviews in plain text, together with the corresponding rating and some more information. The reviews look like this:
{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA",
"business_id":"ujmEBvifdJM6h6RLv4wQIg",
"stars":1.0,"useful":6,"funny":1,"cool":0,
"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
This file needs to be preprocessed for machine learning. Indeed, to feed the review text to a neural network, we need to convert it to an array of numbers in some way, a task called encoding.
You can follow this tutorial to see how to download the dataset and to do the preprocessing yourself.
Alternatively, you can obtain the necessary files here :
data.h5
contains the dataset. It's basically a large numpy array
index.pck
contains the vocabulary, which will be used to convert the numbers representing the words back into text in case we want to investigate our data.
I'm sorry, but at the moment, I do not know how to access these files directly from Google Colab. So if you want to run there, you will need to first download the files locally, and then to upload them to Google Colab (just add a cell to this notebook to do that). This means the files will be transferred twice.
If you run on your own machine, you should put these files in the same directory as the
yelp_simplenet.ipynb
notebook.
Now let's open our dataset file. This is an hdf5 file, so we use the h5py package to open it.
import h5py
# datadir = '/data2/cbernet/maldives/yelp_dataset/'
datadir = './'
datafile = datadir+'data.h5'
h5 = h5py.File(datafile)
h5.keys()
We can use the dataset already as a numpy array. h5py will load in memory only the the data you need to complete a given operation. For example, here is the shape of the array:
data = h5['reviews']
data.shape
and let's check the first line:
data[0]
At preprocessing stage, when I created this array, I decided to reserve the first four slots on each line for:
The reviewer gave 5 stars (the maximum rating) to this company, and somebody considered his review helpful.
After the first four slots come the codes for the review text. I allocated 250 slots for the reviews. If the review contains more than 250 words, it's truncated. If it contains less that 250 words, as is the case here, the unused slots are filled with zeros.
We can decode this review with the vocabulary:
# load the vocabulary object from index.pck
import pickle
with open(datadir+'index.pck', 'rb') as pckf:
vocab = pickle.load(pckf)
# selecting the text of the first review,
# excluding the first 4 slots
first_review = data[0,4:]
# the decoding returns a list of words,
# and we join the words with spaces
' '.join( vocab.decode(first_review) )
Let's extract the information needed to train our neural networks.
# the reviews
x = data[:, 4:]
# the stars, from which we will
# obtain the labels (see below)
stars = data[:,0]
# additional features we might consider:
useful = data[:,1]
cool = data[:,2]
funny = data[:,3]
Our goal is to predict whether the review text is positive or negative. Therefore, we need to label our examples in two categories: 0 (negative) and 1 (positive). We can use the number of stars to define these categories. For example, we could say that a review with 3 stars or more is positive.
First, let's check the distribution of stars:
plt.hist(stars[:1000], range=(-0.5, 5.5), bins=6)
Please note that the number of stars ranges from 1 to 5, so it's not possible for a reviewer to give no star.
Then, we want to split the dataset in two categories that have roughly the same number of examples.
If we were to define as positive the examples with 3 stars or more, the positive category would be much larger than the negative one.
I prefer to define as positive all reviews with 4 stars or more. Technically, here is how to define the targets:
# first fill an array with zeros,
# with the same shape as stars
y = np.zeros_like(stars)
# then write 1 if the number of stars is 4 or 5
y[stars>3.5] = 1
print(y, len(y))
print(stars, len(stars))
As usual, we split the dataset into a training and a test sample. At first, we will use 20000 examples for the test sample, and "only" 100,000 examples for the training sample:
n_test = 20000
n_train = 100000
x_test = x[:n_test]
y_test = y[:n_test]
x_train = x[n_test:n_train+n_test]
y_train = y[n_test:n_train+n_test]
Our first deep neural network will contain:
We start by creating an empty model:
model = keras.Sequential()
The first layer will be the embedding layer. Its role is to convert each integer representing a word into a vector in N-dimensional space. In this space, words with similar meaning will be grouped together.
Following the keras documentation , we indicate the number of possible words, the dimension of the embedding space, and the maximum size of the text.
We start with a 2-dimensional embedding space, as we had done in word embedding and simple sentiment analysis . That's a very low number of dimensions. In fact, typically, embedding is done in 10-100 dimensions.
But as usual, it's good to start small. We will try and increase the number of dimensions of the embedding space later to see if it improves performance.
review_length = len(x_train[0])
model.add(keras.layers.Embedding(len(vocab.words), 2,
input_length=review_length))
The output of the embedding is multidimensional. Indeed, we start with a 1D array with 250 words. Since embedding gives us a two-dimensional vector for each word, the embedding layer spits out an array of shape (250, 2). This 2D array cannot be used directly as input to a dense layer, so we need to flatten it into a 1D array with 500 slots. This is done by the Flatten layer:
model.add(keras.layers.Flatten())
Then, we add dropout regularization. In a nutshell, the dropout regularization layer drops, on a random basis, a fraction of its input values. This forces the network to learn different paths to solve the problem, and helps reduce overfitting . If you want to know a bit more about dropout regularization, check this out .
Here we decide to drop 40% of the values from the Flatten layer:
model.add(keras.layers.Dropout(rate=0.4))
After that, we can add a dense layer, which will analyze the results of the embedding. Again, we start small, with only 5 neurons. We will see later if performance can be improved by increasing the number of neurons.
model.add(keras.layers.Dense(5))
And finally, we end with a dense layer consisting of a single neuron with a sigmoid activation function . Therefore, this neuron will produce a value between 0 and 1, which is the estimated probability for the example review to be positive.
model.add(keras.layers.Dense(1, activation='sigmoid'))
We can now compile and print the full model:
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
We fit the model on the training dataset:
history = model.fit(x_train,
y_train,
epochs=10,
batch_size=1000,
validation_data=(x_test, y_test),
verbose=1)
We see that we end up with a validation accuracy of about 90%. To have a look at the performance in more details, we will use the following function:
import matplotlib.pyplot as plt
def plot_accuracy(history, miny=None):
'''Plot the training and validation accuracy'''
acc = history.history['acc']
test_acc = history.history['val_acc']
epochs = range(len(acc))
plt.plot(epochs, acc)
plt.plot(epochs, test_acc)
if miny:
plt.ylim(miny, 1.0)
plt.title('accuracy')
plt.xlabel('epoch')
plt.figure()
plot_accuracy(history, miny=0.6)
The training accuracy plateaus at 90%, so training further will not help much.
What we see here is that this network underfits the data, meaning that architecture is not complex enough to fit the data. By making it more complex, the training and testing accuracies can certainly be improved. A visual illustration of underfitting is shown here .
After some tuning, I converged to the following architecture. The structure of the network is complex, so I use the full dataset to avoid overfitting.
You can now execute the cell below and go grab a coffee.
n_test = 20000
x_test = x[:n_test]
y_test = y[:n_test]
x_train = x[n_test:]
y_train = y[n_test:]
model = keras.Sequential()
model.add(keras.layers.Embedding(len(vocab.words), 128,
input_length=review_length))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dropout(rate=0.4))
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(x_train,
y_train,
epochs=3,
batch_size=1000,
validation_data=(x_test, y_test),
verbose=1)
plot_accuracy(history,miny=0.85)
That's an improvement over the previous attempt, but the training is quite long, and it seems we will not be able to reach 93% classification accuracy on the test sample with this technique. In the next section we try a different strategy.
We introduced convolutional layers when we tuned a deep convolutional network for image recognition .
In this tutorial, the convolutional layer consists of a small window called the kernel which scans the image and extracts features at each position. The great advantage of convolutional layers for image recognition is that they can recognize parts of an image wherever these parts are in the image. Also, convolutional layers consider the pixels of the kernel together, which allows them to find local relationships between these pixels.
In natural language processing, we deal with a sentence, not an image, but we can make the following analogies:
In this section, we will introduce a 1D convolutional layer in our network, with a kernel size of 3. The following sentences illustrate how the convolutional layer will deal with a given review. At each step, the kernel moves by one word, and the words currently scanned by the kernel are indicated in boldface:
really not good carries a lot of information for our sentiment analysis. The convolutional layer will find it whatever its position in the sentence. Also, it will be easy for the network to understand the meaning of not good . On the contrary, in our previous attempt, not and good are not directly considered together.
Let's try. In the example below, the convolutional layer is set up with:
n_test = 20000
n_train = 1000000
x_test = x[:n_test]
y_test = y[:n_test]
x_train = x[n_test:n_test+n_train]
y_train = y[n_test:n_test+n_train]
model = keras.Sequential()
model.add(keras.layers.Embedding(len(vocab.words), 64, input_length=250))
model.add(keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu'))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dropout(rate=0.4))
model.add(keras.layers.Dense(50, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(x_train,
y_train,
epochs=3,
batch_size=1000,
validation_data=(x_test, y_test),
verbose=1)
plot_accuracy(history,miny=0.85)
With the convolutional layer, we get the same performance as with our best try with a simple dense network. However, please note that:
In this section, we will optimize our convolutional network further by stacking convolutional layers.
As we have done in Tuning a deep convolutional network for image recognition , we perform max pooling between each convolutional layer, and the layers extract more and more features as we progress in the network.
To avoid overfitting, we use the whole dataset for training except for 20000 events that are kept for testing.
n_test = 20000
x_test = x[:n_test]
y_test = y[:n_test]
x_train = x[n_test:]
y_train = y[n_test:]
model = keras.Sequential()
model.add(keras.layers.Embedding(len(vocab.words), 64, input_length=250))
model.add(keras.layers.Conv1D(filters=16, kernel_size=3, padding='same', activation='relu'))
model.add(keras.layers.MaxPooling1D(pool_size=2))
model.add(keras.layers.Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(keras.layers.MaxPooling1D(pool_size=2))
model.add(keras.layers.Conv1D(filters=64, kernel_size=3, padding='same', activation='relu'))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(x_train,
y_train,
epochs=4,
batch_size=1000,
validation_data=(x_test, y_test),
verbose=1)
plot_accuracy(history,miny=0.85)
Nice! 93.7% accuracy, and only a tiny bit of overfitting. There is probably some room for optimization, so please let us know in the comments if you manage to do better with convolutional layers.
It's always interesting to look at misclassified examples to get a hint of what's going on and maybe get ideas for further improvements. That's what we're going to do now, with the first 100 examples.
Here are the predictions and the true labels for these samples:
x_sample = x_test
y_sample = y_test
preds = model.predict_classes(x_sample)
preds = np.array(preds).flatten()
print('true:')
print(y_sample)
print('predictions:')
print(preds)
Now, we select the misclassified examples, together with the true label and the prediction for these examples:
idx = preds!=y_sample
miscl = x_sample[idx]
miscl_pred = preds[idx]
miscl_true = y_sample[idx]
And we print the first five:
for pred, true, rev in zip(miscl_pred[:5], miscl_true, miscl[:5]):
rev = rev[rev!=0] # remove padding
print(pred, true)
print(' '.join(vocab.decode(rev)))
print('\n')
It's not too easy to understand the reviews with all the missing words, especially the stop words like "the", "I", "a", etc. Still, let's try.
So it appears we're not doing so bad: among the 5 misclassified reviews, three are weird. The last two ones correspond to borderline cases.
We can build a pandas dataframe to look at the first 5 misclassified reviews. I know that these misclassified reviews are among the first 100 examples, so I will restrict the dataframe to this range:
import pandas as pd
# take the first 4 columns of the first 100 examples.
# give meaningful names to these columns
df = pd.DataFrame(data=data[:100,:4], columns=['stars','useful','funny','cool'])
# add a column to mark misclassified reviews:
df['misc'] = idx[:100]
# print the misclassified lines:
df[df['misc']==True]
We don't learn much, only that these reviews are indeed borderline: they have 3 or 4 stars, and we set the boundary between our negative and positive categories between 3 and 4 stars.
We're done. Now let's go back and wrap up!
In this post, you learnt how to:
We have seen that convolutional layers can really help in natural language processing, and are not to be restricted to image recognition. With these layers, the network is able to understand the meaning of small groups of words with a relatively small amount of data. Moreover, it is able to do so whatever the position of the group of words within the text.
In the next tutorial, we will try and do the same sentiment analysis with an even more advanced technique, a deep neural network based on an LSTM.
Please let me know what you think in the comments! I’ll try and answer all questions.
And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)
You can join my mailing list for new posts and exclusive content: