Text Preprocessing For Machine Learning (yelp dataset part 2)

The yelp dataset is large, and it's in text format. Here are detailed explanations and all the code needed to convert it to a numpy array for machine learning.

Introduction

In part 1, we have seen how to access the yelp dataset to do simple text mining with pandas . Today, we're going to convert the yelp text data to a numpy array that can fit in memory and that is suitable to machine learning.

For now, our data is in a pandas DataFrame with two columns:

  • stars: the number of stars given by the user to the business
  • text: the user review text
     stars                                               text
31846    1.0  This place is awful. If quick and dirty brunch...
39095    1.0  My husband and I were so overly disappointed i...
78683    4.0  My review will focus on the Arizona State Fair...

We ultimately want to train our machine to predict whether the review is positive or negative given only the text, but we have two problems to solve:

  • To feed the review text to a neural network, for example, we need to convert it to an array of numbers in some way, a task called encoding ;
  • Loading the full dataset in memory is barely doable on a computer with 16 GB of RAM.

Today, we'll write python scripts to solve both of them. You will learn:

  • the basics of text processing in python, with the nltk package;
  • how to do text encoding for machine learning;
  • features of python that can be considered fairly advanced (multiprocessing, organizing a python package in a modular way, algorithmic complexity, memory management);
  • how to manipulate datasets of several GB in python.

My goal is to show you in details how I approached this problem. The focus is really on using python effectively to overcome hardware limitations. There are certainly ways to do things differently and better, so please feel free to react in the comments below!

Prerequisites:

numpy nltk pandas matplotlib ipython h5py

Divide and Conquer

Our dataset is in a 5 GB file JSON Lines file, review.json, which contains lines like:

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}

Processing this data is going to be a bit challenging, both in terms of RAM and CPU.

However, given the size of this JSON file, I was confident that it would be possible to find a way to transform this data into a structure that would entirely fit in memory.

To do that, I decided to adopt a tiered processing approach. In this approach, we will process the data in subsequent steps. For each review, we will take the following steps:

  • extract the words from the review text
  • encode words into integers
  • convert the lists of integers to a numpy array suitable to machine learning

The first two steps are CPU intensive, and the last one needs a lot of RAM.

CPU intensive tasks can be performed in parallel on a multicore processor. On the other hand, tasks than require a lot of RAM may not be parallelized on a single machine, since all cores share the same RAM. For these reasons, I decided to isolate CPU- and RAM-intensive tasks in different processes.

A tiered processing workflow is also quite practical when you are not yet sure about what you want to do. If you need to change something in the later steps, you don't have to redo all of them.

Now, to be able to perform CPU-intensive tasks in parallel on different cores, we need to split our sample into chunks, so that each core can take care of a chunk.

The review.json file has 6685900 lines. We can split it in the following way:

split -l 340000 review.json 

In the -l option, I specify how many lines I want in each chunk. I chose this number to get 20 chunks, since I have 20 cores on my processor.

💡 To find out about the RAM and cores on your computer, check out the first part of this tutorial .

Now split review.json according to the number of cores on your computer.

You will get a number of files from split:

xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj  xak  xal  xam  xan  xao  xap  xaq  xar  xas  xat

Check the contents of one of them:

head -n 1 xab
>
{"review_id":"srnRzrX0sWEigqfyV_3BVQ","user_id":"4eT43qWNh-9Xdy0_TTU1qw","business_id":"9mCX2MZvZP9KgnOUCVod0Q","stars":4.0,"useful":0,"funny":0,"cool":0,"text":"Came within 24 hours of request. Came at the scheduled time. Quickly loaded the items. Polite. A bit expensive, but they provide a useful service. I would call them again.","date":"2016-04-18 00:43:51"}

Text parsing with nltk

In this section, we will process our chunks to extract words from the review text, which is a string. For that, we will use the nltk package. But first, let's try and do it in bare python.

Start ipython, and do:

review = "I went there for a hair cut. Hair wash and stylist was great, but it was very hard to communicate with them since they all spoke chinese and not so good English. The stylist didn't quite understand me while the outcome was not bad. \n\nThe website said $50 for senior stylist but they charged me $60 + tax. Cash only, so I wonder why they charge tax?? Including tip, I ended up paying  $80 just for simple trimming."

A naive way to extract the words of this review would be to split the string:

print(review.split())
['I', 'went', 'there', 'for', 'a', 'hair', 'cut.', 'Hair', 'wash', 'and', 'stylist', 'was', 'great,', 'but', 'it', 'was', 'very', 'hard', 'to', 'communicate', 'with', 'them', 'since', 'they', 'all', 'spoke', 'chinese', 'and', 'not', 'so', 'good', 'English.', 'The', 'stylist', "didn't", 'quite', 'understand', 'me', 'while', 'the', 'outcome', 'was', 'not', 'bad.', 'The', 'website', 'said', '$50', 'for', 'senior', 'stylist', 'but', 'they', 'charged', 'me', '$60', '+', 'tax.', 'Cash', 'only,', 'so', 'I', 'wonder', 'why', 'they', 'charge', 'tax??', 'Including', 'tip,', 'I', 'ended', 'up', 'paying', '$80', 'just', 'for', 'simple', 'trimming.']

Not bad, but we see a couple issues:

  • ponctuation is not handled properly: "tax??" will be considered as a different word from "tax"
  • contractions are not understood: "didn't" is not going to be related to "did", nor to "not"

One could improve our splitting so that it is eventually able to handle complex cases, but people have done that already, so we don't need to redo it.

To perform the tokenization (the task of extracting the words), we will instead use the nltk package:

import nltk
# to be done only once:
nltk.download('punkt')
print(nltk.word_tokenize(review))
['I', 'went', 'there', 'for', 'a', 'hair', 'cut', '.', 'Hair', 'wash', 'and', 'stylist', 'was', 'great', ',', 'but', 'it', 'was', 'very', 'hard', 'to', 'communicate', 'with', 'them', 'since', 'they', 'all', 'spoke', 'chinese', 'and', 'not', 'so', 'good', 'English', '.', 'The', 'stylist', 'did', "n't", 'quite', 'understand', 'me', 'while', 'the', 'outcome', 'was', 'not', 'bad', '.', 'The', 'website', 'said', '$', '50', 'for', 'senior', 'stylist', 'but', 'they', 'charged', 'me', '$', '60', '+', 'tax', '.', 'Cash', 'only', ',', 'so', 'I', 'wonder', 'why', 'they', 'charge', 'tax', '?', '?', 'Including', 'tip', ',', 'I', 'ended', 'up', 'paying', '$', '80', 'just', 'for', 'simple', 'trimming', '.']

That's much better! The ponctuation and contractions are handled properly.

💡 Don't reinvent the wheel, especially with python. Our goal is to do data science and machine learning, not programming. So if you need to do anything, chances are that there is a package for that already. Just google it.

Now let's see how to do the tokenization for the whole dataset.

Setting your locale (macOS)

Before you follow the rest of this tutorial: In my case on macOS, the processing of review.json was failing with an error like:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8145: ordinal not in range(128)

To fix this, I had to set this environment variable:

export LANG=en_US.UTF-8

It might not be necessary or advisable in your case, but I just wanted to warn you in case you get the same error.

Tokenizing the whole datase

Here is what I have written to tokenize the whole dataset. This module, yelp_tokenize.py, as well as the supporting modules parallelize and base, can be found on https://github.com/cbernet/maldives/tree/master/yelp . You can just download the whole repository.

The code should be understandable. If not, feel free to ask questions in the comments, I'll be happy to give more details.

yelp_tokenize.py:

'''Tokenize a JSON lines dataset with nltk
'''

import os 
import json
import nltk
import sys 
nltk.download('punkt')

def output_fname(input_fname):
    return os.path.splitext(input_fname)[0] + '_tok.json'

def process_file(fname, options):
    '''tokenize data in file fname. 
    The output is written to fname_tok.json
    '''
    print('opening', fname)
    ofname = output_fname(fname)
    ifile = open(fname)
    ofile = open(ofname,'w')
    for i, line in enumerate(ifile):
        if i%1000 == 0:
            print(i) 
        if i==options.lines:
            break        
        # convert the json on this line to a dict
        data = json.loads(line) 
        # extract the review text
        text = data['text']
        # tokenize
        words = nltk.word_tokenize(text)
        # convert all words to lower case 
        words = [word.lower() for word in words]
        # updating JSON and writing to output file
        data['text'] = words
        line = json.dumps(data)
        ofile.write(line+'\n')
    ifile.close()
    ofile.close()

def parse_args():
    '''Parse command line arguments.
    See base.setopts for more information
    '''
    from optparse import OptionParser        
    from base import setopts
    usage = "usage: %prog [options] <file_pattern>"
    parser = OptionParser(usage=usage)    
    setopts(parser)
    (options, args) = parser.parse_args()    
    if len(args)!=1:
        parser.print_usage()
        sys.exit(1)
    # pattern should match the files you want to process, 
    # e.g. 'xa?'
    pattern = args[0]
    return options, pattern

if __name__ == '__main__':
    import os
    import glob
    import parallelize
    from multiprocessing import Pool
    
    options, pattern = parse_args()
    olddir = os.getcwd()    
    os.chdir(options.datadir)

    fnames = glob.glob(pattern)
       
    nprocesses = len(fnames) if options.parallel else None
    results = parallelize.run(process_file, fnames, nprocesses, options)
            
    os.chdir(olddir)    

An interesting thing to note: there are two tasks that are going to be done in several scripts: parallelization to several cores, and the definition of some of the command line arguments. Indeed, these scripts have the following in common:

  • they read several input files from a directory
  • they perform some task on each file, possibly returning results

That's why I decided to create a unified interface for the parallelization, and to define the command line arguments that are in common at a single place. I did that in the parallelize and base modules, respectively.

To test this script, do:

python yelp_tokenize.py -d <path_to_your_yelp_dataset> 'xaa' -l 1000 

This takes only one input file, xaa, and reads only 1000 lines from this file in a single process.

To run it on the whole dataset in parallel:

python yelp_tokenize.py -d <path_to_your_yelp_dataset> 'xa?' -p  

this takes all files (in the wildcard pattern, ? means any single character), reads all lines in each file, and spawns one process per file.

After a little time, maybe 10 minutes depending on your computer, you will get new files in your data directory, with a name ending with _tok.json. Check the contents of a file:

head -n 1 xaa_tok.json 
>
{"review_id": "Q1sbwvVQXV2734tPgoKj4Q", "user_id": "hG7b0MtEbXx5QzbzE6C_VA", "business_id": "ujmEBvifdJM6h6RLv4wQIg", "stars": 1.0, "useful": 6, "funny": 1, "cool": 0, "text": ["total", "bill", "for", "this", "horrible", "service", "?", "over", "$", "8gs", ".", "these", "crooks", "actually", "had", "the", "nerve", "to", "charge", "us", "$", "69", "for", "3", "pills", ".", "i", "checked", "online", "the", "pills", "can", "be", "had", "for", "19", "cents", "each", "!", "avoid", "hospital", "ers", "at", "all", "costs", "."], "date": "2013-05-07 04:34:36"}

All reviews have been split into words, and now we need to find out how to convert these words into a format suitable to a machine learning algorithm.

As we have seen in my tutorial about the 1-neuron network , neural nets are just a mathematical function of their input values. Therefore, they need numbers in input, not words.

In the next sections, we will convert our words to numbers, a procedure called encoding .

To do that, we will build a vocabulary, which is an ordered list of all possible words in all reviews. Then, a word can be encoded as a number giving its position in the vocabulary.

Building the vocabulary

Here is a simple example of encoding, that you can run in ipython:

review = ["The", "pizza", "is", "excellent", ".", "The", "wine", "is", "not", "."] 
vocabulary = []
index = dict()
encoded_review = []

# building the index:
for word in review:
  if word not in index:
    vocabulary.append(word)
    index[word] = len(vocabulary) - 1

# encoding:
for word in review:
  encoded_review.append(index[word])
print(encoded_review)
[0, 1, 2, 3, 4, 0, 5, 2, 6, 4]

To decode, you can do:

print(' '.join(words[i] for i in encoded_review) )
The pizza is excellent . The Wine is not .

A very important thing to note is the use of the index dictionary, which makes the correspondence between a word as key and the index as value.

Why did we introduce this index?

Python dictionaries are hash maps . When a new key / value pair is added to the dictionary, the key is hashed (converted to an integer). The hash can then be used to directly find and access the value in memory. Therefore, searching for a key in a dictionary is of O(1) complexity . This means that the time needed for this operation does not depend on the dictionary size (at least to first order).

Python lists are represented by an array . Accessing an element of the list knowing its position in the array is O(1). But if you want to search for a word in a list, you need to loop on the list until you find the word, which is of O(N) complexity . This means that the time needed for a search operation in a list scales linearly with the size of the list.

In the code above, I used both a list and a dictionary:

  • the index is a dictionary because:
    • we want to insert a word in the vocabulary only if it's not already there. So we need to search the index at each word during the construction of the vocabulary, and we need to do that fast.
    • after the vocabulary is built for the whole yelp dataset, we will encode all reviews. For that, we will look up the integer corresponding to each word in the index. That's a massive number of searches, so it must be fast.
  • the vocabulary is a list because:
    • it must be ordered, since the position in the vocabulary represents a word.
    • at decoding time, it's good to have O(1) complexity, though we're not going to decode much in practice.

Now here is the script I've written to build the vocabulary for the whole yelp dataset.

yelp_vocabulary.py

'''Build the vocabulary for the yelp dataset'''

import json
from collections import Counter

# stop words are words that occur very frequently,
# and that don't seem to carry information 
# about the quality of the review. 
# we decide to keep 'not', for example, as negation is an important info.
# I also keep ! which I think might be more frequent in negative reviews, and which is 
# typically used to make a statement stronger (in good or in bad). 
# the period, on the other hand, can probably be considered neutral
# this could have been done at a later stage as well, 
# but we can do it here as this stage is fast 
stopwords = set(['.','i','a','and','the','to', 'was', 'it', 'of', 'for', 'in', 'my', 
                 'that', 'so', 'do', 'our', 'the', 'and', ',', 'my', 'in', 'we', 'you', 
                 'are', 'is', 'be', 'me'])

def process_file(fname, options):
    '''process a review JSLON lines file and count the occurence 
    of each words in all reviews.
    returns the counter, which will be used to find the most frequent words
    '''
    print(fname)
    with open(fname) as ifile:
        counter = Counter()
        for i,line in enumerate(ifile):
            if i==options.lines:
                break
            if i%10000==0:
                print(i)            
            data = json.loads(line) 
            # extract what we want
            words = data['text']               
            for word in words:
                if word in stopwords:
                    continue
                counter[word]+=1
        return counter

def parse_args():
    from optparse import OptionParser        
    from base import setopts
    usage = "usage: %prog [options] <file_pattern>"
    parser = OptionParser(usage=usage)    
    setopts(parser)
    parser.add_option("-n", "--nwords",
                      dest="nwords", default=20000, type=int,
                      help="max number of words in vocabulary, default 20000")
    (options, args) = parser.parse_args()    
    if len(args)!=1:
        parser.print_usage()
        sys.exit(1)
    pattern = args[0]
    return options, pattern

if __name__ == '__main__':
    import os
    import glob    
    import pprint
    from vocabulary import Vocabulary
    import parallelize


    options, pattern = parse_args()
    
    olddir = os.getcwd()
    os.chdir(options.datadir)

    fnames = glob.glob(pattern)
       
    nprocesses = len(fnames) if options.parallel else None
    results = parallelize.run(process_file, fnames, nprocesses, options)

    full_counter = Counter()
    for counter in results:
        full_counter.update(counter)

    vocabulary = Vocabulary(full_counter, n_most_common=options.nwords)
    vocabulary.save('index')
    
    pprint.pprint(full_counter.most_common(200))
    print(len(full_counter))
    print(vocabulary)
    os.chdir(olddir)    

This script is very similar to the tokenize script we discussed above. A few important differences:

  • in process_file, we use a Counter from the collections package to count the number of occurrences of each word in the file. A Counter is basically just a dictionary, with words as keys, and the word counts as values.
  • at the end of the processing, we sum up the counters to get a counter for all files, and we build our vocabulary from the full counter. The Vocabulary class will be discussed in details just below.
  • we added a command line option to limit the vocabulary to the most frequent words.

Let's discuss the Vocabulary class:

'''Vocabulary'''

import pickle
import pprint


class Vocabulary(object): 
    '''Vocabulary'''
    
    def __init__(self, counter=None, n_most_common=10000):
        '''Constructor. Either provide a counter to build the vocabulary 
        or the path to a shelve db to load a pre-existing vocabulary.'''
        self.n_most_common = n_most_common
        if not counter and not dbfname: 
            raise ValueError('provide either a counter or a db filename')   
        if counter: 
            self.words, self.index = self._build_index(counter, n_most_common)
        else:
            self.words = None
            self.index = None
            
    def _build_index(self, counter, n_most_common):
        '''takes the most frequent words in common. 
        returns: list_of_words, index 
        
        the list of words needs to be ordered as a word
        will be encoded later on by its position in the list.
        de-encoding an integer to its corresponding word then can be done 
        with random access.
       
        the index is a dictionary : (word, index)
        it will provide random access to the index corresponding to a word
        during encoding.
        '''
        most_common = counter.most_common(n_most_common)
        words = []
        word_to_index = dict()
        i = 0
        # reserved :
        for tag in ['<PAD>','<UNK>']: 
            words.append(tag)
            word_to_index[tag] = i
            i += 1
        for word, dummy in most_common:
            words.append(word)
            word_to_index[word] = i
            i += 1 
        return words, word_to_index
            
    def save(self, fname): 
        '''Save words, index, stopwords to a shelve'''
        with open(fname + '.pck', 'wb') as pckfile:
            pickle.dump(self, pckfile)
    
    @classmethod
    def load(cls, fname):
        '''load a vocabulary from a pickle file 
        and return the vocabulary object'''
        with open(fname + '.pck', 'rb') as pckfile:
            return pickle.load(pckfile)        
        
    def decode(self, list_of_codes):
        '''print the sentence corresponding to a list of codes'''
        return [self.words[i] for i in list_of_codes]
    
    def encode(self, list_of_words): 
        '''return the list of codes corresponding to a list of words'''
        return [self.index.get(word, 1) for word in list_of_words]
             
    def __str__(self):
        return pprint.pformat(self.words[:20])

This class should be pretty clear, but please note in particular:

  • how pickle is used to save the vocabulary to disk.
  • how a class method is used to load an existing vocabulary from a pickle file;
  • how the _build_index method works. As you can see, we reserve the first two slots in the words list. The first slot is going to be used later on for padding. Below, we will convert each review to a fixed size array. The reviews that are too long will be truncated, and the ones that are too small will be padded with 0, which means that we will put 0 in every unused slot at the end of the review. The second slot is used in the encoding for words that are unknown (not in the vocabulary.)
  • the __str__ magic method, which is called when you do print(index).

Just try this class in ipython:

from vocabulary import Vocabulary
from collections import Counter
review = ["The", "pizza", "is", "excellent", ".", "The", "wine", "is", "not", "."]
count = Counter(review)
print(count)
vocabulary = Vocabulary(count)
print(vocabulary)
print(vocabulary.encode(review))
print(vocabulary.decode(vocabulary.encode(review)))
Counter({'The': 2, 'is': 2, '.': 2, 'pizza': 1, 'excellent': 1, 'wine': 1, 'not': 1})
['<PAD>', '<UNK>', 'The', 'is', '.', 'pizza', 'excellent', 'wine', 'not']
[2, 5, 3, 6, 4, 2, 7, 3, 8, 4]
['The', 'pizza', 'is', 'excellent', '.', 'The', 'wine', 'is', 'not', '.']

Now run the script to build the index in parallel:

python  yelp_vocabulary.py -d <your_data_dir> 'xa?_tok.json' -p

You will get a file called index.pck in your data directory. It contains the vocabulary object.

We're now ready to encode the whole yelp dataset using our vocabulary.

Encoding the yelp dataset

Here is the encoding script :

import json
import os
from collections import Counter

def output_fname(input_fname):
    return  input_fname.split('_')[0] + '_enc.json'

def process_file(fname, options, vocabulary):
    '''process a review JSON lines file and count the words in all reviews.
    returns the counter, which will be used to find the most frequent words
    '''
    print(fname)
    ofname = output_fname(fname)  
    ifile = open(fname) 
    ofile = open(ofname,'w')
    for i,line in enumerate(ifile): 
        if i==options.lines:
            break
        if i%10000==0:
            print(i)        
        data = json.loads(line) 
        words = data['text']     
        codes = vocabulary.encode(words)
        data['text'] = codes
        line = json.dumps(data)
        ofile.write(line+'\n')        
    ifile.close()
    ofile.close()
    
def parse_args():
    from optparse import OptionParser        
    from base import setopts
    usage = "usage: %prog [options] <file_pattern>"
    parser = OptionParser(usage=usage)    
    setopts(parser)
    (options, args) = parser.parse_args()    
    if len(args)!=1:
        parser.print_usage()
        sys.exit(1)
    pattern = args[0]
    return options, pattern

if __name__ == '__main__':
    import os
    import glob    
    import pprint
    import shelve
    from vocabulary import Vocabulary
    import parallelize
    
    options, pattern = parse_args()
    
    olddir = os.getcwd()
    os.chdir(options.datadir)

    vocabulary = Vocabulary.load('index')
        
    fnames = glob.glob(pattern)
    print(fnames)
    
    nprocesses = len(fnames) if options.parallel else None
    results = parallelize.run(process_file, fnames, nprocesses, 
                              options, vocabulary)

Nothing complicated here, we have already discussed all important aspects of this script. Now run it to encode the whole yelp dataset in parallel:

python yelp_encode.py -d <your_data_dir> 'xa?_tok.json' -p 

you will get new files in your data directory, with a name ending with _enc.json. Check the contents of a file:

head -n 1 xaa_enc.json 
>
{"review_id": "Q1sbwvVQXV2734tPgoKj4Q", "user_id": "hG7b0MtEbXx5QzbzE6C_VA", "business_id": "ujmEBvifdJM6h6RLv4wQIg", "stars": 1.0, "useful": 6, "funny": 1, "cool": 0, "text": [805, 548, 1, 5, 528, 28, 65, 103, 55, 1, 1, 238, 8079, 264, 11, 1, 4895, 1, 537, 62, 55, 10651, 1, 182, 9011, 1, 1, 737, 753, 1, 9011, 56, 1, 11, 1, 3754, 2964, 276, 2, 978, 1865, 1, 14, 24, 1767, 1], "date": "2013-05-07 04:34:36"}

Note the unknown words encoded by a 1. There is a good fraction of them, but most often, they are simply stop words, like "the".

Exercise : Use the vocabulary class to decode this review, and check the unknown words.

Converting the encoded dataset to a numpy array

In keras and scikit-learn , and in python in general, numpy arrays are used as input for machine learning because they have excellent performance in numerical analysis, especially for matrix operations.

A numpy array is represented in memory by a contiguous section of memory.

It is good to consider that an array has a fixed size along all its axes. Expanding an array is in fact possible, but this operation might involve a copy of the array data to a new area of memory to find enough space to store the data contiguously. I actually never do that.

In our case, we need to have an array with two axes, like an excel table. The first axis (rows) will index the examples (the reviews) and the second axis (columns) will contain the rating (stars) followed by all encoded words in the review text.

Now, we know that the number of lines is fixed to the number of lines in our input JSON files. But the review text currently has a variable length: some reviews have a lot of words, and some others only a few.

To deal with this issue, we will impose a maximum review size nwords. Reviews with more than nwords will be truncated , meaning that the last words in the review will just be dropped. If nwords is large enough, that's probably not a big issue since the user had enough space to give her opinion already.

Reviews with less than nwords will be padded : all remaining slots on this line of the array will simply be filled with a 0.

Why 0 and not another number? Simply because the neural network will not see this value. Indeed, remember that a neural network is simply a function of its input values . If one of the values is 0, it will not flow through the network. It is the same as setting all weights multiplying this value to 0.

So we want a numpy array containing the whole yelp review data, that fits entirely in memory. Let's estimate the size of this array.

  • we have about 7 million reviews, so the same number of lines
  • we could limit the size of each review to 500 words, and we need a few more columns to store the number of stars and other features. Still, that's about 500 columns
  • each number takes size in memory, and we need to know how much. To find out, let's do a simple test
import numpy as np
a = np.ones(1000)
a.dtype 
# prints dtype('float64')

So by default, every number in a numpy array is a 64-bits float, which takes 8 bytes of memory (since 1 byte = 8 bits).

Then, the size of our numpy array in memory will be 7e6 * 500 * 8 = 28 GB!!! Clearly, this is too large for the RAM of the vast majority of computers. And that's anyway not reasonable to use so much RAM if there are solutions to use less. Let's see what we can do.

First, we want to store integers, not floats. If we code an integer on 64 bits, we can have values of up to 2 at the power 64 = 18446744073709551616. We certainly do not need that: the star rating goes up to 5, and the encoded words go up to the vocabulary size (~20,000 by default). With 8 bits, we can have 256 different integer values, and with 16 bits, 65536. So 16 bits is the right size. With respect to 64 bits, we gain a factor 4. Let's check:

import numpy as np
from sys import getsizeof
print( getsizeof(np.ones(1000)) ) 
print( getsizeof(np.ones(1000, dtype=np.uint16)) )

These printouts give 8096 and 2096 bytes. Apart from a small constant overhead of 96 bytes, we do gain a factor 4. For the whole yelp dataset, we would go down to 28 / 4 = 7 GB. That's still a bit too much.

To gain another factor of two and go down to 3.5 GB, we are just going to limit the review size to 250 words. And now we can talk.

Here is the script that converts the encoded dataset to a numpy array.

yelp_fillarray.py:

import json
import sys 
import os
import glob
import numpy as np
import h5py

def file_len(fname):
    '''Counts the number of lines in the file with name fname
    '''
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def process_file(fname, options):
    '''process a review JSON lines file 
    and return a numpy array with the data.
    '''
    print(fname)
    ifile = open(fname)
    # limit on the number of words in reviews
    limit = options.nwords
    stop = options.lines
    # creating the numpy array in advance 
    # so that we don't have to resize it later.
    # the total size along the second axis is limit+4 
    # to leave four additional slots for 
    # the rating, useful, funny, cool   
    # all cells are initialized to 0 so that the padding
    # is automatically done. 
    n_features = 4
    all_data = np.zeros((min(file_len(fname),stop), limit+n_features),
                        dtype=np.int16)
    for i,line in enumerate(ifile): 
        if i%10000==0:
            print(i)        
        if i==stop:
            break        
        data = json.loads(line) 
        codes = data['text']  
        # we can decide to keep the unknown words (code=1)
        # or just to drop them (default).
        if not options.keep_unknown:
            codes = [code for code in codes if code!=1]
        # store the rating in the 1st column
        all_data[i,0] = data['stars']
        all_data[i,1] = data['useful']
        all_data[i,2] = data['funny']
        all_data[i,3] = data['cool']
        # store the encoded words afterwards 
        # the review is truncated to limit.
        truncated = codes[:limit]
        all_data[i,n_features:len(truncated)+n_features] = truncated
    ifile.close()
    # print(len(all_stars), len(all_reviews
    print(fname,  'done')
    return all_data

def finalize(results):
    # concatenating the numpy arrays for all files
    print('concatenating')
    data = np.concatenate(results)
    print(data) 
    # saving the full numpy array to an hdf5 file
    ofname = 'data.h5'
    print('writing array to {}/{}'.format(ofname,'reviews'))
    h5 = h5py.File(ofname, 'w')
    h5.create_dataset('reviews', data=data) 
    h5.close()    

def parse_args():
    from optparse import OptionParser        
    from base import setopts
    usage = "usage: %prog [options] <file_pattern>"
    parser = OptionParser(usage=usage)    
    setopts(parser)
    parser.add_option("-u", "--keep-unknown",
                      dest="keep_unknown", action="store_true", default=False,
                      help="keep unknown codes")   
    parser.add_option("-n", "--nwords",
                      dest="nwords", default=250, type=int,
                      help="max number of words, default 250")    
    (options, args) = parser.parse_args()    
    if len(args)!=1:
        parser.print_usage()
        sys.exit(1)
    pattern = args[0]
    return options, pattern
    
if __name__ == '__main__':
    import os
    import pprint
    import parallelize

    options, pattern = parse_args()
    
    olddir = os.getcwd()
    os.chdir(options.datadir)
        
    fnames = glob.glob(pattern)
    
    nprocesses = len(fnames) if options.parallel else None
    results = parallelize.run(process_file, fnames, nprocesses, options)
    finalize(results)
    os.chdir(olddir)

This script is slightly trickier, but heavily commented. I hope it's ok, if not ask me questions in the comments.

Just run the script. Here, we don't run in parallel mode because this step requires quite a bit of RAM. If you have 16GB RAM or more, you can enable parallel mode by adding the -p option:

 python yelp_fillarray.py  -d <your_data_dir> 'xa*_enc.json' 

The script creates a file called data.h5 in your data directory, with a size of 3.2 GB, as estimated.

We're now going to open this file and do a bit of analysis on our newly created numpy array.

Text mining with numpy

To analyse your data.h5 file, please download this jupyter notebook.

In [2]:
import numpy as np
import h5py
import matplotlib.pyplot as plt

Let's start by opening the h5 file that contains our data

In [3]:
datadir = '/data2/cbernet/maldives/yelp_dataset'
h5 = h5py.File(datadir+'/data.h5')
d = h5['reviews']
d.shape
Out[3]:
(6685900, 254)

The data is as expected: 6685900 reviews, and for each of them, 254 values:

  • rating
  • useful
  • funny
  • cool
  • 250 encoded words.

We can have a look at the first entry. It is a positive one (5 stars in the first column) and one person found it useful. The review text is relatively short, and followed by the padding zero values.

In [4]:
print(d[0])
[    5     1     0     0   696    26    39  3348    26  1523    44   336
    64    14   153  5179  2731    24    72   172  4377   125   257  3044
  6568 10127  8410     3    33   277   219   501  8900     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0]

To decode this review, we are going to use our Vocabulary class.

In [5]:
from vocabulary import Vocabulary
vocab = Vocabulary.load(datadir+'/index')
rev1 = d[0,1:]
rev1 = rev1[rev1!=0]
' '.join(vocab.decode(rev1))
Out[5]:
'<UNK> helped out when locked out apartment he quick got at price lowest comparison all other area definately recommend top master situations requiring locksmith they get job done quickly effectively'

For what we want to do next, we load the whole dataset in memory, it's going to be faster:

In [6]:
d = d[:]

Let's plot the distribution of words in the reviews:

In [7]:
reviews = d[:,4:]
word_counts = np.count_nonzero(reviews, axis=1)
# the assignment to _ avoids a long useless printout
_ = plt.hist(word_counts,range=(-0.5,300.5),bins=301)

Since we have truncated our reviews at a maximum length of 250, all reviews which had more words end up in the last bin.

Now let's plot the rating distribution:

In [8]:
plt.hist(d[:,0], range=(0.5,5.5), bins=5)
Out[8]:
(array([1002159.,  542394.,  739280., 1468985., 2933082.]),
 array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5]),
 <a list of 5 Patch objects>)

Finally we can have a look at the correlation between review length and say, the rating or the usefulness. First we extract the values we want:

In [70]:
stars = d[:,0]
useful = d[:,1]
funny = d[:,2]
cool = d[:,3]

Then we create a dataframe:

In [72]:
import pandas as pd
df = pd.DataFrame({'useful':useful, 'wc':word_counts, 'stars':stars, 'funny':funny, 'cool':cool})
df
Out[72]:
useful wc stars funny cool
0 1 29 5 0 0
1 5 130 3 2 3
2 0 48 5 0 0
3 1 99 5 0 1
4 4 216 3 0 0
5 0 23 5 0 0
6 0 16 5 0 0
7 0 49 3 0 0
8 2 250 1 0 0
9 0 65 5 0 0
10 0 46 1 0 0
11 4 32 5 1 6
12 0 121 4 0 0
13 0 50 4 0 0
14 0 191 3 0 0
15 0 23 5 0 0
16 0 18 5 0 0
17 4 103 5 2 2
18 2 174 4 0 0
19 3 28 4 3 3
20 0 25 5 0 1
21 3 133 5 0 0
22 0 36 5 0 0
23 1 38 5 0 0
24 1 41 5 0 1
25 3 60 2 0 2
26 0 80 5 0 0
27 0 148 5 0 0
28 5 28 1 1 1
29 1 155 5 0 0
... ... ... ... ... ...
6685870 0 48 5 0 0
6685871 0 33 5 0 0
6685872 0 44 1 0 0
6685873 7 202 5 5 4
6685874 0 20 1 0 0
6685875 4 93 2 4 2
6685876 4 82 5 0 0
6685877 1 24 5 0 0
6685878 3 196 5 0 1
6685879 0 23 5 0 0
6685880 0 62 4 0 0
6685881 0 44 5 1 0
6685882 4 32 4 0 1
6685883 0 37 5 0 0
6685884 0 122 1 0 0
6685885 2 68 3 0 0
6685886 0 11 5 0 0
6685887 0 20 5 1 0
6685888 0 13 5 0 0
6685889 0 234 2 0 0
6685890 0 70 3 0 0
6685891 6 81 4 1 8
6685892 0 61 3 0 0
6685893 3 89 2 0 0
6685894 1 117 4 1 1
6685895 0 135 4 0 0
6685896 37 248 4 30 34
6685897 14 198 1 5 2
6685898 1 104 1 0 0
6685899 1 72 3 1 1

6685900 rows × 5 columns

Now, we group by the number of words in the review, and we compute the mean for each of the other columns:

In [74]:
means = df.groupby('wc').mean()

First, let's see how the average rating evolves with the number of words:

In [75]:
means['stars'].plot()
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e5824c748>

After a peak around 15 words, the average rating decreases steadily with the number of words. It looks like the more angry people are, the more they write.

Around 250, there is a sharp drop. This is simply due to the fact that we have truncated our ratings: over 250 words, the average rating keeps dropping steadily.

Now let's look at the average of the 'funny' score:

In [76]:
means['funny'].plot()
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e58204668>

Long reviews are more funny than the short ones. But actually, some short reviews are funny too, and I'm not too sure why.

exercise: Select funny short reviews, decode them, and read them.

Conclusion

We're finally at the end of this long post. It's been a bit tricky, but we managed to transform the large yelp dataset into a numpy array suitable for machine learning.

In the process, you have learnt how to

  • do natural language processing in python with the nltk package
  • perform word encoding yourself
  • create a numpy array for machine learning

as well as fairly advanced python features like:

  • organizing a python package in a modular way,
  • algorithmic complexity,
  • memory management,
  • multiprocessing.

Next, we will try different machine learning techniques on the yelp dataset, such as a simple dense neural network, and an LSTM (Long Short-Term Memory).

Stay tuned!


Please let me know what you think in the comments! I’ll try and answer all questions.

And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)

Back Home