The yelp dataset is large, and it's in text format. Here are detailed explanations and all the code needed to convert it to a numpy array for machine learning.
In part 1, we have seen how to access the yelp dataset to do simple text mining with pandas . Today, we're going to convert the yelp text data to a numpy array that can fit in memory and that is suitable to machine learning.
For now, our data is in a pandas DataFrame with two columns:
stars text
31846 1.0 This place is awful. If quick and dirty brunch...
39095 1.0 My husband and I were so overly disappointed i...
78683 4.0 My review will focus on the Arizona State Fair...
We ultimately want to train our machine to predict whether the review is positive or negative given only the text, but we have two problems to solve:
Today, we'll write python scripts to solve both of them. You will learn:
My goal is to show you in details how I approached this problem. The focus is really on using python effectively to overcome hardware limitations. There are certainly ways to do things differently and better, so please feel free to react in the comments below!
Prerequisites:
numpy nltk pandas matplotlib ipython h5py
Our dataset is in a 5 GB file JSON Lines file, review.json, which contains lines like:
{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
Processing this data is going to be a bit challenging, both in terms of RAM and CPU.
However, given the size of this JSON file, I was confident that it would be possible to find a way to transform this data into a structure that would entirely fit in memory.
To do that, I decided to adopt a tiered processing approach. In this approach, we will process the data in subsequent steps. For each review, we will take the following steps:
The first two steps are CPU intensive, and the last one needs a lot of RAM.
CPU intensive tasks can be performed in parallel on a multicore processor. On the other hand, tasks than require a lot of RAM may not be parallelized on a single machine, since all cores share the same RAM. For these reasons, I decided to isolate CPU- and RAM-intensive tasks in different processes.
A tiered processing workflow is also quite practical when you are not yet sure about what you want to do. If you need to change something in the later steps, you don't have to redo all of them.
Now, to be able to perform CPU-intensive tasks in parallel on different cores, we need to split our sample into chunks, so that each core can take care of a chunk.
The review.json file has 6685900 lines. We can split it in the following way:
split -l 340000 review.json
In the -l option, I specify how many lines I want in each chunk. I chose this number to get 20 chunks, since I have 20 cores on my processor.
💡 To find out about the RAM and cores on your computer, check out the first part of this tutorial .
Now split review.json according to the number of cores on your computer.
You will get a number of files from split:
xaa xab xac xad xae xaf xag xah xai xaj xak xal xam xan xao xap xaq xar xas xat
Check the contents of one of them:
head -n 1 xab
>
{"review_id":"srnRzrX0sWEigqfyV_3BVQ","user_id":"4eT43qWNh-9Xdy0_TTU1qw","business_id":"9mCX2MZvZP9KgnOUCVod0Q","stars":4.0,"useful":0,"funny":0,"cool":0,"text":"Came within 24 hours of request. Came at the scheduled time. Quickly loaded the items. Polite. A bit expensive, but they provide a useful service. I would call them again.","date":"2016-04-18 00:43:51"}
In this section, we will process our chunks to extract words from the review text, which is a string. For that, we will use the nltk package. But first, let's try and do it in bare python.
Start ipython, and do:
review = "I went there for a hair cut. Hair wash and stylist was great, but it was very hard to communicate with them since they all spoke chinese and not so good English. The stylist didn't quite understand me while the outcome was not bad. \n\nThe website said $50 for senior stylist but they charged me $60 + tax. Cash only, so I wonder why they charge tax?? Including tip, I ended up paying $80 just for simple trimming."
A naive way to extract the words of this review would be to split the string:
print(review.split())
['I', 'went', 'there', 'for', 'a', 'hair', 'cut.', 'Hair', 'wash', 'and', 'stylist', 'was', 'great,', 'but', 'it', 'was', 'very', 'hard', 'to', 'communicate', 'with', 'them', 'since', 'they', 'all', 'spoke', 'chinese', 'and', 'not', 'so', 'good', 'English.', 'The', 'stylist', "didn't", 'quite', 'understand', 'me', 'while', 'the', 'outcome', 'was', 'not', 'bad.', 'The', 'website', 'said', '$50', 'for', 'senior', 'stylist', 'but', 'they', 'charged', 'me', '$60', '+', 'tax.', 'Cash', 'only,', 'so', 'I', 'wonder', 'why', 'they', 'charge', 'tax??', 'Including', 'tip,', 'I', 'ended', 'up', 'paying', '$80', 'just', 'for', 'simple', 'trimming.']
Not bad, but we see a couple issues:
One could improve our splitting so that it is eventually able to handle complex cases, but people have done that already, so we don't need to redo it.
To perform the tokenization (the task of extracting the words), we will instead use the nltk package:
import nltk
# to be done only once:
nltk.download('punkt')
print(nltk.word_tokenize(review))
['I', 'went', 'there', 'for', 'a', 'hair', 'cut', '.', 'Hair', 'wash', 'and', 'stylist', 'was', 'great', ',', 'but', 'it', 'was', 'very', 'hard', 'to', 'communicate', 'with', 'them', 'since', 'they', 'all', 'spoke', 'chinese', 'and', 'not', 'so', 'good', 'English', '.', 'The', 'stylist', 'did', "n't", 'quite', 'understand', 'me', 'while', 'the', 'outcome', 'was', 'not', 'bad', '.', 'The', 'website', 'said', '$', '50', 'for', 'senior', 'stylist', 'but', 'they', 'charged', 'me', '$', '60', '+', 'tax', '.', 'Cash', 'only', ',', 'so', 'I', 'wonder', 'why', 'they', 'charge', 'tax', '?', '?', 'Including', 'tip', ',', 'I', 'ended', 'up', 'paying', '$', '80', 'just', 'for', 'simple', 'trimming', '.']
That's much better! The ponctuation and contractions are handled properly.
💡 Don't reinvent the wheel, especially with python. Our goal is to do data science and machine learning, not programming. So if you need to do anything, chances are that there is a package for that already. Just google it.
Now let's see how to do the tokenization for the whole dataset.
Before you follow the rest of this tutorial: In my case on macOS, the processing of review.json was failing with an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8145: ordinal not in range(128)
To fix this, I had to set this environment variable:
export LANG=en_US.UTF-8
It might not be necessary or advisable in your case, but I just wanted to warn you in case you get the same error.
Here is what I have written to tokenize the whole dataset. This module, yelp_tokenize.py, as well as the supporting modules parallelize and base, can be found on https://github.com/cbernet/maldives/tree/master/yelp . You can just download the whole repository.
The code should be understandable. If not, feel free to ask questions in the comments, I'll be happy to give more details.
'''Tokenize a JSON lines dataset with nltk
'''
import os
import json
import nltk
import sys
nltk.download('punkt')
def output_fname(input_fname):
return os.path.splitext(input_fname)[0] + '_tok.json'
def process_file(fname, options):
'''tokenize data in file fname.
The output is written to fname_tok.json
'''
print('opening', fname)
ofname = output_fname(fname)
ifile = open(fname)
ofile = open(ofname,'w')
for i, line in enumerate(ifile):
if i%1000 == 0:
print(i)
if i==options.lines:
break
# convert the json on this line to a dict
data = json.loads(line)
# extract the review text
text = data['text']
# tokenize
words = nltk.word_tokenize(text)
# convert all words to lower case
words = [word.lower() for word in words]
# updating JSON and writing to output file
data['text'] = words
line = json.dumps(data)
ofile.write(line+'\n')
ifile.close()
ofile.close()
def parse_args():
'''Parse command line arguments.
See base.setopts for more information
'''
from optparse import OptionParser
from base import setopts
usage = "usage: %prog [options] <file_pattern>"
parser = OptionParser(usage=usage)
setopts(parser)
(options, args) = parser.parse_args()
if len(args)!=1:
parser.print_usage()
sys.exit(1)
# pattern should match the files you want to process,
# e.g. 'xa?'
pattern = args[0]
return options, pattern
if __name__ == '__main__':
import os
import glob
import parallelize
from multiprocessing import Pool
options, pattern = parse_args()
olddir = os.getcwd()
os.chdir(options.datadir)
fnames = glob.glob(pattern)
nprocesses = len(fnames) if options.parallel else None
results = parallelize.run(process_file, fnames, nprocesses, options)
os.chdir(olddir)
An interesting thing to note: there are two tasks that are going to be done in several scripts: parallelization to several cores, and the definition of some of the command line arguments. Indeed, these scripts have the following in common:
That's why I decided to create a unified interface for the parallelization, and to define the command line arguments that are in common at a single place. I did that in the parallelize and base modules, respectively.
To test this script, do:
python yelp_tokenize.py -d <path_to_your_yelp_dataset> 'xaa' -l 1000
This takes only one input file, xaa, and reads only 1000 lines from this file in a single process.
To run it on the whole dataset in parallel:
python yelp_tokenize.py -d <path_to_your_yelp_dataset> 'xa?' -p
this takes all files (in the wildcard pattern, ? means any single character), reads all lines in each file, and spawns one process per file.
After a little time, maybe 10 minutes depending on your computer, you will get new files in your data directory, with a name ending with _tok.json. Check the contents of a file:
head -n 1 xaa_tok.json
>
{"review_id": "Q1sbwvVQXV2734tPgoKj4Q", "user_id": "hG7b0MtEbXx5QzbzE6C_VA", "business_id": "ujmEBvifdJM6h6RLv4wQIg", "stars": 1.0, "useful": 6, "funny": 1, "cool": 0, "text": ["total", "bill", "for", "this", "horrible", "service", "?", "over", "$", "8gs", ".", "these", "crooks", "actually", "had", "the", "nerve", "to", "charge", "us", "$", "69", "for", "3", "pills", ".", "i", "checked", "online", "the", "pills", "can", "be", "had", "for", "19", "cents", "each", "!", "avoid", "hospital", "ers", "at", "all", "costs", "."], "date": "2013-05-07 04:34:36"}
All reviews have been split into words, and now we need to find out how to convert these words into a format suitable to a machine learning algorithm.
As we have seen in my tutorial about the 1-neuron network , neural nets are just a mathematical function of their input values. Therefore, they need numbers in input, not words.
In the next sections, we will convert our words to numbers, a procedure called encoding .
To do that, we will build a vocabulary, which is an ordered list of all possible words in all reviews. Then, a word can be encoded as a number giving its position in the vocabulary.
Here is a simple example of encoding, that you can run in ipython:
review = ["The", "pizza", "is", "excellent", ".", "The", "wine", "is", "not", "."]
vocabulary = []
index = dict()
encoded_review = []
# building the index:
for word in review:
if word not in index:
vocabulary.append(word)
index[word] = len(vocabulary) - 1
# encoding:
for word in review:
encoded_review.append(index[word])
print(encoded_review)
[0, 1, 2, 3, 4, 0, 5, 2, 6, 4]
To decode, you can do:
print(' '.join(words[i] for i in encoded_review) )
The pizza is excellent . The Wine is not .
A very important thing to note is the use of the index dictionary, which makes the correspondence between a word as key and the index as value.
Why did we introduce this index?
Python dictionaries are hash maps . When a new key / value pair is added to the dictionary, the key is hashed (converted to an integer). The hash can then be used to directly find and access the value in memory. Therefore, searching for a key in a dictionary is of O(1) complexity . This means that the time needed for this operation does not depend on the dictionary size (at least to first order).
Python lists are represented by an array . Accessing an element of the list knowing its position in the array is O(1). But if you want to search for a word in a list, you need to loop on the list until you find the word, which is of O(N) complexity . This means that the time needed for a search operation in a list scales linearly with the size of the list.
In the code above, I used both a list and a dictionary:
Now here is the script I've written to build the vocabulary for the whole yelp dataset.
'''Build the vocabulary for the yelp dataset'''
import json
from collections import Counter
# stop words are words that occur very frequently,
# and that don't seem to carry information
# about the quality of the review.
# we decide to keep 'not', for example, as negation is an important info.
# I also keep ! which I think might be more frequent in negative reviews, and which is
# typically used to make a statement stronger (in good or in bad).
# the period, on the other hand, can probably be considered neutral
# this could have been done at a later stage as well,
# but we can do it here as this stage is fast
stopwords = set(['.','i','a','and','the','to', 'was', 'it', 'of', 'for', 'in', 'my',
'that', 'so', 'do', 'our', 'the', 'and', ',', 'my', 'in', 'we', 'you',
'are', 'is', 'be', 'me'])
def process_file(fname, options):
'''process a review JSLON lines file and count the occurence
of each words in all reviews.
returns the counter, which will be used to find the most frequent words
'''
print(fname)
with open(fname) as ifile:
counter = Counter()
for i,line in enumerate(ifile):
if i==options.lines:
break
if i%10000==0:
print(i)
data = json.loads(line)
# extract what we want
words = data['text']
for word in words:
if word in stopwords:
continue
counter[word]+=1
return counter
def parse_args():
from optparse import OptionParser
from base import setopts
usage = "usage: %prog [options] <file_pattern>"
parser = OptionParser(usage=usage)
setopts(parser)
parser.add_option("-n", "--nwords",
dest="nwords", default=20000, type=int,
help="max number of words in vocabulary, default 20000")
(options, args) = parser.parse_args()
if len(args)!=1:
parser.print_usage()
sys.exit(1)
pattern = args[0]
return options, pattern
if __name__ == '__main__':
import os
import glob
import pprint
from vocabulary import Vocabulary
import parallelize
options, pattern = parse_args()
olddir = os.getcwd()
os.chdir(options.datadir)
fnames = glob.glob(pattern)
nprocesses = len(fnames) if options.parallel else None
results = parallelize.run(process_file, fnames, nprocesses, options)
full_counter = Counter()
for counter in results:
full_counter.update(counter)
vocabulary = Vocabulary(full_counter, n_most_common=options.nwords)
vocabulary.save('index')
pprint.pprint(full_counter.most_common(200))
print(len(full_counter))
print(vocabulary)
os.chdir(olddir)
This script is very similar to the tokenize script we discussed above. A few important differences:
Let's discuss the Vocabulary class:
'''Vocabulary'''
import pickle
import pprint
class Vocabulary(object):
'''Vocabulary'''
def __init__(self, counter=None, n_most_common=10000):
'''Constructor. Either provide a counter to build the vocabulary
or the path to a shelve db to load a pre-existing vocabulary.'''
self.n_most_common = n_most_common
if not counter and not dbfname:
raise ValueError('provide either a counter or a db filename')
if counter:
self.words, self.index = self._build_index(counter, n_most_common)
else:
self.words = None
self.index = None
def _build_index(self, counter, n_most_common):
'''takes the most frequent words in common.
returns: list_of_words, index
the list of words needs to be ordered as a word
will be encoded later on by its position in the list.
de-encoding an integer to its corresponding word then can be done
with random access.
the index is a dictionary : (word, index)
it will provide random access to the index corresponding to a word
during encoding.
'''
most_common = counter.most_common(n_most_common)
words = []
word_to_index = dict()
i = 0
# reserved :
for tag in ['<PAD>','<UNK>']:
words.append(tag)
word_to_index[tag] = i
i += 1
for word, dummy in most_common:
words.append(word)
word_to_index[word] = i
i += 1
return words, word_to_index
def save(self, fname):
'''Save words, index, stopwords to a shelve'''
with open(fname + '.pck', 'wb') as pckfile:
pickle.dump(self, pckfile)
@classmethod
def load(cls, fname):
'''load a vocabulary from a pickle file
and return the vocabulary object'''
with open(fname + '.pck', 'rb') as pckfile:
return pickle.load(pckfile)
def decode(self, list_of_codes):
'''print the sentence corresponding to a list of codes'''
return [self.words[i] for i in list_of_codes]
def encode(self, list_of_words):
'''return the list of codes corresponding to a list of words'''
return [self.index.get(word, 1) for word in list_of_words]
def __str__(self):
return pprint.pformat(self.words[:20])
This class should be pretty clear, but please note in particular:
Just try this class in ipython:
from vocabulary import Vocabulary
from collections import Counter
review = ["The", "pizza", "is", "excellent", ".", "The", "wine", "is", "not", "."]
count = Counter(review)
print(count)
vocabulary = Vocabulary(count)
print(vocabulary)
print(vocabulary.encode(review))
print(vocabulary.decode(vocabulary.encode(review)))
Counter({'The': 2, 'is': 2, '.': 2, 'pizza': 1, 'excellent': 1, 'wine': 1, 'not': 1})
['<PAD>', '<UNK>', 'The', 'is', '.', 'pizza', 'excellent', 'wine', 'not']
[2, 5, 3, 6, 4, 2, 7, 3, 8, 4]
['The', 'pizza', 'is', 'excellent', '.', 'The', 'wine', 'is', 'not', '.']
Now run the script to build the index in parallel:
python yelp_vocabulary.py -d <your_data_dir> 'xa?_tok.json' -p
You will get a file called index.pck in your data directory. It contains the vocabulary object.
We're now ready to encode the whole yelp dataset using our vocabulary.
Here is the encoding script :
import json
import os
from collections import Counter
def output_fname(input_fname):
return input_fname.split('_')[0] + '_enc.json'
def process_file(fname, options, vocabulary):
'''process a review JSON lines file and count the words in all reviews.
returns the counter, which will be used to find the most frequent words
'''
print(fname)
ofname = output_fname(fname)
ifile = open(fname)
ofile = open(ofname,'w')
for i,line in enumerate(ifile):
if i==options.lines:
break
if i%10000==0:
print(i)
data = json.loads(line)
words = data['text']
codes = vocabulary.encode(words)
data['text'] = codes
line = json.dumps(data)
ofile.write(line+'\n')
ifile.close()
ofile.close()
def parse_args():
from optparse import OptionParser
from base import setopts
usage = "usage: %prog [options] <file_pattern>"
parser = OptionParser(usage=usage)
setopts(parser)
(options, args) = parser.parse_args()
if len(args)!=1:
parser.print_usage()
sys.exit(1)
pattern = args[0]
return options, pattern
if __name__ == '__main__':
import os
import glob
import pprint
import shelve
from vocabulary import Vocabulary
import parallelize
options, pattern = parse_args()
olddir = os.getcwd()
os.chdir(options.datadir)
vocabulary = Vocabulary.load('index')
fnames = glob.glob(pattern)
print(fnames)
nprocesses = len(fnames) if options.parallel else None
results = parallelize.run(process_file, fnames, nprocesses,
options, vocabulary)
Nothing complicated here, we have already discussed all important aspects of this script. Now run it to encode the whole yelp dataset in parallel:
python yelp_encode.py -d <your_data_dir> 'xa?_tok.json' -p
you will get new files in your data directory, with a name ending with _enc.json. Check the contents of a file:
head -n 1 xaa_enc.json
>
{"review_id": "Q1sbwvVQXV2734tPgoKj4Q", "user_id": "hG7b0MtEbXx5QzbzE6C_VA", "business_id": "ujmEBvifdJM6h6RLv4wQIg", "stars": 1.0, "useful": 6, "funny": 1, "cool": 0, "text": [805, 548, 1, 5, 528, 28, 65, 103, 55, 1, 1, 238, 8079, 264, 11, 1, 4895, 1, 537, 62, 55, 10651, 1, 182, 9011, 1, 1, 737, 753, 1, 9011, 56, 1, 11, 1, 3754, 2964, 276, 2, 978, 1865, 1, 14, 24, 1767, 1], "date": "2013-05-07 04:34:36"}
Note the unknown words encoded by a 1. There is a good fraction of them, but most often, they are simply stop words, like "the".
Exercise : Use the vocabulary class to decode this review, and check the unknown words.
In keras and scikit-learn , and in python in general, numpy arrays are used as input for machine learning because they have excellent performance in numerical analysis, especially for matrix operations.
A numpy array is represented in memory by a contiguous section of memory.
It is good to consider that an array has a fixed size along all its axes. Expanding an array is in fact possible, but this operation might involve a copy of the array data to a new area of memory to find enough space to store the data contiguously. I actually never do that.
In our case, we need to have an array with two axes, like an excel table. The first axis (rows) will index the examples (the reviews) and the second axis (columns) will contain the rating (stars) followed by all encoded words in the review text.
Now, we know that the number of lines is fixed to the number of lines in our input JSON files. But the review text currently has a variable length: some reviews have a lot of words, and some others only a few.
To deal with this issue, we will impose a maximum review size nwords. Reviews with more than nwords will be truncated , meaning that the last words in the review will just be dropped. If nwords is large enough, that's probably not a big issue since the user had enough space to give her opinion already.
Reviews with less than nwords will be padded : all remaining slots on this line of the array will simply be filled with a 0.
Why 0 and not another number? Simply because the neural network will not see this value. Indeed, remember that a neural network is simply a function of its input values . If one of the values is 0, it will not flow through the network. It is the same as setting all weights multiplying this value to 0.
So we want a numpy array containing the whole yelp review data, that fits entirely in memory. Let's estimate the size of this array.
import numpy as np
a = np.ones(1000)
a.dtype
# prints dtype('float64')
So by default, every number in a numpy array is a 64-bits float, which takes 8 bytes of memory (since 1 byte = 8 bits).
Then, the size of our numpy array in memory will be 7e6 * 500 * 8 = 28 GB!!! Clearly, this is too large for the RAM of the vast majority of computers. And that's anyway not reasonable to use so much RAM if there are solutions to use less. Let's see what we can do.
First, we want to store integers, not floats. If we code an integer on 64 bits, we can have values of up to 2 at the power 64 = 18446744073709551616. We certainly do not need that: the star rating goes up to 5, and the encoded words go up to the vocabulary size (~20,000 by default). With 8 bits, we can have 256 different integer values, and with 16 bits, 65536. So 16 bits is the right size. With respect to 64 bits, we gain a factor 4. Let's check:
import numpy as np
from sys import getsizeof
print( getsizeof(np.ones(1000)) )
print( getsizeof(np.ones(1000, dtype=np.uint16)) )
These printouts give 8096 and 2096 bytes. Apart from a small constant overhead of 96 bytes, we do gain a factor 4. For the whole yelp dataset, we would go down to 28 / 4 = 7 GB. That's still a bit too much.
To gain another factor of two and go down to 3.5 GB, we are just going to limit the review size to 250 words. And now we can talk.
Here is the script that converts the encoded dataset to a numpy array.
import json
import sys
import os
import glob
import numpy as np
import h5py
def file_len(fname):
'''Counts the number of lines in the file with name fname
'''
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
def process_file(fname, options):
'''process a review JSON lines file
and return a numpy array with the data.
'''
print(fname)
ifile = open(fname)
# limit on the number of words in reviews
limit = options.nwords
stop = options.lines
# creating the numpy array in advance
# so that we don't have to resize it later.
# the total size along the second axis is limit+4
# to leave four additional slots for
# the rating, useful, funny, cool
# all cells are initialized to 0 so that the padding
# is automatically done.
n_features = 4
all_data = np.zeros((min(file_len(fname),stop), limit+n_features),
dtype=np.int16)
for i,line in enumerate(ifile):
if i%10000==0:
print(i)
if i==stop:
break
data = json.loads(line)
codes = data['text']
# we can decide to keep the unknown words (code=1)
# or just to drop them (default).
if not options.keep_unknown:
codes = [code for code in codes if code!=1]
# store the rating in the 1st column
all_data[i,0] = data['stars']
all_data[i,1] = data['useful']
all_data[i,2] = data['funny']
all_data[i,3] = data['cool']
# store the encoded words afterwards
# the review is truncated to limit.
truncated = codes[:limit]
all_data[i,n_features:len(truncated)+n_features] = truncated
ifile.close()
# print(len(all_stars), len(all_reviews
print(fname, 'done')
return all_data
def finalize(results):
# concatenating the numpy arrays for all files
print('concatenating')
data = np.concatenate(results)
print(data)
# saving the full numpy array to an hdf5 file
ofname = 'data.h5'
print('writing array to {}/{}'.format(ofname,'reviews'))
h5 = h5py.File(ofname, 'w')
h5.create_dataset('reviews', data=data)
h5.close()
def parse_args():
from optparse import OptionParser
from base import setopts
usage = "usage: %prog [options] <file_pattern>"
parser = OptionParser(usage=usage)
setopts(parser)
parser.add_option("-u", "--keep-unknown",
dest="keep_unknown", action="store_true", default=False,
help="keep unknown codes")
parser.add_option("-n", "--nwords",
dest="nwords", default=250, type=int,
help="max number of words, default 250")
(options, args) = parser.parse_args()
if len(args)!=1:
parser.print_usage()
sys.exit(1)
pattern = args[0]
return options, pattern
if __name__ == '__main__':
import os
import pprint
import parallelize
options, pattern = parse_args()
olddir = os.getcwd()
os.chdir(options.datadir)
fnames = glob.glob(pattern)
nprocesses = len(fnames) if options.parallel else None
results = parallelize.run(process_file, fnames, nprocesses, options)
finalize(results)
os.chdir(olddir)
This script is slightly trickier, but heavily commented. I hope it's ok, if not ask me questions in the comments.
Just run the script. Here, we don't run in parallel mode because this step requires quite a bit of RAM. If you have 16GB RAM or more, you can enable parallel mode by adding the -p option:
python yelp_fillarray.py -d <your_data_dir> 'xa*_enc.json'
The script creates a file called data.h5 in your data directory, with a size of 3.2 GB, as estimated.
We're now going to open this file and do a bit of analysis on our newly created numpy array.
To analyse your data.h5 file, please download this jupyter notebook.
import numpy as np
import h5py
import matplotlib.pyplot as plt
Let's start by opening the h5 file that contains our data
datadir = '/data2/cbernet/maldives/yelp_dataset'
h5 = h5py.File(datadir+'/data.h5')
d = h5['reviews']
d.shape
The data is as expected: 6685900 reviews, and for each of them, 254 values:
We can have a look at the first entry. It is a positive one (5 stars in the first column) and one person found it useful. The review text is relatively short, and followed by the padding zero values.
print(d[0])
To decode this review, we are going to use our Vocabulary class.
from vocabulary import Vocabulary
vocab = Vocabulary.load(datadir+'/index')
rev1 = d[0,1:]
rev1 = rev1[rev1!=0]
' '.join(vocab.decode(rev1))
For what we want to do next, we load the whole dataset in memory, it's going to be faster:
d = d[:]
Let's plot the distribution of words in the reviews:
reviews = d[:,4:]
word_counts = np.count_nonzero(reviews, axis=1)
# the assignment to _ avoids a long useless printout
_ = plt.hist(word_counts,range=(-0.5,300.5),bins=301)
Since we have truncated our reviews at a maximum length of 250, all reviews which had more words end up in the last bin.
Now let's plot the rating distribution:
plt.hist(d[:,0], range=(0.5,5.5), bins=5)
Finally we can have a look at the correlation between review length and say, the rating or the usefulness. First we extract the values we want:
stars = d[:,0]
useful = d[:,1]
funny = d[:,2]
cool = d[:,3]
Then we create a dataframe:
import pandas as pd
df = pd.DataFrame({'useful':useful, 'wc':word_counts, 'stars':stars, 'funny':funny, 'cool':cool})
df
Now, we group by the number of words in the review, and we compute the mean for each of the other columns:
means = df.groupby('wc').mean()
First, let's see how the average rating evolves with the number of words:
means['stars'].plot()
After a peak around 15 words, the average rating decreases steadily with the number of words. It looks like the more angry people are, the more they write.
Around 250, there is a sharp drop. This is simply due to the fact that we have truncated our ratings: over 250 words, the average rating keeps dropping steadily.
Now let's look at the average of the 'funny' score:
means['funny'].plot()
Long reviews are more funny than the short ones. But actually, some short reviews are funny too, and I'm not too sure why.
exercise: Select funny short reviews, decode them, and read them.
We're finally at the end of this long post. It's been a bit tricky, but we managed to transform the large yelp dataset into a numpy array suitable for machine learning.
In the process, you have learnt how to
as well as fairly advanced python features like:
Next, we will try different machine learning techniques on the yelp dataset, such as a simple dense neural network, and an LSTM (Long Short-Term Memory).
Stay tuned!
Please let me know what you think in the comments! I’ll try and answer all questions.
And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)
You can join my mailing list for new posts and exclusive content: