Learn the minimum numpy needed to get started with machine learning (1h course)
Die-hard C++ or Fortran users among physicists often say that python is too slow.
True, python is an interpreted language and it is slow.
Even python advocates like me realize it, but we think that the (lack of) speed of python is not really an issue, e.g. because:
And most importantly:
In this tutorial, you will understand what is numpy and why it's fast, and learn just what's needed about numpy for the usual machine learning operations.
At the end, you will work on practical, real-life numpy operations, typically encountered in machine-learning projects.
This tutorial is interactive.
To run it on Google Colab click on this link. You'll be able to execute the code on your own, and to modify it.
Prerequisites: Please follow Python Crash Course for Machine Learning first, or be sure to master the basics of python.
The main purpose of numpy is to provide a very efficient data structure called the numpy array, and the tools to manipulate such arrays.
Why is the numpy array so fast?
Because, under the hood, arrays are processed with compiled code, optimized for the CPU. In particular, numpy operations are parallel as they use SIMD (Single Operation Multiple Data).
To see how fast numpy is, we can time it.
Let us create a large list with one million integers, and a numpy array from this list:
import numpy as np
lst = range(1000000)
arr = np.array(lst)
arr
Now let's compute the square of all integers, and see how much time it takes.
We start by the list:
%timeit squares = [x**2 for x in lst]
And we do the same for the array:
%timeit squares = arr**2
As you can see, this is 300 times faster.
We can in principle loop on the numpy array like this:
%timeit squares = [x**2 for x in arr]
But then, we completely lose the benefits of numpy! Indeed, when we do arr**2
we use the square function of numpy, which is intrinsically parallel. When we loop, we process the elements one by one with basic python. So:
Never ever loop on a numpy array! You'll be tempted to do so, but there should be no exception!
We have seen that numpy arrays are processed by compiled code with SIMD.
For this to work, the elements in a numpy array must be:
On the contrary, python lists can contain heterogeneous objects of any type.
Here are a few ways to create numpy arrays with different types:
from sys import getsizeof as sizeof
# numpy guesses that it should use integers
x = np.array([0, 1, 2])
print(x.dtype)
# or floats:
x = np.array([0., 1., 2.])
print(x.dtype)
# here we specify a python compatible type,
# interpreted by numpy as int64
x = np.array([0., 1., 2.], dtype=int)
print(x.dtype)
# here we specify that we want 8 bits integers
x = np.array([0, 1, 2], dtype=np.int8)
print(x.dtype)
This makes it easy to estimate the size of a numpy array in memory, to see if you're going to blow up your computer before you actually do.
Exercise
For example, let's consider a sample of 1000 images, each with 200x200 pixels, and 3 color channels per pixel. The color index ranges from 0 to 255, and can thus be encoded as an 8 bit integer.
Assuming you store the data of all images in a single numpy array, what would be its size in memory in GB?
I call element wise operations all the operations that affect the array elements, but preserve the array shape.
All the usual operators are implemented in numpy, for arrays. For example:
x = np.array(range(5))
x**2
Note that these operators, in numpy, are element wise:
x+1
The element-wise equivalents of the functions in the python math
package are available directly from the numpy
package, with the same name, e.g.:
np.exp(x)
Finally, binary operators are available:
x = np.array([0, 1, 2])
y = np.array([1, 2, 3])
x+y
x*y
So far, we have only seen arrays with a single dimension. But often, more dimensions are used.
Multidimensional arrays can be created from a list of lists, e.g.:
x = np.array([[0, 1], [2, 3], [4,5]])
x
The shape attribute gives us the length of each dimension:
x.shape
In this case, we have 3 rows of 2 numbers. The first dimension is the outermost dimension, and the second one the innermost dimension.
Let's take the example of a 2x2 pixels image, with 3 color channels (red, blue, green) in each pixel:
x = np.array(
[
[ [1,2,3], [4,5,6], ],
[ [7,8,9], [10,11,12]]
]
)
print(x)
print(x.shape)
To visualize numpy arrays more easily, I often think of the innermost dimension separately. For example, here, we have a 2x2 pixel array, with a sub-array of size 3 in each pixel.
And as a final example, let's consider a "column vector":
x = np.array([
[0],
[1],
[2],
[3]
])
print(x)
print(x.shape)
As you can see, the "column vector" has two dimensions, which may be counterintuitive. There is a single number (a scalar) on the innermost dimension.
Please note that in numpy, a dimension can also be called an "axis".
Very often, numpy arrays of a given shape are built by initializing all elements to a fixed number, or a random number. For example:
np.zeros((2,3))
np.ones(3)
np.ones_like(x)
np.random.rand(2,2)
Many more random sampling tools are available.
Here is a 1D array:
x = np.arange(10) + 1
x
Elements can be accessed directly, using their index in the array (the index starts at 0):
x[1]
And, as usual in python sequences, negative arrays start from the end:
x[-2]
The array can be modified in place:
print(id(x))
x[1] = 0
print(id(x))
print(x)
For multidimensional arrays, basic indexing is done by specifying a comma separated list of indices:
x = np.zeros((2,3))
x[0,1] = 1
x
Indexing can be used to select array elements according to a mask.
Let's create our 1D array again:
x = np.arange(10) + 1
x
To create a mask, we evaluate a boolean expression for each element in the array. For example, to find all even numbers:
x%2 == 0
What does this expression mean?
Since x
is a numpy array, x%2
is a numpy element-wise operation that evaluates %2
on all elements of the array, and returns a new array with the results:
xmod = x%2
xmod
Then, we select even numbers by asking the modulo to be equal to zero. Again, the ==
operator is applied to a numpy array, so it is an element-wise operation:
mask = (xmod == 0)
mask
With this mask, we select even numbers and return a new array:
x[mask]
In fact, the new array is a view on the original array. The data is not copied.
It is possible to invert the mask:
x[~mask]
And of course it's possible to build masks on the fly, which is what is typically done:
x[x%2==0]
In python data science, boolean indexing is used extensively to select data by applying thresholds on chosen variables.
In basic python, a slice is defined as a tuple, start,stop,step
. It allows to select a sequence of elements in a sequence (which, by essence in python, is 1D):
lst = list(range(1, 10))
print(lst)
lst[1::2]
We selected elements:
stop
is not specified. At this stage, the last element, 9, is included;Exercise:
Play with the slice definition in the cell above. Try to:
In the example above, we decided not to specify stop. This is possible for all fields:
print( lst[:5:] )
print( lst[::2] )
print( lst[3::] )
print( lst[::] )
The notation for these can and should be simplified to:
print( lst[:5] )
print( lst[::2] )
print( lst[3:] )
print( lst[:] )
Exercise:
Could simplify these expressions further by removing more colons? What would happen if you do? Test your hypotheses in the above cell.
Numpy slicing is a simple generalization of python slicing to multiple dimensions. To test it, we create a 2D matrix with 4 lines and 5 columns. For this, we use the reshape method that will be discussed in the next section:
x = np.arange(20).reshape(4,5)
x
We can now use the slice notation on any field of the multidimensional index. Here are a few example:
x[:, 1]
x[:, :2]
x[:, ::2]
x[:, ::-1]
In the previous examples, we have been working on the last dimension, while preserving the first dimension. For this, we needed to add a :
as first indexing item to request a loop on the first dimension.
But this can become painful when there are more than 2 dimensions. For example, consider this 3D array:
y = np.arange(8).reshape((2,2,2))
y
It's a 2x2x2 cube. If you want to select the first elements along the last dimension, you can either do:
y[:,:,0]
Or use the ellipsis notation:
y[...,0]
The ellipsis adds as many :
as missing dimensions in the indexing statement. In this way, you don't need to keep track of the number of dimensions, and you type less. But most importantly, your code becomes independent of the number of dimensions in the array. For example in this case, y[...,0]
selects the first elements along the last dimension whatever the dimension of y:
y = np.arange(8).reshape(2,4)
print(y)
print(y[...,0])
Reshaping an array consists in re-organizing the array data into a new array with a different shape.
First, reshaping makes it possible to easily create an array with a given shape, as we have done before:
x = np.arange(20).reshape(4,5)
x
We created a flat 1D array with integers ranging from 0 to 19, and immediately reshaped it to a 2D array with shape (4,5)
.
Often, one needs to flatten an array. This can be done in several ways:
y = x.flatten()
y
x.ravel()
x.reshape(-1)
The output is the same, so what's the difference between flatten, ravel, and reshape?
First, flatten always returns a copy of the array.
On the other hand, ravel attempts to return a view on the original array, and only copies the data when it's needed. This means that if you modify the output of ravel, you might modify the original array. Also, it means that you should use ravel on very big arrays to save memory, and that ravel will generally be faster than flatten.
reshape is a more flexible function as you can reshape to anything. Just like ravel, it returns a view and only performs a copy when necessary. I suggest to use either flatten or ravel when you want to flatten an array, to make your code more explicit.
This means that we want to turn an array of shape (N,)
into an array of shape (N,1)
. Let's consider an array of shape (5,)
:
y = np.arange(5)
print(y)
print(y.shape)
We turn it into a columnar array of shape (20,1) with the c_
attribute (note that it's not a function!)
coly = np.c_[y]
print(coly)
print(coly.shape)
You should know enough about numpy to deal with most numpy operations encountered in machine learning projects.
We will now work on real-life examples so that you can hone your numpy skills!
Let's consider a practical supervised machine learning example, the classification of dog and cat pictures. The machine is presented with examples so that it can learn. Each example consists in:
When presented a cat image, the machine will output a value between 0 and 1. We'll compare this value to the target, and update the machine parameters so that next time, it can get closer to the solution.
The training dataset (the set of examples used for training) is typically stored in numpy arrays. For example, one could use two numpy arrays:
(10000, 200, 200, 3)
for the images, assuming we have 10000 color pictures of 200x200 pixels.(10000,)
or of shape (10000, 1)
for the targets. The exact structure of the dataset will depend on where you get it or how you build it, and sometimes you will have to change this structure to provide the dataset to the machine, depending on how the machine is implemented.
As a practical case, let's use the simplified MNIST handwritten digit dataset included in the scikit-learn machine learning package. We first load the dataset:
from sklearn import datasets
digits = datasets.load_digits()
We get some object called digits
. We can use the type
and dir
built-in functions to see what it is:
print( type(digits) )
print( dir(digits) )
Ok, Bunch
is some unknown class in scikit-learn, but its attributes have rather explicit names. Let's print some more information:
print( type(digits.images) )
print( type(digits.target) )
Two numpy arrays! You should start to feel at home. Let's print their shape:
print( digits.images.shape )
print( digits.target.shape )
Exercises
# your code here
# your code here
Accuracy is a measure of performance for classification algorithms. It is defined the probability to misclassify an example and calculated as
$$ a = 1 - \frac{M}{N}, $$where $N$ is the total number of examples, and $M$ the number of misclassified examples.
After training, the accuracy is computed on a test dataset separate from the training dataset. In practice, the whole test dataset is sent as a numpy array to the machine, which produces a numpy array containing the results of the evaluation for each example in the test dataset. The accuracy is then computed from this output and the labels for the test dataset.
Let us assume we're still working on the MNIST handwritten digits dataset and that the test dataset contains 10 examples, with the following labels:
true_labels = np.random.randint(0, 10, 10)
true_labels
And let's assume that the output of our machine for these 10 examples is:
results = np.array([6, 7, 3, 6, 0, 7, 3, 1, 6, 5])
results
Exercise:
Compute the accuracy in per cent. For this, note that you can sum up all entries in a numpy array x
by doing x.sum()
. No loop allowed!
# your code here
Neural networks don't work well with big numbers.
So usually, the neural network inputs are rescaled to be of the order of unity before being fed to the neural network.
Let's consider again the MNIST dataset, that is already loaded in this notebook as digits
, and here is the first image:
digits.images[0]
Exercise
digits.image
?No loop allowed!
# your code here
We already know that neural networks don't like to deal with large numbers. And it's also not good to have widely different shapes for the distributions of the input variables.
So usually, we standardize the input variables. A common choice is to set the mean of each input distribution to 0, and its standard deviation to 1.
Let's build a toy dataset with 1000 entries and two input variables following a gaussian probability density function.
First, we generate the dataset.
For this, we draw a random sample of 1000 events for each variable. Note the large difference in mean and sigma:
x = np.random.normal(100, 5, 1000)
y = np.random.normal(10, 1, 1000)
Then, we stack the two variables in a single numpy array:
dataset = np.c_[x, y]
print(dataset.shape)
dataset
And finally, we check that the mean and standard deviation of each variable is as expected:
print(np.mean(dataset, axis=0))
print(np.std(dataset, axis=0))
By using axis=0
, we asked for the mean and standard deviation to be computed along the first axis, which is the outermost dimension. Therefore, numpy computed these quantities for both columns separately.
Take enough time to think about this, and to understand how you move in the 2D array when you stride the outermost dimension.
Exercise
Rescale the dataset so that the mean and standard deviation of both variables are set to 0 and 1, respectively. But:
# your code here
# your code here
By convention, TensorFlow considers that the color channels in an image are ordered as red, green, blue (RGB) along the innermost dimension. On the contrary, Caffe, another deep learning library, assumes colors to be in the reverse order (BGR).
Let's assume you want to do transfer learning, and that the deep neural network you want to use has been pre-trained with Caffe. But your images are RGB... Obviously, you need to convert your images from RGB to BGR so that the network can make use of its knowledge of colors.
In this exercise, you will find out how to do this on a single image.
First, let's load the "face" example image:
import scipy.misc
import matplotlib.pyplot as plt
face = scipy.misc.face()
plt.imshow(face)
Exercise
Hint: you will have to make good use of the slice notation. Also, try to use the ellipsis.
Well done! you're now an experienced numpy user!
Your new skills will prove useful, even out of the field of machine learning.
Just one word of caution for the future: Don't reinvent the wheel. It's highly probable that what you'll need has already been implemented by others using numpy. So don't hesitate to have a look at existing scientific python packages before starting to code. For example:
You can now head over to Matplotlib for Machine Learning to learn how to make your first plots!
Please let me know what you think in the comments! I’ll try and answer all questions.
And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)
You can join my mailing list for new posts and exclusive content: