Python Crash Course for Machine Learning

python for machine learning

Running this tutorial

This tutorial is designed to be interactive.

To run it, the easiest is to use Google Colab by simply clicking on this link. You'll be able to execute the code on your own, and to modify it.

If you want to run on your own computer, first Install Anaconda for Machine Learning and Data Science in Python, and start a jupyter notebook. The source notebook can be found on github here if you want to download it, but you can also just copy/paste the cells in an empty notebook.

Introduction

At the moment, python is ruling the fields of data science and machine learning, after a long period of domination by R. So we think that you need to know this language to be able to use machine learning efficiently.

But what if you've never used python?

In this tutorial, we'll assume you're not a professional software developer, and we'll give you the basics that are necessary for machine learning.

We hope that we'll give you a starting point and the motivation to learn more about this wonderful language. We can assure you that you won't regret it.

Here's the outline:

Getting help

It's impossible to cover the entirety of python in this short course, and there are many subjects that won't be covered. For example, we will explain you what is a list, but will not explain you how to sort a list.

And, after a few days, you will probably have forgotten some of the concepts or specific commands explained here. That's completely normal, and not a problem. Don't worry about this!

But it's essential for you to know how to find more information about python when you need it.

The python community is extremely large, so the fastest source of information is Google. If you just type python doc sort list on Google, you'll get your answer in a couple seconds.

Generally speaking, the python documentation is extremely good, so make sure to use it.

And if you want to dig further after this crash course, you could start with the python tutorial. It will take you a day or so, but it's definitely worth it.

Variables and objects

You've probably heard about about object-oriented programming, and it might sound complicated to you.

Don't worry, we're not going to do any object-oriented programming in this course. But we'll use objects all the time, because in python, everything is an object. An integer, a string, a list of values, and even a function: all of these things are objects.

Let's start by looking at an integer. With the id built-in function, we can get the unique identifier associated with this integer object:

In [5]:
id(1)
Out[5]:
4561220736

And with the type function, we can get the type of the object:

In [6]:
type(1)
Out[6]:
int

Now let's define a variable a, set it to 1, and print it:

In [7]:
a = 1
print(a)
1

In the notebook, as we have done before, calling print is not always necessary to print the value of an object. At the end of a cell, it's enough to evaluate the object like this:

In [8]:
b = 2
b
a
Out[8]:
1

But in a python script, or if you're not at the end of a cell in the notebook, you will need to use print.

You might have noticed a big difference with respect to other languages, when we defined variable a: We didn't have to declare the type of the variable. In C++, for example, you would do int a = 1;

In python, the type of a variable is inferred from the type of the object that is assigned to the variable, here an integer. So a is of type int:

In [9]:
type(a)
Out[9]:
int

Now let's check the unique identifier of a:

In [10]:
id(a)
Out[10]:
4561220736

That's the identifier of 1! And it's because a and 1 correspond to the exact same object in memory.


Exercise:

Consider this code:

b = 1 
c = b

Can you guess what the ids of b and c will be? Check your hypothesis in the cell below:

In [11]:
# paste the code above, and execute the cell with shift+enter

Important:

In python, variables are called "names". And the process of assigning a value to a variable is called "name binding". In other words, you can see a variable as a label that is attached to an object.

When we do a=1, we attach label a to integer 1. If you attach another label to 1, the same object now carries two labels. When we do the following, we just move label a from 1 to 2:

In [12]:
a = 1 
a = 2

This might sound obvious, but please remember this, as it can lead to many misunderstandings, as we will later, in the section about loops.

Conditions

Very often, you'll need to take decisions in your programs, depending on the current situation. Here's an example. Execute the following cell by selecting it and pressing shift+Enter.

In [13]:
do_great_the_world = True
if do_great_the_world: 
    print('hello_world')
else: 
    print('get lost')
hello_world

Now turn do_great_the_world to False, and re-execute the cell. That's it, you know how to do conditional execution.

About code indentation

In python, indentations are important, and are not only here to make the code readable and pretty. When you indent the code (just use tab for that), you specify that you are creating a nested context. For example, in the code above, the line print('hello_world') is "under' the if condition, and will be executed only if do_great_the_world is true.

In most other languages, however, contexts are surrounded by separators, like '{}' in C++. People coming from these other languages often do not like to be forced to indent. But if you get going with python, you will realize that:

  • no separators means faster and easier typing
  • if people don't indent, the code won't work, so you won't have to waste time trying to review unindented code from other people anymore.
  • Text editors and especially IDEs make it very easy and natural to indent properly: often, the editor will indent automatically, and otherwise you just need to press tab (or shift+tab to back-indent).

Loops

Loops make it possible to repeat an operation many times. That's typically what programs are for, so we need to learn how to do this.

First, we need to define something that we can loop upon. And things that can be looped upon are called iterables. A list is an iterable, and here's one:

In [14]:
a = [1, 2, 3, 4]
print(a)
[1, 2, 3, 4]

Note how the print function adapts to the object being printed. You can try it on any kind of object, and you will usually get meaningful results.

Now, let's write a loop to compute the squares of all values in the list, and to sum up those squares:

In [15]:
squares = []
sumsq = 0
for x in a: 
    x2 = x**2
    squares.append(x2)
    sumsq += x2
print(squares)
print(sumsq)
[1, 4, 9, 16]
30

What you know at this point about loops is probably enough in most programs, but we're going to have a look at a few syntactic shortcuts.

You can loop in a very compact way by using list comprehensions:

In [16]:
squares = [x**2 for x in a]
squares
Out[16]:
[1, 4, 9, 16]

That can be handy, and you will encounter this kind of constructs very often. But don't overuse it. Beginners often tend to use several list comprehensions to compute different values from the same iterable, for example:

In [17]:
squares = [x**2 for x in a]
cubes = [x**3 for x in a]
print(squares)
print(cubes)
[1, 4, 9, 16]
[1, 8, 27, 64]

But this is neither elegant nor efficient. If you need to compute several things from the same iterable, use a standard loop construct to save CPU time and look like a pro.

Often, you need to get the values from the iterable, and also the index at which the value is stored. For this, use enumerate:

In [18]:
for i, x in enumerate(a): 
    print(i, x)
0 1
1 2
2 3
3 4

Exercise

Let's come back to name binding in the context of loops. In the code below, we loop on a list, and try to add 1 to all elements in the list, but it does not work... do you understand why?

In [19]:
lst = list(range(5))
print(lst)
for x in lst:
    x = 1
print(lst)
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]

Basic input and output

For data science, and especially if the data starts getting large, we will use specific libraries to read data files, and you will see that later when you start using data science libraries such as numpy or pandas.

Here, you will just learn how to read and write basic text files.

Write to a text file by doing:

In [20]:
a = [1, 2, 3, 4]
fname = 'myfile.txt'
# 'w' means that the file is opened in write mode
with open(fname, 'w') as out_file: 
    for i in a: 
        # we convert our integers to strings, 
        # and we write one integer per line
        # note that we need to add a newline 
        # character manually to the string,
        # which just contains the number
        out_file.write(str(i) + '\n')

You can check the contents of the file with a shell function (the ! notifies the jupyter notebook that this is a shell function):

In [21]:
! cat myfile.txt
1
2
3
4

And now you can read the file back:

In [22]:
with open(fname) as in_file: 
    for line in in_file: 
        # we remove the trailing '\n':
        line = line.strip()
        print(line)
1
2
3
4

In both the writing and reading case, we have used the with statement. This statement makes sure that the file is live only within the context of the with. When the program goes out of the context, the file is closed automatically.

It's important to make sure that files are closed when they're not needed anymore, so:

Always use the with statement to open files!

Functions

Here is a basic function, that does not take any argument (it has no parameters):

In [23]:
def say_hello():
    print('hello world!')
    
say_hello()
hello world!

To add parameters, just specify them in the function definition:

In [24]:
def say_hello(somebody):
    print('hello {}'.format(somebody))
    
say_hello('world')
say_hello('colin')
hello world
hello colin

You can add any parameter you need, separating them with commas:

In [25]:
def say_hello(a, b): 
    all_people = ' and '.join([a, b])
    print('hello {}'.format(all_people))
    
say_hello('asterix', 'obelix')
hello asterix and obelix

Functions also can return objects. Often a single object (maybe a list), or a tuple of objects. To do this, use the return statement:

In [26]:
def square(x):
    return x**2

square(2)
Out[26]:
4
In [27]:
import random

def random_point():
    return random.random(), random.random()

random_point()
Out[27]:
(0.3257339137707528, 0.4697261999006147)

You can define default values for the arguments, like this:

In [28]:
def say_hello(a='world'):
    print('hello {}'.format(a))

say_hello('colin')
say_hello()
hello colin
hello world

Finally, functions can either take unnamed positional arguments, or keyword arguments. Named arguments must be provided after all positional arguments. For example:

In [29]:
def say_hello(greeting, person='laurel', another_person='hardy'):
    print('{greeting} {person} and {another}'.format(
        greeting=greeting,
        person=person,
        another=another_person
        )
    )
    
say_hello('hi', another_person='colin')
hi laurel and colin

In this example, hi is a positional argument, and person is a keyword argument. Of course, positional arguments are to be given in the right order! Keyword arguments are interesting because they make it obvious from outside the function what the argument is for. They also allow to specify only some of the arguments that have a default value.

Python data structures

Lists

The most widely used data structure in python is certainly the list.

In the following example, we illustrate a few very common list operations:

In [30]:
# create the list, using square brackets
data = ['a', 0, 1, 'b']
# append a single element at the end
data.append('c')
# extend the list with the contents of another list 
data.extend([2, 3])
# print the list length
print('data size', len(data))
# iterate on the list
for elem in data: 
    print(elem)
data size 7
a
0
1
b
c
2
3

Tuples

Tuples are very similar to lists. The difference between the two is that lists are mutable, while tuples are immutable. This means that we can change a list (as we have done when we added elements to our data list above), but we can't change a tuple.

In the following example, we create a tuple, print it, and try to add elements to the tuple:

In [31]:
# create a tuple, using parentheses
tup = (0, 1)
print('length:', len(tup), 'elements:', tup)
tup.append(2)
length: 2 elements: (0, 1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-c72b34954c76> in <module>
      2 tup = (0, 1)
      3 print('length:', len(tup), 'elements:', tup)
----> 4 tup.append(2)

AttributeError: 'tuple' object has no attribute 'append'

Of course, this fails because the tuple is immutable. This might seem like a big drawback of tuples. But it's not! We're not going to enter a long debate about why this is the case, and only give you one recommendation:

if you need to define a sequence that shouldn't or won't be modified, use a tuple.

This will make it easier for you to debug your code, and make it more memory efficient.

Finally, a word about tuple assignment. It's possible to create a tuple by packing values as we have done above. And actually we can omit the parentheses:

In [ ]:
tup = 0, 1
tup

And we can also unpack:

In [ ]:
x, y = tup
print(x, y)

Doing packing and unpacking at the same time, we can initialize several variables in one command:

In [ ]:
x, y = 0, 1
print(x, y)

Dictionaries

Dictionaries are very different from lists and tuples. They are mutable, mapping data structures. This means that dictionaries contain elements that are made of a key mapping to a value, and that we can modify existing dictionaries. Here's your first one:

In [ ]:
data = {
    'x': 1,
    'y': 2
}

data

One can add items (or change the value of existing keys!) like this:

In [ ]:
data['x'] = 0
data['z'] = 3
data

And we can access the value of a key like this:

In [ ]:
data['x']

Python programmers often use dictionaries (or lists of dictionaries) to keep track of information. For example, for a phone book:

In [ ]:
[
    {'first_name':'john',
     'last_name':'smith',
     'phone': 1234567 
    },
    {'first_name':'john',
     'last_name':'goodman',
     'phone': 7654321    
    },
    {'first_name':'will',
     'last_name':'smith',
     'phone': 1234851,     
    } 
]

List or dictionary?

At first, deciding which container to use might not be easy. So keep in mind the following facts:

Lists:

  • sequential, meaning that the elements are arranged in a given order;
  • can be sorted;
  • provide very fast access to an element when its index is known, with O(1) complexity
  • checking whether an item is in a list is of complexity O(n), so the time needed scales linearly with the number of elements in the list

Dictionaries:

  • provide very fast access to an element when its key is known, with O(1) complexity
  • checking whether a key is in the dictionary is of complexity O(1).
  • for python < 3.6, dictionaries are inherently unsorted. For python 3.6, the order of element insertion is preserved in some implementations of python. For python >= 3.7, the insertion order is always preserved.

Exercise

In the following cases, would you use a list, a dictionary, a list of tuples? If you use a list, which elements would you store? If you use a dictionary, which key and value would you store?

  • plotting (x,y) data points. In which data structure would you store your points before plotting?
  • a detector has 32 channels, and measures one value in each channel when the measurement is triggered. Which data structure would you use to store the measurements?
  • some channels are bad and need to be masked out. Where would you store the bad channels?
  • each value should be calibrated by a factor that depends on the channel. Where would you store the calibration factors?
  • consider the phone book given above. What would be the complexity for finding the information for a given person? How would you improve the phonebook to make the search very fast?

The zen of python

Python is an extremely flexible language.

For instance, you can go for object-oriented or fully functional programming, or go for an hybrid approach. And if you decide to do object-oriented programming, for example, there are several ways to do so...

In fact, when you need to do something with python, you will certainly find that the most difficult is to choose how to do it.

When in doubt, follow the Zen of Python (PEP20):

  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.
  • Special cases aren't special enough to break the rules.
  • Although practicality beats purity.
  • Errors should never pass silently.
  • Unless explicitly silenced.
  • In the face of ambiguity, refuse the temptation to guess.
  • There should be one-- and preferably only one --obvious way to do it.
  • Although that way may not be obvious at first unless you're Dutch.
  • Now is better than never.
  • Although never is often better than right now.
  • If the implementation is hard to explain, it's a bad idea.
  • If the implementation is easy to explain, it may be a good idea.
  • Namespaces are one honking great idea -- let's do more of those!

And as style goes, follow PEP8!.

What now?

You should now be ready to get started with numpy, which is essential for scientific computing in python, including machine learning.

Numpy Crash Course for Machine Learning


Please let me know what you think in the comments! I’ll try and answer all questions.

And if you liked this article, you can subscribe to my newsletter to be notified of new posts (no more than one mail per week I promise.)

Back Home