You want to get started with machine learning but you don't know python? You're at the right place! (1h course)
At the moment, python is ruling the fields of data science and machine learning, after a long period of domination by R.
So you definitely need to know this language to be able to use machine learning efficiently.
But learning a new language is not that easy, might you think.
Well, python is special:
In this tutorial, I'll assume you're not a professional software developer, and I'll give you the basics to get you started with data science and machine learning.
I hope that it will give you a starting point and the motivation to learn more about this wonderful language.
You won't regret it.
This tutorial is designed to be interactive.
To run it, the easiest is to use Google Colab by simply clicking on this link. You'll be able to execute the code on your own, and to modify it.
It's impossible to cover the entirety of python in this short course, and there are many subjects that won't be discussed. For example, we will explain you what is a list, but will not explain you how to sort a list.
And, after a few days, you will probably have forgotten some of the concepts or specific commands explained here. That's completely normal, and not a problem. Don't worry about this!
But it's essential for you to know how to find more information about python when you need it.
The python community is extremely large, so the fastest source of information is Google. If you just type python doc sort list on Google, you'll get your answer in a couple seconds.
Generally speaking, the python documentation is extremely good, so make sure to use it.
And if you want to dig further after this crash course, you could start with the python tutorial. It will take you a day or so, but it's definitely worth it.
You've probably heard about about object-oriented programming, and it might sound complicated to you.
Don't worry, we're not going to do any object-oriented programming in this course. But we'll use objects all the time, because in python, everything is an object. An integer, a string, a list of values, and even a function: all of these things are objects.
Let's start by looking at an integer. With the id
built-in function, we can get the unique identifier associated with this integer object:
id(1)
And with the type
function, we can get the type of the object:
type(1)
Now let's define a variable a
, set it to 1, and print it:
a = 1
print(a)
In the notebook, as we have done before, calling print is not always necessary to print the value of an object. At the end of a cell, it's enough to evaluate the object like this:
b = 2
b
a
But in a python script, or if you're not at the end of a cell in the notebook, you will need to use print
.
You might have noticed a big difference with respect to other languages, when we defined variable a
: We didn't have to declare the type of the variable. In C++, for example, you would do int a = 1;
In python, the type of a variable is inferred from the type of the object that is assigned to the variable, here an integer. So a
is of type int
:
type(a)
Now let's check the unique identifier of a:
id(a)
That's the identifier of 1! And it's because a
and 1
correspond to the exact same object in memory.
Exercise:
Consider this code:
b = 1
c = b
Can you guess what the ids of b
and c
will be?
Check your hypothesis in the cell below:
# write your code, and execute the cell with shift+enter
Important:
In python, variables are called "names". And the process of assigning a value to a variable is called "name binding". In other words, you can see a variable as a label that is attached to an object.
When we do a=1
, we attach label a
to integer 1
. If you attach another label to 1
, the same object now carries two labels. When we do the following, we just move label a
from 1
to 2
:
a = 1
a = 2
This might sound obvious, but please remember this, as it can lead to many misunderstandings, as we will later, in the section about loops.
Very often, you'll need to take decisions in your programs, depending on the current situation. Here's an example. Execute the following cell by selecting it and pressing shift+Enter.
do_great_the_world = True
if do_great_the_world:
print('hello_world')
else:
print('get lost')
Now turn do_great_the_world
to False
, and re-execute the cell. That's it, you know how to do conditional execution.
In python, indentations are important, and are not only here to make the code readable and pretty. When you indent the code (just use tab for that), you specify that you are creating a nested context. For example, in the code above, the line print('hello_world')
is "under' the if condition, and will be executed only if do_great_the_world
is true.
In most other languages, however, contexts are surrounded by separators, like '{}' in C++. People coming from these other languages often do not like to be forced to indent. But if you get going with python, you will realize that:
Loops make it possible to repeat an operation many times. That's typically what programs are for, so we need to learn how to do this.
First, we need to define something that we can loop upon. And things that can be looped upon are called iterables. A list is an iterable, and here's one:
a = [1, 2, 3, 4]
print(a)
Note how the print function adapts to the object being printed. You can try it on any kind of object, and you will usually get meaningful results.
Now, let's write a loop to compute the squares of all values in the list, and to sum up those squares:
squares = []
sumsq = 0
for x in a:
x2 = x**2
squares.append(x2)
sumsq += x2
print(squares)
print(sumsq)
What you know at this point about loops is probably enough in most programs, but we're going to have a look at a few syntactic shortcuts.
You can loop in a very compact way by using list comprehensions:
squares = [x**2 for x in a]
squares
That can be handy, and you will encounter this kind of constructs very often. But don't overuse it. Beginners often tend to use several list comprehensions to compute different values from the same iterable, for example:
squares = [x**2 for x in a]
cubes = [x**3 for x in a]
print(squares)
print(cubes)
But this is neither elegant nor efficient. If you need to compute several things from the same iterable, use a standard loop construct to save CPU time and look like a pro.
Often, you need to get the values from the iterable, and also the index at which the value is stored. For this, use enumerate
:
for i, x in enumerate(a):
print(i, x)
Exercise
Let's come back to name binding in the context of loops. In the code below, we loop on a list, and try to add 1 to all elements in the list, but it does not work... do you understand why?
lst = list(range(5))
print(lst)
for x in lst:
x += 1
print(lst)
For data science, and especially if the data starts getting large, we will use specific libraries to read data files, and you will see that later when you start using data science libraries such as numpy
or pandas
.
Here, you will just learn how to read and write basic text files.
Write to a text file by doing:
a = [1, 2, 3, 4]
fname = 'myfile.txt'
# 'w' means that the file is opened in write mode
with open(fname, 'w') as out_file:
for i in a:
# we convert our integers to strings,
# and we write one integer per line
# note that we need to add a newline
# character manually to the string,
# which just contains the number
out_file.write(str(i) + '\n')
You can check the contents of the file with a shell function (the !
notifies the jupyter notebook that this is a shell function):
! cat myfile.txt
And now you can read the file back:
with open(fname) as in_file:
for line in in_file:
# we remove the trailing '\n':
line = line.strip()
print(line)
In both the writing and reading case, we have used the with
statement. This statement makes sure that the file is live only within the context of the with
. When the program goes out of the context, the file is closed automatically.
It's important to make sure that files are closed when they're not needed anymore, so:
Always use the with statement to open files!
Here is a basic function, that does not take any argument (it has no parameters):
def say_hello():
print('hello world!')
say_hello()
To add parameters, just specify them in the function definition:
def say_hello(somebody):
print('hello {}'.format(somebody))
say_hello('world')
say_hello('colin')
You can add any parameter you need, separating them with commas:
def say_hello(a, b):
all_people = ' and '.join([a, b])
print('hello {}'.format(all_people))
say_hello('asterix', 'obelix')
Functions also can return objects. Often a single object (maybe a list), or a tuple of objects. To do this, use the return statement:
def square(x):
return x**2
square(2)
import random
def random_point():
return random.random(), random.random()
random_point()
You can define default values for the arguments, like this:
def say_hello(a='world'):
print('hello {}'.format(a))
say_hello('colin')
say_hello()
Finally, functions can either take unnamed positional arguments, or keyword arguments. Named arguments must be provided after all positional arguments. For example:
def say_hello(greeting, person='laurel', another_person='hardy'):
print('{greeting} {person} and {another}'.format(
greeting=greeting,
person=person,
another=another_person
)
)
say_hello('hi', another_person='colin')
In this example, hi
is a positional argument, and person is a keyword argument. Of course, positional arguments are to be given in the right order! Keyword arguments are interesting because they make it obvious from outside the function what the argument is for. They also allow to specify only some of the arguments that have a default value.
The most widely used data structure in python is certainly the list.
In the following example, we illustrate a few very common list operations:
# create the list, using square brackets
data = ['a', 0, 1, 'b']
# append a single element at the end
data.append('c')
# extend the list with the contents of another list
data.extend([2, 3])
# print the list length
print('data size', len(data))
# iterate on the list
for elem in data:
print(elem)
Tuples are very similar to lists. The difference between the two is that lists are mutable, while tuples are immutable. This means that we can change a list (as we have done when we added elements to our data
list above), but we can't change a tuple.
In the following example, we create a tuple, print it, and try to add elements to the tuple:
# create a tuple, using parentheses
tup = (0, 1)
print('length:', len(tup), 'elements:', tup)
tup.append(2)
Of course, this fails because the tuple is immutable. This might seem like a big drawback of tuples. But it's not! We're not going to enter a long debate about why this is the case, and only give you one recommendation:
if you need to define a sequence that shouldn't or won't be modified, use a tuple.
This will make it easier for you to debug your code, and make it more memory efficient.
Finally, a word about tuple assignment. It's possible to create a tuple by packing values as we have done above. And actually we can omit the parentheses:
tup = 0, 1
tup
And we can also unpack:
x, y = tup
print(x, y)
Doing packing and unpacking at the same time, we can initialize several variables in one command:
x, y = 0, 1
print(x, y)
Dictionaries are very different from lists and tuples. They are mutable, mapping data structures. This means that dictionaries contain elements that are made of a key mapping to a value, and that we can modify existing dictionaries. Here's your first one:
data = {
'x': 1,
'y': 2
}
data
One can add items (or change the value of existing keys!) like this:
data['x'] = 0
data['z'] = 3
data
And we can access the value of a key like this:
data['x']
Python programmers often use dictionaries (or lists of dictionaries) to keep track of information. For example, for a phone book:
[
{'first_name':'john',
'last_name':'smith',
'phone': 1234567
},
{'first_name':'john',
'last_name':'goodman',
'phone': 7654321
},
{'first_name':'will',
'last_name':'smith',
'phone': 1234851,
}
]
At first, deciding which container to use might not be easy. So keep in mind the following facts:
Lists:
Dictionaries:
Exercise
In the following cases, would you use a list, a dictionary, a list of tuples? If you use a list, which elements would you store? If you use a dictionary, which key and value would you store?
Python is an extremely flexible language.
For instance, you can go for object-oriented or fully functional programming, or go for an hybrid approach. And if you decide to do object-oriented programming, for example, there are several ways to do so...
In fact, when you need to do something with python, you will certainly find that the most difficult is to choose how to do it.
When in doubt, follow the Zen of Python (PEP20):
And as style goes, follow PEP8!.
You should now be ready to get started with numpy, which is essential for scientific computing in python, including machine learning.
Please let me know what you think in the comments! I’ll try and answer all questions.
And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)
You can join my mailing list for new posts and exclusive content: