Analyse the COVID-19 data from John Hopkins University by yourself with python (pandas, holoviews, matplotlib)
In this post, you will see how to access the COVID-19 dataset provided by John Hopkins University, and how to analyse it with python.
You will learn how to:
The dataset can be found in this github repository. To get it, just clone the repository:
git clone https://github.com/CSSEGISandData/COVID-19.git
cd COVID-19
It consists of three csv files that are updated daily.
To run this tutorial, first Install Anaconda for Machine Learning and Data Science in Python.
Then, create a conda environement
conda create -n covid19
Activate it:
conda activate covid19
And install holoviz (you'll also get pandas, matplotlib, and everything you need :)
conda install -c pyviz holoviz
Finally, start your jupyter notebook:
jupyter notebook
The dataset consists of three csv files that are updated daily:
We create three dataframes, one for each csv file, and we store them in a dictionary:
import pandas as pd
datatemplate = 'csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-{}.csv'
fields = ['Confirmed', 'Deaths', 'Recovered']
dfs = dict()
for field in fields:
dfs[field] = pd.read_csv(datatemplate.format(field))
Here is what we have for one of them:
dfs['Confirmed'].head()
We need to do a bit of preprocessing before we can analyze these data. Indeed:
Here is the code:
# loop on the dataframe dictionary
for field, df in dfs.items():
# group by country, to sum on states
df = df.groupby('Country/Region', as_index=False).sum()
# turn each measurement column into a separate line,
# and store the results in a new dataframe
df = df.melt(id_vars=['Country/Region', 'Lat', 'Long'],
value_name='counts')
# keep track of the quantity that is measured
# either Confirmed, Deaths, or Recovered
df['quantity'] = field
# change column names
df.columns = ['country', 'lat', 'lon', 'date', 'counts', 'quantity']
# replace the dataframe in the dictionary
dfs[field] = df
Now, we can concatenate the three dataframes and look at the results:
dfall = pd.concat(dfs.values())
dfall['date'] = pd.to_datetime(dfall['date'])
dfall.head()
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
Then, we need to create a holoviews dataset from our dataframe. Holoviews datasets are just a way to label and give meaning to your data. For example, here, we have a single measurement, counts
. This measurement is done in different countries at different dates, for different quantities: Confirmed, Deaths, and Recovered.
So we provide country
, date
, and quantity
as "key dimensions", and counts
as "value dimension":
ds = hv.Dataset(dfall, kdims=['country', 'date', 'quantity'],
vdims=['counts'])
Currently, the dataset contains all countries. Let's select a few European countries for simplicity:
dsel = ds.select(country=['France','Italy', 'Switzerland'])
Finally, we can project the dataset into a set of curves.
Here, we plot counts
as a function of date
, and the other key dimensions are automatically provided as drop down menus:
curves = dsel.to(hv.Curve, 'date', 'counts')
curves.opts(width=400)
We can also overlay the curves for all countries:
curves.overlay('country').opts(legend_position='top_left')
And for all quantities:
curves.overlay('quantity').opts(legend_position='top_left')
We can also do bar plots, e.g. :
bars = dsel.to(hv.Bars, ['country','quantity'], 'counts')
bars.opts(width=400, stacked=True)
Beware, the y axis scale changes over time in this one... I haven't found a way to fix it. At least, it allows us to see that France and Germany started early but managed their first cases very well.
We can also do a grouped bar plot:
bars.opts(width=400, height=400, stacked=False, xrotation=90)
In the absence of external change (confinement measures, summer, a vaccine, etc), the number of confirmed cases in a given country should follow an exponential, at least at the beginning of the epidemic.
Indeed, each person has a given probability to infect other people.
So if there are $n(x)$ people infected at day $x$, the additional number of infected $\Delta n(x)$ between day $x$ and day $x+1$ is proportional to $n$:
$$\Delta n = a n$$So:
$$\frac{\Delta n}{n} = a $$And by integrating on x:
$${\rm ln} (n) = ax$$Then, we exponentiate:
$$n = e^{ax}$$Now, the infection does not start on the same day in all countries. So we add a parameter to our model to allow for a delay of the exponential:
$$n = e^{a(x-b)}$$For a more visual (and much better) explanation, please have a look at this excellent video from 3blue1brown: Exponential growth and epidemics.
We will now extract the parameters a
and b
for a given country by fitting the model to the data from this country.
Let's start by fitting the number of confirmed cases as a function of time for France.
We select the subset of our dataframe with these data:
sel = dfall[ (dfall['country']=='France') &
(dfall['quantity']=='Confirmed')]
sel.head()
The counts
column contains the number of cases. Let's store this array of values as y
for our fit (this is just a shortcut):
y = sel['counts']
We want to fit this as a function of the day. In the dataframe, the day is a timestamp. This is not a value, so we can't do a fit with it. So we just create an array of days starting at zero, with the right length:
import numpy as np
x = np.arange(len(y))
x[:10]
Then, we define the function that we want to fit:
expo = lambda x, a, b : np.exp( a*(x-b) )
The lambda
statement creates a function object that, given x
, a
, and b
, returns the value of our exponential. To take the exponential, we use np.exp
, which is the numpy exponential function. This function can take a numpy array and exponentiate very efficiently all elements in the array. Therefore, our expo function can also take a numpy array, and will also perform well.
After doing that, we can just call expo
:
expo(np.arange(5), 1, 0)
We're now ready to perform the fit:
from scipy.optimize import curve_fit
results = curve_fit(expo, x, y)
results
The first element of the results tuple contains the fitted a and b parameters. We can now plot the data points together with the fitted function.
import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.plot(x, expo(x, *results[0]))
We have evaluated the function for each value of x in the dataset. If we want to extrapolate out of the data points, we need more values of x. We get them by creating a linspace:
linx = np.linspace(0,50,101)
plt.scatter(x, y)
plt.plot(linx, expo(linx, *results[0]))
Finally, here is a small function that compares different countries in a given time range:
def plot(countries, xrange,
dtype='Confirmed',
yrange=None,
yscale='linear'):
'''plot the covid-19 data with an exponential fit.
- countries: list of countries
- xrange: fit range, e.g. (30,55)
- yscale: log or linear
- yrange: well, y range, useful in log mode.
'''
xmin, xmax = xrange
linx = np.linspace(xmin, xmax, 101)
colors = ['blue', 'red', 'orange', 'green']
for i, country in enumerate(countries):
color = colors[i]
sel = dfall[ (dfall['country']==country) &
(dfall['quantity']==dtype)]
yp = sel['counts'][xmin:xmax+1]
xp = np.arange(len(yp))+xmin
plt.scatter(xp, yp, label=country,
alpha=0.7, marker='.', color=color)
pars, cov = curve_fit(expo, xp, yp)
f = expo(linx, *pars)
plt.plot(linx, f,
color=color, alpha=0.3)
plt.legend(loc='upper left')
plt.xlabel('days')
plt.ylabel('confirmed cases')
plt.yscale(yscale)
if yrange:
plt.ylim(*yrange)
plt.grid()
plt.figure(dpi=90, figsize=(8,4))
plot(['France', 'Italy', 'Switzerland'],
dtype='Confirmed',
xrange=(30, 56),
yscale='log')
We have extrapolated by only a week, which is probably reasonable. Confinement has just been enforced in Italy, and its effects will come later. In the other countries, as far as I know, not much has been done yet.
Of course, the more you extrapolate, the less accurate is the prediction.
Extreme care must be taken when extrapolating a model out of the data points.
First of all, we're now fitting the number of cases in France as a function of time. But the French population is limited, so the exponential is clearly not going to work forever.
As a matter of fact, the larger the fraction of infected people, the more difficult it is for infected people to find new people to infect. In other words, at some point, our parameter a, which is the rate of infection and is currently fixed, is going to start decreasing as a function of time. The question is when... And we just have no clue for now.
Also, confinement measures will make it more difficult for people to infect others, which will also reduce a. This is certainly what's happening in China, where almost a billion people are currently confined. But if confinement ends for some reason, the exponential growth will kick in again.
Again, for a more detailed discussion, please have a look at the video from 3blue1brown.
The other reason why extrapolations are dangerous is that they are quite sensitive to the value of the last fitted point:
# the previous fit:
linx = np.linspace(0,50,101)
plt.scatter(x, y)
plt.plot(linx, expo(linx, *results[0]))
# multiplying the last count by 1.2:
y_p = y.copy()
y_p.iloc[-1] = y.iloc[-1]*1.2
plt.scatter(x, y_p)
results_p = curve_fit(expo, x, y_p)
plt.plot(linx, expo(linx, *results_p[0]))
plt.xlim(30, 50)
Reduce a: Stay at home if possible and wash your hands!
You might have noticed that something important is missing in the plots of this post: uncertainties! And indeed, without uncertainties, a plot cannot be taken seriously.
Here, I wanted to insist on the technical python tools. But in the next post, I'll show you how to take uncertainties into account.
Please let me know what you think in the comments! I’ll try and answer all questions.
And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)
You can join my mailing list for new posts and exclusive content: