Analyse the COVID-19 data from John Hopkins University by yourself with python (pandas, holoviews, matplotlib)

In this post, you will see how to access the COVID-19 dataset provided by John Hopkins University, and how to analyse it with python.

You will learn how to:

preprocess the dataset with pandas
create interactive plots with holoviews
fit a simple exponential model to the data to predict the number of cases in the near future

The COVID-19 Dataset from John Hopkins University

The dataset can be found in this github repository. To get it, just clone the repository:

git clone https://github.com/CSSEGISandData/COVID-19.git

cd COVID-19

It consists of three csv files that are updated daily.

Installation

To run this tutorial, first Install Anaconda for Machine Learning and Data Science in Python.

Then, create a conda environement

conda create -n covid19

Activate it:

conda activate covid19

And install holoviz (you'll also get pandas, matplotlib, and everything you need :)

conda install -c pyviz holoviz

Finally, start your jupyter notebook:

jupyter notebook

Preprocessing the COVID-19 Dataset from John Hopkins University¶

The dataset consists of three csv files that are updated daily:

time_series_19-covid-Confirmed.csv
time_series_19-covid-Deaths.csv
time_series_19-covid-Recovered.csv

We create three dataframes, one for each csv file, and we store them in a dictionary:

In [1]:

import pandas as pd
datatemplate = 'csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-{}.csv'

fields = ['Confirmed', 'Deaths', 'Recovered']
dfs = dict()
for field in fields: 
    dfs[field] = pd.read_csv(datatemplate.format(field))

Here is what we have for one of them:

In [2]:

dfs['Confirmed'].head()

Out[2]:

	Province/State	Country/Region	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	...	3/2/20	3/3/20	3/4/20	3/5/20	3/6/20	3/7/20	3/8/20	3/9/20	3/10/20	3/11/20
0	NaN	Thailand	15.0000	101.0000	2	3	5	7	8	8	...	43	43	43	47	48	50	50	50	53	59
1	NaN	Japan	36.0000	138.0000	2	1	2	2	4	4	...	274	293	331	360	420	461	502	511	581	639
2	NaN	Singapore	1.2833	103.8333	0	1	3	3	4	5	...	108	110	110	117	130	138	150	150	160	178
3	NaN	Nepal	28.1667	84.2500	0	0	0	1	1	1	...	1	1	1	1	1	1	1	1	1	1
4	NaN	Malaysia	2.5000	112.5000	0	0	0	3	4	4	...	29	36	50	50	83	93	99	117	129	149

5 rows × 54 columns

We need to do a bit of preprocessing before we can analyze these data. Indeed:

The number of cases are stored in columns, which is not practical for analysis and display. We'd rather have one line per measurement.
The numbers of confirmed cases, deaths, and recovered patients are currently stored in different dataframes. We would prefer to have all the information in a single dataframe.
The column names are too long and painful to type.

Here is the code:

In [4]:

# loop on the dataframe dictionary
for field, df in dfs.items():
    # group by country, to sum on states
    df = df.groupby('Country/Region', as_index=False).sum()
    # turn each measurement column into a separate line, 
    # and store the results in a new dataframe
    df = df.melt(id_vars=['Country/Region', 'Lat', 'Long'],
                 value_name='counts')
    # keep track of the quantity that is measured 
    # either Confirmed, Deaths, or Recovered
    df['quantity'] = field
    # change column names 
    df.columns =  ['country', 'lat', 'lon', 'date', 'counts', 'quantity']
    # replace the dataframe in the dictionary
    dfs[field] = df

Now, we can concatenate the three dataframes and look at the results:

In [5]:

dfall = pd.concat(dfs.values())
dfall['date'] = pd.to_datetime(dfall['date'])
dfall.head()

Out[5]:

	country	lat	lon	date	quantity
0	Afghanistan	33.0000	65.0000	2020-01-22	Confirmed
1	Albania	41.1533	20.1683	2020-01-22	Confirmed
2	Algeria	28.0339	1.6596	2020-01-22	Confirmed
3	Andorra	42.5063	1.5218	2020-01-22	Confirmed
4	Argentina	-38.4161	-63.6167	2020-01-22	Confirmed

Displaying COVID-19 Cases vs Time¶

For the display, I decided to use holoviews, which is a fast way to get nice interactive plots, with bokeh as a powerful backend.

First, we initialize holoviews:

In [6]:

import holoviews as hv
from holoviews import opts
hv.extension('bokeh')

Then, we need to create a holoviews dataset from our dataframe. Holoviews datasets are just a way to label and give meaning to your data. For example, here, we have a single measurement, counts. This measurement is done in different countries at different dates, for different quantities: Confirmed, Deaths, and Recovered.

So we provide country, date, and quantity as "key dimensions", and counts as "value dimension":

In [7]:

ds = hv.Dataset(dfall, kdims=['country', 'date', 'quantity'], 
                vdims=['counts'])

Currently, the dataset contains all countries. Let's select a few European countries for simplicity:

In [10]:

dsel = ds.select(country=['France','Italy', 'Switzerland'])

Finally, we can project the dataset into a set of curves.

Here, we plot counts as a function of date, and the other key dimensions are automatically provided as drop down menus:

In [11]:

curves = dsel.to(hv.Curve, 'date', 'counts')
curves.opts(width=400)

Out[11]:

We can also overlay the curves for all countries:

In [12]:

curves.overlay('country').opts(legend_position='top_left')

Out[12]:

And for all quantities:

In [13]:

curves.overlay('quantity').opts(legend_position='top_left')

Out[13]:

We can also do bar plots, e.g. :

In [14]:

bars = dsel.to(hv.Bars, ['country','quantity'], 'counts')
bars.opts(width=400, stacked=True)

Out[14]:

Beware, the y axis scale changes over time in this one... I haven't found a way to fix it. At least, it allows us to see that France and Germany started early but managed their first cases very well.

We can also do a grouped bar plot:

In [15]:

bars.opts(width=400, height=400, stacked=False, xrotation=90)

Out[15]:

Statistical Modelling and Extrapolation¶

A simple exponential model of the epidemic¶

In the absence of external change (confinement measures, summer, a vaccine, etc), the number of confirmed cases in a given country should follow an exponential, at least at the beginning of the epidemic.

Indeed, each person has a given probability to infect other people.

So if there are $n(x)$ people infected at day $x$, the additional number of infected $\Delta n(x)$ between day $x$ and day $x+1$ is proportional to $n$:

$$\Delta n = a n$$

So:

$$\frac{\Delta n}{n} = a $$

And by integrating on x:

$${\rm ln} (n) = ax$$

Then, we exponentiate:

$$n = e^{ax}$$

Now, the infection does not start on the same day in all countries. So we add a parameter to our model to allow for a delay of the exponential:

$$n = e^{a(x-b)}$$

For a more visual (and much better) explanation, please have a look at this excellent video from 3blue1brown: Exponential growth and epidemics.

Fitting the model¶

We will now extract the parameters a and b for a given country by fitting the model to the data from this country.

Let's start by fitting the number of confirmed cases as a function of time for France.

We select the subset of our dataframe with these data:

In [16]:

sel = dfall[ (dfall['country']=='France') &
             (dfall['quantity']=='Confirmed')]
sel.head()

Out[16]:

	country	lat	lon	date	counts	quantity
39	France	82.1984	-123.6697	2020-01-22	0	Confirmed
153	France	82.1984	-123.6697	2020-01-23	0	Confirmed
267	France	82.1984	-123.6697	2020-01-24	2	Confirmed
381	France	82.1984	-123.6697	2020-01-25	3	Confirmed
495	France	82.1984	-123.6697	2020-01-26	3	Confirmed

The counts column contains the number of cases. Let's store this array of values as y for our fit (this is just a shortcut):

In [17]:

y = sel['counts']

We want to fit this as a function of the day. In the dataframe, the day is a timestamp. This is not a value, so we can't do a fit with it. So we just create an array of days starting at zero, with the right length:

In [18]:

import numpy as np
x = np.arange(len(y))
x[:10]

Out[18]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Then, we define the function that we want to fit:

In [19]:

expo = lambda x, a, b : np.exp( a*(x-b) )

The lambda statement creates a function object that, given x, a, and b, returns the value of our exponential. To take the exponential, we use np.exp, which is the numpy exponential function. This function can take a numpy array and exponentiate very efficiently all elements in the array. Therefore, our expo function can also take a numpy array, and will also perform well.

After doing that, we can just call expo:

In [20]:

expo(np.arange(5), 1, 0)

Out[20]:

array([ 1.        ,  2.71828183,  7.3890561 , 20.08553692, 54.59815003])

We're now ready to perform the fit:

In [21]:

from scipy.optimize import curve_fit
results = curve_fit(expo, x, y)
results

Out[21]:

(array([ 0.2695074 , 20.27067366]),
 array([[3.95487585e-05, 4.01039595e-03],
        [4.01039595e-03, 4.08498869e-01]]))

The first element of the results tuple contains the fitted a and b parameters. We can now plot the data points together with the fitted function.

In [22]:

import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.plot(x, expo(x, *results[0]))

Out[22]:

[<matplotlib.lines.Line2D at 0x101f2cf7b8>]

We have evaluated the function for each value of x in the dataset. If we want to extrapolate out of the data points, we need more values of x. We get them by creating a linspace:

In [23]:

linx = np.linspace(0,50,101)
plt.scatter(x, y)
plt.plot(linx, expo(linx, *results[0]))

Out[23]:

[<matplotlib.lines.Line2D at 0x1020054400>]

Finally, here is a small function that compares different countries in a given time range:

In [34]:

def plot(countries, xrange,
         dtype='Confirmed',
         yrange=None,
         yscale='linear'): 
    '''plot the covid-19 data with an exponential fit.
    - countries: list of countries
    - xrange: fit range, e.g. (30,55)
    - yscale: log or linear
    - yrange: well, y range, useful in log mode.
    '''
    xmin, xmax = xrange
    linx = np.linspace(xmin, xmax, 101)    
    colors = ['blue', 'red', 'orange', 'green']
    for i, country in enumerate(countries): 
        color = colors[i]
        sel = dfall[ (dfall['country']==country) &
                 (dfall['quantity']==dtype)]
        yp = sel['counts'][xmin:xmax+1]
        xp = np.arange(len(yp))+xmin
        plt.scatter(xp, yp, label=country, 
                    alpha=0.7, marker='.', color=color)
        pars, cov = curve_fit(expo, xp, yp)
        f = expo(linx, *pars)
        plt.plot(linx, f, 
                 color=color, alpha=0.3)
    plt.legend(loc='upper left')
    plt.xlabel('days')
    plt.ylabel('confirmed cases')
    plt.yscale(yscale)
    if yrange: 
        plt.ylim(*yrange)
    plt.grid()


plt.figure(dpi=90, figsize=(8,4))
plot(['France', 'Italy', 'Switzerland'], 
     dtype='Confirmed',
     xrange=(30, 56),
     yscale='log')

We have extrapolated by only a week, which is probably reasonable. Confinement has just been enforced in Italy, and its effects will come later. In the other countries, as far as I know, not much has been done yet.

Of course, the more you extrapolate, the less accurate is the prediction.

A Word of Caution¶

Extreme care must be taken when extrapolating a model out of the data points.

First of all, we're now fitting the number of cases in France as a function of time. But the French population is limited, so the exponential is clearly not going to work forever.

As a matter of fact, the larger the fraction of infected people, the more difficult it is for infected people to find new people to infect. In other words, at some point, our parameter a, which is the rate of infection and is currently fixed, is going to start decreasing as a function of time. The question is when... And we just have no clue for now.

Also, confinement measures will make it more difficult for people to infect others, which will also reduce a. This is certainly what's happening in China, where almost a billion people are currently confined. But if confinement ends for some reason, the exponential growth will kick in again.

Again, for a more detailed discussion, please have a look at the video from 3blue1brown.

The other reason why extrapolations are dangerous is that they are quite sensitive to the value of the last fitted point:

In [26]:

# the previous fit: 
linx = np.linspace(0,50,101)
plt.scatter(x, y)
plt.plot(linx, expo(linx, *results[0]))
# multiplying the last count by 1.2:
y_p = y.copy()
y_p.iloc[-1] = y.iloc[-1]*1.2
plt.scatter(x, y_p)
results_p = curve_fit(expo, x, y_p)
plt.plot(linx, expo(linx, *results_p[0]))
plt.xlim(30, 50)

Out[26]:

(30, 50)

Conclusion & Outlook

Reduce a: Stay at home if possible and wash your hands!

You might have noticed that something important is missing in the plots of this post: uncertainties! And indeed, without uncertainties, a plot cannot be taken seriously.

Here, I wanted to insist on the technical python tools. But in the next post, I'll show you how to take uncertainties into account.

Please let me know what you think in the comments! I’ll try and answer all questions.

And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)

Back Home

Learn about Data Science and Machine Learning!

You can join my mailing list for new posts and exclusive content:

COVID-19 Analysis

The COVID-19 Dataset from John Hopkins University

Installation

Preprocessing the COVID-19 Dataset from John Hopkins University¶

Displaying COVID-19 Cases vs Time¶

Statistical Modelling and Extrapolation¶

A simple exponential model of the epidemic¶

Fitting the model¶

A Word of Caution¶

Conclusion & Outlook

Learn about Data Science and Machine Learning!

Please share if you like this post:

Tags

Join the Data Frog Community

Subscribe

Connect