COVID-19 Analysis

Analyse the COVID-19 data from John Hopkins University by yourself with python (pandas, holoviews, matplotlib)

In this post, you will see how to access the COVID-19 dataset provided by John Hopkins University, and how to analyse it with python.

You will learn how to:

  • preprocess the dataset with pandas
  • create interactive plots with holoviews
  • fit a simple exponential model to the data to predict the number of cases in the near future

The COVID-19 Dataset from John Hopkins University

The dataset can be found in this github repository. To get it, just clone the repository:

git clone https://github.com/CSSEGISandData/COVID-19.git

cd COVID-19

It consists of three csv files that are updated daily.

Installation

To run this tutorial, first Install Anaconda for Machine Learning and Data Science in Python.

Then, create a conda environement

conda create -n covid19

Activate it:

conda activate covid19

And install holoviz (you'll also get pandas, matplotlib, and everything you need :)

conda install -c pyviz holoviz

Finally, start your jupyter notebook:

jupyter notebook

Preprocessing the COVID-19 Dataset from John Hopkins University

The dataset consists of three csv files that are updated daily:

  • time_series_19-covid-Confirmed.csv
  • time_series_19-covid-Deaths.csv
  • time_series_19-covid-Recovered.csv

We create three dataframes, one for each csv file, and we store them in a dictionary:

In [1]:
import pandas as pd
datatemplate = 'csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-{}.csv'

fields = ['Confirmed', 'Deaths', 'Recovered']
dfs = dict()
for field in fields: 
    dfs[field] = pd.read_csv(datatemplate.format(field))

Here is what we have for one of them:

In [2]:
dfs['Confirmed'].head()
Out[2]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 3/2/20 3/3/20 3/4/20 3/5/20 3/6/20 3/7/20 3/8/20 3/9/20 3/10/20 3/11/20
0 NaN Thailand 15.0000 101.0000 2 3 5 7 8 8 ... 43 43 43 47 48 50 50 50 53 59
1 NaN Japan 36.0000 138.0000 2 1 2 2 4 4 ... 274 293 331 360 420 461 502 511 581 639
2 NaN Singapore 1.2833 103.8333 0 1 3 3 4 5 ... 108 110 110 117 130 138 150 150 160 178
3 NaN Nepal 28.1667 84.2500 0 0 0 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
4 NaN Malaysia 2.5000 112.5000 0 0 0 3 4 4 ... 29 36 50 50 83 93 99 117 129 149

5 rows × 54 columns

We need to do a bit of preprocessing before we can analyze these data. Indeed:

  • The number of cases are stored in columns, which is not practical for analysis and display. We'd rather have one line per measurement.
  • The numbers of confirmed cases, deaths, and recovered patients are currently stored in different dataframes. We would prefer to have all the information in a single dataframe.
  • The column names are too long and painful to type.

Here is the code:

In [4]:
# loop on the dataframe dictionary
for field, df in dfs.items():
    # group by country, to sum on states
    df = df.groupby('Country/Region', as_index=False).sum()
    # turn each measurement column into a separate line, 
    # and store the results in a new dataframe
    df = df.melt(id_vars=['Country/Region', 'Lat', 'Long'],
                 value_name='counts')
    # keep track of the quantity that is measured 
    # either Confirmed, Deaths, or Recovered
    df['quantity'] = field
    # change column names 
    df.columns =  ['country', 'lat', 'lon', 'date', 'counts', 'quantity']
    # replace the dataframe in the dictionary
    dfs[field] = df

Now, we can concatenate the three dataframes and look at the results:

In [5]:
dfall = pd.concat(dfs.values())
dfall['date'] = pd.to_datetime(dfall['date'])
dfall.head()
Out[5]:
country lat lon date counts quantity
0 Afghanistan 33.0000 65.0000 2020-01-22 0 Confirmed
1 Albania 41.1533 20.1683 2020-01-22 0 Confirmed
2 Algeria 28.0339 1.6596 2020-01-22 0 Confirmed
3 Andorra 42.5063 1.5218 2020-01-22 0 Confirmed
4 Argentina -38.4161 -63.6167 2020-01-22 0 Confirmed

Displaying COVID-19 Cases vs Time

For the display, I decided to use holoviews, which is a fast way to get nice interactive plots, with bokeh as a powerful backend.

First, we initialize holoviews:

In [6]:
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')

Then, we need to create a holoviews dataset from our dataframe. Holoviews datasets are just a way to label and give meaning to your data. For example, here, we have a single measurement, counts. This measurement is done in different countries at different dates, for different quantities: Confirmed, Deaths, and Recovered.

So we provide country, date, and quantity as "key dimensions", and counts as "value dimension":

In [7]:
ds = hv.Dataset(dfall, kdims=['country', 'date', 'quantity'], 
                vdims=['counts'])

Currently, the dataset contains all countries. Let's select a few European countries for simplicity:

In [10]:
dsel = ds.select(country=['France','Italy', 'Switzerland'])

Finally, we can project the dataset into a set of curves.

Here, we plot counts as a function of date, and the other key dimensions are automatically provided as drop down menus:

In [11]:
curves = dsel.to(hv.Curve, 'date', 'counts')
curves.opts(width=400)
Out[11]:

We can also overlay the curves for all countries:

In [12]:
curves.overlay('country').opts(legend_position='top_left')
Out[12]:

And for all quantities:

In [13]:
curves.overlay('quantity').opts(legend_position='top_left')
Out[13]:

We can also do bar plots, e.g. :

In [14]:
bars = dsel.to(hv.Bars, ['country','quantity'], 'counts')
bars.opts(width=400, stacked=True)
Out[14]:

Beware, the y axis scale changes over time in this one... I haven't found a way to fix it. At least, it allows us to see that France and Germany started early but managed their first cases very well.

We can also do a grouped bar plot:

In [15]:
bars.opts(width=400, height=400, stacked=False, xrotation=90)
Out[15]:

Statistical Modelling and Extrapolation

A simple exponential model of the epidemic

In the absence of external change (confinement measures, summer, a vaccine, etc), the number of confirmed cases in a given country should follow an exponential, at least at the beginning of the epidemic.

Indeed, each person has a given probability to infect other people.

So if there are $n(x)$ people infected at day $x$, the additional number of infected $\Delta n(x)$ between day $x$ and day $x+1$ is proportional to $n$:

$$\Delta n = a n$$

So:

$$\frac{\Delta n}{n} = a $$

And by integrating on x:

$${\rm ln} (n) = ax$$

Then, we exponentiate:

$$n = e^{ax}$$

Now, the infection does not start on the same day in all countries. So we add a parameter to our model to allow for a delay of the exponential:

$$n = e^{a(x-b)}$$

For a more visual (and much better) explanation, please have a look at this excellent video from 3blue1brown: Exponential growth and epidemics.

Fitting the model

We will now extract the parameters a and b for a given country by fitting the model to the data from this country.

Let's start by fitting the number of confirmed cases as a function of time for France.

We select the subset of our dataframe with these data:

In [16]:
sel = dfall[ (dfall['country']=='France') &
             (dfall['quantity']=='Confirmed')]
sel.head()
Out[16]:
country lat lon date counts quantity
39 France 82.1984 -123.6697 2020-01-22 0 Confirmed
153 France 82.1984 -123.6697 2020-01-23 0 Confirmed
267 France 82.1984 -123.6697 2020-01-24 2 Confirmed
381 France 82.1984 -123.6697 2020-01-25 3 Confirmed
495 France 82.1984 -123.6697 2020-01-26 3 Confirmed

The counts column contains the number of cases. Let's store this array of values as y for our fit (this is just a shortcut):

In [17]:
y = sel['counts']

We want to fit this as a function of the day. In the dataframe, the day is a timestamp. This is not a value, so we can't do a fit with it. So we just create an array of days starting at zero, with the right length:

In [18]:
import numpy as np
x = np.arange(len(y))
x[:10]
Out[18]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Then, we define the function that we want to fit:

In [19]:
expo = lambda x, a, b : np.exp( a*(x-b) )

The lambda statement creates a function object that, given x, a, and b, returns the value of our exponential. To take the exponential, we use np.exp, which is the numpy exponential function. This function can take a numpy array and exponentiate very efficiently all elements in the array. Therefore, our expo function can also take a numpy array, and will also perform well.

After doing that, we can just call expo:

In [20]:
expo(np.arange(5), 1, 0)
Out[20]:
array([ 1.        ,  2.71828183,  7.3890561 , 20.08553692, 54.59815003])

We're now ready to perform the fit:

In [21]:
from scipy.optimize import curve_fit
results = curve_fit(expo, x, y)
results
Out[21]:
(array([ 0.2695074 , 20.27067366]),
 array([[3.95487585e-05, 4.01039595e-03],
        [4.01039595e-03, 4.08498869e-01]]))

The first element of the results tuple contains the fitted a and b parameters. We can now plot the data points together with the fitted function.

In [22]:
import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.plot(x, expo(x, *results[0]))
Out[22]:
[<matplotlib.lines.Line2D at 0x101f2cf7b8>]

We have evaluated the function for each value of x in the dataset. If we want to extrapolate out of the data points, we need more values of x. We get them by creating a linspace:

In [23]:
linx = np.linspace(0,50,101)
plt.scatter(x, y)
plt.plot(linx, expo(linx, *results[0]))
Out[23]:
[<matplotlib.lines.Line2D at 0x1020054400>]

Finally, here is a small function that compares different countries in a given time range:

In [34]:
def plot(countries, xrange,
         dtype='Confirmed',
         yrange=None,
         yscale='linear'): 
    '''plot the covid-19 data with an exponential fit.
    - countries: list of countries
    - xrange: fit range, e.g. (30,55)
    - yscale: log or linear
    - yrange: well, y range, useful in log mode.
    '''
    xmin, xmax = xrange
    linx = np.linspace(xmin, xmax, 101)    
    colors = ['blue', 'red', 'orange', 'green']
    for i, country in enumerate(countries): 
        color = colors[i]
        sel = dfall[ (dfall['country']==country) &
                 (dfall['quantity']==dtype)]
        yp = sel['counts'][xmin:xmax+1]
        xp = np.arange(len(yp))+xmin
        plt.scatter(xp, yp, label=country, 
                    alpha=0.7, marker='.', color=color)
        pars, cov = curve_fit(expo, xp, yp)
        f = expo(linx, *pars)
        plt.plot(linx, f, 
                 color=color, alpha=0.3)
    plt.legend(loc='upper left')
    plt.xlabel('days')
    plt.ylabel('confirmed cases')
    plt.yscale(yscale)
    if yrange: 
        plt.ylim(*yrange)
    plt.grid()


plt.figure(dpi=90, figsize=(8,4))
plot(['France', 'Italy', 'Switzerland'], 
     dtype='Confirmed',
     xrange=(30, 56),
     yscale='log')

We have extrapolated by only a week, which is probably reasonable. Confinement has just been enforced in Italy, and its effects will come later. In the other countries, as far as I know, not much has been done yet.

Of course, the more you extrapolate, the less accurate is the prediction.

A Word of Caution

Extreme care must be taken when extrapolating a model out of the data points.

First of all, we're now fitting the number of cases in France as a function of time. But the French population is limited, so the exponential is clearly not going to work forever.

As a matter of fact, the larger the fraction of infected people, the more difficult it is for infected people to find new people to infect. In other words, at some point, our parameter a, which is the rate of infection and is currently fixed, is going to start decreasing as a function of time. The question is when... And we just have no clue for now.

Also, confinement measures will make it more difficult for people to infect others, which will also reduce a. This is certainly what's happening in China, where almost a billion people are currently confined. But if confinement ends for some reason, the exponential growth will kick in again.

Again, for a more detailed discussion, please have a look at the video from 3blue1brown.

The other reason why extrapolations are dangerous is that they are quite sensitive to the value of the last fitted point:

In [26]:
# the previous fit: 
linx = np.linspace(0,50,101)
plt.scatter(x, y)
plt.plot(linx, expo(linx, *results[0]))
# multiplying the last count by 1.2:
y_p = y.copy()
y_p.iloc[-1] = y.iloc[-1]*1.2
plt.scatter(x, y_p)
results_p = curve_fit(expo, x, y_p)
plt.plot(linx, expo(linx, *results_p[0]))
plt.xlim(30, 50)
Out[26]:
(30, 50)

Conclusion & Outlook

Reduce a: Stay at home if possible and wash your hands!

You might have noticed that something important is missing in the plots of this post: uncertainties! And indeed, without uncertainties, a plot cannot be taken seriously.

Here, I wanted to insist on the technical python tools. But in the next post, I'll show you how to take uncertainties into account.


Please let me know what you think in the comments! I’ll try and answer all questions.

And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)

Back Home