Using python to analyse effects of COVID-19 on mortality in the UK
By andre
COVID-19 has been with us for almost a year as a news item. The reports indicate, that a lot of people died of the disease. Every lost life is a tragedy, of course. It is important to understand, how exactly the pandemic affected mortality in large populations. Unfortunately, the cause of death attribution to COVID-19 is not always accurate. Sometimes people are recorded as have died of the virus simply because they tested positive, even when they had none of very few real symptoms. On the other hand, some deaths from the infection might very well gone unnoticed. So today I will use some python code and publicly available figures to try and draw some conclusions from the data that we do have and can be reasonably confident about.
As an example I will be looking at the United Kingdom. It has good record keeping of the fundamental demographic figures, such as interest us today, namely overall population size and death counts. The data is available on the web site for the ONS (Office for National Statistics). I use these two sets specifically:
- Estimates of the population for the UK, England and Wales, Scotland and Northern Ireland - the data available there covers years 2006-2020
- Deaths registered monthly in England and Wales - which includes the totals for the whole of the UK. Regional details for areas outside England and Wales are not available, but I don’t need those anyway for this exercise
For December 2020 there are no death numbers published at the time of writing, therefore I approximate those to be the average of the preceding 11 months of the year. There is also no population figure for 2020 as well. The approximation here is the value for 2019, adjusted to the same growth rate as between 2018 and 2019, which is around 0.5%.
With that in mind, let’s look at some results. First of all, we can calculate the yearly death rates for the available years. Here is the data ordered by the death rate per 1000 persons in ascending order:
Date               Death_rate    Change                                                                                                                                                                                                                                                                                                                                                                                          
2011-01-01    7.653723  1.014195                                                                                                                                                                                   
2014-01-01    7.762372  1.009765                                                                                                                                                                                   
2012-01-01    7.838172  1.002689                                                                                                                                                                                   
2010-01-01    7.859246  1.004144                                                                                                                                                                                   
2009-01-01    7.891811  1.001740                                                                                                                                                                                   
2013-01-01    7.905543  1.005257                                                                                                                                                                                   
2019-01-01    7.947101  1.006395                                                                                                                                                                                   
2016-01-01    7.997922  1.009596                                                                                                                                                                                   
2017-01-01    8.074669  1.007443
2015-01-01    8.134768  1.002130
2018-01-01    8.152096  1.008348
2007-01-01    8.220150  1.001750
2008-01-01    8.234535  1.003428
2006-01-01    8.262766  1.083828
2020-01-01    8.955416       NaN
What can we see here? Looks like the year 2011 was a good one - it had the lowest mortality rate. And 2020 does lead this macabre hit parade. By how much? Well, 2006 holds the second place. The change between these two is ~8.4%. This is also the highest sequential change here. 2020 doesn’t look good in that regard either. Let’s take a look at year by year change now, in chronological order:
Date               Death_rate    Change
2006-01-01    8.262766  0.994842
2007-01-01    8.220150  1.001750
2008-01-01    8.234535  0.958380
2009-01-01    7.891811  0.995874
2010-01-01    7.859246  0.973850
2011-01-01    7.653723  1.024099
2012-01-01    7.838172  1.008595
2013-01-01    7.905543  0.981890
2014-01-01    7.762372  1.047975                                                                         
2015-01-01    8.134768  0.983178
2016-01-01    7.997922  1.009596
2017-01-01    8.074669  1.009589
2018-01-01    8.152096  0.974854
2019-01-01    7.947101  1.126878
2020-01-01    8.955416       NaN
We can see clearly that 2020 still displays the largest increase in mortality rates year on year. It shot 12% vs 2019. The second largest increase during the investigated period was between 2014-2015, at roughly 5%. If you were in the sceptics camp until now, doubting even the existence of the virus, it might be a good time to re-evaluate your position. This is under the assumption, that you at least trust the ONS numbers.
As we know, the new coronavirus might had been spreading around the world as early as winter of 2019. However, the sharp increase in reported case numbers only started appearing in March 2020 and later. Quite possible, that the first quarter of 2020 wasn’t that special after all. Therefore, let’s look at the data split into quarter periods, again, starting with the death rate ordering:
Date               Death_rate    Change
2013-07-01    1.751359  1.014003
2011-07-01    1.775883  1.000595
2018-07-01    1.776940  1.000676
2009-07-01    1.778142  1.002361
2012-07-01    1.782340  1.004786
2010-07-01    1.790870  1.001598
2020-07-01    1.793731  1.009587
...................
2010-01-01    2.143884  1.003103
2019-01-01    2.150537  1.006831
2016-01-01    2.165228  1.010686
2008-10-01    2.188365  1.005275
2020-10-01    2.199910  1.002848
2008-01-01    2.206174  1.001151
2013-01-01    2.208713  1.009871
2009-01-01    2.230516  1.001872
2020-01-01    2.234692  1.016805
2007-01-01    2.272246  1.020122
2006-01-01    2.317969  1.004696
2017-01-01    2.328853  1.025557
2015-01-01    2.388372  1.037103
2018-01-01    2.476987  1.100968
2020-04-01    2.727083       NaN
Very curious! It looks like the second quarter of 2020 was a really bad time to be in the UK, if you valued your earthly existence - it is the clear “winner” of this ranking. However, the first quarter, as I suspected, was nothing special and it languishes at the 7th place. The last quarter is even further behind, at 11th! Of course, let’s remember that the figures for December 2020 are an approximation for now, so we will have to wait until the real numbers are published. The third quarter was the most interesting - it is 7th place from the end, i.e. pretty much in line with mortality for a reasonably good year. Of course, it was summer, and death rates fluctuate between seasons. It is possible, however, looking at all the differences between the quarters of 2020, that the prevention measures, such as lockdowns and international travel reduction, did help bringing mortality down.
As a last bit of analysis, let’s take a look at monthly death rates figures:
MIN                          
            Death_rate    Change
Date                            
2009-08-01    0.539764  1.024146
2012-09-01    0.552798  1.001645
2020-08-01    0.553707  1.004080
2015-08-01    0.555967  1.001263
2014-08-01    0.556669  1.000151
   MAX                          
            Death_rate    Change
Date                            
2010-12-01    0.769510  1.001251
2014-12-01    0.770472  1.000178
2010-01-01    0.770609  1.000648
2018-03-01    0.771108  1.006122
2013-04-01    0.775829  1.005169
2020-05-01    0.779839  1.012964
2011-01-01    0.789948  1.021680
2019-01-01    0.807075  1.000654
2006-01-01    0.807602  1.001690
2006-03-01    0.808967  1.014109
2007-01-01    0.820381  1.005836
2013-01-01    0.825169  1.020424
2008-01-01    0.842022  1.002754
2020-01-01    0.844342  1.026698
2008-12-01    0.866883  1.002076
2017-01-01    0.868683  1.017757
2009-01-01    0.884108  1.057791
2015-01-01    0.935201  1.032566
2018-01-01    0.965658  1.359262
2020-04-01    1.312581       NaN
Once again we see, that the first quarter, and specifically April of 2020 was a rather sinister time in recent British history. However, other months of that year weren’t that bad, at least as far as mortality was concerned.
Conclusion
In this post I demonstrated how using publicly available data and a bit of python code we can analyse even such complex and ongoing phenomenon as a global pandemic. It is worth emphasising, that I myself have no training in either epidemiology nor in serious statistical analysis, therefore I am not trying to do too much of deep investigation into figures I extract from the data. There are plenty of considerations that should be taken into consideration as well, but I am not in a position to do so in the framework of a short publication. People with more data, experience and/or patience will derive more interesting insights. And I did not even attempt to look at the human cost or the economical consequences - for this more diverse data and models will be needed. Nonetheless, I hope you consider this article as a useful exercise and will use the demonstrated techniques for your own research into matters large and small.
The source code
Below I attach the python 3 script I use to generate my data. It can serve as an example for manipulating Excel files in the language, as well as working with some popular scientific libraries, such as NumPy and Pandas. This code is published under the GPLv3 license agreement.
#!/usr/bin/env python
import enum
import re
import os
import os.path
import datetime as dt
from enum import unique, auto
from typing import List, Dict
import numpy as np
import pandas as pd
import openpyxl as xl
from openpyxl.worksheet.worksheet import Worksheet
@unique
class DataPointType(enum.IntEnum):
    Yearly = auto()
    Quarterly = auto()
    Monthly = auto()
class YearlyStats:
    def __init__(self, year: int, population: int):
        self.year = year
        self.population = population
        self.months = np.zeros(12, int)
    def add_month(self, month: int, value: int):
        self.months[month] = value
    def year_total(self) -> int:
        return int(self.months.sum())
    def monthly_average(self, exclude_zeros: bool = True) -> int:
        if exclude_zeros:
            count = np.count_nonzero(self.months)
        else:
            count = len(self.months)
        return self.year_total() / count
    def fill_zeros(self):
        avg = self.monthly_average()
        for i in range(len(self.months)):
            if self.months[i] == 0:
                print(f"Adjusting {self.year}/{i+1} to {avg}")
                self.months[i] = avg
    def death_rate(self) -> float:
        return float(self.year_total()) / (self.population / 1000.0)
    def data_points(self, dp_type: DataPointType):
        if dp_type == DataPointType.Yearly:
            return [(dt.date(self.year, 1, 1), self.death_rate())]
        elif dp_type == DataPointType.Monthly:
            slices = []
            for i in range(len(self.months)):
                slices.append((dt.date(self.year, i+1, 1), self._death_rate(self.months[i])))
            return slices
        elif dp_type == DataPointType.Quarterly:
            slices = []
            for i in range(4):
                slices.append((dt.date(self.year, i*3+1, 1), self._death_rate(self.months[i*3 : (i+1)*3])))
            return slices
        else:
            raise Exception("Unsupported type")
    def _death_rate(self, data) -> float:
        if isinstance(data, int):
            sum = data
        else:
            sum = data.sum()
        return float(sum) / (self.population / 1000.0)
    def __str__(self):
        return f"{self.year}: {self.year_total():,} {self.death_rate():.3}"
# Data source: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/monthlyfiguresondeathsregisteredbyareaofusualresidence
# Can covert .xls files into .xslx with LibreOffice:
# libreoffice --headless --convert-to xlsx --outdir . *.xls
class UkDataExtractor:
    year_reg = re.compile(r'.*\b(\d{4})\b.*')
    # Data from:
    # ukpopulationestimates18382018.xlsx
    # ukmidyearestimates20192020ladcodes.xls
    # For 2020 the estimate is based on 2019 with the same increase as between 2018 and 2019
    population_raw = """
Mid-2020    67,160,028
Mid-2019    66,796,807
Mid-2018	66,435,550
Mid-2017	66,040,229
Mid-2016	65,648,054
Mid-2015	65,110,034
Mid-2014	64,596,752
Mid-2013	64,105,654
Mid-2012	63,705,030
Mid-2011	63,285,145
Mid-2010	62,759,456
Mid-2009	62,260,486
Mid-2008	61,823,772
Mid-2007	61,319,075
Mid-2006	60,826,967
Mid-2005	60,413,276
Mid-2004	59,950,364
Mid-2003	59,636,662
Mid-2002	59,365,677
Mid-2001	59,113,016
Mid-2000	58,886,100
    """
    def __init__(self):
        self.years: List[YearlyStats] = []
        self.population: Dict[int, int] = dict()
        for line in self.population_raw.strip().splitlines():
            year, population = line.split()
            year = year.split("-")[1]
            self.population[int(year)] = int(population.replace(",", ""))
    def load(self, path):
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                fullname = os.path.join(dirpath, filename)
                self.process_file(fullname)
        for year in self.years:
            year.fill_zeros()
        self.years.sort(key=lambda x: x.death_rate())
    def process_file(self, path: str):
        if not path.endswith(".xlsx"):
            return
        def is_total_row_indicator(cell_value: str) -> bool:
            indicators = ["england, wales and elsewhere", "total registrations"]
            if cell_value is None or not isinstance(cell_value, str):
                return False
            for ind in indicators:
                if ind in cell_value.lower():
                    return True
            return False
        wb = xl.open(path, read_only=True, data_only=True)
        for name in wb.sheetnames:
            match = self.year_reg.match(name)
            if match is not None:
                year = int(match.group(1))
                print(f"{path}: Year {year} in {name}")
                sheet: Worksheet = wb[name]
                totals = None
                for row in sheet.rows:
                    if totals is None:
                        month = 0
                        for cell in row:
                            value = cell.value
                            if is_total_row_indicator(value):
                                totals = YearlyStats(year, self.population[year])
                            elif totals is not None:
                                if value is not None:
                                    totals.add_month(month, value)
                                    month += 1
                    else:
                        break
                    # if year == 12013:
                    #    print(tuple([cell.value for cell in row]))
                self.years.append(totals)
                break
def main():
    extractor = UkDataExtractor()
    extractor.load("uk")
    dr_name = "Death_rate"
    for dp_type in DataPointType:
        data_points = []
        for year in extractor.years:
            data_points.extend(year.data_points(dp_type))
        df = pd.DataFrame(data_points, columns=["Date", dr_name])
        df.set_index("Date", inplace=True)
        print()
        print("####", dp_type)
        values = df.sort_values(by=["Date"])#df.sort_values(by=[dr_name])
        values["Change"] = values[dr_name].shift(-1) / values[dr_name]
        if len(values) > 100:
            print("   MIN   ")
            print(values.head())
            print("   MAX   ")
            print(values.tail(20))
        else:
            print(values)
if __name__ == '__main__':
    main()