Using python to analyse effects of COVID-19 on mortality in the UK
By andre
COVID-19 has been with us for almost a year as a news item. The reports indicate, that a lot of people died of the disease. Every lost life is a tragedy, of course. It is important to understand, how exactly the pandemic affected mortality in large populations. Unfortunately, the cause of death attribution to COVID-19 is not always accurate. Sometimes people are recorded as have died of the virus simply because they tested positive, even when they had none of very few real symptoms. On the other hand, some deaths from the infection might very well gone unnoticed. So today I will use some python code and publicly available figures to try and draw some conclusions from the data that we do have and can be reasonably confident about.
As an example I will be looking at the United Kingdom. It has good record keeping of the fundamental demographic figures, such as interest us today, namely overall population size and death counts. The data is available on the web site for the ONS (Office for National Statistics). I use these two sets specifically:
- Estimates of the population for the UK, England and Wales, Scotland and Northern Ireland - the data available there covers years 2006-2020
- Deaths registered monthly in England and Wales - which includes the totals for the whole of the UK. Regional details for areas outside England and Wales are not available, but I don’t need those anyway for this exercise
For December 2020 there are no death numbers published at the time of writing, therefore I approximate those to be the average of the preceding 11 months of the year. There is also no population figure for 2020 as well. The approximation here is the value for 2019, adjusted to the same growth rate as between 2018 and 2019, which is around 0.5%.
With that in mind, let’s look at some results. First of all, we can calculate the yearly death rates for the available years. Here is the data ordered by the death rate per 1000 persons in ascending order:
Date Death_rate Change
2011-01-01 7.653723 1.014195
2014-01-01 7.762372 1.009765
2012-01-01 7.838172 1.002689
2010-01-01 7.859246 1.004144
2009-01-01 7.891811 1.001740
2013-01-01 7.905543 1.005257
2019-01-01 7.947101 1.006395
2016-01-01 7.997922 1.009596
2017-01-01 8.074669 1.007443
2015-01-01 8.134768 1.002130
2018-01-01 8.152096 1.008348
2007-01-01 8.220150 1.001750
2008-01-01 8.234535 1.003428
2006-01-01 8.262766 1.083828
2020-01-01 8.955416 NaN
What can we see here? Looks like the year 2011 was a good one - it had the lowest mortality rate. And 2020 does lead this macabre hit parade. By how much? Well, 2006 holds the second place. The change between these two is ~8.4%. This is also the highest sequential change here. 2020 doesn’t look good in that regard either. Let’s take a look at year by year change now, in chronological order:
Date Death_rate Change
2006-01-01 8.262766 0.994842
2007-01-01 8.220150 1.001750
2008-01-01 8.234535 0.958380
2009-01-01 7.891811 0.995874
2010-01-01 7.859246 0.973850
2011-01-01 7.653723 1.024099
2012-01-01 7.838172 1.008595
2013-01-01 7.905543 0.981890
2014-01-01 7.762372 1.047975
2015-01-01 8.134768 0.983178
2016-01-01 7.997922 1.009596
2017-01-01 8.074669 1.009589
2018-01-01 8.152096 0.974854
2019-01-01 7.947101 1.126878
2020-01-01 8.955416 NaN
We can see clearly that 2020 still displays the largest increase in mortality rates year on year. It shot 12% vs 2019. The second largest increase during the investigated period was between 2014-2015, at roughly 5%. If you were in the sceptics camp until now, doubting even the existence of the virus, it might be a good time to re-evaluate your position. This is under the assumption, that you at least trust the ONS numbers.
As we know, the new coronavirus might had been spreading around the world as early as winter of 2019. However, the sharp increase in reported case numbers only started appearing in March 2020 and later. Quite possible, that the first quarter of 2020 wasn’t that special after all. Therefore, let’s look at the data split into quarter periods, again, starting with the death rate ordering:
Date Death_rate Change
2013-07-01 1.751359 1.014003
2011-07-01 1.775883 1.000595
2018-07-01 1.776940 1.000676
2009-07-01 1.778142 1.002361
2012-07-01 1.782340 1.004786
2010-07-01 1.790870 1.001598
2020-07-01 1.793731 1.009587
...................
2010-01-01 2.143884 1.003103
2019-01-01 2.150537 1.006831
2016-01-01 2.165228 1.010686
2008-10-01 2.188365 1.005275
2020-10-01 2.199910 1.002848
2008-01-01 2.206174 1.001151
2013-01-01 2.208713 1.009871
2009-01-01 2.230516 1.001872
2020-01-01 2.234692 1.016805
2007-01-01 2.272246 1.020122
2006-01-01 2.317969 1.004696
2017-01-01 2.328853 1.025557
2015-01-01 2.388372 1.037103
2018-01-01 2.476987 1.100968
2020-04-01 2.727083 NaN
Very curious! It looks like the second quarter of 2020 was a really bad time to be in the UK, if you valued your earthly existence - it is the clear “winner” of this ranking. However, the first quarter, as I suspected, was nothing special and it languishes at the 7th place. The last quarter is even further behind, at 11th! Of course, let’s remember that the figures for December 2020 are an approximation for now, so we will have to wait until the real numbers are published. The third quarter was the most interesting - it is 7th place from the end, i.e. pretty much in line with mortality for a reasonably good year. Of course, it was summer, and death rates fluctuate between seasons. It is possible, however, looking at all the differences between the quarters of 2020, that the prevention measures, such as lockdowns and international travel reduction, did help bringing mortality down.
As a last bit of analysis, let’s take a look at monthly death rates figures:
MIN
Death_rate Change
Date
2009-08-01 0.539764 1.024146
2012-09-01 0.552798 1.001645
2020-08-01 0.553707 1.004080
2015-08-01 0.555967 1.001263
2014-08-01 0.556669 1.000151
MAX
Death_rate Change
Date
2010-12-01 0.769510 1.001251
2014-12-01 0.770472 1.000178
2010-01-01 0.770609 1.000648
2018-03-01 0.771108 1.006122
2013-04-01 0.775829 1.005169
2020-05-01 0.779839 1.012964
2011-01-01 0.789948 1.021680
2019-01-01 0.807075 1.000654
2006-01-01 0.807602 1.001690
2006-03-01 0.808967 1.014109
2007-01-01 0.820381 1.005836
2013-01-01 0.825169 1.020424
2008-01-01 0.842022 1.002754
2020-01-01 0.844342 1.026698
2008-12-01 0.866883 1.002076
2017-01-01 0.868683 1.017757
2009-01-01 0.884108 1.057791
2015-01-01 0.935201 1.032566
2018-01-01 0.965658 1.359262
2020-04-01 1.312581 NaN
Once again we see, that the first quarter, and specifically April of 2020 was a rather sinister time in recent British history. However, other months of that year weren’t that bad, at least as far as mortality was concerned.
Conclusion
In this post I demonstrated how using publicly available data and a bit of python code we can analyse even such complex and ongoing phenomenon as a global pandemic. It is worth emphasising, that I myself have no training in either epidemiology nor in serious statistical analysis, therefore I am not trying to do too much of deep investigation into figures I extract from the data. There are plenty of considerations that should be taken into consideration as well, but I am not in a position to do so in the framework of a short publication. People with more data, experience and/or patience will derive more interesting insights. And I did not even attempt to look at the human cost or the economical consequences - for this more diverse data and models will be needed. Nonetheless, I hope you consider this article as a useful exercise and will use the demonstrated techniques for your own research into matters large and small.
The source code
Below I attach the python 3 script I use to generate my data. It can serve as an example for manipulating Excel files in the language, as well as working with some popular scientific libraries, such as NumPy and Pandas. This code is published under the GPLv3 license agreement.
#!/usr/bin/env python
import enum
import re
import os
import os.path
import datetime as dt
from enum import unique, auto
from typing import List, Dict
import numpy as np
import pandas as pd
import openpyxl as xl
from openpyxl.worksheet.worksheet import Worksheet
@unique
class DataPointType(enum.IntEnum):
Yearly = auto()
Quarterly = auto()
Monthly = auto()
class YearlyStats:
def __init__(self, year: int, population: int):
self.year = year
self.population = population
self.months = np.zeros(12, int)
def add_month(self, month: int, value: int):
self.months[month] = value
def year_total(self) -> int:
return int(self.months.sum())
def monthly_average(self, exclude_zeros: bool = True) -> int:
if exclude_zeros:
count = np.count_nonzero(self.months)
else:
count = len(self.months)
return self.year_total() / count
def fill_zeros(self):
avg = self.monthly_average()
for i in range(len(self.months)):
if self.months[i] == 0:
print(f"Adjusting {self.year}/{i+1} to {avg}")
self.months[i] = avg
def death_rate(self) -> float:
return float(self.year_total()) / (self.population / 1000.0)
def data_points(self, dp_type: DataPointType):
if dp_type == DataPointType.Yearly:
return [(dt.date(self.year, 1, 1), self.death_rate())]
elif dp_type == DataPointType.Monthly:
slices = []
for i in range(len(self.months)):
slices.append((dt.date(self.year, i+1, 1), self._death_rate(self.months[i])))
return slices
elif dp_type == DataPointType.Quarterly:
slices = []
for i in range(4):
slices.append((dt.date(self.year, i*3+1, 1), self._death_rate(self.months[i*3 : (i+1)*3])))
return slices
else:
raise Exception("Unsupported type")
def _death_rate(self, data) -> float:
if isinstance(data, int):
sum = data
else:
sum = data.sum()
return float(sum) / (self.population / 1000.0)
def __str__(self):
return f"{self.year}: {self.year_total():,} {self.death_rate():.3}"
# Data source: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/monthlyfiguresondeathsregisteredbyareaofusualresidence
# Can covert .xls files into .xslx with LibreOffice:
# libreoffice --headless --convert-to xlsx --outdir . *.xls
class UkDataExtractor:
year_reg = re.compile(r'.*\b(\d{4})\b.*')
# Data from:
# ukpopulationestimates18382018.xlsx
# ukmidyearestimates20192020ladcodes.xls
# For 2020 the estimate is based on 2019 with the same increase as between 2018 and 2019
population_raw = """
Mid-2020 67,160,028
Mid-2019 66,796,807
Mid-2018 66,435,550
Mid-2017 66,040,229
Mid-2016 65,648,054
Mid-2015 65,110,034
Mid-2014 64,596,752
Mid-2013 64,105,654
Mid-2012 63,705,030
Mid-2011 63,285,145
Mid-2010 62,759,456
Mid-2009 62,260,486
Mid-2008 61,823,772
Mid-2007 61,319,075
Mid-2006 60,826,967
Mid-2005 60,413,276
Mid-2004 59,950,364
Mid-2003 59,636,662
Mid-2002 59,365,677
Mid-2001 59,113,016
Mid-2000 58,886,100
"""
def __init__(self):
self.years: List[YearlyStats] = []
self.population: Dict[int, int] = dict()
for line in self.population_raw.strip().splitlines():
year, population = line.split()
year = year.split("-")[1]
self.population[int(year)] = int(population.replace(",", ""))
def load(self, path):
for dirpath, dirnames, filenames in os.walk(path):
for filename in filenames:
fullname = os.path.join(dirpath, filename)
self.process_file(fullname)
for year in self.years:
year.fill_zeros()
self.years.sort(key=lambda x: x.death_rate())
def process_file(self, path: str):
if not path.endswith(".xlsx"):
return
def is_total_row_indicator(cell_value: str) -> bool:
indicators = ["england, wales and elsewhere", "total registrations"]
if cell_value is None or not isinstance(cell_value, str):
return False
for ind in indicators:
if ind in cell_value.lower():
return True
return False
wb = xl.open(path, read_only=True, data_only=True)
for name in wb.sheetnames:
match = self.year_reg.match(name)
if match is not None:
year = int(match.group(1))
print(f"{path}: Year {year} in {name}")
sheet: Worksheet = wb[name]
totals = None
for row in sheet.rows:
if totals is None:
month = 0
for cell in row:
value = cell.value
if is_total_row_indicator(value):
totals = YearlyStats(year, self.population[year])
elif totals is not None:
if value is not None:
totals.add_month(month, value)
month += 1
else:
break
# if year == 12013:
# print(tuple([cell.value for cell in row]))
self.years.append(totals)
break
def main():
extractor = UkDataExtractor()
extractor.load("uk")
dr_name = "Death_rate"
for dp_type in DataPointType:
data_points = []
for year in extractor.years:
data_points.extend(year.data_points(dp_type))
df = pd.DataFrame(data_points, columns=["Date", dr_name])
df.set_index("Date", inplace=True)
print()
print("####", dp_type)
values = df.sort_values(by=["Date"])#df.sort_values(by=[dr_name])
values["Change"] = values[dr_name].shift(-1) / values[dr_name]
if len(values) > 100:
print(" MIN ")
print(values.head())
print(" MAX ")
print(values.tail(20))
else:
print(values)
if __name__ == '__main__':
main()