McMaster Team Digitizes 100 Years of Canadian Disease Data

McMaster University

Researchers at McMaster University have developed a new database that brings together more than 100 years of historical epidemiological data from across Canada, which will help to predict future patterns of infectious disease.

The database is the culmination of a 25-year project led by mathematician David Earn, which began when he uncovered two boxes of handwritten documents containing 50 years of weekly infectious disease incidence reports—from 1939 to 1989—found in a neglected storage area at the Ontario Ministry of Health.

It was exactly the sort of thing Earn hoped to unearth during his visit — historical public health data that could help contextualize current and future infectious disease outbreaks.

"Initially, the Ministry said that they couldn't provide the data — that they didn't have the time to search through their archives for us," recalls Earn, a professor in McMaster's Department of Mathematics and Statistics. "So, I offered to come to Toronto and look through their files myself, if they would let me. I basically begged, insisting on the value of the historical records, and I wouldn't let it go. Eventually, I guess I became too much of a nuisance and they relented."

The documents uncovered that day catalyzed a massive retrospective research project that has culminated in a complete, province-by-province inventory of Canadian infectious disease records.

The result, published today in PLOS Global Public Health , is what Earn describes as a "genuinely beautiful dataset" that strings together more than 100 years of historical epidemiological information.

Altogether, the new database — the Canadian Notifiable Disease Incidence Dataset, or "CANDID" — contains more than a million infectious disease incidence counts that date back as far as 1903.

The dataset, which is now publicly accessible, captures weekly, monthly, and quarterly case numbers for diseases like poliomyelitis, hepatitis, tuberculosis, whooping cough, influenza, rubella, mumps, measles, and many others, and tracks their spread in each province and territory across time.

"Data like these reveal the speed and shape of outbreaks and recurrent epidemics of the past, and allow us to test models that predict patterns of spread," Earn says. "This new dataset can be leveraged to understand the ecology and evolution of infectious disease across Canada's history, and to help us prepare for emerging and re-emerging diseases in the future."

In fact, Earn's team has already used the database to better understand the spatial and temporal incidence of polio and whooping cough across several decades of Canadian history.

While the new study was 25 years in the making, Earn says it really accelerated in 2021, when a large pandemic-related NSERC network grant allowed him to recruit Steven Walker, a former McMaster postdoctoral fellow, to his team.

Walker, who re-joined McMaster as a data scientist in Earn's group, was tasked with curating, cleaning, and harmonizing the troves of data that Earn and his associates had previously unearthed from libraries, public health offices, and provincial and federal agencies based all across Canada.

"We would start with scans of handwritten or typewritten documents and manually transcribe them into Microsoft Excel to ensure that we had functional replicas of every original document," Walker explains. "But the replicas aren't conducive to data analysis, due to inconsistent formatting, so we've also been developing flexible data structures that are more convenient for analysis and discovery."

Earn, a member of the Michael G. DeGroote Institute for Infectious Disease Research , hopes that the new dataset — and the herculean efforts to assemble it — will help spur important changes to Canada's current infectious disease reporting standards, noting that the public release of infectious disease data is arguably worse now than it was at any point during the 20th century, including the pre-digital era.

In fact, today, the Public Health Agency of Canada issues only annual, nationally aggregated incidence counts — not weekly or regional information — which limits opportunity for important studies into epidemic patterns, seasonal effects, and geographic variation.

Earn says that the reduced resolution in today's data is due in large part to patient privacy protection — a critically important consideration, but one that Earn believes can be maintained even with increased sharing of useful data.

"It is extremely important to protect patient privacy, and our federal, provincial, and territorial agencies have developed protocols for data release that aim to ensure privacy is protected," he says. "But there is no individual-level information in aggregate counts of infectious disease cases, and no identifying information can be extracted from these data. I think that current data release protocols should be thoughtfully and carefully reconsidered, so that they still prioritize privacy, but also allow for the release of more useful information, which could help us to prepare for future outbreaks — to the benefit of all Canadians."

In the meantime, Earn's group encourages epidemiologists in Canada and elsewhere to use CANDID to study the patterns of disease incidence, to learn from historical surveillance efforts, and to strengthen public health preparedness.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.