Advanced Analytics for Disease Forecasting - A Comparitive Analysis of Statistical and Machine Learning Methods
Infectious diseases continue to evolve and as seen during the recent coronavirus (COVID-19) pandemic, they remain a serious threat to the health of human populations. To keep pace with the evolution of infectious diseases, surveillance methods must similarly advance. This thesis aims to explore the application of statistical and machine learning methods for disease forecasting and outbreak detection. Methods were applied to temporal and spatio-temporal data from infectious disease surveillance in Canada. Infectious diseases of interest were seasonal influenza, COVID-19, and Lyme disease. Accurately forecasting the timing and magnitude of peak seasonal influenza incidence is important for public health preparedness. In 2019, COVID-19 emerged and posed a substantial risk to the health of Canadians. COVID-19 incidence data provide the opportunity to evaluate the statistical and machine learning ability to model emerging diseases. The expanding geographic range of Lyme disease is also a growing concern to Canadian public health. As Lyme disease incidence has been linked to changing weather patterns, projecting its incidence under the different climate scenarios of Representative Concentration Pathway 4.5 and 8.5 is of interest. Seasonal Autoregressive Integrated Moving Average was shown to outperform artificial neural networks in forecasting seasonal influenza activity in Canada. However, when applied to COVID-19 incidence in the public health units of Toronto and Wellington-Dufferin-Guelph, random forest outperformed several statistical learning models. Additionally, machine learning accurately forecasted spatio-temporal Lyme disease incidence in Ontario. For the same dataset, Bayesian statistics did not converge. Endemic-Epidemic modeling showed solid performance in measures of power of detection, sensitivity, specificity, and timeliness for detecting simulated COVID-19 outbreaks in spatio-temporal data structures. Farrington Flexible (FF) required tuning before demonstrating robust performance. Results indicate that both statistical and machine learning are valuable for disease surveillance. Machine learning is a flexible tool, displaying strong forecasting performance across different data structures. With advances in computational power and availability of “big data”, machine learning will continue to play an important role in disease forecasting. However, the “black box” problem of machine learning makes it unfit for explanatory purposes. Therefore, traditional statistical models should still be applied to identify possible risk factors for disease incidence.