Statistical and Data-Driven Models to Explain and Predict Bacterial Contamination of Private Wells in the Wellington-Dufferin-Guelph Public Health Region
This thesis presents an exploration of statistical and data-driven models to explain and predict bacterial contamination of privately-owned drinking wells in the Wellington-Dufferin-Guelph Public Health region. Logistic regression models applying stepwise selection based on the Akaike Information Criterion were used to identify important well characteristics associated with bacterial contamination. Random forest algorithms with combinations of up- and down-sampling procedures, the number of predictors utilized for each split, and the statistical metric on which the algorithm was optimized, were compared to create a predictive model for bacterial contamination based on hydrogeological predictor variables. Overall, the age of well, treatment systems on the well, point contamination sources near the well, and season in which the water sample was tested were associated with bacterial contamination. Predictive models that utilized down-sampling methods and maximized the number of predictors for each split based on Kappa in model training improved the prediction accuracy of contaminated wells.