A study of autocalibration of complex environmental models using machine learning approaches
This thesis is an investigation of machine learning methods for the autocalibration of environmental process models of intermediate complexity, meaning that they run with individual simulation times of the order of 1-30 minutes, and have tens of individual calibration parameters of varying distributions. A gridded approach, with a number of machine learning algorithms for post-processing, was tested against different popular calibration genetic, evolutionary and statistical algorithms. The probably approximately correct (PAC) learning hypothesis was tested on a well-documented watershed. For the test cases, the gridded approach based on directed random search with PAC reduction outperformed generally accepted genetic and stochastic algorithms by at least 10%. An experiment was conducted to test the hypothesis that PAC learning improves monotonically with increased numbers of simulations over a discrete set of calibration parameters, with positive results. A simple multi-objective calculation for flow, together with nitrogen and phosphorous transport was tested. It was shown that when the model is calibrated separately for best flow values, it does not necessarily produce a good result for total nitrogen (TN) and total phosphorous (TP) loads. On the other hand, when flow, TN and TP loads are calibrated simultaneously, with acceptable calibration results for TN and TP along with acceptable flow values, all three are simultaneously in an acceptable range. One outcome is a table-driven agent program, where the processes of calibration and production are separated from each other. An alternative, implicit approach was also tested for practicality, where the parameter set was used to populate output values from which calibration was derived by searching the output calibration space. The models tested are "gold standard" nonpoint source surface water pollution models and lake circulation models, whose primary objective is estimating the impacts of sediment and nutrient transport on Lake Winnipeg. Runs were performed using a Monte Carlo approach on a clustered supercomputer, a powerful server and a regular desktop computer. A real watershed with an outflow into a large lake, with complex calibration needs, was used as the test case for experiments with linkages of more than one model, and of models requiring simulation times of the order of hours. Together, the experiments pointed to machine learning and to data mining as promising candidates for autocalibration, preliminary assessments or even as efficient substitutes for the standard genetic or statistical algorithms currently in practice.