Data Integration from Multiple Historical Sources to Study Canadian Casualties of WWI
Longitudinal data (data that observe the same entities at different points in time), are of interest to historians and social scientists because they create opportunities to study populations over time. In this thesis, we construct longitudinal data by integrating data from four historical sources to study Canadian casualties of World War I. Due to the unavailability of labeled data for two out of three linkage tasks and our application's low tolerance for false matches, we develop a simple stepwise deterministic strategy to integrate the four datasets. For one of three linkage tasks where labeled data are available, we compare the strategy with linkage that incorporates a Support Vector Machine. With the longitudinal dataset constructed, we demonstrate its utility by performing a multivariate regression analysis to determine the factors that influenced a Canadian soldier's likelihood of survival in World War I. The findings of this research indicate that a cautious stepwise deterministic strategy that incorporates approximate comparisons and domain knowledge, can perform on par with a linkage approach that incorporates a supervised learning algorithm, without requiring labeled data. The regression analysis reveals several fascinating patterns of historical importance in early 19th century Canada, demanding further historical investigation.