A Pipeline for Recognition of Trophic Information in Primary Literature
This thesis consists of an investigation into the use of Natural Language Processing methods for the automated extraction and classification of trophic information from primary literature. First, this thesis explores the use of two-character bigrams in training machine learning models for scientific name identification. Afterwards, the composition and testing of the overall trophic analysis pipeline is discussed, which consists of an open information extraction tool, dictionary-based methods, rule-based methods and a machine learning model. Then potential future directions such as the incorporation of noun phrases and document-level analysis are mentioned. The results demonstrate that input format has a large influence on the retrieval of information from primary literature and that open information extraction tools can quickly filter simple relations in text, but long-distance relations are difficult to locate.