A distributional approach to text segmentation of coherent documents

Thumbnail Image
Vasak, Joe
Journal Title
Journal ISSN
Volume Title
University of Guelph

A naturally occurring text is generally coherent in that there is an underlying hierarchical organization where a set of interrelated topics contribute to few common themes and some topics may be supported by subtopics. Text segmentation is to divide a document into a set of segments that roughly correspond to topics and/or subtopics. Current text segmentation systems are mostly cohesion-based because they rely on word repetitions to measure the similarity between two regions of text. Such methods are not suited for coherent text since transitions are subtle and difficult to detect given the hierarchical organization with interrelated topics contributing to one or few common themes. In this thesis, we propose two new measures for reducing the overlapped words between segments and a new text segmentation algorithm with extensions for refining the segment boundaries. We also conduct experiments on natural coherent documents to demonstrate the effectiveness of the proposed solutions.

text segmentation, coherent documents, overlapped words, text segmentation algorithm, segment boundaries