Sibling-first data organization for efficient XML data processing
XML is becoming one of the most important structures for data exchange. Despite having many advantages, XML structure imposes several major obstacles to large document processing. Incompatibility between the linear nature of the current algorithms such as caching and prefetch used in operating systems and databases, and the non-linear structure of XML data makes XML processing more costly. In addition to verbosity, parsing depth-first (DF) structure of XML documents is a significant overhead to processing applications, including search engines. Recent research on XML query processing has learned that sibling clustering can improve performance significantly. However, the existing methods are limited in several aspects including in processing very large documents. In this research, a better data organization has been developed for native XML databases, named sibling-first (SF), that significantly improves the performance in large data processing. SF uses an embedded index for fast access to child nodes. It also compresses documents by eliminating extra data from the original DF format. The converted SF documents can be processed for XPath query purposes without being parsed. The SF storage has been implemented in virtual memory as well as a format on disk. Experimental results with real data have shown that significantly higher performance can be achieved when XPath queries are conducted on very large SF documents.