dc.description.abstract | The Web, with its vast, heterogeneous, and dynamic content, poses significant challenges for applying classical database technologies. The lack of structure, the presence of hyperlinks, and the absence of centralized control make traditional data modeling, querying, and processing approaches inadequate. This thesis presents DIASPORA, a web database system designed to address these challenges through an integrated solution encompassing data modeling, query language design, and distributed query processing.
DIASPORA introduces a graph-based data model that captures both the content and hyperlink structure of web documents. It supports traditional formats like HTML and emerging semantic formats like XML. The model automatically infers semantic relationships using markup tags and element values, enabling fully automatic graph construction.
A declarative query language allows users to specify keyword-based hints and hyperlink predicates, facilitating both content-based and structure-aware querying. DIASPORA’s most novel feature is its fully distributed query processing mechanism, which contrasts with conventional centralized approaches. Queries are shipped across web nodes, processed locally, and results are returned without requiring coordination from a master site.
The system addresses key challenges in distributed query processing, including query completion, rewriting, termination, and result transmission. A Java-based prototype has been implemented and tested on IISc campus websites, demonstrating significant improvements in query quality and resource efficiency.
DIASPORA is positioned to support a wide range of web applications, including search engine indexing, site mapping, and fine-grained querying of XML documents. Its distributed architecture also opens avenues for mining user queries to enhance public and commercial web services. | |