Indexing for Datalog Materialisation with Leapfrog Triejoin

Philipp HanischPhilipp Hanisch

Download

Philipp Hanisch
Indexing for Datalog Materialisation with Leapfrog Triejoin
Diploma Thesis, October 2021

Details
Bibtex

KurzfassungAbstract
Datalog is a well-understood relational query language and there are efficient reasoners based on different concepts and technologies. Nevertheless, the search for faster and more efficient reasoners continues. A promising approach for further performance
improvements is the so-called leapfrog triejoin by Veldhuizen. Leapfrog triejoin is a variable-oriented join algorithm, similar to an ordered merge join, and it computesthe matches of a Datalog rule as a result trie following a previously defined variable order. Leapfrog triejoin is worst-case optimal w.r.t. the AGM bound, which provides a tight bound on the maximum result size of full conjunctive queries, and empirical evaluations suggest that leapfrog triejoin is promising for real-world problems, too.
The literature about leapfrog triejoin focuses on its application to a single join or, respectively, Datalog rule. Thus, there are both practical and theoretical insights, e.g., on constructing the necessary data structures and the complexity bounds of leapfrog triejoin. Unfortunately, applying leapfrog triejoin to whole Datalog programs during materialisation yields new challenges. In this thesis, we concentrate on these challenges as well as on potential solutions.
As leapfrog triejoin requires suitable data structures for each rule, it is prone to be inefficient when considering the rules of a program in isolation, as this might introduce avoidable redundancy. Moreover, the choice of ‘good’ variable orders is crucial for the performance of leapfrog triejoin. Hence, we propose and discuss criteria for the quality of variable orders. Unfortunately, finding good variable orders is challenging: we show the NP-completeness of two decision problems that are tightly related to this task. For finding good variable orders, we propose an optimal approach based on Answer Set Programming as well as a heuristic approach. Furthermore, we show the trade-off between the optimality of the ASP approach and the required time in an evaluation.
Moreover, we generalise leapfrog triejoin to use partial variable orders. We discuss the resulting generalisation of the data structures, i.e., f-representations instead of tries, and we show how to efficiently prepare them for leapfrog triejoin. To realise the potential of considering independent variables separately and the more succinct representation of relations, we adapt leapfrog triejoin for partial variable orders. We show that, on a set of benchmarks, partial variable orders are of higher quality than total ones w.r.t. our optimisation criteria.
Bemerkung: Note: Supervised by Markus Krötzsch
Forschungsgruppe:Research Group: Wissensbasierte SystemeKnowledge-Based Systems

@misc{H2021,
  author = {Philipp Hanisch},
  title  = {Indexing for Datalog Materialisation with Leapfrog Triejoin},
  year   = {2021},
  month  = {October}
}