Efficient Large Outer Joins over MapReduce

From International Center for Computational Logic

Toggle side column

Efficient Large Outer Joins over MapReduce

Long ChengLong Cheng,  Spyros KotoulasSpyros Kotoulas
Efficient Large Outer Joins over MapReduce


Long Cheng, Spyros Kotoulas
Efficient Large Outer Joins over MapReduce
Proc. 22nd International European Conference on Parallel Processing (Euro-Par'16), 334-346, August 2016. Springer
  • KurzfassungAbstract
    Big Data analytics largely rely on being able to execute large joins efficiently. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially on the extremely popular MapReduce platform. In this paper, we studied several current algorithms/techniques used in large outer joins. We find that some of them could meet performance bottlenecks in the presence of data skew, while others could be complex and incur significant coordination overheads when applied to the MapReduce framework. In this light, we propose a new algorithm, called POPI (Partial Outer join & Partial Inner join), which targets for efficient processing large outer joins, and most important, is lightweight and adapted to the processing model of MapReduce. We implement our method in Pig and evaluate its performance on a Hadoop cluster of up to 256 cores and datasets of 1 billion tuples. Experimental results show that our method is scalable, robust and outperforms current implementations, at least in the case of high skew.
  • Projekt:Project: DIAMONDHAEC B08
  • Forschungsgruppe:Research Group: Wissensbasierte SystemeKnowledge-Based Systems
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-43659-3_25.
@inproceedings{CK2016,
  author    = {Long Cheng and Spyros Kotoulas},
  title     = {Efficient Large Outer Joins over {MapReduce}},
  booktitle = {Proc. 22nd International European Conference on Parallel
               Processing (Euro-Par'16)},
  publisher = {Springer},
  year      = {2016},
  month     = {August},
  pages     = {334-346},
  doi       = {10.1007/978-3-319-43659-3_25}
}