Efficient Skew Handling for Outer Joins in a Cloud Computing Environment

From International Center for Computational Logic

Toggle side column

Efficient Skew Handling for Outer Joins in a Cloud Computing Environment

Long ChengLong Cheng,  Spyros KotoulasSpyros Kotoulas
Long Cheng, Spyros Kotoulas
Efficient Skew Handling for Outer Joins in a Cloud Computing Environment
IEEE Transactions on Cloud Computing, 6(2):558 - 571, 2018
  • KurzfassungAbstract
    Outer joins are ubiquitous in many workloads and Big Data systems. The question of how to best execute outer joins in large parallel systems is particularly challenging, as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding problem for parallel outer joins, especially in the extremely popular Cloud computing environment. Conventional approaches to the problem such as ones based on hash redistribution often lead to load balancing problems while duplication-based approaches incur significant overhead in terms of network communication. In this paper, we propose a new approach for efficient skew handling in outer joins over a Cloud computing environment. We present an efficient implementation of our approach over the Spark framework. We evaluate the performance of our approach on a 192-core system with large test datasets in excess of 100GB and with varying skew. Experimental results show that our approach is scalable and, at least of in cases of high skew, significantly faster than the state-of-the-art.
  • Projekt:Project: DIAMONDHAEC B08
  • Forschungsgruppe:Research Group: Wissensbasierte Systeme
@article{CK2018,
  author  = {Long Cheng and Spyros Kotoulas},
  title   = {Efficient Skew Handling for Outer Joins in a Cloud Computing
             Environment},
  journal = {IEEE Transactions on Cloud Computing},
  volume  = {6},
  number  = {2},
  year    = {2018},
  pages   = {558 - 571},
  doi     = {10.1109/TCC.2015.2487965}
}