19 Parallel Hash Join Algorithms

Background

  • Perform a join between two relations on multiple threads simultaneously to speed up up operation.
  • This is about hash join.
  • Nest-loop jion is suitable for OLTP.
  • Nest-loop join on index with a small number of target tuples is similar to hash join.

Parallel Hash Join

  • hash join phase (say R jion S)
    1. partition (optional) (divide R, S into sets using a hash on join key)
    2. build (scan R and create a hash table)
    3. probe (for each tuple of S, probe by join key in R’s hash table)
  • non blocking partition
    • incremental partitioning
    • scan once and generate output on the fly
    • approach 1. (shared partition)multiple threads update the same global set of partitions
    • approach 2. (private partition) each thread has its own partition and need to merge
  • blocking partitioning(radix)
    1. scan R and get a histogram of number of tuples per hash key for the radix at some offset
    2. ‘Use this histogram to determine output offsets by computing the prefix sum.’
    3. ‘Scan R again and partition them according to the hash key.’
  • build phase
    • hash function
      • speed vs collision rate
      • murmur hash
      • google city hash
      • google farm hash
      • clhash
    • hash scheme
      • handle collision
      • chained hashing
      • liner hashing
      • robin hood hashing
      • cukoo hashing
  • probing phase
    • hash tuple’s join key of S and probe into hash table built from R
    • bloom filter on keys of R can speed up this probing (if matching selectivity is low, otherwise probing is needed anyway and looking up in bloom filter is a waste)

Hash Functions

mentioned in build phase

Hashing Schemes

mentioned in build phase

Evaluation

  • no partitioning proforms a little bit slow than radix hashing
  • non-blocking partitioning is slow in uniform/skew data inputs than no partioning/ radix partitioning method
  • liner hasing is good in most curcumstance