19 Parallel Hash Join Algorithms
19 Parallel Hash Join Algorithms
Background
- Perform a join between two relations on multiple threads simultaneously to speed up up operation.
- This is about hash join.
- Nest-loop jion is suitable for OLTP.
- Nest-loop join on index with a small number of target tuples is similar to hash join.
Parallel Hash Join
- hash join phase (say R jion S)
- partition (optional) (divide R, S into sets using a hash on join key)
- build (scan R and create a hash table)
- probe (for each tuple of S, probe by join key in R’s hash table)
- non blocking partition
- incremental partitioning
- scan once and generate output on the fly
- approach 1. (shared partition)multiple threads update the same global set of partitions
- approach 2. (private partition) each thread has its own partition and need to merge
- blocking partitioning(radix)
- scan R and get a histogram of number of tuples per hash key for the radix at some offset
- ‘Use this histogram to determine output offsets by computing the prefix sum.’
- ‘Scan R again and partition them according to the hash key.’
- build phase
- hash function
- speed vs collision rate
- murmur hash
- google city hash
- google farm hash
- clhash
- hash scheme
- handle collision
- chained hashing
- liner hashing
- robin hood hashing
- cukoo hashing
- hash function
- probing phase
- hash tuple’s join key of S and probe into hash table built from R
- bloom filter on keys of R can speed up this probing (if matching selectivity is low, otherwise probing is needed anyway and looking up in bloom filter is a waste)
Hash Functions
mentioned in build phase
Hashing Schemes
mentioned in build phase
Evaluation
- no partitioning proforms a little bit slow than radix hashing
- non-blocking partitioning is slow in uniform/skew data inputs than no partioning/ radix partitioning method
- liner hasing is good in most curcumstance