20 Parallel Sort-Merge Join Algorithms
20 Parallel Sort-Merge Join Algorithms
SIMD background
- single instruction multiple data = SIMD
- MMX, 3DNow!, SSE, SSE2, SSE3, AVX (since 1990s)
- this saves instructions compared to SISD approach
- speed things up when algorithm can be vectorized
- implementation is a manual process (colmunbia db may be a good example)
- SIMD has restrictions on data alignment
- gathering into/ scattering from registers can be tricky
Paralelel sort-merge join
sort-merge join
- phase 1: sort tuples of R & S based on the join key
- phase 2: scan the tuples & outer relation R only needs to be scanned once
parallel sort-merge join goals
- speed up the sorting part
- utilize as many CPU cores as possible
- be mindful of NUMA boundaries
- use SIMD instructions where applicable
procedure (R join S)
- phase 1 : patitioning(optional)
- implicit partitioning
- data was partitioned on the join key when it was loaded into the database
- no extra pass over the data is needed
- explicit partitioning
- divide only the outer relation and redistribute among the different CPU cores
- radix sort could be used but actually no in memory database does this
- implicit partitioning
- phase 2 : sort
- process can be described in 3 levels: register/cache size/beyond cache size
- registers can be sorted using special instructions/ sorting networks
- data that fits in cache size can be sorted by bitonic merge network(more like specific hardware)
- multiway merging for large data set
- usually one sorting process combines these three levels of sorting
- phase 3 : merge
- multi-way sort-merge(sort locally, redistribute, sort locally)
- each core does level1&2 sorting locally
- do level 3 sorting across multiple cores(redistribute & sorting locally, or say sort one chunk a time)
- do this both for inner/outer table
- multi-pass sort-merge(sort locally, merge globally)
- not redistributing tuples but using multi-pass naive merge on sorted runs(or say get one result tuple one time)
- others parts is just like multi-way sort-merge
- massively parallel sort-merge(outer:redistribute&sort, inner:sort, cross partition merge outer & inner table)
- outer table are range-partitioned and redistributed to cores
- each core do sorting in parallel on their partition
- inner table are not redistributed like outer table
- each core sorts its local data
- merge phase is between entire sorted run of outer table and a segment of inner table
evaluation
* SIMD sort is 2.5-3x faster than C++ STL sort(quick sort) * M-way sort-merge join uses least cycles per output tuple & has the highest through put * MPSM performs worst #### project 3 code review guidelines [jekyll-docs]: http://jekyllrb.com/docs/home [jekyll-gh]: https://github.com/jekyll/jekyll [jekyll-talk]: https://talk.jekyllrb.com/
- multi-way sort-merge(sort locally, redistribute, sort locally)