21 Vectorized Query Execution Part I

vectorization

the process of converting an algorithm’s scalar implementation that processes a single pair of operands at a time , to a vector implementation that process one operation on multiple pairs of operands at once. this is about parallelism of a single core.

some concepts

multi-core cpu with superscalar and out-of-order execution and high power consumption, i.e. Xeon
many integrate cores (mic) use a large number of low-power cores with low power consumption i.e. Xeon Phi
SIMD : a class of instructions that allow cpu to perform the same operation on multiple data points simultaneously
- intel:MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512
- powerpc:altivec
- arm:neon
  streaming simd extension(SSE)
instruction category
- data movement(move in/out)
- arithmetic operation
- logical instruction
- comparision instruction
- shuffle instruction(move between simd regs)
- miscellaneous
  - transform data between x86 and simd regs
  - cache control:move data from simd reg to memory, bypassing cpu cache
    what about GPU ?
moving data back and forth between DRAM and GPU over PCI-E is slow
major cermercial database systems don’t support this
co-processer that cache-coherent with cpu may change current trend
how to achieve vectorization?
automatic vectorization
- compiler identifies possible rewrite of loops into vectorized operations
- this only works for simple loops
compiler hints
- provide compiler with additional information to let it know that the code is safe to be vectorized.
- approach#1 explicit info about memory locations
  - use ‘restrict’ key work in C++ to tell the compiler that the arrays(say char* restrict a1, char* restrict a2) have distinct locations in memory
- approach#2 tell compiler to ignore vector dependencies
  - use ‘#pragma ivdep’ before the for loop
- these two approaches are generally supported by major compilers
explicit vectorization
- use cpu instructions explicitly
  vectorization direction
approach#1:horizontal
- perform operation on elements of a single vector
- ‘1 2 3 4’ =>add=>10

approach#2:vertical

perform operation on elements of each vector

‘1

4’ add ‘5

8’ => ‘6

12’

selective load/store/gather/scatter can be building blocks for algorithms [jekyll-docs]: http://jekyllrb.com/docs/home [jekyll-gh]: https://github.com/jekyll/jekyll [jekyll-talk]: https://talk.jekyllrb.com/

21 Vectorized Query Execution Part I

vectorization

some concepts

streaming simd extension(SSE)

what about GPU ?

how to achieve vectorization?

vectorization direction