21 Vectorized Query Execution Part I

vectorization

the process of converting an algorithm’s scalar implementation that processes a single pair of operands at a time , to a vector implementation that process one operation on multiple pairs of operands at once. this is about parallelism of a single core.

some concepts

  • multi-core cpu with superscalar and out-of-order execution and high power consumption, i.e. Xeon
  • many integrate cores (mic) use a large number of low-power cores with low power consumption i.e. Xeon Phi
  • SIMD : a class of instructions that allow cpu to perform the same operation on multiple data points simultaneously
    • intel:MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512
    • powerpc:altivec
    • arm:neon

      streaming simd extension(SSE)

  • instruction category
    • data movement(move in/out)
    • arithmetic operation
    • logical instruction
    • comparision instruction
    • shuffle instruction(move between simd regs)
    • miscellaneous
      • transform data between x86 and simd regs
      • cache control:move data from simd reg to memory, bypassing cpu cache

        what about GPU ?

  • moving data back and forth between DRAM and GPU over PCI-E is slow
  • major cermercial database systems don’t support this
  • co-processer that cache-coherent with cpu may change current trend

    how to achieve vectorization?

  • automatic vectorization
    • compiler identifies possible rewrite of loops into vectorized operations
    • this only works for simple loops
  • compiler hints
    • provide compiler with additional information to let it know that the code is safe to be vectorized.
    • approach#1 explicit info about memory locations
      • use ‘restrict’ key work in C++ to tell the compiler that the arrays(say char* restrict a1, char* restrict a2) have distinct locations in memory
    • approach#2 tell compiler to ignore vector dependencies
      • use ‘#pragma ivdep’ before the for loop
    • these two approaches are generally supported by major compilers
  • explicit vectorization
    • use cpu instructions explicitly

      vectorization direction

  • approach#1:horizontal
    • perform operation on elements of a single vector
    • ‘1 2 3 4’ =>add=>10
  • approach#2:vertical
    • perform operation on elements of each vector
    • ‘1 2 3 4’ add ‘5 6 7 8’ => ‘6 8 10 12’
  • selective load/store/gather/scatter can be building blocks for algorithms [jekyll-docs]: http://jekyllrb.com/docs/home [jekyll-gh]: https://github.com/jekyll/jekyll [jekyll-talk]: https://talk.jekyllrb.com/