CPU Dispatch in ClickHouse
Overview
In this post, I will describe how vectorization works, what CPU dispatch is, how to find places for CPU dispatch optimizations and how we use CPU dispatch in ClickHouse.
First, let’s place our problem. Hardware vendors constantly add new instructions to the instruction set of modern CPUs. And we often want to use the latest instructions for optimizations, and the most important are SIMD instructions. But the main issue with that is the compatibility. For example, if your program is compiled with AVX2 instruction set, and your CPU supports only SSE4.2, then if it runs such program, you will get an illegal instruction signal.
Also, an important thing to note is that some algorithms can be specifically designed to work with modern instructions, for example, SIMD JSON.
To improve performance and save compatibility, parts of your code can be compiled for different instruction sets, and then in runtime program can dispatch execution to the most performant variant.
For examples in this post I will use clang-15.
Vectorization basics
Vectorization is an optimization where your data is processed using vector operations instead of scalar operations. Modern CPUs have specific instructions that allow to process your data in vectors using SIMD instructions. Such optimization can be performed manually, or compilers can perform automatic vectorization.
Let’s consider such code example:
void plus(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size) { for (size_t i = 0; i < size; ++i) { c[i] = b[i] + a[i]; } }
We have a plus
function that takes 3 pointers to a
, b
and c
arrays and size
of these arrays. This function computes the sum of a
and b
array
elements and writes it into array c
.
If we compile this code without loop unrolling specifying option fno-unroll-loops
and with AVX2 support specifying option -mavx2
. There will be the following assembly:
$ /usr/bin/clang++-15 -mavx2 -fno-unroll-loops -O3 -S vectorization_example.cpp # %bb.0: testq %rcx, %rcx je .LBB0_7 # %bb.1: cmpq $4, %rcx jae .LBB0_3 # %bb.2: xorl %r8d, %r8d jmp .LBB0_6 .LBB0_3: movq %rcx, %r8 andq $-4, %r8 xorl %eax, %eax .p2align 4, 0x90 .LBB0_4: # =>This Inner Loop Header: Depth=1 vmovdqu (%rdi,%rax,8), %ymm0 vpaddq (%rsi,%rax,8), %ymm0, %ymm0 vmovdqu %ymm0, (%rdx,%rax,8) addq $4, %rax cmpq %rax, %r8 jne .LBB0_4 # %bb.5: cmpq %rcx, %r8 je .LBB0_7 .p2align 4, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 movq (%rdi,%r8,8), %rax addq (%rsi,%r8,8), %rax movq %rax, (%rdx,%r8,8) incq %r8 cmpq %r8, %rcx jne .LBB0_6 .LBB0_7: vzeroupper retq
In the final assembly, there are two loops. Vectorized loop that processes 4 elements at a time:
.LBB0_4: # =>This Inner Loop Header: Depth=1 vmovdqu (%rdi,%rax,8), %ymm0 vpaddq (%rsi,%rax,8), %ymm0, %ymm0 vmovdqu %ymm0, (%rdx,%rax,8) addq $4, %rax cmpq %rax, %r8 jne .LBB0_4
Scalar loop:
.LBB0_6: # =>This Inner Loop Header: Depth=1 movq (%rdi,%r8,8), %rax addq (%rsi,%r8,8), %rax movq %rax, (%rdx,%r8,8) incq %r8 cmpq %r8, %rcx jne .LBB0_6
At the function assembly beginning, there is a check depending on the arrays size based on which we choose the loop:
# %bb.1: cmpq $4, %rcx jae .LBB0_3 # %bb.2: xorl %r8d, %r8d jmp .LBB0_6
Additionally, an important thing to note is vzeroupper
instruction. Compiler inserts it to avoid the penalty of mixing SSE and VEX AVX instructions. You can read more about it
in Agner Fog Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 13.2 Mixing VEX and SSE code.
Another important thing to note is if there is no __restrict
modifier specified for pointers, the compiler may not vectorize the loop. Or it will vectorize the loop but put
a couple of runtime checks at the beginning of the function to make sure that the arrays do not overlap.
Additionally, if we compile this example without fno-unroll-loops
and look at the generated loop, we will see that compiler unrolled vectorized loop, which now processes 16 elements at a time.
.LBB0_4: # =>This Inner Loop Header: Depth=1 vmovdqu (%rdi,%rax,8), %ymm0 vmovdqu 32(%rdi,%rax,8), %ymm1 vmovdqu 64(%rdi,%rax,8), %ymm2 vmovdqu 96(%rdi,%rax,8), %ymm3 vpaddq (%rsi,%rax,8), %ymm0, %ymm0 vpaddq 32(%rsi,%rax,8), %ymm1, %ymm1 vpaddq 64(%rsi,%rax,8), %ymm2, %ymm2 vpaddq 96(%rsi,%rax,8), %ymm3, %ymm3 vmovdqu %ymm0, (%rdx,%rax,8) vmovdqu %ymm1, 32(%rdx,%rax,8) vmovdqu %ymm2, 64(%rdx,%rax,8) vmovdqu %ymm3, 96(%rdx,%rax,8) addq $16, %rax cmpq %rax, %r8 jne .LBB0_4
There is a very useful tool that can help you identify places where the compiler does or does not perform vectorization to avoid assembly checking.
You can add -Rpass=loop-vectorize
, -Rpass-missed=loop-vectorize
and -Rpass-analysis=loop-vectorize
options to clang. There are similar
options for gcc.
If we compile our example with these options, there will be the following output:
$ /usr/bin/clang++-15 -mavx2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -O3 vectorization_example.cpp:7:5: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize] for (size_t i = 0; i < size; ++i) {
Now let’s consider another example:
class SumFunction { public: void sumIf(int64_t * values, int8_t * filter, size_t size); int64_t sum = 0; }; void SumFunction::sumIf(int64_t * values, int8_t * filter, size_t size) { for (size_t i = 0; i < size; ++i) { sum += filter[i] ? 0 : values[i]; } }
/usr/bin/clang++-15 -mavx2 -O3 -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -c vectorization_example.cpp ... vectorization_example.cpp:28:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize] for (size_t i = 0; i < size; ++i) {
In case the compiler cannot perform vectorization. There are two possible scenarios:
- You can try to modify your code, so it can be vectorized. In some complex scenarios, you may need to redesign your data representation. I highly encourage you to check the LLVM documentation and gcc documentation that can help you understand in which cases auto-vectorization can or cannot be performed.
- You can vectorize the loop manually using intrinsics. This option is less preferred because of the additional maintenance.
To fix the issue in our example, we need to make a local sum inside the function:
class SumFunction { public: void sumIf(int64_t * values, int8_t * filter, size_t size); int64_t sum = 0; }; void SumFunction::sumIf(int64_t * values, int8_t * filter, size_t size) { int64_t local_sum = 0; for (size_t i = 0; i < size; ++i) { local_sum += filter[i] ? 0 : values[i]; } sum += local_sum; }
Such code example is vectorized by the compiler:
/usr/bin/clang++-15 -mavx2 -O3 -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -c vectorization_example.cpp vectorization_example.cpp:31:5: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize] for (size_t i = 0; i < size; ++i) {
In the resulting assembly vectorized loop looks like this:
.LBB0_5: # =>This Inner Loop Header: Depth=1 vmovd (%rdx,%rax), %xmm5 # xmm5 = mem[0],zero,zero,zero vmovd 4(%rdx,%rax), %xmm6 # xmm6 = mem[0],zero,zero,zero vmovd 8(%rdx,%rax), %xmm7 # xmm7 = mem[0],zero,zero,zero vmovd 12(%rdx,%rax), %xmm1 # xmm1 = mem[0],zero,zero,zero vpcmpeqb %xmm5, %xmm8, %xmm5 vpmovsxbq %xmm5, %ymm5 vpcmpeqb %xmm6, %xmm8, %xmm6 vpmovsxbq %xmm6, %ymm6 vpcmpeqb %xmm7, %xmm8, %xmm7 vpmovsxbq %xmm7, %ymm7 vpcmpeqb %xmm1, %xmm8, %xmm1 vpmaskmovq -96(%r8,%rax,8), %ymm5, %ymm5 vpmovsxbq %xmm1, %ymm1 vpmaskmovq -64(%r8,%rax,8), %ymm6, %ymm6 vpaddq %ymm0, %ymm5, %ymm0 vpmaskmovq -32(%r8,%rax,8), %ymm7, %ymm5 vpaddq %ymm2, %ymm6, %ymm2 vpmaskmovq (%r8,%rax,8), %ymm1, %ymm1 vpaddq %ymm3, %ymm5, %ymm3 vpaddq %ymm4, %ymm1, %ymm4 addq $16, %rax cmpq %rax, %r9 jne .LBB0_5
CPU Dispatch basics
CPU dispatch is a technique when there are multiple compiled versions of your code for different CPU features, and in runtime, your program detects which CPU features your machine has and uses the most performant version in runtime. The most important instruction sets you want to check are SSE42, AVX, AVX2, and AVX512.
To implement CPU dispatch, first, we need to use CPUID instruction to check if particular features are supported for the current CPU.
You can call cpuid
instruction using inline assembly or use cpuid.h
header where these functions are defined:
/* x86-64 uses %rbx as the base register, so preserve it. */ #define __cpuid(__leaf, __eax, __ebx, __ecx, __edx) \ __asm(" xchgq %%rbx,%q1\n" \ " cpuid\n" \ " xchgq %%rbx,%q1" \ : "=a"(__eax), "=r" (__ebx), "=c"(__ecx), "=d"(__edx) \ : "0"(__leaf)) #define __cpuid_count(__leaf, __count, __eax, __ebx, __ecx, __edx) \ __asm(" xchgq %%rbx,%q1\n" \ " cpuid\n" \ " xchgq %%rbx,%q1" \ : "=a"(__eax), "=r" (__ebx), "=c"(__ecx), "=d"(__edx) \ : "0"(__leaf), "2"(__count)) #endif
Next, to check if some CPU feature is supported, you need to check Intel Software Optimization Reference Manual Chapter 5 manual for specific instructions. For example, for SSE42:
bool hasSSE42() { uint32_t eax = 0; uint32_t ebx = 0; uint32_t ecx = 0; uint32_t edx = 0; __cpuid(0x1, eax, ebx, ecx, edx); return (ecx >> 20) & 1ul; }
Now we need to compile our function with different instructions. In clang, there is target attribute that can do exactly that. In gcc, there is the same attribute. For example:
void plusDefault(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size) { for (size_t i = 0; i < size; ++i) { c[i] = a[i] + b[i]; } } __attribute__((target("sse,sse2,sse3,ssse3,sse4,avx,avx2"))) void plusAVX2(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size) { for (size_t i = 0; i < size; ++i) { c[i] = a[i] + b[i]; } } __attribute__((target("sse,sse2,sse3,ssse3,sse4,avx,avx2,avx512f"))) void plusAVX512(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size) { for (size_t i = 0; i < size; ++i) { c[i] = a[i] + b[i]; } }
In this example, we compile our plus
function additionally for AVX2 and AVX512. In the final assembly, we can check that compiler uses AVX2 for vectorized
loop of plusAVX2
function:
... .globl _Z8plusAVX2PlS_S_m # -- Begin function _Z8plusAVX2PlS_S_m ... .LBB4_4: # =>This Inner Loop Header: Depth=1 vmovdqu (%rsi,%rax,8), %ymm0 vmovdqu 32(%rsi,%rax,8), %ymm1 vmovdqu 64(%rsi,%rax,8), %ymm2 vmovdqu 96(%rsi,%rax,8), %ymm3 vpaddq (%rdi,%rax,8), %ymm0, %ymm0 vpaddq 32(%rdi,%rax,8), %ymm1, %ymm1 vpaddq 64(%rdi,%rax,8), %ymm2, %ymm2 vpaddq 96(%rdi,%rax,8), %ymm3, %ymm3 vmovdqu %ymm0, (%rdx,%rax,8) vmovdqu %ymm1, 32(%rdx,%rax,8) vmovdqu %ymm2, 64(%rdx,%rax,8) vmovdqu %ymm3, 96(%rdx,%rax,8) addq $16, %rax cmpq %rax, %r8 jne .LBB4_4 ...
and AVX512 for vectorized loop of plusAVX512
:
... .globl _Z10plusAVX512PlS_S_m # -- Begin function _Z10plusAVX512PlS_S_m ... .LBB5_4: # =>This Inner Loop Header: Depth=1 vmovdqu64 (%rsi,%rax,8), %zmm0 vmovdqu64 64(%rsi,%rax,8), %zmm1 vmovdqu64 128(%rsi,%rax,8), %zmm2 vmovdqu64 192(%rsi,%rax,8), %zmm3 vpaddq (%rdi,%rax,8), %zmm0, %zmm0 vpaddq 64(%rdi,%rax,8), %zmm1, %zmm1 vpaddq 128(%rdi,%rax,8), %zmm2, %zmm2 vpaddq 192(%rdi,%rax,8), %zmm3, %zmm3 vmovdqu64 %zmm0, (%rdx,%rax,8) vmovdqu64 %zmm1, 64(%rdx,%rax,8) vmovdqu64 %zmm2, 128(%rdx,%rax,8) vmovdqu64 %zmm3, 192(%rdx,%rax,8) addq $32, %rax cmpq %rax, %r8 jne .LBB5_4 ...
Now we have everything to perform CPU dispatch:
void plus(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size) { if (hasAVX512()) { plusAVX512(a, b, c, size); } else if (hasAVX2()) { plusAVX2(a, b, c, size); } else { plusDefault(a, b, c, size); } }
In this example, we created plus
function that dispatches to concrete implementation based on the available instruction set. Such CPU dispatch method is also called
dispatch on each call. There are other methods that you can read about in Agner Fog Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 13.1 CPU dispatch strategies.
Dispatch on each call is the most flexible method because it will allow you to work with template functions and class member functions or choose implementation based on some statistics collected in runtime. The only downside is branching, although this overhead is negligible if your function performs a lot of work.
CPU Dispatch Optimization Places
Now how to find places where SIMD optimizations can be applied?
- If you know what loops are hot in your program, you can try to apply CPU dispatch for them.
- If you have performance tests, you can compile your program with AVX, AVX2, AVX512 and compare performance reports to figure out which places can be optimized in your program using CPU dispatch.
This technique with performance tests can be applied not only to CPU dispatch but to many other useful optimizations. The main idea is to compile your code with different configurations (compilers, compiler options, libraries, allocators), and if there is performance improvement in some places, you can manually optimize them. For example:
- Try different allocators and different libraries.
- Try different compiler options (loop unrolling, inline threshold).
- Enable AVX/AVX2/AVX512 for build.
CPU Dispatch in ClickHouse
First, I want to show how we designed our dispatch framework in ClickHouse.
enum class TargetArch : UInt32 { Default = 0, /// Without any additional compiler options. SSE42 = (1 << 0), /// SSE4.2 AVX = (1 << 1), AVX2 = (1 << 2), AVX512F = (1 << 3), AVX512BW = (1 << 4), AVX512VBMI = (1 << 5), AVX512VBMI2 = (1 << 6), }; /// Runtime detection. bool isArchSupported(TargetArch arch);
We define an enum TargetArch
for the target architecture, and inside isArchSupported
function we use CPUID instruction set checks that we already discussed.
Then we define a bunch of BEGIN_INSTRUCTION_SET_SPECIFIC_CODE
sections that apply target
attribute to the whole code block.
For example, for clang:
# define BEGIN_AVX512F_SPECIFIC_CODE \ _Pragma("clang attribute push(__attribute__((target(\"sse,sse2,sse3,ssse3,sse4,\ popcnt,avx,avx2, avx512f\"))), apply_to=function)") \ # define BEGIN_AVX2_SPECIFIC_CODE \ _Pragma("clang attribute push(__attribute__((target(\"sse,sse2,sse3,ssse3,sse4,\ popcnt, avx,avx2\"))), apply_to=function)") \ \ # define END_TARGET_SPECIFIC_CODE \ _Pragma("clang attribute pop")
Then for each instruction set, we define a separate namespace TargetSpecific::INSTRUCTION_SET
. Examples for AVX2 and AVX512:
#define DECLARE_AVX2_SPECIFIC_CODE(...) \ BEGIN_AVX2_SPECIFIC_CODE \ namespace TargetSpecific::AVX2 { \ DUMMY_FUNCTION_DEFINITION \ using namespace DB::TargetSpecific::AVX2; \ __VA_ARGS__ \ } \ END_TARGET_SPECIFIC_CODE #define DECLARE_AVX512F_SPECIFIC_CODE(...) \ BEGIN_AVX512F_SPECIFIC_CODE \ namespace TargetSpecific::AVX512F { \ DUMMY_FUNCTION_DEFINITION \ using namespace DB::TargetSpecific::AVX512F; \ __VA_ARGS__ \ } \ END_TARGET_SPECIFIC_CODE
And it can be used like this:
DECLARE_DEFAULT_CODE ( int funcImpl() { return 1; } ) // DECLARE_DEFAULT_CODE DECLARE_AVX2_SPECIFIC_CODE ( int funcImpl() { return 2; } ) // DECLARE_AVX2_SPECIFIC_CODE /// Dispatcher function int dispatchFunc() { #if USE_MULTITARGET_CODE if (isArchSupported(TargetArch::AVX2)) return TargetSpecific::AVX2::funcImpl(); #endif return TargetSpecific::Default::funcImpl(); }
By default USE_MULTITARGET_CODE
is 1. Additionally in each TargetSpecific::INSTRUCTION_SET
namespace, we define compile time BuildArch
constant:
DECLARE_SSE42_SPECIFIC_CODE( constexpr auto BuildArch = TargetArch::SSE42; /// NOLINT ) // DECLARE_SSE42_SPECIFIC_CODE DECLARE_AVX_SPECIFIC_CODE( constexpr auto BuildArch = TargetArch::AVX; /// NOLINT ) // DECLARE_AVX_SPECIFIC_CODE DECLARE_AVX2_SPECIFIC_CODE( constexpr auto BuildArch = TargetArch::AVX2; /// NOLINT ) // DECLARE_AVX2_SPECIFIC_CODE DECLARE_AVX512F_SPECIFIC_CODE( constexpr auto BuildArch = TargetArch::AVX512F; /// NOLINT ) // DECLARE_AVX512F_SPECIFIC_CODE
This is useful in case we have almost the same function, and we want to add only a couple of compile time branches for different instruction sets. For example:
#define DECLARE_SSE42_AVX2_CODE(...) \ DECLARE_DEFAULT_CODE (__VA_ARGS__) \ DECLARE_SSE42_SPECIFIC_CODE (__VA_ARGS__) \ DECLARE_AVX2_SPECIFIC_CODE (__VA_ARGS__) \ DECLARE_SSE42_AVX2_CODE ( int funcImpl() { if constexpr (BuildArch == TargetArch::Default) { return 0; } else if constexpr (BuildArch == TargetArch::SSE42) { return 1; } else if constexpr (BuildArch == TargetArch::AVX2) { return 2; } return 3; } ) /// Dispatcher function int dispatchFunc() { #if USE_MULTITARGET_CODE if (isArchSupported(TargetArch::AVX2)) return TargetSpecific::AVX2::funcImpl(); else if (isArchSupported(TargetArch::SSE42)) return TargetSpecific::SSE42::funcImpl(); #endif return TargetSpecific::Default::funcImpl(); }
The examples above work well with standalone functions, but when we have class member functions, they do not work because member functions cannot be wrapped into the namespace. For such cases, we have another bunch of macros. We need to insert a specific attribute before the class member function name. And we also need to generate functions with different names, ideally with suffixes like SSE42, AVX2, AVX512.
We can split the function into the header with MULTITARGET_FUNCTION_HEADER
macros, function name, function body with MULTITARGET_FUNCTION_BODY
macros to insert specific attributes before the function name. For example, for AVX2 and SSE42 it can look like this:
/// Function header #define MULTITARGET_FUNCTION_HEADER(...) __VA_ARGS__ /// Function body #define MULTITARGET_FUNCTION_BODY(...) __VA_ARGS__ #define MULTITARGET_FUNCTION_AVX2_SSE42(FUNCTION_HEADER, name, FUNCTION_BODY) \ FUNCTION_HEADER \ \ AVX2_FUNCTION_SPECIFIC_ATTRIBUTE \ name##AVX2 \ FUNCTION_BODY \ \ FUNCTION_HEADER \ \ SSE42_FUNCTION_SPECIFIC_ATTRIBUTE \ name##SSE42 \ FUNCTION_BODY \ \ FUNCTION_HEADER \ \ name \ FUNCTION_BODY \
We use CPU dispatch in computationally intensive places, for example, in hashing, geometry functions, string processing functions, random numbers generation functions, unary functions, and aggregate functions.
For example, let’s see how we use CPU dispatch in aggregate functions. In ClickHouse, if there is GROUP BY without keys, for example SELECT sum(value), avg(value) FROM test_table
,
in such case, the aggregate functions process data directly in a batch. For sum
function, there is the following implementation:
template <typename Value> void NO_INLINE addManyImpl(const Value * __restrict ptr, size_t start, size_t end) { ptr += start; size_t count = end - start; const auto * end_ptr = ptr + count; /// Loop T local_sum{}; while (ptr < end_ptr) { Impl::add(local_sum, *ptr); ++ptr; } Impl::add(sum, local_sum); }
After we wrap this loop into our dispatch framework, the function code will look like this:
MULTITARGET_FUNCTION_AVX2_SSE42( MULTITARGET_FUNCTION_HEADER( template <typename Value> void NO_SANITIZE_UNDEFINED NO_INLINE ), addManyImpl, MULTITARGET_FUNCTION_BODY((const Value * __restrict ptr, size_t start, size_t end) { ptr += start; size_t count = end - start; const auto * end_ptr = ptr + count; /// Loop T local_sum{}; while (ptr < end_ptr) { Impl::add(local_sum, *ptr); ++ptr; } Impl::add(sum, local_sum); }))
Now we need to add dispatch based on the currently available CPU instruction set:
template <typename Value> void NO_INLINE addMany(const Value * __restrict ptr, size_t start, size_t end) { #if USE_MULTITARGET_CODE if (isArchSupported(TargetArch::AVX2)) { addManyImplAVX2(ptr, start, end); return; } else if (isArchSupported(TargetArch::SSE42)) { addManyImplSSE42(ptr, start, end); return; } #endif addManyImpl(ptr, start, end); }
Here is a small part of the performance report after this optimization was applied:
Query | Old (s) | New (s) | Ratio of speedup(-) or slowdown(+) | Relative difference (new - old) / old |
---|---|---|---|---|
SELECT sum(toNullable(toUInt8(number))) FROM numbers(100000000) | 0.110 | 0.077 | -1.428x | -0.300 |
SELECT sum(number) FROM numbers(100000000) | 0.044 | 0.035 | -1.228x | -0.185 |
SELECT sumOrNull(number) FROM numbers(100000000) | 0.044 | 0.036 | -1.226x | -0.183 |
SELECT avg(number) FROM numbers(100000000) | 0.416 | 0.341 | -1.219x | -0.180 |
In general, such optimization for sum
and avg
aggregate functions improves performance in 1.2-1.6 times. Similar optimizations can be applied to other aggregate functions as well.
Now let’s take a look at CPU dispatch optimization in unary functions:
template <typename A, typename Op> struct UnaryOperationImpl { using ResultType = typename Op::ResultType; using ColVecA = ColumnVectorOrDecimal<A>; using ColVecC = ColumnVectorOrDecimal<ResultType> using ArrayA = typename ColVecA::Container; using ArrayC = typename ColVecC::Container; static void vector(const ArrayA & a, ArrayC & c) { /// Loop Op::apply is template for operation size_t size = a.size(); for (size_t i = 0; i < size; ++i) c[i] = Op::apply(a[i]); } static void constant(A a, ResultType & c) { c = Op::apply(a); } };
In the example, there is a loop that applies some template operation using Op::apply
on elements of array a
and writes result into array c
. After we wrap this loop into our dispatch framework, the loop code will look like this:
MULTITARGET_FUNCTION_WRAPPER_AVX2_SSE42( MULTITARGET_FH(static void NO_INLINE), vectorImpl, MULTITARGET_FB((const ArrayA & a, ArrayC & c) /// NOLINT { /// Loop Op::apply is template for operation size_t size = a.size(); for (size_t i = 0; i < size; ++i) c[i] = Op::apply(a[i]); }))
Now we need to add dispatch based on the currently available CPU instruction set:
static void NO_INLINE vector(const ArrayA & a, ArrayC & c) { #if USE_MULTITARGET_CODE if (isArchSupported(TargetArch::AVX2)) { vectorImplAVX2(a, c); return; } else if (isArchSupported(TargetArch::SSE42)) { vectorImplSSE42(a, c); return; } #endif vectorImpl(a, c); }
Here is a small part of the performance report after this optimization was applied:
Query | Old (s) | New (s) | Ratio of speedup(-) or slowdown(+) | Relative difference (new - old) / old |
---|---|---|---|---|
SELECT roundDuration(toInt32(number))) FROM numbers(100000000) | 1.632 | 0.229 | -7.119x | -0.860 |
SELECT intExp2(toInt32(number)) FROM numbers(100000000) | 0.148 | 0.105 | -1.413x | -0.293 |
SELECT roundToExp2(toUInt8(number)) FROM numbers(100000000) | 0.144 | 0.102 | -1.41x | -0.291 |
In general, such optimization for unary functions improves performance in 1.15-2 times. For some specific functions like roundDuration
such optimization improved performance
in 2-7 times.
Summary
The compiler can vectorize even complex loops using SIMD instructions. Additionally, you can vectorize your code manually or design SIMD-oriented algorithms. But the biggest problem is that if you want to use modern instruction sets, it can decrease your program or library portability. Runtime CPU dispatch can help you eliminate this problem, with the cost of compiling parts of your code multiple times for different architectures.
You can find places to improve performance using performance tests and compiling your code with different configurations. For CPU dispatch optimizations, you can compile your code with AVX, AVX2, AVX512 and manually apply CPU dispatch in places where there are performance improvements.
In ClickHouse, we designed a framework specifically for such optimizations, and we improved performance in a lot of places using it.