Overview

In this post, I will describe how vectorization works, what CPU dispatch is, how to find places for CPU dispatch optimizations and how we use CPU dispatch in ClickHouse.

First, let’s place our problem. Hardware vendors constantly add new instructions to the instruction set of modern CPUs. And we often want to use the latest instructions for optimizations, and the most important are SIMD instructions. But the main issue with that is the compatibility. For example, if your program is compiled with AVX2 instruction set, and your CPU supports only SSE4.2, then if it runs such program, you will get an illegal instruction signal.

Also, an important thing to note is that some algorithms can be specifically designed to work with modern instructions, for example, SIMD JSON.

To improve performance and save compatibility, parts of your code can be compiled for different instruction sets, and then in runtime program can dispatch execution to the most performant variant.

For examples in this post I will use clang-15.

Vectorization basics

Vectorization is an optimization where your data is processed using vector operations instead of scalar operations. Modern CPUs have specific instructions that allow to process your data in vectors using SIMD instructions. Such optimization can be performed manually, or compilers can perform automatic vectorization.

Let’s consider such code example:

void plus(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
   for (size_t i = 0; i < size; ++i) {
       c[i] = b[i] + a[i];
   }
}

We have a plus function that takes 3 pointers to a, b and c arrays and size of these arrays. This function computes the sum of a and b array elements and writes it into array c.

If we compile this code without loop unrolling specifying option fno-unroll-loops and with AVX2 support specifying option -mavx2. There will be the following assembly:

$ /usr/bin/clang++-15 -mavx2 -fno-unroll-loops -O3 -S vectorization_example.cpp

# %bb.0:
	testq	%rcx, %rcx
	je	.LBB0_7
# %bb.1:
	cmpq	$4, %rcx
	jae	.LBB0_3
# %bb.2:
	xorl	%r8d, %r8d
	jmp	.LBB0_6
.LBB0_3:
	movq	%rcx, %r8
	andq	$-4, %r8
	xorl	%eax, %eax
	.p2align	4, 0x90
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu	(%rdi,%rax,8), %ymm0
	vpaddq	(%rsi,%rax,8), %ymm0, %ymm0
	vmovdqu	%ymm0, (%rdx,%rax,8)
	addq	$4, %rax
	cmpq	%rax, %r8
	jne	.LBB0_4
# %bb.5:
	cmpq	%rcx, %r8
	je	.LBB0_7
	.p2align	4, 0x90
.LBB0_6:                                # =>This Inner Loop Header: Depth=1
	movq	(%rdi,%r8,8), %rax
	addq	(%rsi,%r8,8), %rax
	movq	%rax, (%rdx,%r8,8)
	incq	%r8
	cmpq	%r8, %rcx
	jne	.LBB0_6
.LBB0_7:
	vzeroupper
	retq

In the final assembly, there are two loops. Vectorized loop that processes 4 elements at a time:

.LBB0_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu	(%rdi,%rax,8), %ymm0
	vpaddq	(%rsi,%rax,8), %ymm0, %ymm0
	vmovdqu	%ymm0, (%rdx,%rax,8)
	addq	$4, %rax
	cmpq	%rax, %r8
	jne	.LBB0_4

Scalar loop:

.LBB0_6:                                # =>This Inner Loop Header: Depth=1
	movq	(%rdi,%r8,8), %rax
	addq	(%rsi,%r8,8), %rax
	movq	%rax, (%rdx,%r8,8)
	incq	%r8
	cmpq	%r8, %rcx
	jne	.LBB0_6

At the function assembly beginning, there is a check depending on the arrays size based on which we choose the loop:

# %bb.1:
	cmpq	$4, %rcx
	jae	.LBB0_3
# %bb.2:
	xorl	%r8d, %r8d
	jmp	.LBB0_6

Additionally, an important thing to note is vzeroupper instruction. Compiler inserts it to avoid the penalty of mixing SSE and VEX AVX instructions. You can read more about it in Agner Fog Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 13.2 Mixing VEX and SSE code.

Another important thing to note is if there is no __restrict modifier specified for pointers, the compiler may not vectorize the loop. Or it will vectorize the loop but put a couple of runtime checks at the beginning of the function to make sure that the arrays do not overlap.

Additionally, if we compile this example without fno-unroll-loops and look at the generated loop, we will see that compiler unrolled vectorized loop, which now processes 16 elements at a time.

.LBB0_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu	(%rdi,%rax,8), %ymm0
	vmovdqu	32(%rdi,%rax,8), %ymm1
	vmovdqu	64(%rdi,%rax,8), %ymm2
	vmovdqu	96(%rdi,%rax,8), %ymm3
	vpaddq	(%rsi,%rax,8), %ymm0, %ymm0
	vpaddq	32(%rsi,%rax,8), %ymm1, %ymm1
	vpaddq	64(%rsi,%rax,8), %ymm2, %ymm2
	vpaddq	96(%rsi,%rax,8), %ymm3, %ymm3
	vmovdqu	%ymm0, (%rdx,%rax,8)
	vmovdqu	%ymm1, 32(%rdx,%rax,8)
	vmovdqu	%ymm2, 64(%rdx,%rax,8)
	vmovdqu	%ymm3, 96(%rdx,%rax,8)
	addq	$16, %rax
	cmpq	%rax, %r8
	jne	.LBB0_4

There is a very useful tool that can help you identify places where the compiler does or does not perform vectorization to avoid assembly checking. You can add -Rpass=loop-vectorize, -Rpass-missed=loop-vectorize and -Rpass-analysis=loop-vectorize options to clang. There are similar options for gcc.

If we compile our example with these options, there will be the following output:

$ /usr/bin/clang++-15 -mavx2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -O3

vectorization_example.cpp:7:5: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
    for (size_t i = 0; i < size; ++i) {

Now let’s consider another example:

class SumFunction
{
public:
    void sumIf(int64_t * values, int8_t * filter, size_t size);

    int64_t sum = 0;
};

void SumFunction::sumIf(int64_t * values, int8_t * filter, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        sum += filter[i] ? 0 : values[i];
    }
}
/usr/bin/clang++-15 -mavx2 -O3 -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -c vectorization_example.cpp

...

vectorization_example.cpp:28:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
        for (size_t i = 0; i < size; ++i) {

In case the compiler cannot perform vectorization. There are two possible scenarios:

  1. You can try to modify your code, so it can be vectorized. In some complex scenarios, you may need to redesign your data representation. I highly encourage you to check the LLVM documentation and gcc documentation that can help you understand in which cases auto-vectorization can or cannot be performed.
  2. You can vectorize the loop manually using intrinsics. This option is less preferred because of the additional maintenance.

To fix the issue in our example, we need to make a local sum inside the function:

class SumFunction
{
public:
    void sumIf(int64_t * values, int8_t * filter, size_t size);

    int64_t sum = 0;
};

void SumFunction::sumIf(int64_t * values, int8_t * filter, size_t size)
{
    int64_t local_sum = 0;

    for (size_t i = 0; i < size; ++i) {
        local_sum += filter[i] ? 0 : values[i];
    }

    sum += local_sum;
}

Such code example is vectorized by the compiler:

/usr/bin/clang++-15 -mavx2 -O3 -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -c vectorization_example.cpp

vectorization_example.cpp:31:5: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
    for (size_t i = 0; i < size; ++i) {

In the resulting assembly vectorized loop looks like this:

.LBB0_5:                                # =>This Inner Loop Header: Depth=1
	vmovd	(%rdx,%rax), %xmm5              # xmm5 = mem[0],zero,zero,zero
	vmovd	4(%rdx,%rax), %xmm6             # xmm6 = mem[0],zero,zero,zero
	vmovd	8(%rdx,%rax), %xmm7             # xmm7 = mem[0],zero,zero,zero
	vmovd	12(%rdx,%rax), %xmm1            # xmm1 = mem[0],zero,zero,zero
	vpcmpeqb	%xmm5, %xmm8, %xmm5
	vpmovsxbq	%xmm5, %ymm5
	vpcmpeqb	%xmm6, %xmm8, %xmm6
	vpmovsxbq	%xmm6, %ymm6
	vpcmpeqb	%xmm7, %xmm8, %xmm7
	vpmovsxbq	%xmm7, %ymm7
	vpcmpeqb	%xmm1, %xmm8, %xmm1
	vpmaskmovq	-96(%r8,%rax,8), %ymm5, %ymm5
	vpmovsxbq	%xmm1, %ymm1
	vpmaskmovq	-64(%r8,%rax,8), %ymm6, %ymm6
	vpaddq	%ymm0, %ymm5, %ymm0
	vpmaskmovq	-32(%r8,%rax,8), %ymm7, %ymm5
	vpaddq	%ymm2, %ymm6, %ymm2
	vpmaskmovq	(%r8,%rax,8), %ymm1, %ymm1
	vpaddq	%ymm3, %ymm5, %ymm3
	vpaddq	%ymm4, %ymm1, %ymm4
	addq	$16, %rax
	cmpq	%rax, %r9
	jne	.LBB0_5

CPU Dispatch basics

CPU dispatch is a technique when there are multiple compiled versions of your code for different CPU features, and in runtime, your program detects which CPU features your machine has and uses the most performant version in runtime. The most important instruction sets you want to check are SSE42, AVX, AVX2, and AVX512.

To implement CPU dispatch, first, we need to use CPUID instruction to check if particular features are supported for the current CPU.

You can call cpuid instruction using inline assembly or use cpuid.h header where these functions are defined:

/* x86-64 uses %rbx as the base register, so preserve it. */
#define __cpuid(__leaf, __eax, __ebx, __ecx, __edx) \
   __asm("  xchgq  %%rbx,%q1\n" \
         "  cpuid\n" \
         "  xchgq  %%rbx,%q1" \
       : "=a"(__eax), "=r" (__ebx), "=c"(__ecx), "=d"(__edx) \
       : "0"(__leaf))

#define __cpuid_count(__leaf, __count, __eax, __ebx, __ecx, __edx) \
   __asm("  xchgq  %%rbx,%q1\n" \
         "  cpuid\n" \
         "  xchgq  %%rbx,%q1" \
       : "=a"(__eax), "=r" (__ebx), "=c"(__ecx), "=d"(__edx) \
       : "0"(__leaf), "2"(__count))
#endif

Next, to check if some CPU feature is supported, you need to check Intel Software Optimization Reference Manual Chapter 5 manual for specific instructions. For example, for SSE42:

bool hasSSE42()
{
    uint32_t eax = 0;
    uint32_t ebx = 0;
    uint32_t ecx = 0;
    uint32_t edx = 0;

    __cpuid(0x1, eax, ebx, ecx, edx);

    return (ecx >> 20) & 1ul;
}

Now we need to compile our function with different instructions. In clang, there is target attribute that can do exactly that. In gcc, there is the same attribute. For example:

void plusDefault(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

__attribute__((target("sse,sse2,sse3,ssse3,sse4,avx,avx2")))
void plusAVX2(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

__attribute__((target("sse,sse2,sse3,ssse3,sse4,avx,avx2,avx512f")))
void plusAVX512(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

In this example, we compile our plus function additionally for AVX2 and AVX512. In the final assembly, we can check that compiler uses AVX2 for vectorized loop of plusAVX2 function:

...

.globl	_Z8plusAVX2PlS_S_m              # -- Begin function _Z8plusAVX2PlS_S_m

...

.LBB4_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu	(%rsi,%rax,8), %ymm0
	vmovdqu	32(%rsi,%rax,8), %ymm1
	vmovdqu	64(%rsi,%rax,8), %ymm2
	vmovdqu	96(%rsi,%rax,8), %ymm3
	vpaddq	(%rdi,%rax,8), %ymm0, %ymm0
	vpaddq	32(%rdi,%rax,8), %ymm1, %ymm1
	vpaddq	64(%rdi,%rax,8), %ymm2, %ymm2
	vpaddq	96(%rdi,%rax,8), %ymm3, %ymm3
	vmovdqu	%ymm0, (%rdx,%rax,8)
	vmovdqu	%ymm1, 32(%rdx,%rax,8)
	vmovdqu	%ymm2, 64(%rdx,%rax,8)
	vmovdqu	%ymm3, 96(%rdx,%rax,8)
	addq	$16, %rax
	cmpq	%rax, %r8
	jne	.LBB4_4

...

and AVX512 for vectorized loop of plusAVX512:

...

.globl	_Z10plusAVX512PlS_S_m           # -- Begin function _Z10plusAVX512PlS_S_m

...

.LBB5_4:                                # =>This Inner Loop Header: Depth=1
	vmovdqu64	(%rsi,%rax,8), %zmm0
	vmovdqu64	64(%rsi,%rax,8), %zmm1
	vmovdqu64	128(%rsi,%rax,8), %zmm2
	vmovdqu64	192(%rsi,%rax,8), %zmm3
	vpaddq	(%rdi,%rax,8), %zmm0, %zmm0
	vpaddq	64(%rdi,%rax,8), %zmm1, %zmm1
	vpaddq	128(%rdi,%rax,8), %zmm2, %zmm2
	vpaddq	192(%rdi,%rax,8), %zmm3, %zmm3
	vmovdqu64	%zmm0, (%rdx,%rax,8)
	vmovdqu64	%zmm1, 64(%rdx,%rax,8)
	vmovdqu64	%zmm2, 128(%rdx,%rax,8)
	vmovdqu64	%zmm3, 192(%rdx,%rax,8)
	addq	$32, %rax
	cmpq	%rax, %r8
	jne	.LBB5_4

...

Now we have everything to perform CPU dispatch:

void plus(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    if (hasAVX512()) {
        plusAVX512(a, b, c, size);
    } else if (hasAVX2()) {
        plusAVX2(a, b, c, size);
    } else {
        plusDefault(a, b, c, size);
    }
}

In this example, we created plus function that dispatches to concrete implementation based on the available instruction set. Such CPU dispatch method is also called dispatch on each call. There are other methods that you can read about in Agner Fog Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 13.1 CPU dispatch strategies.

Dispatch on each call is the most flexible method because it will allow you to work with template functions and class member functions or choose implementation based on some statistics collected in runtime. The only downside is branching, although this overhead is negligible if your function performs a lot of work.

CPU Dispatch Optimization Places

Now how to find places where SIMD optimizations can be applied?

  1. If you know what loops are hot in your program, you can try to apply CPU dispatch for them.
  2. If you have performance tests, you can compile your program with AVX, AVX2, AVX512 and compare performance reports to figure out which places can be optimized in your program using CPU dispatch.

This technique with performance tests can be applied not only to CPU dispatch but to many other useful optimizations. The main idea is to compile your code with different configurations (compilers, compiler options, libraries, allocators), and if there is performance improvement in some places, you can manually optimize them. For example:

  1. Try different allocators and different libraries.
  2. Try different compiler options (loop unrolling, inline threshold).
  3. Enable AVX/AVX2/AVX512 for build.

CPU Dispatch in ClickHouse

First, I want to show how we designed our dispatch framework in ClickHouse.

enum class TargetArch : UInt32
{
    Default  = 0,         /// Without any additional compiler options.
    SSE42    = (1 << 0),  /// SSE4.2
    AVX      = (1 << 1),
    AVX2     = (1 << 2),
    AVX512F  = (1 << 3),
    AVX512BW    = (1 << 4),
    AVX512VBMI  = (1 << 5),
    AVX512VBMI2 = (1 << 6),
};

/// Runtime detection.
bool isArchSupported(TargetArch arch);

We define an enum TargetArch for the target architecture, and inside isArchSupported function we use CPUID instruction set checks that we already discussed.

Then we define a bunch of BEGIN_INSTRUCTION_SET_SPECIFIC_CODE sections that apply target attribute to the whole code block.

For example, for clang:

#   define BEGIN_AVX512F_SPECIFIC_CODE \
_Pragma("clang attribute push(__attribute__((target(\"sse,sse2,sse3,ssse3,sse4,\
    popcnt,avx,avx2, avx512f\"))), apply_to=function)")
\
#   define BEGIN_AVX2_SPECIFIC_CODE \
_Pragma("clang attribute push(__attribute__((target(\"sse,sse2,sse3,ssse3,sse4,\
    popcnt, avx,avx2\"))), apply_to=function)") \
\
#   define END_TARGET_SPECIFIC_CODE \
_Pragma("clang attribute pop")

Then for each instruction set, we define a separate namespace TargetSpecific::INSTRUCTION_SET. Examples for AVX2 and AVX512:

#define DECLARE_AVX2_SPECIFIC_CODE(...) \
BEGIN_AVX2_SPECIFIC_CODE \
namespace TargetSpecific::AVX2 { \
    DUMMY_FUNCTION_DEFINITION \
    using namespace DB::TargetSpecific::AVX2; \
    __VA_ARGS__ \
} \
END_TARGET_SPECIFIC_CODE

#define DECLARE_AVX512F_SPECIFIC_CODE(...) \
BEGIN_AVX512F_SPECIFIC_CODE \
namespace TargetSpecific::AVX512F { \
    DUMMY_FUNCTION_DEFINITION \
    using namespace DB::TargetSpecific::AVX512F; \
    __VA_ARGS__ \
} \
END_TARGET_SPECIFIC_CODE

And it can be used like this:

DECLARE_DEFAULT_CODE (
    int funcImpl() {
        return 1;
    }
) // DECLARE_DEFAULT_CODE

DECLARE_AVX2_SPECIFIC_CODE (
    int funcImpl() {
        return 2;
    }
) // DECLARE_AVX2_SPECIFIC_CODE

/// Dispatcher function
int dispatchFunc() {
#if USE_MULTITARGET_CODE
    if (isArchSupported(TargetArch::AVX2))
        return TargetSpecific::AVX2::funcImpl();
#endif
    return TargetSpecific::Default::funcImpl();
}

By default USE_MULTITARGET_CODE is 1. Additionally in each TargetSpecific::INSTRUCTION_SET namespace, we define compile time BuildArch constant:

DECLARE_SSE42_SPECIFIC_CODE(
    constexpr auto BuildArch = TargetArch::SSE42; /// NOLINT
) // DECLARE_SSE42_SPECIFIC_CODE

DECLARE_AVX_SPECIFIC_CODE(
    constexpr auto BuildArch = TargetArch::AVX; /// NOLINT
) // DECLARE_AVX_SPECIFIC_CODE

DECLARE_AVX2_SPECIFIC_CODE(
    constexpr auto BuildArch = TargetArch::AVX2; /// NOLINT
) // DECLARE_AVX2_SPECIFIC_CODE

DECLARE_AVX512F_SPECIFIC_CODE(
    constexpr auto BuildArch = TargetArch::AVX512F; /// NOLINT
) // DECLARE_AVX512F_SPECIFIC_CODE

This is useful in case we have almost the same function, and we want to add only a couple of compile time branches for different instruction sets. For example:

#define DECLARE_SSE42_AVX2_CODE(...) \
DECLARE_DEFAULT_CODE         (__VA_ARGS__) \
DECLARE_SSE42_SPECIFIC_CODE  (__VA_ARGS__) \
DECLARE_AVX2_SPECIFIC_CODE   (__VA_ARGS__) \

DECLARE_SSE42_AVX2_CODE (
    int funcImpl() {
        if constexpr (BuildArch == TargetArch::Default) {
            return 0;
        } else if constexpr (BuildArch == TargetArch::SSE42) {
            return 1;
        } else if constexpr (BuildArch == TargetArch::AVX2) {
            return 2;
        }

        return 3;
    }
)

/// Dispatcher function
int dispatchFunc() {
#if USE_MULTITARGET_CODE
    if (isArchSupported(TargetArch::AVX2))
        return TargetSpecific::AVX2::funcImpl();
    else if (isArchSupported(TargetArch::SSE42))
        return TargetSpecific::SSE42::funcImpl();
#endif
    return TargetSpecific::Default::funcImpl();
}

The examples above work well with standalone functions, but when we have class member functions, they do not work because member functions cannot be wrapped into the namespace. For such cases, we have another bunch of macros. We need to insert a specific attribute before the class member function name. And we also need to generate functions with different names, ideally with suffixes like SSE42, AVX2, AVX512.

We can split the function into the header with MULTITARGET_FUNCTION_HEADER macros, function name, function body with MULTITARGET_FUNCTION_BODY macros to insert specific attributes before the function name. For example, for AVX2 and SSE42 it can look like this:

/// Function header
#define MULTITARGET_FUNCTION_HEADER(...) __VA_ARGS__

/// Function body
#define MULTITARGET_FUNCTION_BODY(...) __VA_ARGS__

#define MULTITARGET_FUNCTION_AVX2_SSE42(FUNCTION_HEADER, name, FUNCTION_BODY) \
    FUNCTION_HEADER \
    \
    AVX2_FUNCTION_SPECIFIC_ATTRIBUTE \
    name##AVX2 \
    FUNCTION_BODY \
    \
    FUNCTION_HEADER \
    \
    SSE42_FUNCTION_SPECIFIC_ATTRIBUTE \
    name##SSE42 \
    FUNCTION_BODY \
    \
    FUNCTION_HEADER \
    \
    name \
    FUNCTION_BODY \

We use CPU dispatch in computationally intensive places, for example, in hashing, geometry functions, string processing functions, random numbers generation functions, unary functions, and aggregate functions.

For example, let’s see how we use CPU dispatch in aggregate functions. In ClickHouse, if there is GROUP BY without keys, for example SELECT sum(value), avg(value) FROM test_table, in such case, the aggregate functions process data directly in a batch. For sum function, there is the following implementation:

template <typename Value>
void NO_INLINE addManyImpl(const Value * __restrict ptr, size_t start, size_t end)
{
    ptr += start;
    size_t count = end - start;
    const auto * end_ptr = ptr + count;

    /// Loop
    T local_sum{};
    while (ptr < end_ptr)
    {
        Impl::add(local_sum, *ptr);
        ++ptr;
    }
    Impl::add(sum, local_sum);
}

After we wrap this loop into our dispatch framework, the function code will look like this:

MULTITARGET_FUNCTION_AVX2_SSE42(
MULTITARGET_FUNCTION_HEADER(
template <typename Value>
void NO_SANITIZE_UNDEFINED NO_INLINE
), addManyImpl,
MULTITARGET_FUNCTION_BODY((const Value * __restrict ptr, size_t start, size_t end)
{
    ptr += start;
    size_t count = end - start;
    const auto * end_ptr = ptr + count;

    /// Loop
    T local_sum{};
    while (ptr &lt end_ptr)
    {
        Impl::add(local_sum, *ptr);
        ++ptr;
    }
    Impl::add(sum, local_sum);
}))

Now we need to add dispatch based on the currently available CPU instruction set:

template <typename Value>
void NO_INLINE addMany(const Value * __restrict ptr, size_t start, size_t end)
{
#if USE_MULTITARGET_CODE
    if (isArchSupported(TargetArch::AVX2))
    {
        addManyImplAVX2(ptr, start, end);
        return;
    }
    else if (isArchSupported(TargetArch::SSE42))
    {
        addManyImplSSE42(ptr, start, end);
        return;
    }
#endif

    addManyImpl(ptr, start, end);
}

Here is a small part of the performance report after this optimization was applied:

Query Old (s) New (s) Ratio of speedup(-) or slowdown(+) Relative difference (new - old) / old
SELECT sum(toNullable(toUInt8(number))) FROM numbers(100000000) 0.110 0.077 -1.428x -0.300
SELECT sum(number) FROM numbers(100000000) 0.044 0.035 -1.228x -0.185
SELECT sumOrNull(number) FROM numbers(100000000) 0.044 0.036 -1.226x -0.183
SELECT avg(number) FROM numbers(100000000) 0.416 0.341 -1.219x -0.180

In general, such optimization for sum and avg aggregate functions improves performance in 1.2-1.6 times. Similar optimizations can be applied to other aggregate functions as well.

Now let’s take a look at CPU dispatch optimization in unary functions:

template <typename A, typename Op>
struct UnaryOperationImpl
{
    using ResultType = typename Op::ResultType;
    using ColVecA = ColumnVectorOrDecimal<A>;
    using ColVecC = ColumnVectorOrDecimal<ResultType>
    using ArrayA = typename ColVecA::Container;
    using ArrayC = typename ColVecC::Container;

    static void vector(const ArrayA & a, ArrayC & c)
    {
        /// Loop Op::apply is template for operation
        size_t size = a.size();
        for (size_t i = 0; i < size; ++i)
            c[i] = Op::apply(a[i]);
    }

    static void constant(A a, ResultType & c)
    {
        c = Op::apply(a);
    }
};

In the example, there is a loop that applies some template operation using Op::apply on elements of array a and writes result into array c. After we wrap this loop into our dispatch framework, the loop code will look like this:

MULTITARGET_FUNCTION_WRAPPER_AVX2_SSE42(
MULTITARGET_FH(static void NO_INLINE),
vectorImpl,
MULTITARGET_FB((const ArrayA & a, ArrayC & c) /// NOLINT
{
    /// Loop Op::apply is template for operation
    size_t size = a.size();
    for (size_t i = 0; i < size; ++i)
        c[i] = Op::apply(a[i]);
}))

Now we need to add dispatch based on the currently available CPU instruction set:

static void NO_INLINE vector(const ArrayA & a, ArrayC & c)
{
#if USE_MULTITARGET_CODE
    if (isArchSupported(TargetArch::AVX2))
    {
        vectorImplAVX2(a, c);
        return;
    }
    else if (isArchSupported(TargetArch::SSE42))
    {
        vectorImplSSE42(a, c);
        return;
    }
#endif

    vectorImpl(a, c);
}

Here is a small part of the performance report after this optimization was applied:

Query Old (s) New (s) Ratio of speedup(-) or slowdown(+) Relative difference (new - old) / old
SELECT roundDuration(toInt32(number))) FROM numbers(100000000) 1.632 0.229 -7.119x -0.860
SELECT intExp2(toInt32(number)) FROM numbers(100000000) 0.148 0.105 -1.413x -0.293
SELECT roundToExp2(toUInt8(number)) FROM numbers(100000000) 0.144 0.102 -1.41x -0.291

In general, such optimization for unary functions improves performance in 1.15-2 times. For some specific functions like roundDuration such optimization improved performance in 2-7 times.

Summary

The compiler can vectorize even complex loops using SIMD instructions. Additionally, you can vectorize your code manually or design SIMD-oriented algorithms. But the biggest problem is that if you want to use modern instruction sets, it can decrease your program or library portability. Runtime CPU dispatch can help you eliminate this problem, with the cost of compiling parts of your code multiple times for different architectures.

You can find places to improve performance using performance tests and compiling your code with different configurations. For CPU dispatch optimizations, you can compile your code with AVX, AVX2, AVX512 and manually apply CPU dispatch in places where there are performance improvements.

In ClickHouse, we designed a framework specifically for such optimizations, and we improved performance in a lot of places using it.