How to do fast sparse int8 multiplications on CPUs to speed up deep learning inference

4 min readOct 23, 2021

This medium blog post is going to be highly technical — and though it has quite a lot to do with deep learning, it is probably beyond the scope of what most deep learning practitioners care about. If you have a model that you’d like to speed up, I’d recommend starting with this article here: https://towardsdatascience.com/speeding-up-bert-inference-different-approaches-40863b2954c4. This blog post is kind of an Arxiv article, except I am too lazy to write LaTex.

To fully understand this blog post, you should know (quite) a bit of computer architecture and the latest Intel AVX-512 VNNI instructions, which allow the programmer to perform 32 pairwise int8 multiplications, and accumulate adjacent groups of 4 into 8 int32 results. For more information on these instructions (or any instructions for that matter), please refer to: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX_VNNI&text=store.

These instructions are designed to speed up dense int8 linear algebra. It is difficult to make them work for sparse int8 operations. Let’s take the case of unstructured sparse matrix A multiply dense matrix B. A naive approach would be to try to process 8 rows of sparse matrix A at a time, and use some form of gather instruction to pack one of the AVX512 vector registers from 8 different columns of matrix B. However, this approach has two problems. The first is that if A is highly sparse and has many columns, this gather wastes memory bandwidth, cache capacity and CPU cycles, and will likely be more costly than the actual VNNI multiplication itself. The second is that if different rows of A have a different number of nonzeros, then some rows in the 8-row block will run out of nonzero values to pack!

The above approach can be improved vastly if the sparsity pattern of A is known in advance (which we will assume from now on, as is the case in deep learning inference when you have a sparse weight matrix) — you can presort the rows of A so that rows with similar number of nonzeros end up together, you can’t really get rid of the shitty memory access pattern in matrix B. Is there a better approach?

I think there is. In this approach, we will process the sparse matrix one row at a time. Row i of sparse matrix A multiplied with dense matrix B produces row i of the output matrix. This multiplication amounts to a weighted sum of of the rows of dense matrix B, where the weights are the values of row i of sparse matrix A. I have illustrated this below. This is slightly different from the inner-products that we typically associate with matrix multiplies. Please take time to fully understand this picture.

**Make sure you understand this image. You can fill in some numbers in the cells to convince yourself this works.**

We are going to use VNNI instructions to accelerate this weighted vector sum. The sequence of instructions look something like this:

vpbroadcastd 0(%rsi), %zmm0;

2. vmovdqu8 640(%r8,%r11,1), %xmm24;
vmovdqu8 2176(%r8,%r11,1), %xmm17;
vbroadcasti32x4 768(%r8,%r11,1),%ymm24 {%k1};
vbroadcasti32x4 4864(%r8,%r11,1), %ymm17 {%k1};
vpermt2d %zmm17, %zmm25, %zmm24;
vpshufb %zmm26, %zmm24, %zmm24;

3. vpdpbusd %zmm24,%zmm31,%zmm0;

Let’s talk about 1. We know the sparsity pattern of A, so we know which elements in the row are nonzero and what their values are. We will prepack them into groups of four, which we will hold in 32 bit registers. Then for each group of 4 in this row, we will then broadcast this 32 bit register into a AVX512 register, so the AVX512 register looks something like this:

Let’s now talk about 2. We know which 4 rows in B these 4 nonzero values are the weights to in this weighted sum. So we need to fetch the 4 rows. We could store the indices in an array somewhere, or we can just bake them in the code, an approach which I describe extensively in my publications: SparseDNN https://arxiv.org/abs/2101.07948 and SparseRT https://dl.acm.org/doi/10.1145/3410463.3414654. What we are going to do here is to load 8 consecutive int8 values into 4 AVX (xmm)registers, and then permute and shuffle them so that they eventually end up filling an AVX512 register that matches the one we produced for the weights. For more information on this, please read this: https://stackoverflow.com/questions/64409634/4-way-bytewise-interleave-4x-16-byte-vectors-from-memory-with-avx512.

Now 3. We actually perform the VNNI instruction and accumulate the result in int32 registers.

Note two things: we are processing sparse matrix A one row at a time, there are no problems with different rows having a different number of nonzeros (though having block structure would improve performance, can you see why?) Much more importantly, we are now doing consecutive loads from the dense matrix, even though we are skipping to different rows. This is still a much better access pattern than an almost purely random access by the naive approach.

We can go ahead and produce these code blocks for each group of four in each row and every row, and concatenate all of them to produce our final sparse matrix multiply kernel tailor made for sparse matrix A. This is basically what SparseDNN does for floating point sparse matrices.

How does this approach perform? On pure unstructured sparse int8 matrices, this achieves 50% speedup at 90% sparsity, and breaks even at around 80% sparsity. If you are familiar with the SparseDNN work, this level of speedup is worse than what one can achieve with floating point. In SparseDNN, I was able to achieve 3x speedup at 90% unstructured sparsity.

So is there a way to improve our sparse int8 kernel? Will adding block structure improve the performance? Please stay tuned for Part 2.

How to do fast sparse int8 multiplications on CPUs to speed up deep learning inference

Written by Ziheng Wang