SIMD Assembly Optimization
July 10, 2017 Erik van Bilsen
July 10, 2017 Erik van Bilsen
[SHOWTOGROUPS=4,20]
We get close to the metal as we demonstrate how you can incorporate assembly optimized SIMD routines into your Delphi apps. This can be done by using inline assembly code for Intel platforms, or by creating and linking a static library for ARM platforms.
Nowadays, with optimizing and vectorizing compilers, few people still hand-write assembly code and it is becoming a lost art. This is a bit of a shame since performance critical parts of your code can still hugely benefit from assembly optimized routines, especially if those routines can take advantage of Single Instruction, Multiple Data (SIMD) operations. SIMD, as the name implies, can perform operations on multiple pieces of data at the same time using only a single instruction.
Why?
The most advanced C(++) compilers available have support for “automatic vectorization”, and will automatically use SIMD instructions when it sees an opportunity to do so. However, those compilers are still not smart enough to outperform (or even come close to the performance of) hand-optimized routines. Delphi is not a vectorizing compiler. So if you want to take advantage of SIMD instructions, then you will have to write some assembly code.
At Grijjy, we use hand-optimized SIMD code in a couple of places, from improving the speed of camera and video capture on iOS and Android by up to a factor of 8 to optimizing the core parts of a custom video decoder. All these routines have one thing in common: they operate on large blocks of similar data (images and video). And SIMD instructions are very well suited for this.
But SIMD can be very beneficial for “smaller” problems as well. For example (warning: shameless plug follows), my personal Для просмотра ссылки Войдиили Зарегистрируйся project makes heavy use of SIMD to increase the speed of common vector and matrix operations by a factor of 4-8, and sometimes even more. This can have a big impact on math heavy 3D visualization applications and games for example.
We will just scratch the surface of SIMD assembly in this post, since this topic could easily fill many books. But hopefully it is enough to wet your appetite… or reinforce your decision to never touch a line of assembly code in your life. In the latter case, if you are in need of SIMD optimizations, then you can always (warning: even more shameless plug follows) enlist our Для просмотра ссылки Войдиили Зарегистрируйся.
Structure of this article
Nowadays, Delphi supports 4 different CPUs: 32- and 64-bit Intel CPUs and 32-bit and 64-bit ARM CPUs. Each of these has its own assembly language and set of registers. In this post we will briefly introduce the available registers on each CPU and how they are used to pass parameters and return values. Then we show three example routines that can be optimized using SIMD instructions. We use the same three examples for all CPUs.
Most CPUs nowadays have two sets of registers and two instruction sets: one for “regular” (scalar) operations, and one for SIMD (vector) operations. Since most gains will come from using vector operations, this article mostly focuses on SIMD registers and instructions. For non-vector operations, there is little gain in using assembly code nowadays.
The code for this article is available in our Для просмотра ссылки Войдиили Зарегистрируйся GitHub repository. You will find it in the directory Для просмотра ссылки Войди или Зарегистрируйся. It contains a simple cross-platform application that runs the three sample routines using pure Delphi code and SIMD assembly code. It times these versions and shows how much faster the SIMD version is. It runs on Windows (32-bit and 64-bit), macOS, iOS (32-bit, 64-bit and simulator) and Android. I have not tested it on Linux since I don’t have a Linux compiler (yet).
32-bit Intel Platforms
We will start our exploration with the 32-bit Intel platform. If you are a long time Delphi programmer, you will be most familiar with this platform, and you may even have written some assembly code for it (or at least looked at it in some RTL units).
Registers
This platform offers eight 32-bit general purpose (scalar) registers called EAX, EBX, ECX, EDX, ESP, EBP, ESI and EDI. Of these, ESP and EBP are usually used to manage the stack, leaving the other six for general use. When using Delphi’s default (register) calling convention, the first three 32-bit parameters of a routine are passed in the registers EAX, ECX and EDX (in that order), and any 32-bit function result is returned in EAX. An assembly routine is free to modify the contents of the EAX, ECX and EDX registers, but must preserve the contents of the other registers.
In addition the platform offers eight 128-bit SIMD (vector) registers, sequentially numbered from XMM0 through XMM7.
Hello SIMD
Legend has it that the first program you wrote displayed the text “Hello World”. The program after that probably contained a function to add two values together. We will start here for our exploration of SIMD, but instead of adding two values together, we add 16 pairs of bytes together using a single instruction.
The Delphi version is simple enough:
We can use the inline assembler to create an SIMD version of this routine:
All parameters of are of type T16Bytes, which is an array of 16 bytes. When the size of a parameter is greater than the size of a native integer, then passing those parameters using a const qualifier will actually pass a pointer to the value instead. For var and out parameters, Delphi will always pass a pointer (or reference).
So in this example, the address of A is passed in the EAX register, the address of B in the EDX register and the address of C in the ECX register.
The first MOVDQU instruction here reads as “move the 16 bytes starting at the address in EAX to the XMM0 register”. MOVDQU stands for MOVe Double Quadword Unaligned. A Double Quadword is a 128-bit value, so we are moving all 16 bytes. Likewise, a Quadword is a 64-bit value, a Doubleword is a 32-bit value, a Word is a 16-bit value and a Byte is an 8-bit value.
The U for Unaligned means that the source address (EAX) does not have to be aligned on a 128-bit (16-byte) boundary. If you know for a fact that that parameters are aligned to 16-byte boundaries, then you can use MOVDQA instead, where the A stands for Aligned. This improves performance a bit (but will result in an Access Violation if the data is not aligned). But since it is hard to manually align values in Delphi, you usually want to use MOVDQU instead.
The second MOVDQU instruction does the same and loads B into XMM1.
Next, the PADDB instruction adds the 16 bytes in XMM0 to the 16 bytes in XMM1 and stores the result in XMM0. PADDB stands for Packed ADD Bytes. Most SIMD instructions that work on integer values start with P for Packed, indicating it operates on multiple values at the same time. The B suffix means we are adding bytes together. If we wanted to treat the XMM registers as containing eight 16-bit Word values instead, then we would use the PADDW instruction.
The last line uses MOVDQU again to store the result in C.
On my Windows machine, the SIMD version is 7-10 times faster than the Delphi version. On my Mac, it is 16-19 times faster.
[/SHOWTOGROUPS]
We get close to the metal as we demonstrate how you can incorporate assembly optimized SIMD routines into your Delphi apps. This can be done by using inline assembly code for Intel platforms, or by creating and linking a static library for ARM platforms.
Nowadays, with optimizing and vectorizing compilers, few people still hand-write assembly code and it is becoming a lost art. This is a bit of a shame since performance critical parts of your code can still hugely benefit from assembly optimized routines, especially if those routines can take advantage of Single Instruction, Multiple Data (SIMD) operations. SIMD, as the name implies, can perform operations on multiple pieces of data at the same time using only a single instruction.
Why?
The most advanced C(++) compilers available have support for “automatic vectorization”, and will automatically use SIMD instructions when it sees an opportunity to do so. However, those compilers are still not smart enough to outperform (or even come close to the performance of) hand-optimized routines. Delphi is not a vectorizing compiler. So if you want to take advantage of SIMD instructions, then you will have to write some assembly code.
At Grijjy, we use hand-optimized SIMD code in a couple of places, from improving the speed of camera and video capture on iOS and Android by up to a factor of 8 to optimizing the core parts of a custom video decoder. All these routines have one thing in common: they operate on large blocks of similar data (images and video). And SIMD instructions are very well suited for this.
But SIMD can be very beneficial for “smaller” problems as well. For example (warning: shameless plug follows), my personal Для просмотра ссылки Войди
We will just scratch the surface of SIMD assembly in this post, since this topic could easily fill many books. But hopefully it is enough to wet your appetite… or reinforce your decision to never touch a line of assembly code in your life. In the latter case, if you are in need of SIMD optimizations, then you can always (warning: even more shameless plug follows) enlist our Для просмотра ссылки Войди
Structure of this article
Nowadays, Delphi supports 4 different CPUs: 32- and 64-bit Intel CPUs and 32-bit and 64-bit ARM CPUs. Each of these has its own assembly language and set of registers. In this post we will briefly introduce the available registers on each CPU and how they are used to pass parameters and return values. Then we show three example routines that can be optimized using SIMD instructions. We use the same three examples for all CPUs.
Most CPUs nowadays have two sets of registers and two instruction sets: one for “regular” (scalar) operations, and one for SIMD (vector) operations. Since most gains will come from using vector operations, this article mostly focuses on SIMD registers and instructions. For non-vector operations, there is little gain in using assembly code nowadays.
The code for this article is available in our Для просмотра ссылки Войди
32-bit Intel Platforms
We will start our exploration with the 32-bit Intel platform. If you are a long time Delphi programmer, you will be most familiar with this platform, and you may even have written some assembly code for it (or at least looked at it in some RTL units).
Registers
This platform offers eight 32-bit general purpose (scalar) registers called EAX, EBX, ECX, EDX, ESP, EBP, ESI and EDI. Of these, ESP and EBP are usually used to manage the stack, leaving the other six for general use. When using Delphi’s default (register) calling convention, the first three 32-bit parameters of a routine are passed in the registers EAX, ECX and EDX (in that order), and any 32-bit function result is returned in EAX. An assembly routine is free to modify the contents of the EAX, ECX and EDX registers, but must preserve the contents of the other registers.
In addition the platform offers eight 128-bit SIMD (vector) registers, sequentially numbered from XMM0 through XMM7.
The XMM registers can contain multiple pieces of data. For example, a single XMM register can hold 16 individual bytes, 8 individual (16-bit) words, 4 individual (32-bit) integers, 2 individual 64-bit integers, 4 individual Single values or 2 individual Double values.These XMM registers were introduced with the SSE instruction set. This article assumes your CPU supports SSE2. This is a safe assumption nowadays, since SSE2 has been available since 2001. All 64-bit Intel CPUs support SSE2 as well.
Hello SIMD
Legend has it that the first program you wrote displayed the text “Hello World”. The program after that probably contained a function to add two values together. We will start here for our exploration of SIMD, but instead of adding two values together, we add 16 pairs of bytes together using a single instruction.
The Delphi version is simple enough:
1 2 3 4 5 6 7 8 9 10 | type T16Bytes = array [0..15] of Byte; procedure AddDelphi(const A, B: T16Bytes; out C: T16Bytes); var I: Integer; begin for I := 0 to 15 do C := A + B; end; |
We can use the inline assembler to create an SIMD version of this routine:
1 2 3 4 5 6 7 8 | procedure AddSIMD(const A, B: T16Bytes; out C: T16Bytes); // eax edx ecx asm movdqu xmm0, [eax] // Load A into xmm0 movdqu xmm1, [edx] // Load B into xmm1 paddb xmm0, xmm1 // xmm0 := xmm0 + xmm1 (16 times) movdqu [ecx], xmm0 // Store xmm0 into C end; |
All parameters of are of type T16Bytes, which is an array of 16 bytes. When the size of a parameter is greater than the size of a native integer, then passing those parameters using a const qualifier will actually pass a pointer to the value instead. For var and out parameters, Delphi will always pass a pointer (or reference).
So in this example, the address of A is passed in the EAX register, the address of B in the EDX register and the address of C in the ECX register.
The first MOVDQU instruction here reads as “move the 16 bytes starting at the address in EAX to the XMM0 register”. MOVDQU stands for MOVe Double Quadword Unaligned. A Double Quadword is a 128-bit value, so we are moving all 16 bytes. Likewise, a Quadword is a 64-bit value, a Doubleword is a 32-bit value, a Word is a 16-bit value and a Byte is an 8-bit value.
The U for Unaligned means that the source address (EAX) does not have to be aligned on a 128-bit (16-byte) boundary. If you know for a fact that that parameters are aligned to 16-byte boundaries, then you can use MOVDQA instead, where the A stands for Aligned. This improves performance a bit (but will result in an Access Violation if the data is not aligned). But since it is hard to manually align values in Delphi, you usually want to use MOVDQU instead.
The second MOVDQU instruction does the same and loads B into XMM1.
Next, the PADDB instruction adds the 16 bytes in XMM0 to the 16 bytes in XMM1 and stores the result in XMM0. PADDB stands for Packed ADD Bytes. Most SIMD instructions that work on integer values start with P for Packed, indicating it operates on multiple values at the same time. The B suffix means we are adding bytes together. If we wanted to treat the XMM registers as containing eight 16-bit Word values instead, then we would use the PADDW instruction.
The last line uses MOVDQU again to store the result in C.
On my Windows machine, the SIMD version is 7-10 times faster than the Delphi version. On my Mac, it is 16-19 times faster.
[/SHOWTOGROUPS]
Последнее редактирование: