Delphi Fastest AES-PRNG, AES-CTR and AES-GCM Delphi implementation

FireWind

Свой
Регистрация
2 Дек 2005
Сообщения
1,957
Реакции
1,203
Credits
4,034
Fastest AES-PRNG, AES-CTR and AES-GCM Delphi implementation

Last week, I committed new ASM implementations of our AES-PRNG, AES-CTR and AES-GCM for mORMot 2.
They handle eight 128-bit at once in an interleaved fashion, as permitted by the CTR chaining mode. The aes-ni opcodes (aesenc aesenclast) are used for AES process, and the GMAC of the AES-GCM mode is computed using the pclmulqdq opcode.

1613214527604.png
Resulting performance is amazing: on my simple Core i3, I reach 2.6 GB/s for aes-128-ctr, and 1.5 GB/s for aes-128-gcm for instance - the first being actually faster than OpenSSL!
AES-CTR is the basic chaining mode used for:

mORMot 2​

For mORMot 2, we refactored the SynCrypto.pas unit into Для просмотра ссылки Войди или Зарегистрируйся:
  • All assembly code has been moved to dedicated Для просмотра ссылки Войди или Зарегистрируйся and Для просмотра ссылки Войди или Зарегистрируйся include files;
  • A generic catalog of AES algorithms has been implemented, which allows to search them by name (e.g. 'aes-128-ctr'), and also switch to the fastest implementation available, e.g. if OpenSSL is enabled;
  • The regression tests have been enhanced, to include validation against test vectors for all modes, and comparison with the OpenSSL reference implementation;
  • A lot of low-level optimizations have been applied, especially targeting x86_64 which is now (sometimes much) faster than the original very tuned i386 code - in fact, we focus on x86_64 which is our main target for Linux high-end services implementation with FPC compilation;
  • Still as a stand-alone Delphi/FPC unit, with no external .dll to download, search and load.
Here are some numbers, extracted from the unit comments, run from several types of blocks (not only huge buffers), during regression tests.

AES-CTR​

On x86_64 we use a 8*128-bit interleaved optimized asm:
  • mormot aes-128-ctr in 1.99ms i.e. 1254390/s or 2.6 GB/s
  • mormot aes-256-ctr in 2.64ms i.e. 945179/s or 1.9 GB/s
It is actually faster than OpenSSL 1.1.1 in our benchmarks
  • openssl aes-128-ctr in 2.23ms i.e. 1121076/s or 2.3 GB/s
  • openssl aes-256-ctr in 2.80ms i.e. 891901/s or 1.8 GB/s
As reference, optimized but not interleaved OFB asm is 3 times slower:
  • mormot aes-128-ofb in 6.88ms i.e. 363002/s or 772.5 MB/s
  • mormot aes-256-ofb in 9.37ms i.e. 266808/s or 567.8 MB/s
On i386, numbers are slower for our classes, which are not interleaved:
  • mormot aes-128-ctr in 10ms i.e. 249900/s or 531.8 MB/s
  • mormot aes-256-ctr in 12.47ms i.e. 200368/s or 426.4 MB/s
  • openssl aes-128-ctr in 3.01ms i.e. 830288/s or 1.7 GB/s
  • openssl aes-256-ctr in 3.52ms i.e. 709622/s or 1.4 GB/s

AES-GCM​

On x86_64, our TAesGcm class is 8x interleaved for both GMAC and AES-CTR:
  • mormot aes-128-gcm in 3.45ms i.e. 722752/s or 1.5 GB/s
  • mormot aes-256-gcm in 4.11ms i.e. 607385/s or 1.2 GB/s
OpenSSL is faster since it performs GMAC and AES-CTR in a single pass:
  • openssl aes-128-gcm in 2.86ms i.e. 874125/s or 1.8 GB/s
  • openssl aes-256-gcm in 3.43ms i.e. 727590/s or 1.5 GB/s
On i386, numbers are much lower, since lacks interleaved asm - but still faster than any other Delphi alternatives:
  • mormot aes-128-gcm in 15.86ms i.e. 157609/s or 335.4 MB/s
  • mormot aes-256-gcm in 18.23ms i.e. 137083/s or 291.7 MB/s
  • openssl aes-128-gcm in 5.49ms i.e. 455290/s or 0.9 GB/s
  • openssl aes-256-gcm in 6.11ms i.e. 408630/s or 869.6 MB/s

Other AES modes​

As you may see from the recent commits, and the numbers in the source code, almost all of our AES classes (e.g. OFB and CFB) have had their performance enhanced, sometimes by a large margin.
A new TAesCtrCrc class has been added. It combines AES-CTR with 4 parallel crc32c checksums, of both the encrypted and the decrypted content.
It results in an Для просмотра ссылки Войди или Зарегистрируйся with 256-bit of associated authentication, which outperforms AES-GCM in our implementation.
On x86_64 we use a 8*128-bit interleaved optimized asm:
  • mormot aes-128-ctc in 2.58ms i.e. 967492/s or 2 GB/s
  • mormot aes-256-ctc in 3.13ms i.e. 797702/s or 1.6 GB/s
(to be compared with the CTR without 256-bit crc32c MAC computation above at 2.6 GB/s and 1.9GB/s)
In i386, numbers are lower, because they are not interleaved:
  • mormot aes-128-ctc in 9.76ms i.e. 256068/s or 544.9 MB/s
  • mormot aes-256-ctc in 12.14ms i.e. 205930/s or 438.2 MB/s
For internal communication, e.g. for our WebSockets services, it is a very good algorithm, especially for small messages, since it needs less warmup than AES-GCM.
Here are some numbers of our ECDHE stream protocol:
  • efAesCrc128 in 1.57ms i.e. 63,331/s, aver. 15us, 1.1 GB/s
  • efAesCfb128 in 1.66ms i.e. 60,060/s, aver. 16us, 1 GB/s
  • efAesOfb128 in 2.52ms i.e. 39,588/s, aver. 25us, 729.9 MB/s
  • efAesCtr128 in 851us i.e. 117,508/s, aver. 8us, 2.1 GB/s
  • efAesCbc128 in 2.93ms i.e. 34,059/s, aver. 29us, 628 MB/s
  • efAesCrc256 in 2.13ms i.e. 46,926/s, aver. 21us, 865.2 MB/s
  • efAesCfb256 in 2.20ms i.e. 45,330/s, aver. 22us, 835.8 MB/s
  • efAesOfb256 in 3.38ms i.e. 29,507/s, aver. 33us, 544 MB/s
  • efAesCtr256 in 1.09ms i.e. 91,659/s, aver. 10us, 1.6 GB/s
  • efAesCbc256 in 3.33ms i.e. 30,012/s, aver. 33us, 553.3 MB/s
  • efAesGcm128 in 790us i.e. 126,582/s, aver. 7us, 2.2 GB/s
  • efAesGcm256 in 987us i.e. 101,317/s, aver. 9us, 1.8 GB/s
  • efAesCtc128 in 820us i.e. 121,951/s, aver. 8us, 2.1 GB/s
  • efAesCtc256 in 985us i.e. 101,522/s, aver. 9us, 1.8 GB/s
Note that
  • those numbers don't exactly match the other benchmarks, because we don't measure the raw AES encryption performance, but the whole encapsulation in the WebSockets frames protocol, and we test another set of message sizes;
  • the efAesGcm128/efAesGcm256 numbers above automatically used the OpenSSL library on my Ubuntu laptop, since they are faster than our TAesGcm class - so when you don't have OpenSSL installed (which is sometimes tricky on Windows), you could rely on efAesCtc128 as WebSockets asymetric encryption protocol.

AES PRNG​

As I wrote above, our TAesPrng class uses internally the AES-CTR mode to generate its random output stream.
The newly introduced asm was very beneficial to its 256-bit AES generator, in terms of performance.
On x86_64, it uses fast hardware AES-NI acceleration, and our 8X interleaved asm:
  • mORMot Random32 in 3.95ms i.e. 25,303,643/s, aver. 0us, 96.5 MB/s
  • mORMot FillRandom in 46us, 2 GB/s
It is actually noticeably faster than OpenSSL with the same 256-bit safety level:
  • OpenSSL Random32 in 288.71ms i.e. 346,363/s, aver. 2us, 1.3 MB/s
  • OpenSSL FillRandom in 240us, 397.3 MB/s
On i386, numbers are similar, but for FillRandom which is not interleaved:
  • mORMot Random32 in 5.54ms i.e. 18,044,027/s, aver. 0us, 68.8 MB/s
  • mORMot FillRandom in 203us, 469.7 MB/s
  • OpenSSL Random32 in 364.24ms i.e. 274,540/s, aver. 3us, 1 MB/s
  • OpenSSL FillRandom in 371us, 257 MB/s

Conclusion​

Since years, I suspected we wrote the fastest AES library for Delphi and FreePascal. Now we covered even more algorithms (AES-GCM is widely used but not widely implemented in Delphi), and pushed away the performance limits even further!
We can be proud that our library outperforms the OpenSSL 1.1.1 proven codebase for most algorithms, with no .dll dependency.
Open Source rocks!
Next logical step is to work on OpenSSL integration of the TLS layer, which is welcome especially on Linux (we already have a Для просмотра ссылки Войди или Зарегистрируйся for Windows since years in mORMot)...
If you wish, you can download the current mORMot 2 source code from Для просмотра ссылки Войди или Зарегистрируйся and run the Для просмотра ссылки Войди или Зарегистрируйся. You could share your own numbers!

Stay tuned, and feedback is Для просмотра ссылки Войди или Зарегистрируйся!