Fastest AES-PRNG, AES-CTR and AES-GCM Delphi implementation
Last week, I committed new ASM implementations of our AES-PRNG, AES-CTR and AES-GCM for mORMot 2.
They handle eight 128-bit at once in an interleaved fashion, as permitted by the CTR chaining mode. The aes-ni opcodes (aesenc aesenclast) are used for AES process, and the GMAC of the AES-GCM mode is computed using the pclmulqdq opcode.
Resulting performance is amazing: on my simple Core i3, I reach 2.6 GB/s for aes-128-ctr, and 1.5 GB/s for aes-128-gcm for instance - the first being actually faster than OpenSSL!
AES-CTR is the basic chaining mode used for:
или Зарегистрируйся:
A new TAesCtrCrc class has been added. It combines AES-CTR with 4 parallel crc32c checksums, of both the encrypted and the decrypted content.
It results in an Для просмотра ссылки Войдиили Зарегистрируйся with 256-bit of associated authentication, which outperforms AES-GCM in our implementation.
On x86_64 we use a 8*128-bit interleaved optimized asm:
In i386, numbers are lower, because they are not interleaved:
Here are some numbers of our ECDHE stream protocol:
The newly introduced asm was very beneficial to its 256-bit AES generator, in terms of performance.
On x86_64, it uses fast hardware AES-NI acceleration, and our 8X interleaved asm:
We can be proud that our library outperforms the OpenSSL 1.1.1 proven codebase for most algorithms, with no .dll dependency.
Open Source rocks!
Next logical step is to work on OpenSSL integration of the TLS layer, which is welcome especially on Linux (we already have a Для просмотра ссылки Войдиили Зарегистрируйся for Windows since years in mORMot)...
If you wish, you can download the current mORMot 2 source code from Для просмотра ссылки Войдиили Зарегистрируйся and run the Для просмотра ссылки Войди или Зарегистрируйся. You could share your own numbers!
Stay tuned, and feedback is Для просмотра ссылки Войдиили Зарегистрируйся!
Last week, I committed new ASM implementations of our AES-PRNG, AES-CTR and AES-GCM for mORMot 2.
They handle eight 128-bit at once in an interleaved fashion, as permitted by the CTR chaining mode. The aes-ni opcodes (aesenc aesenclast) are used for AES process, and the GMAC of the AES-GCM mode is computed using the pclmulqdq opcode.
Resulting performance is amazing: on my simple Core i3, I reach 2.6 GB/s for aes-128-ctr, and 1.5 GB/s for aes-128-gcm for instance - the first being actually faster than OpenSSL!
AES-CTR is the basic chaining mode used for:
- AES-CTR as defined by the Для просмотра ссылки Войди
или Зарегистрируйся - see our TAesPrngNist class; - AES-GCM which includes a 128-bit GMAC using in Для просмотра ссылки Войди
или Зарегистрируйся - see the TAesGcm class; - and our AES-based Для просмотра ссылки Войди
или Зарегистрируйся (PRNG) as implemented by our TAesPrng class.
mORMot 2
For mORMot 2, we refactored the SynCrypto.pas unit into Для просмотра ссылки Войди- All assembly code has been moved to dedicated Для просмотра ссылки Войди
или Зарегистрируйся and Для просмотра ссылки Войдиили Зарегистрируйся include files; - A generic catalog of AES algorithms has been implemented, which allows to search them by name (e.g. 'aes-128-ctr'), and also switch to the fastest implementation available, e.g. if OpenSSL is enabled;
- The regression tests have been enhanced, to include validation against test vectors for all modes, and comparison with the OpenSSL reference implementation;
- A lot of low-level optimizations have been applied, especially targeting x86_64 which is now (sometimes much) faster than the original very tuned i386 code - in fact, we focus on x86_64 which is our main target for Linux high-end services implementation with FPC compilation;
- Still as a stand-alone Delphi/FPC unit, with no external .dll to download, search and load.
AES-CTR
On x86_64 we use a 8*128-bit interleaved optimized asm:- mormot aes-128-ctr in 1.99ms i.e. 1254390/s or 2.6 GB/s
- mormot aes-256-ctr in 2.64ms i.e. 945179/s or 1.9 GB/s
- openssl aes-128-ctr in 2.23ms i.e. 1121076/s or 2.3 GB/s
- openssl aes-256-ctr in 2.80ms i.e. 891901/s or 1.8 GB/s
- mormot aes-128-ofb in 6.88ms i.e. 363002/s or 772.5 MB/s
- mormot aes-256-ofb in 9.37ms i.e. 266808/s or 567.8 MB/s
- mormot aes-128-ctr in 10ms i.e. 249900/s or 531.8 MB/s
- mormot aes-256-ctr in 12.47ms i.e. 200368/s or 426.4 MB/s
- openssl aes-128-ctr in 3.01ms i.e. 830288/s or 1.7 GB/s
- openssl aes-256-ctr in 3.52ms i.e. 709622/s or 1.4 GB/s
AES-GCM
On x86_64, our TAesGcm class is 8x interleaved for both GMAC and AES-CTR:- mormot aes-128-gcm in 3.45ms i.e. 722752/s or 1.5 GB/s
- mormot aes-256-gcm in 4.11ms i.e. 607385/s or 1.2 GB/s
- openssl aes-128-gcm in 2.86ms i.e. 874125/s or 1.8 GB/s
- openssl aes-256-gcm in 3.43ms i.e. 727590/s or 1.5 GB/s
- mormot aes-128-gcm in 15.86ms i.e. 157609/s or 335.4 MB/s
- mormot aes-256-gcm in 18.23ms i.e. 137083/s or 291.7 MB/s
- openssl aes-128-gcm in 5.49ms i.e. 455290/s or 0.9 GB/s
- openssl aes-256-gcm in 6.11ms i.e. 408630/s or 869.6 MB/s
Other AES modes
As you may see from the recent commits, and the numbers in the source code, almost all of our AES classes (e.g. OFB and CFB) have had their performance enhanced, sometimes by a large margin.A new TAesCtrCrc class has been added. It combines AES-CTR with 4 parallel crc32c checksums, of both the encrypted and the decrypted content.
It results in an Для просмотра ссылки Войди
On x86_64 we use a 8*128-bit interleaved optimized asm:
- mormot aes-128-ctc in 2.58ms i.e. 967492/s or 2 GB/s
- mormot aes-256-ctc in 3.13ms i.e. 797702/s or 1.6 GB/s
In i386, numbers are lower, because they are not interleaved:
- mormot aes-128-ctc in 9.76ms i.e. 256068/s or 544.9 MB/s
- mormot aes-256-ctc in 12.14ms i.e. 205930/s or 438.2 MB/s
Here are some numbers of our ECDHE stream protocol:
- efAesCrc128 in 1.57ms i.e. 63,331/s, aver. 15us, 1.1 GB/s
- efAesCfb128 in 1.66ms i.e. 60,060/s, aver. 16us, 1 GB/s
- efAesOfb128 in 2.52ms i.e. 39,588/s, aver. 25us, 729.9 MB/s
- efAesCtr128 in 851us i.e. 117,508/s, aver. 8us, 2.1 GB/s
- efAesCbc128 in 2.93ms i.e. 34,059/s, aver. 29us, 628 MB/s
- efAesCrc256 in 2.13ms i.e. 46,926/s, aver. 21us, 865.2 MB/s
- efAesCfb256 in 2.20ms i.e. 45,330/s, aver. 22us, 835.8 MB/s
- efAesOfb256 in 3.38ms i.e. 29,507/s, aver. 33us, 544 MB/s
- efAesCtr256 in 1.09ms i.e. 91,659/s, aver. 10us, 1.6 GB/s
- efAesCbc256 in 3.33ms i.e. 30,012/s, aver. 33us, 553.3 MB/s
- efAesGcm128 in 790us i.e. 126,582/s, aver. 7us, 2.2 GB/s
- efAesGcm256 in 987us i.e. 101,317/s, aver. 9us, 1.8 GB/s
- efAesCtc128 in 820us i.e. 121,951/s, aver. 8us, 2.1 GB/s
- efAesCtc256 in 985us i.e. 101,522/s, aver. 9us, 1.8 GB/s
- those numbers don't exactly match the other benchmarks, because we don't measure the raw AES encryption performance, but the whole encapsulation in the WebSockets frames protocol, and we test another set of message sizes;
- the efAesGcm128/efAesGcm256 numbers above automatically used the OpenSSL library on my Ubuntu laptop, since they are faster than our TAesGcm class - so when you don't have OpenSSL installed (which is sometimes tricky on Windows), you could rely on efAesCtc128 as WebSockets asymetric encryption protocol.
AES PRNG
As I wrote above, our TAesPrng class uses internally the AES-CTR mode to generate its random output stream.The newly introduced asm was very beneficial to its 256-bit AES generator, in terms of performance.
On x86_64, it uses fast hardware AES-NI acceleration, and our 8X interleaved asm:
- mORMot Random32 in 3.95ms i.e. 25,303,643/s, aver. 0us, 96.5 MB/s
- mORMot FillRandom in 46us, 2 GB/s
- OpenSSL Random32 in 288.71ms i.e. 346,363/s, aver. 2us, 1.3 MB/s
- OpenSSL FillRandom in 240us, 397.3 MB/s
- mORMot Random32 in 5.54ms i.e. 18,044,027/s, aver. 0us, 68.8 MB/s
- mORMot FillRandom in 203us, 469.7 MB/s
- OpenSSL Random32 in 364.24ms i.e. 274,540/s, aver. 3us, 1 MB/s
- OpenSSL FillRandom in 371us, 257 MB/s
Conclusion
Since years, I suspected we wrote the fastest AES library for Delphi and FreePascal. Now we covered even more algorithms (AES-GCM is widely used but not widely implemented in Delphi), and pushed away the performance limits even further!We can be proud that our library outperforms the OpenSSL 1.1.1 proven codebase for most algorithms, with no .dll dependency.
Open Source rocks!
Next logical step is to work on OpenSSL integration of the TLS layer, which is welcome especially on Linux (we already have a Для просмотра ссылки Войди
If you wish, you can download the current mORMot 2 source code from Для просмотра ссылки Войди
Stay tuned, and feedback is Для просмотра ссылки Войди