What's new

OpenSSL hardware acceleration

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Voxel

Part of the Furniture
Question to @RMerlin, @sfx2000 and maybe other experts and developers:

I’ve prepared a test build of firmware for R9000 (Alpine AL-514, Cortex-A15) with hardware acceleration of OpenSSL using cryptodev engine, /dev/crypto (cryptodev-linux, https://github.com/cryptodev-linux/cryptodev-linux). But there are a bit strange for me results of OpenSSL test. For example:

OpenSSL w/o hardware acceleration (NO cryptodev engine is used):
Code:
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      55305.55k    56667.01k    58129.66k    58706.94k    58594.65k
des cbc          33149.58k    34800.41k    35267.16k    35559.77k    35550.55k

OpenSSL with hardware acceleration (cryptodev IS used):
Code:
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc       1313.02k     5265.17k    20732.59k    70701.06k   305261.23k
des-cbc           1315.02k     5248.36k    20742.83k    60788.39k   216607.40k

I.e. for 16/64/256 bytes I've got degradation instead of acceleration. But for 1K and 8K significant acceleration.

The question is: has it sense at all to include hardware acceleration into firmware? I.e. will not I get degradation of speed instead of speedup? For OpenVPN, other programs using OpenSSL…

Thanks for comments/hints.

Voxel.
 
Last edited:
It's been a while since I've looked into the internals of OpenSSL - one thing to consider is data length and memory write/reads... and cryptodev takes a different code path than just cpu does.

In any event, when looking at the 1K/8K values, it looks good...

You'll see similar things on other architectures - x86 is like this when using AESNI instructions...

You can also see this with SHA1, SHA256, SHA512 in OpenSSL - again, it's how we deal with memory write/read.
 
What compile options did you use? There is a warning under OpenSSL that enabling it for digests can cause a performance penalty.

After applying the patches you can add cryptodev support by using the
-DHAVE_CRYPTODEV and -DUSE_CRYPTODEV_DIGESTS flags during compilation.
Note that the latter flag (digests) may induce a performance penalty
in some systems.
 
It's been a while since I've looked into the internals of OpenSSL - one thing to consider is data length and memory write/reads... and cryptodev takes a different code path than just cpu does.

In any event, when looking at the 1K/8K values, it looks good...

You'll see similar things on other architectures - x86 is like this when using AESNI instructions...

You can also see this with SHA1, SHA256, SHA512 in OpenSSL - again, it's how we deal with memory write/read.

Thank you for reply. AES-NI with x64, I know. But there is no such difference, Intel Core I5, executed as "openssl speed -evp aes256 -elapsed":
Code:
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     443168.03k   474395.09k   480630.19k   482149.03k   482639.87k

So this is why I am confused...

Voxel.
 
What compile options did you use? There is a warning under OpenSSL that enabling it for digests can cause a performance penalty.

Thank for reply. I used these options (as recommended). Full set of options:

Code:
compiler: arm-openwrt-linux-uclibcgnueabi-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -I/home/voxel/Netgear/R9000-V1.0.2.34HF_gpl_src/staging_dir/target-arm_uClibc-0.9.33.2_eabi/usr/include -I/home/voxel/Netgear/R9000-V1.0.2.34HF_gpl_src/staging_dir/target-arm_uClibc-0.9.33.2_eabi/include -I/home/voxel/Netgear/R9000-V1.0.2.34HF_gpl_src/staging_dir/toolchain-arm_gcc-4.8.5_uClibc-0.9.33.2_eabi/usr/include -I/home/voxel/Netgear/R9000-V1.0.2.34HF_gpl_src/staging_dir/toolchain-arm_gcc-4.8.5_uClibc-0.9.33.2_eabi/include -DNDEBUG -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS -DTERMIOS -O3 -pipe -mcpu=cortex-a15 -mfpu=neon-vfpv4 -mtune=cortex-a15 -mfloat-abi=hard -fhonour-copts -fpic -I/home/voxel/Netgear/R9000-V1.0.2.34HF_gpl_src/package/openssl/include -fomit-frame-pointer -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM

Voxel.
 
openssl speed -evp aes256 -elapsed

If you use the envelope, that's yet another code path in OpenSSL ;)

If you want to test engines - first discover what engines are there - "openssl engine"

You'll probably see at least "dynamic", but you might see more, depends on the build and OS/hardware support -

To test a specific engine - you have to define which engine - e.g. "openssl speed -engine rdrand aes-256-cbc"

Anyways, I wouldn't worry about the small block sizes - 8K/16K is more typical for folks looking at OpenSSL performance as part of OVPN..

Take a look at SHA performance - sha256, sha512

SHA is going to be very relevant for OVPN client users...
 
OpenSSL with hardware acceleration (cryptodev IS used):
Code:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-cbc 1313.02k 5265.17k 20732.59k 70701.06k 305261.23k
des-cbc 1315.02k 5248.36k 20742.83k 60788.39k 216607.40k
I.e. for 16/64/256 bytes I've got degradation instead of acceleration. But for 1K and 8K significant acceleration.

Hmmm... now thinking more about this - you're seeing a context switch between kernel and userland, and the smaller block sizes - more switches, and the hit on performance... that's because of the specific call to /dev/crypto

Why - larger blocks, less context switches...

Anyways - not sure that this is a good thing - depends on the application - but since most folks are interested in OpenSSL performance for OpenVPN...

It's going to be better to use what's in the kernel directly - the kernel knows what's there - and /dev/crypto basically enabled userland access, but this is another context switch when considering that OpenVPN is already doing the kernel-userland-kernel hops anyways, so this patch just makes things worse...

Try not to over optimize - one can go down a path and end up in a not so good place...

What do you see with gnutls, as it can also call /dev/crypto - but you have to build it with the right switches/options... and you can check things there with "gnutls-cli --benchmark-ciphers"
 
Last edited:
If you use the envelope, that's yet another code path in OpenSSL ;)

Yes, but as far as I know OpenVPN is using envelope. Otherwise there would be no acceleration used. The same AES-NI, Core-I5. Test executed with EVP and w/o EVP:

Code:
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      84306.33k    92011.71k    93360.30k    93791.23k    93883.05k
aes-256-cbc     450932.94k   473753.56k   480789.76k   482333.01k   482820.10k

So w/o EVP AES-NI just do not work.

If you want to test engines - first discover what engines are there - "openssl engine"

You'll probably see at least "dynamic", but you might see more, depends on the build and OS/hardware support -

This was my main test when I was fighting with this ;). Almost no problem to make /dev/crypto, but to force using it in OpenSSL... Only when I got "cryptodev" in the list of engines I realized that I succeed.

Command: /usr/bin/openssl engine
Code:
(cryptodev) cryptodev engine
(dynamic) Dynamic engine loading support

To test a specific engine - you have to define which engine - e.g. "openssl speed -engine rdrand aes-256-cbc"

It is not so for cryptodev. It works automatically when using EVP.

Anyways, I wouldn't worry about the small block sizes - 8K/16K is more typical for folks looking at OpenSSL performance as part of OVPN..

It is what I was interested. Thank you :)

Take a look at SHA performance - sha256, sha512

SHA is going to be very relevant for OVPN client users...

Similar results. Faster for 8K, slower for 16/64/256. SHA512 for 8K: 162725.89k vs 106359.47k. SHA256 for 8K: 149935.45k vs 139242.15k.

OK, thank you. I'll include this solution into firmware.

Voxel.
 
What I found in OpenWRT Wiki

Cryptographic Hardware Accelerators
. . .
Due to the overhead of hardware/DMA transfers and buffer copies between kernel/user space it gives only a good return for packet sizes greater than 256 bytes.
. . .

Voxel.
 
Real test. My home Internet speed is 60/60. Not so fast but OK for my needs. Result of testing OpenVPN server with hardware acceleration (computer in my office connected to R9000 located in my home) is 56.37 download and 55.16 upload. IMO it is OK.

Voxel.
 
I would take a look at the DD-WRT SVN repo. I believe Kong enabled hardware crypto for some models a while ago, I just don't know if his change was merged upstream or it was only personal development/test code.

DD-WRT's svn is a nightmare do search through due to a totally useless commit log, so good luck finding anything in there :/
 
I would take a look at the DD-WRT SVN repo. I believe Kong enabled hardware crypto for some models a while ago, I just don't know if his change was merged upstream or it was only personal development/test code.

DD-WRT's svn is a nightmare do search through due to a totally useless commit log, so good luck finding anything in there :/

Thanks for suggestion, I tried half of year ago to browse this repo :).

In general I know already HOW. It's working. But I am not sure: has it sense? On the other hand, my VPN works now very fast. Let's other users try with their VPN. It is better to do and regret than not to do and regret ;)

Voxel.
 
In general I know already HOW. It's working. But I am not sure: has it sense?

I thought I saw Kong post openssl speed results that were much faster, so I thought maybe he used different build options, or applied an additional patch.
 
DD-WRT's svn is a nightmare do search through due to a totally useless commit log, so good luck finding anything in there :/

It is "interesting" - it works, but I think this is one that is just "one has to know..." - which isn't very friendly for third parties that are spelunking about for nuggets of insight...

Not the only project that has problems like that...
 
I thought I saw Kong post openssl speed results that were much faster, so I thought maybe he used different build options, or applied an additional patch.

Eric, could you please point me where you could find this info? I failed to find except this:

http://www.dd-wrt.com/phpBB2/viewtopic.php?t=287177&postdays=0&postorder=asc&start=0

Not quite clear are these results with use of HW acceleration or not. But my results for 8k with HW are faster:
305261.23k vs 74125.27k on this page.

I did not browse DD-WRT repo, but extracted files from binary DD-WRT with HW acceleration, April 27. It looks as Kong is using other approach, not cryptodev-linux, but OCF for HW acceleration.

Thank you,
Voxel.
 
I can confirm the new firmware, 1.0.2.34HF, has increased my speeds to ~86Mbps down. On 1.0.2.33 it was ~61Mbps down. Looking good!
 
I did not browse DD-WRT repo, but extracted files from binary DD-WRT with HW acceleration, April 27. It looks as Kong is using other approach, not cryptodev-linux, but OCF for HW acceleration.

Should take a deeper look... Kong has done some nice stuff, so has Brainslayer - esp. wrapped around Marvell and the SoC's for the Linksys WRT's.

I can see why they have investigated OCF (which is a port from the BSD world, BTW), as it can be more portable over the long term when supporting multiple SoC's, each having, perhaps, their own crypto accelerators... the challenge of course is drivers that can support the code.

http://ocf-linux.sourceforge.net

Risk with OCF (and perhaps cryptodev) is that it can add an additional context change between kernel and userland - which OVPN, by nature, is already doing...

Remember the packet flow - userland to kernel, back to userland as the ovpn app, and then back to kernel space driver and out... so with openssl/openvpn, make the most of the context change - and let the CPU/Accelerators do the work where they're best at.
 
What would be interesting to see - since OVPN 2.4 support the GCM methods...

To test this - one does have to use the envelope on OpenSSL...
  • openssl speed -evp aes-128-gcm
  • openssl speed -evp aes-256-gcm

Note - OpenSSL numbers are not linear to OpenVPN throughput...

Braswell N3700 - x86-64 numbers (this CPU does support AESNI) - Braswell is a Intel Low Power core - base is 1.6Ghz with Turbo to 2.4 - it's an Airmont core, which is a die shrink from Silvermont's 22nm

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-gcm 58432.80k 109802.43k 150152.45k 167195.65k 172010.15k
aes-256-gcm 53050.14k 96761.39k 127607.04k 140015.96k 143461.03k

I'd be curious to see how Annapurna Labs Alpine compares - they're very similar...
 
I'd be curious to see how Annapurna Labs Alpine compares - they're very similar...

Code:
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-gcm      50597.94k    57245.71k    68604.84k    74406.23k    75961.69k
aes-256-gcm      41875.21k    44864.34k    56065.96k    60561.41k    61500.28k

Voxel.
 

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top