What's new

[RT-AC86U] Accelerated crypto with /dev/crypto

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Fitz Mutch

Senior Member
I've been trying to test openssl to use /dev/crypto, to access the accelerated ciphers of the RT-AC86U. These are kernel modules in the Linux mainline kernel, written in 64-bit assembler, intended for the ARMv8 cpu. However, it's kernelspace and I need a userspace interface such as /dev/crypto or AF_ALG to access the ciphers.
https://github.com/RMerl/asuswrt-me...nd/kernel/linux-4.1/arch/arm64/crypto/Kconfig

Enable /dev/crypto in Asuswrt-Merlin with this patch, then re-build the firmware. This patch automatically downloads the cryptodev-linux source archive and inserts it into the Asuswrt firmware build. I've only tested with RT-AC86U, but it should work for all HND platform routers.
Code:
cd ~/asuswrt-merlin.ng
patch -p1 -i ~/path/to/patches/asuswrt-cryptodev.patch

Look for cryptodev.ko in the image directory, then you would copy it to the router and modprobe it. Then /dev/crypto will appear.

How to load the accelerated ciphers kernel modules on RT-AC86U.
Code:
for M in /lib/modules/4.1.27/kernel/arch/arm64/crypto/*; do modprobe $(basename $M); done

Get results from the AES example. Apparently, it's not using the accelerated ciphers. At least /dev/crypto works!
Code:
Got cbc(aes) with driver cbc-aes-iproc
Note: This is not an accelerated cipher
AES Test passed

The AES example uses the software-based ciphers.
Code:
# cat /proc/crypto | grep -F cbc-aes-iproc
driver       : cbc-aes-iproc
driver       : cbc-aes-iproc
driver       : authenc-hmac-sha512-cbc-aes-iproc
driver       : authenc-hmac-sha384-cbc-aes-iproc
driver       : authenc-hmac-sha256-cbc-aes-iproc
driver       : authenc-hmac-sha224-cbc-aes-iproc
driver       : authenc-hmac-sha1-cbc-aes-iproc
driver       : authenc-hmac-md5-cbc-aes-iproc

The accelerated cipher I want to test via /dev/crypto is "AES core cipher using ARMv8 Crypto Extensions". Anyone know how to test it with openssl or another program?
Code:
# cat /proc/crypto | grep -F cbc-aes-ce
driver       : cbc-aes-ce
 

Attachments

  • asuswrt-cryptodev.patch.txt
    5 KB · Views: 366
Last edited:
I got the cryptodev + Broadcom's crypto driver working a few months ago through manual tinkering, however I misplaced my notes on how to achieve this, so I can't help you - but I can at least tell you that it's possible. I even posted test results on the VPN forums here - prepare to be disappointed if your goal is to improve OpenVPN performance.

What I can tell you is you want the bcmspu.ko kernel module to be loaded. It will add a new userspace interface which you can read to determine how many blocks got processed by it, confirming that it's getting used. I can't remember either what was that block/char device that I was accessing to read stats data.
 
Once I'm done with 384.3 (things got delayed because of a Sourceforge data center migration that took longer than expected), 380.70, migrating the RT-AC5300 and RT-AC87U, merging 382_50010 and 384_20379, debugging all of it, taking a look at the AiMesh code, then maybe, just MAYBE I'll find some time to revisit this...
 
I have /dev/spu, it's already loaded. Is it open source?
https://github.com/RMerl/asuswrt-me...5.02hnd/kernel/linux-4.1/config_base.6a#L2827

Also, the CONFIG_BCM_CRYPTODEV setting doesn't appear to do anything, it doesn't make a /dev/crypto.

No, bcmspu.ko is closed source. It's in router/hnd_extra/prebuilt/ .

Load the cryptodev.ko then bcmspu.ko module, then check /dev/crypto. It will show entries like this one, confirming that the bcmspu module is used:

Code:
name         : cbc(aes)
driver       : cbc-aes-iproc
module       : bcmspu
priority     : 400
refcnt       : 1
selftest     : passed
internal     : no
type         : ablkcipher
async        : yes
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 16
geniv        : <default>

I just tracked down that bcmspu stats interface I was talking about:

Code:
admin@Stargate86:/tmp/mnt/sda1# cat /sys/kernel/debug/bcmspu/stats 
Number of SPUs.........1
Number of channels.....1
Current sessions.......0
Session count..........0
Cipher setkey..........0
HMAC setkey............0
AEAD setkey............0
Cipher Ops.............0
Hash Ops...............0
HMAC Ops...............0
AEAD Ops...............0
AEAD fallback Ops......0
Bytes of req data......0
Bytes of resp data.....0
Message send failures..0. 
Check ICV errors.......0

If you load bcmspu, then you tell openssl to use the crypto dev engine, stats should increase there. Unfortunately I no longer have my special openssl build with engine support enabled to rerun tests. Basically what I did was compile the cryptodev module, recompile OpenSSL with engine support. Then, I copied /var/lib/* to the USB stick, overwrote the OpenSSL files with my special build, then did a mount bind of /usr/lib/ on top of that folder. That allowed me to run tests without having to recompile the whole firmware (enabling engines would conflict with other pieces of the firmware, preventing me from building a complete image).
 
... confirming that the bcmspu module is used
I included the stats for BCM SPU in my results for the Cryptsetup benchmark.

FYI, the following command crash my router.
Code:
modprobe -r bcmspu
cat /sys/kernel/debug/bcmspu/stats
 
Here's results for Cryptsetup 2.0.1 benchmark on the RT-AC86U.

Cryptsetup uses AF_ALG which is a socket interface from userspace to kernelspace crypto.

Linux kernel Cryptographic API (not accelerated)
Code:
modprobe -r cryptodev
modprobe -r bcmspu
for M in /lib/modules/4.1.27/kernel/arch/arm64/crypto/*; do modprobe -r $(basename $M); done
modprobe algif_skcipher
cryptsetup benchmark

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       304818 iterations per second for 256-bit key
PBKDF2-sha256     557753 iterations per second for 256-bit key
PBKDF2-sha512     189959 iterations per second for 256-bit key
PBKDF2-ripemd160  189959 iterations per second for 256-bit key
PBKDF2-whirlpool   18618 iterations per second for 256-bit key
argon2i       4 iterations, 74825 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      4 iterations, 75889 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm | Key |  Encryption |  Decryption
        aes-cbc   128b    46.9 MiB/s    48.8 MiB/s
    serpent-cbc   128b           N/A           N/A
    twofish-cbc   128b           N/A           N/A
        aes-cbc   256b    35.0 MiB/s    36.8 MiB/s
    serpent-cbc   256b           N/A           N/A
    twofish-cbc   256b           N/A           N/A
        aes-xts   256b    50.3 MiB/s    49.9 MiB/s
    serpent-xts   256b           N/A           N/A
    twofish-xts   256b           N/A           N/A
        aes-xts   512b    36.9 MiB/s    37.3 MiB/s
    serpent-xts   512b           N/A           N/A
    twofish-xts   512b           N/A           N/A

cat /proc/crypto | grep -F -A 10 "aes"

name         : aes
driver       : aes-generic
module       : kernel
priority     : 100
refcnt       : 2
selftest     : passed
internal     : no
type         : cipher
blocksize    : 16
min keysize  : 16
max keysize  : 32

Broadcom SPU driver
Code:
modprobe -r cryptodev
for M in /lib/modules/4.1.27/kernel/arch/arm64/crypto/*; do modprobe -r $(basename $M); done
modprobe bcmspu
modprobe algif_skcipher
cryptsetup benchmark

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       304818 iterations per second for 256-bit key
PBKDF2-sha256     557753 iterations per second for 256-bit key
PBKDF2-sha512     189959 iterations per second for 256-bit key
PBKDF2-ripemd160  189959 iterations per second for 256-bit key
PBKDF2-whirlpool   18512 iterations per second for 256-bit key
argon2i       4 iterations, 75848 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      4 iterations, 76509 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm | Key |  Encryption |  Decryption
        aes-cbc   128b   216.6 MiB/s   217.5 MiB/s
    serpent-cbc   128b           N/A           N/A
    twofish-cbc   128b           N/A           N/A
        aes-cbc   256b   233.7 MiB/s   231.3 MiB/s
    serpent-cbc   256b           N/A           N/A
    twofish-cbc   256b           N/A           N/A
        aes-xts   256b   215.9 MiB/s   214.6 MiB/s
    serpent-xts   256b           N/A           N/A
    twofish-xts   256b           N/A           N/A
        aes-xts   512b   228.5 MiB/s   228.4 MiB/s
    serpent-xts   512b           N/A           N/A
    twofish-xts   512b           N/A           N/A

cat /proc/crypto | grep -F -A 14 " cbc(aes)"

name         : cbc(aes)
driver       : cbc-aes-iproc
module       : bcmspu
priority     : 400
refcnt       : 1
selftest     : passed
internal     : no
type         : ablkcipher
async        : yes
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 16
geniv        : <default>

cat /sys/kernel/debug/bcmspu/stats

Number of SPUs.........1
Number of channels.....1
Current sessions.......0
Session count..........1780
Cipher setkey..........1780
HMAC setkey............0
AEAD setkey............0
Cipher Ops.............28480
Hash Ops...............0
HMAC Ops...............0
AEAD Ops...............0
AEAD fallback Ops......0
Bytes of req data......1866465280
Bytes of resp data.....1866465280
Message send failures..0
Check ICV errors.......0

Linux kernel ARM64 Accelerated Cryptographic Algorithms
Code:
modprobe -r cryptodev
modprobe -r bcmspu
for M in /lib/modules/4.1.27/kernel/arch/arm64/crypto/*; do modprobe $(basename $M); done
modprobe algif_skcipher
cryptsetup benchmark

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       308404 iterations per second for 256-bit key
PBKDF2-sha256     557753 iterations per second for 256-bit key
PBKDF2-sha512     189959 iterations per second for 256-bit key
PBKDF2-ripemd160  189959 iterations per second for 256-bit key
PBKDF2-whirlpool   18306 iterations per second for 256-bit key
argon2i       4 iterations, 75648 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      4 iterations, 76496 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm | Key |  Encryption |  Decryption
        aes-cbc   128b   480.5 MiB/s   608.9 MiB/s
    serpent-cbc   128b           N/A           N/A
    twofish-cbc   128b           N/A           N/A
        aes-cbc   256b   424.0 MiB/s   564.5 MiB/s
    serpent-cbc   256b           N/A           N/A
    twofish-cbc   256b           N/A           N/A
        aes-xts   256b   545.1 MiB/s   550.5 MiB/s
    serpent-xts   256b           N/A           N/A
    twofish-xts   256b           N/A           N/A
        aes-xts   512b   504.8 MiB/s   505.4 MiB/s
    serpent-xts   512b           N/A           N/A
    twofish-xts   512b           N/A           N/A

cat /proc/crypto | grep -F -A 14 " cbc(aes)"

name         : cbc(aes)
driver       : cbc-aes-ce
module       : kernel
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : givcipher
async        : yes
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 16
geniv        : eseqiv

--
name         : cbc(aes)
driver       : cbc-aes-neon
module       : aes_neon_blk
priority     : 200
refcnt       : 1
selftest     : passed
internal     : no
type         : ablkcipher
async        : yes
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 16
geniv        : <default>

--
name         : cbc(aes)
driver       : cbc-aes-ce
module       : aes_ce_blk
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : ablkcipher
async        : yes
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 16
geniv        : <default>
 
Last edited:
Broadcom's driver truly shines when you feed it large blocks. I was getting ludicrous results on openssl speed tests and 256 KB blocks.
 
I included the stats for BCM SPU in my results for the Cryptsetup benchmark.

FYI, the following command crash my router.
Code:
modprobe -r bcmspu
cat /sys/kernel/debug/bcmspu/stats

Odd, was working fine on my own RT-AC86U when I tested them before posting.
 
I dug out my original cryptodev build info. Will do some tinkering with it tonight.

The part you are missing is compiling OpenSSL with cryptodev support. Here's the Makefile diff (everything is hardcoded for now, would eventually need to be configurable through a target.mak flag):

Code:
@@ -1235,6 +1235,7 @@ obj-$(RTCONFIG_OPENVPN) += ministun
 obj-$(RTCONFIG_DNSSEC) += nettle
 obj-$(RTCONFIG_SAMBA36X) += libiconv-1.14
 obj-$(RTCONFIG_TELENET) += lanauth
+obj-y += cryptodev-linux

 ifneq ($(HND_ROUTER),y)
 obj-y += cstats
@@ -2470,6 +2471,7 @@ openssl/Makefile:
        cd openssl && \
        ./Configure $(HOSTCONFIG) --prefix=/usr --openssldir=/etc --cross-compile-prefix=' ' \
        -ffunction-sections -fdata-sections -Wl,--gc-sections \
+       -I$(TOP)/cryptodev-linux/ -DHAVE_CRYPTODEV -DUSE_CRYPTDEV_DIGESTS \
        shared $(OPENSSL_CIPHERS) no-ssl2 no-ssl3
 #      no-sha0 no-smime no-camellia no-krb5 no-rmd160 no-ripemd \
 #      no-seed no-capieng no-cms no-gms no-gmp no-rfc3779 \
@@ -7021,6 +7023,15 @@ portmap: portmap/Makefile
 portmap-install:
        install -D portmap/portmap $(INSTALLDIR)/portmap/usr/sbin/portmap
        $(STRIP) $(INSTALLDIR)/portmap/usr/sbin/portmap
+
+cryptodev-linux:
+       $(MAKE) -C $@ KERNEL_DIR=$(LINUXDIR) CC=$(CROSS_COMPILE_64)gcc RANLIB=$(CROSS_COMPILE_64)ranlib CFLAGS="-O3" \
+               prefix=$(LINUX_INC_DIR) CCOPTS="-march=armv8-a -mabi=lp64" ARCH="arm64" LD=$(CROSS_COMPILE_64)ld
+#              LDFLAGS="-ffunction-sections -fdata-sections -Wl,--gc-sections" CROSS_COMPILE=$(CROSS_COMPILE_64)
+
+cryptodev-linux-install:
+       @cp -f cryptodev-linux/cryptodev.ko $(INSTALLDIR)/../modules/lib/modules/4.1.27/extra/
+
 # End merlin components

 readline-6.2/Makefile: readline-6.2/configure

Benchmarks will be somewhat different from your results since OpenSSL already has built-in arm7 optimized code.
 
Last edited:
Final tests for tonight...

Code:
admin@Stargate86:/tmp/home/root# time openvpn --test-crypto --secret /tmp/secret --verb 0 --tun-mtu 20000 --cipher aes-256-cbc
Sat Feb 17 04:27:16 2018 disabling NCP mode (--ncp-disable) because not in P2MP client or server mode
real    0m 14.52s
user    0m 14.42s
sys    0m 0.00s
admin@Stargate86:/tmp/home/root# time openvpn --test-crypto --secret /tmp/secret --verb 0 --tun-mtu 20000 --cipher aes-256-cbc --engine cryptodev
Sat Feb 17 04:27:37 2018 disabling NCP mode (--ncp-disable) because not in P2MP client or server mode
real    0m 14.95s
user    0m 13.93s
sys    0m 0.92s
admin@Stargate86:/tmp/home/root# modprobe bcmspu
admin@Stargate86:/tmp/home/root# time openvpn --test-crypto --secret /tmp/secret --verb 0 --tun-mtu 20000 --cipher aes-256-cbc --engine cryptodev
Sat Feb 17 04:28:03 2018 disabling NCP mode (--ncp-disable) because not in P2MP client or server mode
real    0m 16.52s
user    0m 14.16s
sys    0m 0.55s

First benchmark is "raw" OpenSSL, with its own asmv7 optimisations

Second is cryptodev + AES-CE (ARM64 + NEON)

Third is Broadcom's engine.

While raw benchmarks might give the edge to the second and third, the context switch is what I suspect is killing performance, versus a 100% userland solution like the native OpenSSL one (first benchmark).

If you want the OpenSSL speed tests themselves, here they are (in the same order):

Code:
admin@Stargate86:/tmp/home/root# rmmod bcmspu
admin@Stargate86:/tmp/home/root# rmmod cryptodev
admin@Stargate86:/tmp/home/root# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 32949359 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 64 size blocks: 19870123 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 256 size blocks: 7674753 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 1024 size blocks: 2260300 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 8192 size blocks: 298967 aes-256-cbc's in 3.01s
OpenSSL 1.0.2n  7 Dec 2017
built on: reproducible build, date unspecified
options:bn(64,32) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /opt/toolchains/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.22-binutils-2.25/usr/bin/arm-buildroot-linux-gnueabi-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_HEARTBEATS -DL_ENDIAN -march=armv7-a -fomit-frame-pointer -mabi=aapcs-linux -marm -ffixed-r8 -msoft-float -D__ARM_ARCH_7A__ -ffunction-sections -fdata-sections -I/home/merlin/amng.ac86/release/src-rt-5.02hnd/bcmdrivers/broadcom/net/wl/bcm94908/main/src/router/cryptodev-linux/ -DHAVE_CRYPTODEV -DUSE_CRYPTDEV_DIGESTS -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     175146.09k   422487.67k   652736.47k   768952.56k   813667.00k

Code:
admin@Stargate86:/tmp/home/root# modprobe cryptodev
admin@Stargate86:/tmp/home/root# insmod /lib/modules/4.1.27/kernel/crypto/cryptd.ko
insmod: can't insert '/lib/modules/4.1.27/kernel/crypto/cryptd.ko': File exists
admin@Stargate86:/tmp/home/root# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 1834077 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 64 size blocks: 1759546 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 256 size blocks: 1535313 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 1024 size blocks: 1029994 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 8192 size blocks: 235820 aes-256-cbc's in 3.01s
OpenSSL 1.0.2n  7 Dec 2017
built on: reproducible build, date unspecified
options:bn(64,32) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /opt/toolchains/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.22-binutils-2.25/usr/bin/arm-buildroot-linux-gnueabi-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_HEARTBEATS -DL_ENDIAN -march=armv7-a -fomit-frame-pointer -mabi=aapcs-linux -marm -ffixed-r8 -msoft-float -D__ARM_ARCH_7A__ -ffunction-sections -fdata-sections -I/home/merlin/amng.ac86/release/src-rt-5.02hnd/bcmdrivers/broadcom/net/wl/bcm94908/main/src/router/cryptodev-linux/ -DHAVE_CRYPTODEV -DUSE_CRYPTDEV_DIGESTS -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc       9749.25k    37412.27k   130578.12k   350403.27k   641806.46k

Code:
admin@Stargate86:/tmp/home/root# modprobe bcmspu
admin@Stargate86:/tmp/home/root# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 193326 aes-256-cbc's in 3.02s
Doing aes-256-cbc for 3s on 64 size blocks: 190449 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 256 size blocks: 183182 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 1024 size blocks: 155124 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 8192 size blocks: 74913 aes-256-cbc's in 3.01s
OpenSSL 1.0.2n  7 Dec 2017
built on: reproducible build, date unspecified
options:bn(64,32) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /opt/toolchains/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.22-binutils-2.25/usr/bin/arm-buildroot-linux-gnueabi-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_HEARTBEATS -DL_ENDIAN -march=armv7-a -fomit-frame-pointer -mabi=aapcs-linux -marm -ffixed-r8 -msoft-float -D__ARM_ARCH_7A__ -ffunction-sections -fdata-sections -I/home/merlin/amng.ac86/release/src-rt-5.02hnd/bcmdrivers/broadcom/net/wl/bcm94908/main/src/router/cryptodev-linux/ -DHAVE_CRYPTODEV -DUSE_CRYPTDEV_DIGESTS -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc       1024.24k     4049.41k    15579.60k    52773.08k   203882.82k

Further tests might be needed, but so far I'd say that the native OpenSSL code is more efficient than using either kernel-based solutions, at least when dealing with OpenSSL + cryptodev.

Way past bedtime for me, so follow-up will have to wait :)
 
... I'd say that the native OpenSSL code is more efficient than using either kernel-based solutions, at least when dealing with OpenSSL + cryptodev...
I agree and I think that /dev/crypto is the performance bottleneck. So I close with test results of an aarch64 version of openssl compiled with the Asuswrt toolchain. The "no-asm" option was removed. Not using /dev/crypto or bcmspu anymore.

Code:
LD_LIBRARY_PATH=. ./openssl speed -evp aes-256-cbc -elapsed

You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 38602433 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 64 size blocks: 22192203 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 256 size blocks: 8100663 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 1024 size blocks: 2358592 aes-256-cbc's in 3.02s
Doing aes-256-cbc for 3s on 8192 size blocks: 307501 aes-256-cbc's in 3.01s
OpenSSL 1.0.2n  7 Dec 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(ptr)
compiler: /opt/toolchains/crosstools-aarch64-gcc-5.3-linux-4.1-glibc-2.22-binutils-2.25/usr/bin/aarch64-buildroot-linux-gnu-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_HEARTBEATS -DL_ENDIAN -march=armv8-a -fomit-frame-pointer -mabi=lp64 -ffixed-r8 -D__ARM_ARCH_8A__ -ffunction-sections -fdata-sections -O3 -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     205195.66k   471860.79k   688960.04k   799734.51k   836893.09k
 
I think crypto optimization work should be focused on openssl itself for now. Going with armv8 is something worth investigating, I'd like also to ensure that neon gets used by openssl - not sure if that's currently the case with armv7.
 
OpenSSL's armv7 ASM code does support Neon

Compiling OpenSSL as aarch64 is not really realistic however, since all the userspace is 32-bit.
 
Compiling OpenSSL as aarch64 is not really realistic however, since all the userspace is 32-bit.
How to turn on all the bells and whistles for 32-bit programs ? :)
Code:
~/am-toolchains/brcm-arm-hnd/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.22-binutils-2.25/usr/bin/arm-buildroot-linux-gnueabi-gcc -dM -E -march=armv8-a+crc -mtune=cortex-a53 -mfpu=crypto-neon-fp-armv8 - < /dev/null | egrep -i '(arm|aarch|neon|crc|crypto)'

#define __ARM_SIZEOF_WCHAR_T 4
#define __ARM_FEATURE_SAT 1
#define __ARM_ARCH_ISA_ARM 1
#define __ARMEL__ 1
#define __ARM_FEATURE_UNALIGNED 1
#define __ARM_FEATURE_IDIV 1
#define __ARM_ARCH_8A__ 1
#define __ARM_SIZEOF_MINIMAL_ENUM 4
#define __ARM_FEATURE_CRYPTO 1
#define __ARM_FEATURE_LDREX 15
#define __ARM_PCS 1
#define __ARM_FEATURE_QBIT 1
#define __ARM_FEATURE_FMA 1
#define __ARM_ARCH_PROFILE 65
#define __ARM_32BIT_STATE 1
#define __ARM_FEATURE_CLZ 1
#define __ARM_ARCH_ISA_THUMB 2
#define __ARM_ARCH 8
#define __arm__ 1
#define __ARM_FEATURE_SIMD32 1
#define __ARM_FEATURE_CRC32 1
#define __ARM_ARCH_EXT_IDIV__ 1
#define __ARM_EABI__ 1
#define __ARM_FEATURE_DSP 1
 
mtune and fpu is where I would begin, two areas where OpenSSL might benefit. Might be worth investigating doing the same to zlib, dropbear and nettle as well, as they're also math-intensive.
 
mtune and fpu is where I would begin, two areas where OpenSSL might benefit. Might be worth investigating doing the same to zlib, dropbear and nettle as well, as they're also math-intensive.
Here's some results. The HND toolchain is 'soft float' only.
Code:
# openssl speed -evp aes-256-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 32781468 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 64 size blocks: 19823459 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 256 size blocks: 7710689 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 1024 size blocks: 2274155 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 8192 size blocks: 298980 aes-256-cbc's in 3.01s
OpenSSL 1.0.2n  7 Dec 2017
built on: reproducible build, date unspecified
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /opt/toolchains/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.22-binutils-2.25/usr/bin/arm-buildroot-linux-gnueabi-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_HEARTBEATS -DL_ENDIAN -ffunction-sections -fdata-sections -march=armv8-a+crc -mtune=cortex-a53 -mfpu=crypto-neon-fp-armv8 -fomit-frame-pointer -fno-caller-saves -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     174253.65k   421495.47k   655792.82k   773666.02k   813702.38k

src/router/Makefile
Code:
...

openssl/Makefile:
ifeq ($(HND_ROUTER),y)
    cd openssl && \
    ./Configure --prefix=/usr --openssldir=/etc --cross-compile-prefix=' ' \
    shared $(OPENSSL_CIPHERS) no-ssl2 no-ssl3 \
    linux-armv4 -DOPENSSL_NO_HEARTBEATS -DL_ENDIAN \
    -ffunction-sections -fdata-sections -Wl,--gc-sections \
    -march=armv8-a+crc -mtune=cortex-a53 -mfpu=crypto-neon-fp-armv8 \
    -fomit-frame-pointer -fno-caller-saves
else
    cd openssl && \
    ./Configure $(HOSTCONFIG) --prefix=/usr --openssldir=/etc --cross-compile-prefix=' ' \
    -ffunction-sections -fdata-sections -Wl,--gc-sections \
    shared $(OPENSSL_CIPHERS) no-ssl2 no-ssl3
endif

...
 
Last edited:
... Might be worth investigating doing the same to zlib, dropbear and nettle as well, as they're also math-intensive.
Dropbear could easily be complied for aarch64 because it only depends on the pre-compiled toolchain libraries.
Code:
/tmp/home/root# ldd dropbear
        linux-vdso.so.1 (0x0000007fb2a20000)
        libutil.so.1 => /lib/aarch64/libutil.so.1 (0x0000007fb29e2000)
        libcrypt.so.1 => /lib/aarch64/libcrypt.so.1 (0x0000007fb299c000)
        libc.so.6 => /lib/aarch64/libc.so.6 (0x0000007fb2852000)
        /lib/ld-linux-aarch64.so.1 (0x0000007fb29f5000)

FYI, I run with PROTECTION_SERVER=n, not sure if that made things easier.

For everything else, I recompiled the entire firmware with, -march=armv8-a+crc -mtune=cortex-a53 -mfpu=crypto-neon-fp-armv8 -fomit-frame-pointer -fno-caller-saves . The firmware image is about 3-4 MB bigger. It's running fine so far. I have most ASUS features disabled because I don't need all those bells and whistles.
 
Been tinkering with OpenSSL and IPSEC lately. I've gotten Strongswan to use the kernel's af_alg API,allowing it to use bcmspu. Performance was (oddly) slightly behind that of a native OpenSSL build.

My OpenSSL build wasn't fully optimized yet, gonna rebench Strongswan with some more optimizations.
 
Well this is a surprise. Just discovered that those 300 Mbps of IPSEC throughout were achieved through the BCM crypto engine. Somehow Strongswan is able to use the engine without the need for af_alg support. The 64-bit openssl libraries are mostly used for DH handling so far.
 

Similar threads

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top