Optimised OpenSSL library for EdgeRouter-X

kvic · Feb 27, 2018

I'm amazed with the performance boost over Ubiquiti stock version. I think I would share with other ER-X users.

Performance. Absolute numbers are still limited by the hardware.

Code:

** openssl speed test

type         16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  
-------------------------------------------------------------------------- 
sha           3375.42k    10791.47k    26679.31k    41283.24k    51606.86k  
sha256        3182.77k     8147.95k    15476.05k    19808.83k    22353.24k  
aes-128-cbc  12459.23k    13650.56k    13869.23k    13889.54k    13991.94k  
-------------------------------------------------------------------------- 
sha            +16.9%       +29.5%       +45.4%       +58.4%       +74.0%  
sha256         +57.8%       +75.6%       +89.5%       +96.1%        +107%  
aes-128-cbc    +7.36%       +7.53%       +6.62%       +8.36%       +6.66%  
-------------------------------------------------------------------------- 
                  sign/s verify/s
rsa 1024 bits       64.8   1311.4 | +18.0%  +14.3%  
rsa 2048 bits       10.5    393.2 | +16.7%  +14.6%  
rsa 4096 bits        1.6    109.8 | +14.3%  +13.5%  
dsa 1024 bits      134.8    112.9 | +16.8%  +17.5%  
dsa 2048 bits       40.0     34.0 | +17.0%  +17.6%

** openvpn throughput

aes-128-cbc 31.8Mbps (+26.2%)
aes-256-cbc 28.9Mbps (+23.5%)

Follow this how-to to install. Pls be reminded: use it at your own risk.

RMerlin · Feb 27, 2018

Odd that the numbers don't really change regardless of the block size on aes-128-cbc. RAM bottleneck perhaps?

Are you running your tests with -evp? This makes OpenSSL use the best optimized code path for your specific hardware.

kvic · Feb 27, 2018

On my mipsel, -evp gives slightly better speed but within 2% difference. I've seen ppl get better results on the same SoC ( OpenSSL benchmark on OpenWRT).

AES operates on data with a block size of 128bit each time. Longer input takes more time to process on a single execution unit. The average rate shall remain similar. I believe that's the reason across inputs of varying data length the rate is about the same.

RMerlin · Feb 28, 2018

kvic said:
AES operates on data with a block size of 128bit each time. Longer input takes more time to process on a single execution unit. The average rate shall remain similar. I believe that's the reason across inputs of varying data length the rate is about the same.

The reason why I found it odd:

Code:

OpenSSL 1.0.2n  7 Dec 2017
built on: reproducible build, date unspecified
options:bn(64,32) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /opt/toolchains/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.22-binutils-2.25/usr/bin/arm-buildroot-linux-gnueabi-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_HEARTBEATS -DL_ENDIAN -march=armv7-a -mtune=cortex-a53 -fomit-frame-pointer -mabi=aapcs-linux -marm -ffixed-r8 -msoft-float -D__ARM_ARCH_7A__ -ffunction-sections -fdata-sections -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     192589.37k   527732.27k   939901.18k  1189585.58k  1295262.41k

Larger blocks seem to be processed more optimally here.

kvic · Mar 1, 2018

RMerlin said:
Larger blocks seem to be processed more optimally here.

That looks "odd" on the other hand but it's Cortex-A53. Someone more familiar can possibly offer an explanation or point out which direction doesn't go right.

I heard ARMv8 added "AES-NI" instructions that could speed up crypto operations. However, on Intel AES-NI enabled CPUs, I saw about the same throughput across the board of different input sizes. For example, this one from my last year's blog post.

RMerlin · Mar 1, 2018

kvic said:
I heard ARMv8 added "AES-NI" instructions that could speed up crypto operations.

It does, that's the reason why OpenSSL performance is so much better on the B53 than on the older A9.

sfx2000 · Mar 1, 2018

I think the ER-X is probably RAM limited... the numbers suggest as much.

Should see the latest Kaby Lake i5's - I think the 64MB of eDRAM helps out a bit there (Iris Pro 640 iGPU has the eDRAM)

aes-128-cbc 1145303.08k 1238979.50k 1272503.21k 1277990.57k 1280679.94k
aes-128-gcm 652517.09k 1455412.78k 2738444.29k 4252616.02k 5201199.10k

(on AMD64, it's not just AES-NI that jumps the numbers up, it's also that Intel contributed code that leverages SSE4 to move memory around faster)

sfx2000 · Mar 1, 2018

@kvic - numbers look good - saw someone in the comment section - mips24kc on openwrt...

nice work

Code:

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-gcm 6133.83k 6951.93k 7182.46k 7243.86k 7261.21k 7250.33k
aes-128-cbc 10307.24k 12817.54k 13634.25k 13859.72k 13953.62k 13912.79k

kvic · Mar 1, 2018

sfx2000 said:
I think the ER-X is probably RAM limited... the numbers suggest as much.

Should see the latest Kaby Lake i5's - I think the 64MB of eDRAM helps out a bit there (Iris Pro 640 iGPU has the eDRAM)

aes-128-cbc 1145303.08k 1238979.50k 1272503.21k 1277990.57k 1280679.94k
aes-128-gcm 652517.09k 1455412.78k 2738444.29k 4252616.02k 5201199.10k

(on AMD64, it's not just AES-NI that jumps the numbers up, it's also that Intel contributed code that leverages SSE4 to move memory around faster)

See.. look at the row of aes-128-cbc. More recent Intel CPUs with AES-NI demonstrates similar phenomenon: from 16 bytes to 8096 bytes, throughput remains about the same. Unlike the Cortex-A53..8096 bytes is six times that of 16 bytes

Interestingly Kaby Lake i5 from 1024bytes to 8096bytes performs about the same as A53 at the same input sizes. Is A53 that good?

sfx2000 · Mar 1, 2018

kvic said:
Is A53 that good?

check the effect of 10

The Intel numbers are because of bandwidth to the good side, the MIPS numbers are suggesting there's a bottle neck.

Numbers across the range can imply architecture efficiencies (or not) - just saying.

a53 is commercially successful, as it's cheap, but it's not the strongest hand at OpenSSL - the Broadcom B53 numbers benefit certain things that generic A53's do not - not much different than what we see with other ARM variants like Apple and Qualcomm (not QCA)

To see the benefit of arch differences, just look at the GCM values - Intel scales very nicely there, even without AES.

kvic · Mar 1, 2018

sfx2000 said:
check the effect of 10

On ER-X, aes-128-cbc demonstrates consistent performance across different input sizes (from 16 bytes to 8096 bytes). The thoughtput is about right for classical MIPS at its clock speed. I don't see an architectural limitation or RAM speed limit..

On CBC vs GCM, we went through that last year..We agreed GCM will scale better than CBC on newer CPUs. I have a record after that conversation: CBC vs GCM.

The mind boggling part or what's new in this thread is that B53 scales so well on aes-128-cbc (according to RMerlin's numbers). Why is that?

Is B53 a standard ARM core? Seems not. Or a Broadcom variant of A53? What's added features in B53 that allows even aes-cbc to scale..?

sfx2000 · Mar 2, 2018

kvic said:
The mind boggling part or what's new in this thread is that B53 scales so well on aes-128-cbc (according to RMerlin's numbers). Why is that?

Is B53 a standard ARM core? Seems not. Or a Broadcom variant of A53? What's added features in B53 that allows even aes-cbc to scale..?

Consensus is that it's an ARM variant much like Apple or Qualcomm - Broadcom has an architecture license, and they were close to launching a server chip similar to X-Gene...

B53 scales much like the out of order big cores (A15/A17/A57)...

RMerlin · Mar 2, 2018

kvic said:
The mind boggling part or what's new in this thread is that B53 scales so well on aes-128-cbc (according to RMerlin's numbers). Why is that?

That chip contains a crypto module. Maybe the CPU is able to leverage some of it in its AES acceleration, without fully utilizing it like Strongswan does through the bcmspu kernel module. It means it can achieve high performance on large blocks, while the overhead is slowing things down with small blocks (more context switches to that crypto engine?)

I would have hard time believing those results if I hadn't measured the effect on OpenVPN.

Search

Search

Optimised OpenSSL library for EdgeRouter-X

kvic

Part of the Furniture

RMerlin

Asuswrt-Merlin dev

kvic

Part of the Furniture

RMerlin

Asuswrt-Merlin dev

kvic

Part of the Furniture

RMerlin

Asuswrt-Merlin dev

sfx2000

Part of the Furniture

sfx2000

Part of the Furniture

kvic

Part of the Furniture

sfx2000

Part of the Furniture

kvic

Part of the Furniture

sfx2000

Part of the Furniture

RMerlin

Asuswrt-Merlin dev

Similar threads

Latest threads

Sign Up For SNBForums Daily Digest

Members online