What's new

Entware-3x for new HND platform (GT-AC5300 and RT-AC86U) with asuswrt-merlin firmware

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

OpenSSL ranked from best to worst performance on RT-AC86U.
1. Asuswrt-Merlin (armv7)
2. Entware-ng-3x-armv7
3. Entware-ng-3x-armv8

As I can see Entware versions are compiled with a goal to support /dev/crypto i.e. not only using assember acceleration but also hardware acceleration. I.e. these options for compilation:
Code:
-DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS

So if it is really so, the test should be run with "-evp" and "-elapsed" options to use /dev/crypto:
Code:
openssl speed -evp aes-256-cbc -elapsed

Merlin's version is using assembler optimization. These options -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS slow down the speed of OpenSSL if no /dev/crypto is used.

Voxel.
 
I know some people are using this version not only for Cortex-A15. Main points: it is compiled with hard float and for neon-vfpv4.
My RT-AC86U cpuinfo has "fp asimd evtstrm aes pmull sha1 sha2 crc32". I want to turn on all the bells and whistles. Have you tried "-mfpu=crypto-neon-fp-armv8" ?
 
You can also test Vortex Entware-3x port that is armv7 -O3 hard float. It was 30% faster in a similar test compared with Entware-ng on IPQ4018 (Asus RT-58AC running lede).
IMO you had in mind my version: Voxel. ?

Voxel.
 
My RT-AC86U cpuinfo has "fp asimd evtstrm aes pmull sha1 sha2 crc32". I want to turn on all the bells and whistles. Have you tried "-mfpu=crypto-neon-fp-armv8" ?

No, sorry. I prepare version for Cortex-A15 (32-bit) for users of NETGEAR R7500/R7800/R9000. I do not have armv8 gadgets.

Voxel.
 
BTW, is there /dev/crypto in Asus RT-58AC and/or RT-86U really? I do not have neither nor to check...

Voxel.
 
When I enabled Broadcom's crypto engine on the RT-AC86U, it seriously reduced OpenVPN performance due to the added context switches (I assume). That's why I keep it disabled.
 
When I enabled Broadcom's crypto engine on the RT-AC86U, it seriously reduced OpenVPN performance due to the added context switches (I assume). That's why I keep it disabled.
It is interesting. I have rather opposite feedback from users of R9000 (AL-514, Cortex-A15). For example:

https://www.snbforums.com/threads/custom-firmware-build-for-r9000.40125/#post-335864
(it is just assembler acceleration)

and

https://www.snbforums.com/threads/custom-firmware-build-for-r9000.40125/page-4#post-338655
(it is when guy tried version with asm plus /dev/crypto)

So 61/19 vs 93/21 for AL-514 (OpenVPN).

Voxel.
 
No, there is no /dev/crypto (openwrt). Some of my NASes have /dev/encryptfs - hardware encrypted file system support?

If so it has no sense to compile OpenSSL with -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS. IMO. As I know these options really slow down the speed. I would suggest to use pure asm acceleration.

/dev/encryptfs != /dev/crypto of course.

Voxel.
 
It is interesting. I have rather opposite feedback from users of R9000 (AL-514, Cortex-A15). For example:

https://www.snbforums.com/threads/custom-firmware-build-for-r9000.40125/#post-335864
(it is just assembler acceleration)

and

https://www.snbforums.com/threads/custom-firmware-build-for-r9000.40125/page-4#post-338655
(it is when guy tried version with asm plus /dev/crypto)

So 61/19 vs 93/21 for AL-514 (OpenVPN).

Voxel.

Could be because I'm already optimizing OpenSSL and OpenVPN beyond what Netgear does. I ran iperf tests through an OpenVPN tunnel, and performance dropped. Raw openssl speed tests was also slower on small block sizes - see the benchmarks I posted in the VPN sub-forums.

I suspect IPSEC would be where performance improvements could be gained, but I haven't had time to debug the Strongswan implementation on the RT-AC86U to run tests.
 
If so it has no sense to compile OpenSSL with -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS. IMO. As I know these options really slow down the speed. I would suggest to use pure asm acceleration.

/dev/encryptfs != /dev/crypto of course.

Voxel.

What I did in my tests is to compile with OpenSSL external engine support, then used such an engine to access the kernel cryptodev API.

I don't remember the exact build time change I did tho, I had to manually compile specific pieces and copy them to a running router, as the change would prevent some of the other firmware components from running properly.
 
Could be because I'm already optimizing OpenSSL and OpenVPN beyond what Netgear does.
Sorry, I do not talk re: what Netgear does. This guy who tested OpenVPN (my links above) tried my version with optimized OpenVPN and OpenSSL. Netgear still do not enable any acceleration of OpenSSL in their stock firmware. Even assembler acceleration in spite of my hints passed to NG developers by NETGEAR Guy.

Voxel.
 
What I did in my tests is to compile with OpenSSL external engine support, then used such an engine to access the kernel cryptodev API.

I don't remember the exact build time change I did tho, I had to manually compile specific pieces and copy them to a running router, as the change would prevent some of the other firmware components from running properly.

I've used kernel's code in /drivers/crypto/al (i.e. specific Alpine driver for crypto) and cryptodev http://cryptodev-linux.org/, not OCF.

Voxel.
 
-O3 -pipe -mcpu=cortex-a15 -mfpu=neon-vfpv4 -mtune=cortex-a15

I've run into some math issues with -03 - just saying... and if you're defining -mcpu, you don't have to include -mtune - as that is implied with -mcpu

curious one would use -pipe - doesn't actually change the code there, just speeds up builds...

-O2 -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard

Works for A8/A9/A7/A17 chips that I support (not all A9's are supported, due to odd stuff there with licensees - RT-68U for example, has no VFP or NEON...

A53/A57 - we run in ARMv7-a profiles, like above, as we did some benchmarking, and most cases, there wasn't enough of a change to merit a total rebuild of userland... and the baggage of support two different userlands across archs... not worth the effort.

I suspect it'll be fine with A15 as well as it's similar in many ways to A17.
 
I have made similar tests on Realtek RTD1295.

What platform is this on - and what's the clock speeds? Some of the Realtek 1295's can clock way up to around 2GHz or so...

The RTD1295 is generally focused on Android STB's - and mali is a pain outside of Android - but some hobby/enthusiast contributors have made progress there. Otherwise, it's a generic Quad Cortex-A53 - should scale accordingly, it's a bigger/faster little in-order core and in my experience, a53 runs best in ARMv7A mode for the most part.

The big 32 bit OOO dual cores like A9 and quads like A15/A17 likely will out perform it.
 
Works for A8/A9/A7/A17 chips that I support

Reason why I treat A17 like A7 - big.LITTLE - same goes with A15 as those cores can also be teamed up with A7 - A8 is a bit of a challenge to support but not a priority, but A9 w/VFP and NEON is...

Tuning for A15/A17 gets a little bit - but not enough to matter...
 
What platform is this on - and what's the clock speeds? Some of the Realtek 1295's can clock way up to around 2GHz or so...
This was run on QNAP TS-128A - https://www.qnap.com/en/product/ts-128a/specs/hardware
Runinig some QNAP's version of linux with kernel 4.2.8, CPU clocked 1.4GHz.

Amlogic S912 (Android + kernel 3.14.29) is much faster (-O2 + aarch64)
Code:
# /opt/bin/openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 6049209 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 1560507 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 256 size blocks: 395548 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 99237 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 12415 aes-256 cbc's in 3.00s
OpenSSL 1.0.2n  7 Dec 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-gnu-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -I/media/ware4/Entware-3x.2017.12/staging_dir/target-aarch64_cortex-a53_glibc-2.25/opt/include -I/media/ware4/Entware-3x.2017.12/staging_dir/target-aarch64_cortex-a53_glibc-2.25/include -I/media/ware4/Entware-3x.2017.12/staging_dir/toolchain-aarch64_cortex-a53_gcc-6.3.0_glibc-2.25/usr/include -I/media/ware4/Entware-3x.2017.12/staging_dir/toolchain-aarch64_cortex-a53_gcc-6.3.0_glibc-2.25/include -DOPENSSL_SMALL_FOOTPRINT -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS -DOPENSSL_NO_ERR -DTERMIOS -O2 -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result  -fpic -I/media/ware4/Entware-3x.2017.12/package/libs/openssl/include -ffunction-sections -fdata-sections -fomit-frame-pointer -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      32262.45k    33290.82k    33753.43k    33872.90k    33901.23k
For Amlogic aarch64 openssl test is ~17% faster than armv7 variant.
 
-O2 -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard
Works for A8/A9/A7/A17 chips that I support (not all A9's are supported, due to odd stuff there with licensees - RT-68U for example, has no VFP or NEON...

Well, you know my position :). You call it "over optimization". I do not use -march=armv7-a preferring -mcpu. Different for different platform. Anyway I respect universal multi-platform portable solution you use. I had to use similar approach in my past work ("over optimization" vs portability).

Voxel.
 

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top