What's new

Memory Leak related to Strongswan

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Sampson

New Around Here
I am currently running 378.50 on my rt-n16. I installed Entware so that I could use Strongswan (v5.2.1) to establish an ipsec vpn tunnel between my home and office. For the most part, this is working well, but I recently noticed that memory is being consumed in a quantity proportional to the amount of tunnel traffic and does not seem to ever be garbage collected. Based on my initial searching, I came across the following which appears to describe the same issue (user running Tomato, not Asuswrt-Merlin):

http://www.linksysinfo.org/index.php?threads/memory-leak-ipsec-strongswan.70929/

Based on that, I installed slabtop and confirmed that my case also involves secpath_cache consuming memory that doesn't seem to ever get released (I've seen it stay occupied for multiple days).

After a reboot, secpath_cache show a CACHE_SIZE of 0K, I'm not sure if it helps, but here is an example of the slabtop output after approximately 100 MB of data transferred over the tunnel:

Code:
Active / Total Objects (% used)    : 70635 / 75743 (93.3%)
 Active / Total Slabs (% used)      : 2364 / 2373 (99.6%)
 Active / Total Caches (% used)     : 86 / 128 (67.2%)
 Active / Total Size (% used)       : 9586.77K / 10112.02K (94.8%)
 Minimum / Average / Maximum Object : 0.01K / 0.13K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 40567  40514  99%    0.03K    359      113      1436K secpath_cache
  8496   8455  99%    0.06K    144       59       576K buffer_head
  4290   4261  99%    0.13K    143       30       572K size-32
  2808   2766  98%    0.05K     36       78       144K sysfs_dir_cache
  2418   2418 100%    0.12K     78       31       312K dentry
  4120   2156  52%    0.09K    103       40       412K flow_cache
  1380   1269  91%    0.08K     30       46       120K vm_area_struct
   840    840 100%    0.13K     28       30       112K size-64
   793    788  99%    0.29K     61       13       244K inode_cache
   765    765 100%    0.22K     45       17       180K skbuff_head_cache
   737    728  98%    4.00K    737        1      2948K size-4096
   840    660  78%    0.16K     35       24       140K filp
   507    501  98%    0.29K     39       13       156K radix_tree_node
   678    470  69%    0.01K      2      339         8K anon_vma
   480    393  81%    0.13K     16       30        64K size-128
   374    369  98%    0.34K     34       11       136K squashfs_inode_cache
   330    315  95%    0.25K     22       15        88K ip_dst_cache
   312    312 100%    0.30K     26       12       104K proc_inode_cache
   304    304 100%    0.50K     38        8       152K size-512
   270    263  97%    0.39K     27       10       108K shmem_inode_cache
   180    177  98%    0.42K     20        9        80K ext3_inode_cache
   150    147  98%    0.13K      5       30        20K size-96
   150    143  95%    0.25K     10       15        40K size-192
   128    128 100%    1.00K     32        4       128K size-1024
   160    127  79%    0.09K      4       40        16K kmem_cache
   120    113  94%    0.25K      8       15        32K jffs2_refblock


Anyway, I apologize if this is not the ideal place to post about this issue, but I'm hoping that if that's the case, someone can point me in the right direction.

Thanks for any help.
 
Sorry to revive this old thread...

Being dismayed with OpenVPN client on iOS as "2nd-class citizen", I'm switching away from OpenVPN to IPSec/IKEv2.

I bump into this exact memory leakage issue on 378.55. Leak is proportional to how much traffic sent through the ipsec tunnel.

Any folks have ideas what caused the leak?
 
Managed to run KMEMLEAK detector...and spotted two sources of leaks. Seems from xfrm modules

Code:
unreferenced object 0xcac625a0 (size 80):
  comm "softirq", pid 0, jiffies 5555 (age 232.090s)
  hex dump (first 32 bytes):
    00 00 00 00 88 9a b7 cf 0a 00 01 cd 0f 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 20 01 41 d0  ............ .A.
  backtrace:
    [<c00d5324>] create_object+0xd0/0x200
    [<c038f64c>] kmem_cache_alloc+0xa0/0xdc
    [<c0214ddc>] flow_cache_lookup+0x370/0x3e8
    [<c0285d74>] __xfrm_lookup+0x1e8/0x444
    [<c02a89b8>] udpv6_sendmsg+0x50c/0x9a4
    [<c02693e8>] inet_sendmsg+0xb4/0xe4
    [<c01f3c9c>] sock_sendmsg+0x94/0xa8
    [<c01f5fcc>] sys_sendto+0xb4/0xdc
    [<c0047aa0>] ret_fast_syscall+0x0/0x30
    [<ffffffff>] 0xffffffff

unreferenced object 0xcad41aa0 (size 32):
  comm "softirq", pid 0, jiffies 3403 (age 253.600s)
  hex dump (first 32 bytes):
    01 00 00 00 01 00 00 00 00 0c a4 cd c4 17 39 c0  ..............9.
    5c 50 1e c0 b0 51 1e c0 2c 42 01 bf ff ff ff ff  \P...Q..,B......
  backtrace:
    [<c00d5324>] create_object+0xd0/0x200
    [<c038f64c>] kmem_cache_alloc+0xa0/0xdc
    [<c028ac74>] secpath_dup+0x18/0x8c
    [<c028aec4>] xfrm_input+0x38/0x3d4
    [<c0262408>] udp_queue_rcv_skb+0xc0/0x3bc
    [<c0262d34>] __udp4_lib_rcv+0x164/0x64c
    [<c02423a8>] ip_local_deliver_finish+0x104/0x32c
    [<c03958a4>] ip_rcv_finish+0x158/0x530
    [<c02041e4>] __netif_receive_skb+0x484/0x544
    [<c03925f4>] netif_receive_skb+0xe4/0x104
    [<c0392c98>] napi_skb_finish+0x40/0x50
    [<c02049b8>] process_backlog+0x98/0x1e4
    [<c03927d4>] net_rx_action+0xa0/0x168
    [<c039e734>] __do_softirq+0xac/0x144
    [<c0073478>] irq_exit+0x50/0x58
    [<c039e398>] asm_do_IRQ+0x58/0xb4

Google presents lots of noise. Any ideas how to fix?
 
Google presents lots of noise. Any ideas how to fix?

I would check the Linux kernel changelog for that module between 2.6.36 and latest. If it was truly a leak in the module, it must have been fixed years ago.
 
I would check the Linux kernel changelog for that module between 2.6.36 and latest. If it was truly a leak in the module, it must have been fixed years ago.
Taking Merlin's excellent advice.....I found the following and backported them.....
Code:
fac53a1 xfrm: Fix xfrm_state_migrate leak
3ff02ff xfrm: Fix memory leak in xfrm_state_update
94790a8 xfrm: fix freed block size calculation in xfrm_policy_fini()

One from 2010, one from 2011 and one from 2013. After that the code has changed too much.....

Compiling a build for my fork now to see if anything breaks :)
 
Thank you both, Your Excellences.

I looked at all xfrm patches 2010 onwards and some core and ipv4 patches. Sadly didn't see a honourable mention of such a leak issue in any of those. Unsurprisingly people weren't complaining of such issues either on the internet.

I'm thinking it's an issue only in broadcom contaminated kernels (but couldn't explain why tomato also had the problem. Perhaps tomato borrowed a broadcom kernel..?)

When I saw the following line in the early stage of my deep dive, I thought I saw the end of the tunnel. Sadly that was the only beginning and it didn't cause the leak either.
Long story short, still haven't figured out what caused the leak but the problem is better understood:
  • the memory leak is mainly in "secpath_cache"
  • the leak is proportional to the number of packets transferred through the tunnel. Roughly leaks 32 bytes per forwarded packet.
  • the leak doesn't happen on packets coming out of the tunnel designated for the router itself; also no leak if originated from the router going into the tunnel.
When the leak doesn't happen, "secpath" (as pointed to by "skb->sp") is released when socket buffer aka "skb" is released after a packet delivery.

When the leak happens, "secpath" is not released. BUT we have no reason to suspect the socket buffer is not released. Or else the leak could be million times worse.

So skb->sp is reset to NULL for forwarded packets somewhere before socket buffer is to be released?
 
Hurrah!

Figured out a workable fix. Testing stability..!!
 
Excellent! Keep us posted.

600GiB traffic transferred. Looks promising. Still ongoing..some numbers from my RT-AC56U:

IKEv2 VPN (using IPSec)
  • Download: 78Mbit/s; upload 67Mbit/s
  • About 10% faster speed than OpenVPN
  • At 20% less CPU utilisation (out of 200% - two cores)
Core 1 is mostly 100% at max throughput. Core 2 is around 15%. Essentially like a single core act, sadly. Or else speed advantage of IPSec over OpenVPN will be much categorical I would think.
 
Hurrah!!

Async encrypt/decrypt is available. With it on,
  • Download 88 Mbit/s; upload 88 Mbit/s
  • CPU utilization: core 1 100%; core 2 60%
Praying my AC56U won't crash...let's see.
 
Numbers from previous posts were taken with iperf transfer between a host from WAN and a host from LAN. Same ciphers (aes-128) on ipsec and openvpn.

Repeated the speedtest.net test on a client host from WAN. Download is 84 Mbit/s. Upload hits 90 Mbit/s. Here is a graph of CPU load capturing the test runs:

1o5enr.png


Compared the graph below of the same run over openvpn (down/up at 68Mbit/s) here:

1zgqnuh.jpg



For a 100/100 WAN connection, I'm surprised IPSec can squeeze extra performance out of a little RT-AC56U. To me it's near wire-speed VPN on a home router which a prosumer even won't buy today.

To sum up for the time being, my fix to the memory leak works very well. For the record, I'm willing to share the fix, and perhaps a IPSec how-to with Strongswan with the Asuswrt-Merlin community (if there is demand).

The pre-requisites are that RMerlin is back to the game or an alternative roadmap to progress Asuswrt-Merlin shows up... (_and_ the GPL license issue in some fork is properly taken care of).
 
@kvic: Wondering if you ever created a how-to for this as I haven't been able to track one down here on the forums or elsewhere? Would love to simplify my testing to my work datacenter from home without having to setup every device I need to test with to connect to the VPN there.
 

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top