XT8 troubleshooting log, cfg_server possibly an issue

arrgh · Dec 20, 2022

Hey folks,

[edit1: all three nodes are running 3.0.0.4.388_21617]
[edit2: as of 2023/10/12, 23285 has finally been stable for me! Currently 73 days uptime]

I have three XT8's, currently using wireless backhaul. Previously I had them on wired backhaul, but was having serious stability issues (blinking red lights on the satellites) so I reset everything to use wireless backhaul, and everything was stable for a month or two. Now I've been having these symptoms:

1. dnsmasq stops working every 1-4 days. I can still `ping 8.8.8.8` and `dig google.ca @8.8.8.8` so I know it's not a connectivity issue, just dnsmasq being dead or slow.
2. the admin web interface gets incredibly slow, like many requests take 5-60 seconds. The mobile app is affected identically.

Recently I enabled SSH on the main router, and after poking around I can see `cfg_server` being sketchy:

1. it uses a lot of RAM -- at the moment it's up to 159MB, maybe half an hour after I last killed it
2. it attempts a lot of TCP connections on port 7788, both from my WAN IP and from the router's LAN IP, to the satellites. They all end up in CLOSE_WAIT.

current breakdown:
source LAN:7788, dest .152 (one of the satellites): 216 sockets
source WAN:7788, dest .152: 229 sockets
source LAN:7788, dest .237 (the other satellite): 470 sockets
source WAN:7788, dest .237: 474 sockets

3. it has a lot of threads, currently 689 -- simple arithmetic tells us every thread is trying to talk to both satellite nodes via both source IPs.

When I `killall cfg_server`, the admin web UI gets fast for a minute or two, then goes back to being very slow as cfg_server respawns then goes back to spiraling out of control.

I dug a little deeper:

Code:

# cd /proc/$(pidof cfg_server)/fd
# ls -lart |grep socket |wc -l
90
# ls -lart |grep -v socket
dr-xr-xr-x 7 admin root 0 Dec 20 11:11 ..
dr-x------ 2 admin root 0 Dec 20 11:11 .
lr-x------ 1 admin root 64 Dec 20 11:11 9 -> pipe:[118871]
lr-x------ 1 admin root 64 Dec 20 11:11 8 -> pipe:[118870]
lr-x------ 1 admin root 64 Dec 20 11:11 7 -> pipe:[118869]
lr-x------ 1 admin root 64 Dec 20 11:11 5 -> /proc/1/mounts
lrwx------ 1 admin root 64 Dec 20 11:11 2 -> /dev/null
lrwx------ 1 admin root 64 Dec 20 11:11 15 -> /var/lock/allwevent.lock
lrwx------ 1 admin root 64 Dec 20 11:11 1 -> /dev/null
lrwx------ 1 admin root 64 Dec 20 11:11 0 -> /dev/null
# ps T |grep cfg_server |grep -v grep |wc -l
164

Hypothesis: every one of those 164 threads is contending on /var/lock/allwevent.lock, and the admin web interface backend is also trying to get it, which is why it slows to a crawl. Why are there hundreds of threads? ¯\_(ツ)_/¯

I have a case open waiting for L2, so I'm reluctant to try another firmware at the moment, in hopes that my suffering can help future generations.

googling for `allwevent.lock` yields https://www.snbforums.com/threads/a...ausing-router-not-accessible-via-webui.80469/ which seems fairly on-point.

arrgh · Dec 20, 2022

I think contention on `allwevent.lock` is certainly what's making the admin web UI slow. Here's an extremely dirty way of doing something like `lsof`:

Code:

for p in /proc/[0-9]*; do echo $p; cat $p/cmdline; ls -l $p/fd |grep -v 'socket\|null\|mounts\|pipe\|console'; done > ~/procs

If I run this while (slowly) loading a page in the admin web UI, `httpd` is listed as having /var/lock/allwevent.lock open, in addition to the usual `cfg_server`.

arrgh · Dec 20, 2022

Also note, unlike 2-3 of the previous mentions of `allwevent.lock`, there's plenty of space on the /var partition, currently 248MB free.

arrgh · Dec 20, 2022

I found https://github.com/gnuton/asuswrt-m.../release/src/router/cfg_mnt/prebuild/RT-AX95Q, I might try replacing my cfg_* binaries with these.

Edit: I didn't replace /usr/sbin/cfg_server, but I ran this version in the foreground; it automatically killed the previous one. The symptoms persisted (cfg_server threads and CLOSE_WAIT sockets growing continuously, web admin UI slow)

arrgh · Dec 20, 2022

Correction: cfg_server is _listening_ on 7788, the satellites are connecting to _it_.

arrgh · Dec 20, 2022

On the satellite nodes I see quite a few sockets connecting to router:7788 in the FIN_WAIT_2 state, which to me indicates that the connection handling threads in cfg_server aren't exiting, so they aren't closing their sockets, as evidenced by the CLOSE_WAIT count of ~2x the number of threads:

Code:

# ps T |grep cfg_server |grep -v grep |wc -l
1008

# netstat -tnap |grep cfg_server |grep -v grep |wc -l
1967

# netstat -tnap |grep cfg_server |grep -v grep |grep -v CLOSE_WAIT
tcp        0      0 0.0.0.0:7788            0.0.0.0:*               LISTEN      12305/cfg_server

This instance of cfg_server has been running for a couple hours, so this seems to be its steady state. Its RAM usage is 268MB: roughly 256kb per thread. I don't think I've seen it grow bigger than that.

ulimit says max file descriptors is 1024, that might be the limiting factor here... I don't think it would cause this problem, but it might be fencing in the bad behaviour.

I guess I'll try running cfg_server in the foreground and watching its output...

arrgh · Dec 20, 2022

After 33 minutes, cfg_server is finally complaining about "too many open files".. I'll try `ulimit -n 10240` and relaunching

(note: the "no such file or directory" have been logged every few seconds since boot, they're "normal")

Code:

json_object_from_file: error opening file /tmp/maclist.json: No such file or directory
json_object_from_file: error opening file /tmp/maclist.json: No such file or directory
json_object_from_file: error opening file /tmp/maclist.json: No such file or directory
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files

arrgh · Dec 21, 2022

Summing up:

I'm pretty sure there's a bug in cfg_server, in which the threads it spawns to handle cfg_client requests from AiMesh satellites don't get shut down when the client disconnects. (yes, I'm a software engineer, can you tell?

Symptoms

Admin web UI is extremely slow, 5-60 seconds per request. Killing cfg_server on the router makes it fast again for a very short time, until cfg_server gets restarted by init
The router's DNS service stops working every 1-3 days, but connectivity is still live, e.g.:
dig google.ca times out
ping 8.8.8.8 is fine
dig google.ca @8.8.8.8 is fine

Very high memory usage

Code:

router# ps w |grep cfg_server |grep -v grep
30614 admin 269m S cfg_server

Very large number of threads

Code:

router# ps T |grep cfg_server |grep -v grep |wc -l
1008

Very large number of sockets left open, all in CLOSE_WAIT state, indicating a dead server process/thread (CLOSE_WAIT are never reaped by the kernel)

Code:

router# netstat -tnap |grep cfg_server |grep -vi listen |grep -v grep |wc -l
1942

router# netstat -tnap |grep cfg_server |grep -vi listen |grep -v grep |grep -v CLOSE_WAIT |wc -l
0

Client Sockets in FIN_WAIT2 state on satellites, indicating a dead server process/thread (FIN_WAIT2 are reaped by the kernel)

Code:

satellite# netstat -tnap |grep :7788 |grep 'FIN_WAIT2' |wc -l
21
satellite# netstat -tnap |grep :7788 |grep -vi 'FIN_WAIT2\|listen' |wc -l
0

I've tried running a more recent cfg_server binary from https://github.com/gnuton/asuswrt-m.../release/src/router/cfg_mnt/prebuild/RT-AX95Q but it exhibits the same problems.

Hypotheses

The proximate cause of the extreme slowness of the admin web UI is contention on /var/lock/allwevent.lock because every one of the zombie cfg_server threads is either still holding it, or reacquiring it in a loop; this is caused in turn by the cfg_server threads not dying when they're supposed to. I have no idea why THAT happens. The zombie threads consume file descriptors, which are limited to 1024.

Next Steps

Next time the DNS stops working, I'll try to remember to check whether dnsmasq is still running, restart it regardless, and see if there's anything in the logs about its failure.

glens · Dec 21, 2022

That's quite a bit of investigation. (Probably could avoid multiple calls of grep

I loaded up (I believe) that firmware and poked around in it. Preferred the Merlin environment so reverted. Uptime on the router has got to be 60 days by now, so it's been at least that long ago.

I guess I'm wondering: what's your uptime, especially since wiping the slate clean?

arrgh · Dec 21, 2022

glens said:
I guess I'm wondering: what's your uptime, especially since wiping the slate clean?

Right now, only 16.5 hours... The next time DNS goes down I'll try to check the logs and maybe just restart dnsmasq instead of power-cycling the router...

glens · Dec 21, 2022

I'm talking about a complete factory reset. As I'd suggested, my time on that firmware was minimal so had I that problem it went unnoticed. But I did perform a factory reset via gui after loading the (any) firmware.

arrgh · Dec 22, 2022

glens said:
I'm talking about a complete factory reset. As I'd suggested, my time on that firmware was minimal so had I that problem it went unnoticed. But I did perform a factory reset via gui after loading the (any) firmware.

Oh I see, yeah, I think I did a reset recently, but I'll try it again if I keep seeing problems...

Also, the damnedest thing just happened, the router just rebooted itself spontaneously 29 minutes ago, whereas the satellites have been up for ~1.5 days. Now cfg_server seems to be behaving itself.

Forget about it Jake, it's computertown.

glens · Dec 22, 2022

For future reference, you can use

Code:

grep '[^]]whatever'

to exclude the (single) call itself from the result.

tabormeister · Dec 22, 2022

Also having weird issues on XT8 nodes since 388 firmware. Will update if I figure anything out

tabormeister · Dec 22, 2022

EXACT same behavior where killall cfg_server instantly returns webui to snappy as heck. What's up with that Asus? I'm also waiting on L2 support.

arrgh · Dec 22, 2022

welp, cfg_server is back to misbehaving again:

Code:

admin@router:/tmp/home/root# netstat -tnap |grep cfg_server |wc -l
1949
admin@router:/tmp/home/root# ps T |grep cfg_server |wc -l
1012

tabormeister · Dec 23, 2022

Flash back to 386. Solved it for me. 388 still pretty buggy in myriad ways (syncing guest network to aimesh nodes causes random ip dropouts on guest network from nodes as an example)

arrgh · Dec 27, 2022

22068 was released in the past couple days, I'll give it a try

edit: so far so good, 3 hours later, cfg_server behaving well. I didn't do a factory reset, I just flashed all three units. I'll update here if anything interesting happens.

arrgh · Dec 29, 2022

22068 didn't help, I'll do a hard reset soon. :/

arrgh · Jan 3, 2023

Yesterday I hard-reset all three units, still running 3.0.0.4.388_22068.

A couple differences from last time:
* I set up static DHCP reservations for the satellites' 5GHz-2 radios, on .2 and .3
* I'm back to wired backhaul

At the moment (21 hours of uptime) cfg_server is behaving properly, but a day of stability holds no precedential value for me anymore.

Thread starter	Title	Forum	Replies	Date
R	4 pcs of XT8	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	3	Tuesday at 1:55 PM
A	ASUS XT8 dropout - momentarily disconnection	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	1	Monday at 4:02 PM
	Problems with XT8 AX6600 (RT-AX95Q) HW ver 2 running stock firmware.	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	10	Apr 5, 2024
P	Asus XT8 DHCP issue with Windows clients	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	7	Mar 31, 2024
C	ASUS ZenWifi AX XT8 - Only some devices lose WiFi	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	3	Mar 14, 2024
I	ZenWifi XT8 not allowing LAN crosstalk	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	3	Mar 4, 2024
	Asus XT8 two nodes: force connection to always one specific node	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	2	Feb 29, 2024
	Success: Migrating from 3 x XT8 to GT-AXE16000 + 2 x XT8	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	5	Feb 28, 2024
P	ASUS ZenWifi XT8 and XD5 issue with 2.4GHz clients	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	2	Feb 26, 2024
	XT8 vs XT9, Decision	ASUS AX Routers & Adapters (Wi-Fi 6/6e)	11	Feb 9, 2024

XT8 troubleshooting log, cfg_server possibly an issue

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Very Senior Member

Occasional Visitor

Very Senior Member

Occasional Visitor

Very Senior Member

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Occasional Visitor

Similar threads

Similar threads

Sign Up For SNBForums Daily Digest