What's new

XT8 troubleshooting log, cfg_server possibly an issue

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

arrgh

Occasional Visitor
Hey folks,

[edit1: all three nodes are running 3.0.0.4.388_21617]
[edit2: as of 2023/10/12, 23285 has finally been stable for me! Currently 73 days uptime]

I have three XT8's, currently using wireless backhaul. Previously I had them on wired backhaul, but was having serious stability issues (blinking red lights on the satellites) so I reset everything to use wireless backhaul, and everything was stable for a month or two. Now I've been having these symptoms:

1. dnsmasq stops working every 1-4 days. I can still `ping 8.8.8.8` and `dig google.ca @8.8.8.8` so I know it's not a connectivity issue, just dnsmasq being dead or slow.
2. the admin web interface gets incredibly slow, like many requests take 5-60 seconds. The mobile app is affected identically.

Recently I enabled SSH on the main router, and after poking around I can see `cfg_server` being sketchy:

1. it uses a lot of RAM -- at the moment it's up to 159MB, maybe half an hour after I last killed it
2. it attempts a lot of TCP connections on port 7788, both from my WAN IP and from the router's LAN IP, to the satellites. They all end up in CLOSE_WAIT.

current breakdown:
source LAN:7788, dest .152 (one of the satellites): 216 sockets
source WAN:7788, dest .152: 229 sockets
source LAN:7788, dest .237 (the other satellite): 470 sockets
source WAN:7788, dest .237: 474 sockets

3. it has a lot of threads, currently 689 -- simple arithmetic tells us every thread is trying to talk to both satellite nodes via both source IPs.

When I `killall cfg_server`, the admin web UI gets fast for a minute or two, then goes back to being very slow as cfg_server respawns then goes back to spiraling out of control.

I dug a little deeper:

Code:
# cd /proc/$(pidof cfg_server)/fd
# ls -lart |grep socket |wc -l
90
# ls -lart |grep -v socket
dr-xr-xr-x 7 admin root 0 Dec 20 11:11 ..
dr-x------ 2 admin root 0 Dec 20 11:11 .
lr-x------ 1 admin root 64 Dec 20 11:11 9 -> pipe:[118871]
lr-x------ 1 admin root 64 Dec 20 11:11 8 -> pipe:[118870]
lr-x------ 1 admin root 64 Dec 20 11:11 7 -> pipe:[118869]
lr-x------ 1 admin root 64 Dec 20 11:11 5 -> /proc/1/mounts
lrwx------ 1 admin root 64 Dec 20 11:11 2 -> /dev/null
lrwx------ 1 admin root 64 Dec 20 11:11 15 -> /var/lock/allwevent.lock
lrwx------ 1 admin root 64 Dec 20 11:11 1 -> /dev/null
lrwx------ 1 admin root 64 Dec 20 11:11 0 -> /dev/null
# ps T |grep cfg_server |grep -v grep |wc -l
164

Hypothesis: every one of those 164 threads is contending on /var/lock/allwevent.lock, and the admin web interface backend is also trying to get it, which is why it slows to a crawl. Why are there hundreds of threads? ¯\_(ツ)_/¯

I have a case open waiting for L2, so I'm reluctant to try another firmware at the moment, in hopes that my suffering can help future generations.

googling for `allwevent.lock` yields https://www.snbforums.com/threads/a...ausing-router-not-accessible-via-webui.80469/ which seems fairly on-point.
 
Last edited:
I think contention on `allwevent.lock` is certainly what's making the admin web UI slow. Here's an extremely dirty way of doing something like `lsof`:

Code:
for p in /proc/[0-9]*; do echo $p; cat $p/cmdline; ls -l $p/fd |grep -v 'socket\|null\|mounts\|pipe\|console'; done > ~/procs

If I run this while (slowly) loading a page in the admin web UI, `httpd` is listed as having /var/lock/allwevent.lock open, in addition to the usual `cfg_server`.
 
Also note, unlike 2-3 of the previous mentions of `allwevent.lock`, there's plenty of space on the /var partition, currently 248MB free.
 
Correction: cfg_server is _listening_ on 7788, the satellites are connecting to _it_.
 
On the satellite nodes I see quite a few sockets connecting to router:7788 in the FIN_WAIT_2 state, which to me indicates that the connection handling threads in cfg_server aren't exiting, so they aren't closing their sockets, as evidenced by the CLOSE_WAIT count of ~2x the number of threads:

Code:
# ps T |grep cfg_server |grep -v grep |wc -l
1008

# netstat -tnap |grep cfg_server |grep -v grep |wc -l
1967

# netstat -tnap |grep cfg_server |grep -v grep |grep -v CLOSE_WAIT
tcp        0      0 0.0.0.0:7788            0.0.0.0:*               LISTEN      12305/cfg_server

This instance of cfg_server has been running for a couple hours, so this seems to be its steady state. Its RAM usage is 268MB: roughly 256kb per thread. I don't think I've seen it grow bigger than that.

ulimit says max file descriptors is 1024, that might be the limiting factor here... I don't think it would cause this problem, but it might be fencing in the bad behaviour.

I guess I'll try running cfg_server in the foreground and watching its output...
 
After 33 minutes, cfg_server is finally complaining about "too many open files".. I'll try `ulimit -n 10240` and relaunching

(note: the "no such file or directory" have been logged every few seconds since boot, they're "normal")
Code:
json_object_from_file: error opening file /tmp/maclist.json: No such file or directory
json_object_from_file: error opening file /tmp/maclist.json: No such file or directory
json_object_from_file: error opening file /tmp/maclist.json: No such file or directory
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_from_file: error opening file /tmp/maclist.json: Too many open files
json_object_from_file: error opening file /tmp/wiredclientlist.json: Too many open files
json_object_to_file: error opening file /tmp/wiredclientlist.json: Too many open files
 
Summing up:

I'm pretty sure there's a bug in cfg_server, in which the threads it spawns to handle cfg_client requests from AiMesh satellites don't get shut down when the client disconnects. (yes, I'm a software engineer, can you tell? ;)

Symptoms
  1. Admin web UI is extremely slow, 5-60 seconds per request. Killing cfg_server on the router makes it fast again for a very short time, until cfg_server gets restarted by init

  2. The router's DNS service stops working every 1-3 days, but connectivity is still live, e.g.:
    dig google.ca times out
    ping 8.8.8.8 is fine
    dig google.ca @8.8.8.8 is fine

  3. Very high memory usage
    Code:
    router# ps w |grep cfg_server |grep -v grep
    30614 admin 269m S cfg_server

  4. Very large number of threads
    Code:
    router# ps T |grep cfg_server |grep -v grep |wc -l
    1008

  5. Very large number of sockets left open, all in CLOSE_WAIT state, indicating a dead server process/thread (CLOSE_WAIT are never reaped by the kernel)
    Code:
    router# netstat -tnap |grep cfg_server |grep -vi listen |grep -v grep |wc -l
    1942
    
    router# netstat -tnap |grep cfg_server |grep -vi listen |grep -v grep |grep -v CLOSE_WAIT |wc -l
    0

  6. Client Sockets in FIN_WAIT2 state on satellites, indicating a dead server process/thread (FIN_WAIT2 are reaped by the kernel)

    Code:
    satellite# netstat -tnap |grep :7788 |grep 'FIN_WAIT2' |wc -l
    21
    satellite# netstat -tnap |grep :7788 |grep -vi 'FIN_WAIT2\|listen' |wc -l
    0
I've tried running a more recent cfg_server binary from https://github.com/gnuton/asuswrt-m.../release/src/router/cfg_mnt/prebuild/RT-AX95Q but it exhibits the same problems.

Hypotheses

The proximate cause of the extreme slowness of the admin web UI is contention on /var/lock/allwevent.lock because every one of the zombie cfg_server threads is either still holding it, or reacquiring it in a loop; this is caused in turn by the cfg_server threads not dying when they're supposed to. I have no idea why THAT happens. The zombie threads consume file descriptors, which are limited to 1024.

Next Steps

Next time the DNS stops working, I'll try to remember to check whether dnsmasq is still running, restart it regardless, and see if there's anything in the logs about its failure.
 
Last edited:
That's quite a bit of investigation. (Probably could avoid multiple calls of grep ;)

I loaded up (I believe) that firmware and poked around in it. Preferred the Merlin environment so reverted. Uptime on the router has got to be 60 days by now, so it's been at least that long ago.

I guess I'm wondering: what's your uptime, especially since wiping the slate clean?
 
I guess I'm wondering: what's your uptime, especially since wiping the slate clean?
Right now, only 16.5 hours... The next time DNS goes down I'll try to check the logs and maybe just restart dnsmasq instead of power-cycling the router...
 
I'm talking about a complete factory reset. As I'd suggested, my time on that firmware was minimal so had I that problem it went unnoticed. But I did perform a factory reset via gui after loading the (any) firmware.
 
I'm talking about a complete factory reset. As I'd suggested, my time on that firmware was minimal so had I that problem it went unnoticed. But I did perform a factory reset via gui after loading the (any) firmware.
Oh I see, yeah, I think I did a reset recently, but I'll try it again if I keep seeing problems...

Also, the damnedest thing just happened, the router just rebooted itself spontaneously 29 minutes ago, whereas the satellites have been up for ~1.5 days. Now cfg_server seems to be behaving itself.

Forget about it Jake, it's computertown.
 
For future reference, you can use
Code:
grep '[^]]whatever'
to exclude the (single) call itself from the result.
 
EXACT same behavior where killall cfg_server instantly returns webui to snappy as heck. What's up with that Asus? I'm also waiting on L2 support.
 
welp, cfg_server is back to misbehaving again:

Code:
admin@router:/tmp/home/root# netstat -tnap |grep cfg_server |wc -l
1949
admin@router:/tmp/home/root# ps T |grep cfg_server |wc -l
1012
 
Flash back to 386. Solved it for me. 388 still pretty buggy in myriad ways (syncing guest network to aimesh nodes causes random ip dropouts on guest network from nodes as an example)
 
22068 was released in the past couple days, I'll give it a try

edit: so far so good, 3 hours later, cfg_server behaving well. I didn't do a factory reset, I just flashed all three units. I'll update here if anything interesting happens.
 
Last edited:
Yesterday I hard-reset all three units, still running 3.0.0.4.388_22068.

A couple differences from last time:
* I set up static DHCP reservations for the satellites' 5GHz-2 radios, on .2 and .3
* I'm back to wired backhaul

At the moment (21 hours of uptime) cfg_server is behaving properly, but a day of stability holds no precedential value for me anymore.
 

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top