Stuck commands

Tech9 · Jun 17, 2022

SomeWhereOverTheRainBow said:
Has anyone phoned home to Asus with these findings?

Does it break something in stock Asuswrt?

SomeWhereOverTheRainBow · Jun 17, 2022

Tech9 said:
Does it break something in stock Asuswrt?

Needs to be tested, but I don't know how. Does Optware support strace? If so, then I imagine one would only need to follow @dave14305 and others earlier troubleshooting procedures to determine.

Tech9 · Jun 17, 2022

If it doesn't:

"The number you are trying to call in unavailable at the moment. Please, try again later."

SomeWhereOverTheRainBow · Jun 17, 2022

Tech9 said:
If it doesn't:

"The number you are trying to call in unavailable at the moment. Please, try again later."

hahaha. For me, that is still the same message because I don't have this issue.

Martinski · Jun 18, 2022

Tech9 said:
Does it break something in stock Asuswrt?

Based on what I've seen on the GT-AC5300 (as shown in the above trace log) which is running the latest stock OEM firmware, the only process that appears to be affected (so far) is "conn_diag" because it's calling the "wl" command fairly frequently. I assume this process is some internal diagnostic utility so it wouldn't be a critical component whose failure may affect the router's networking operations. Note that on the RT-AC86U I have also seen the "cru l" command fail because it makes a call to "nvram get http_username" so it might be safe to say that the same failure could happen on the GT-AC5300. Since there are no 3rd-party add-ons installed on the GT-AC5300, the frequency of "stuck" commands is significantly less than on a router running RMerlin's F/W which can have any number of add-ons installed & running in the background as cron jobs.

Oracle · Jun 18, 2022

@Martinski, do you mind testing any of our wrapper scripts to see if it would do its job on the GT-AC5300 and prevent stuck commands?

Martinski · Jun 18, 2022

Oracle said:
@Martinski, do you mind testing any of our wrapper scripts to see if it would do its job on the GT-AC5300 and prevent stuck commands?

Well, first let me be clear that I don't own the GT-AC5300 router. It belongs to one of my best friends who has it running at his home, and I would need his consent to run any type of testing. He trusts me enough to give me full access to the router since the pandemic started, when I helped him reconfigure his home network & set up their PCs so that his whole family (wife + 3 kids) could work & attend school remotely, and have conference calls, remote meetings, etc. So while I have complete access to the router via an OpenVPN Server to do regular checks & maintenance (e.g. F/W updates, reconfiguration based on needs, etc.) I cannot do what I please without his prior knowledge & approval.

Now, I pretty much know what my friend is going to say (I've known him since our days in graduate school, so over 24+ years). First, he'll say he trusts my judgment as to how to proceed, but he'll also ask if I have already vetted the testing scripts, and I'll say that I have not.

I have followed this thread & fully appreciate the efforts that you and @SomeWhereOverTheRainBow have made to come up with the scripts that address the problems with "stuck" commands. However, my personal "trust, but verify" philosophy & policy does not allow me to simply take the scripts, install them on a router that is *not* my own, and run tests that essentially replace critical commands (nvram & wl). I know that the changes can be easily undone. But the point is that I have not personally vetted the scripts myself by running full tests (e.g. trying as many options of the command syntax: "nvram get", "nvram set", "nvram unset", "nvram save", "nvram restore", etc., etc.) under various script conditions. Until I do that, I cannot put the scripts on someone else's router. I always test & verify full scripts on my own router first. Always. No exceptions. I'm very methodical & meticulous when it comes to my S/W development work, and that unavoidably bleeds into my own personal work, especially if the scripts are also going into someone else's system.

Now, this weekend I don't have any time at all to run tests. I fully "booked" for the weekend. Being Father's Day on Sunday, we'll spend it at my parents' home and won't be back until the evening. Today, Saturday, we already have plans. During weekdays, I prefer not to change the router at all and in any way because we need it to operate reliably & consistently for remote work & school (as mentioned in a previous post).

Bottom line, at this point I doubt that I'll be able to run the scripts on my friend's GT-AC5300 router.

Tech9 · Jun 18, 2022

Martinski said:
Based on what I've seen on the GT-AC5300

I don't know if it's related, but I've seen the following stuck on AC86U running Asuswrt:

- Clients List (rare)
- AiProtection logging (rare)
- Web History (common)
- Traffic Analyzer (common)
- when testing 384 code sometimes it turns off on reboot, haven't seen it on 386 though

Mostly TrendMicro components. A reboot fixes it. Perhaps explains why folks set reboot scheduler.

Martinski · Jun 18, 2022

Tech9 said:
I don't know if it's related, but I've seen the following stuck on AC86U running Asuswrt:

- Clients List (rare)
- AiProtection logging (rare)
- Web History (common)
- Traffic Analyzer (common)
- when testing 384 code sometimes it turns off on reboot, haven't seen it on 386 though

Mostly TrendMicro components. A reboot fixes it. Perhaps explains why folks set reboot scheduler.

All of the ASUS router's extra services (AiProtection, Traffic Analyzer/Traffic Monitor, Parental Controls, QoS, AiCloud, AiDisk, etc.) are disabled on my own RT-AC86U as well as on the GT-AC5300, so that may be why I have not observed any of the problems you listed. It's possible that some of those services make "nvram" or "wl" calls, which upon failure to return would affect the behavior of the service.

WRT the Client List, I don't look at it or pay close attention to it at all ever since I realized that it was not consistently accurate or reliable, and not just for the RT-AC86U, but pretty much for every ASUS router that I had my hands on.

Tech9 · Jun 18, 2022

Martinski said:
I have not observed any of the problems you listed

This is only a test router. When I read about issues, I try to replicate.

Martinski said:
I realized that it was not consistently accurate or reliable

Correct. I've seen it stuck completely though, with frozen connection rates.

Oracle · Jun 18, 2022

Martinski said:
Well, first let me be clear that I don't own the GT-AC5300 router. It belongs to one of my best friends who ...

... at this point I doubt that I'll be able to run the scripts on my friend's GT-AC5300 router.

That was looong way to say "No" but fair enough.

dave14305 · Jun 18, 2022

Definition of BLOVIATE

to speak or write verbosely and windily… See the full definition

www.merriam-webster.com

ColinTaylor · Jun 18, 2022

dave14305 said:
I’m just saying that I don’t understand why eapd with pid 24095 uses nl_pid 24084 and then 24084+32770. Unless the precompiled binaries are broken in that regard.

OK I've done a lot more tracing on my RT-AX86U and I think I understand more about what's happening, but I'm also slightly confused.

I wrote a program (intended to be run on an RT-AC86U) that cross-references the netlink table with the current process list and prints out orphaned entries. Much to my surprise I see the same anomaly that appears on the RT-AC86U. Assuming we can use the lsof inode information to try and identify the originating process (which appears to be the case) I then came up with these processes:

ceventd
cfg_server
mcpd
eapd
wlceventd
debug_monitor

I've only traced a couple of them so far and the issue (unsurprisingly) is that internally they make a call to read nvram. But what makes these processes different is that after setting up netlink to read nvram they clone child processes with netlink still open. This creates a problem that was highlighted over on one of the kernel developer lists. As long as at least one of the child processes is running it will hold open the parent process' netlink connection (even if it's not used). In our case the parent process and many of the clones that it spawned have long since terminated but one still remains.

So this would account for the discrepancy between the pid and the nl_pid. But here's where my confusion comes in... Why am I not experiencing the same "hung" process problem on my router? I ran an infinite loop that read an nvram variable and let it run for a few hours. It never hung.

So what is it that's different about the RT-AC86U? I assume there's a standard Broadcom API to read the nvram. So is this an issue with the API or the user space program (nvram)? From the nvram and wl traces it looks like the error isn't being properly trapped, falls through to the next bit of code and attempts to read the nonexistent socket.

dave14305 · Jun 18, 2022

ColinTaylor said:
So what is it that's different about the RT-AC86U? I assume there's a standard Broadcom API to read the nvram. So is this an issue with the API or the user space program (nvram)? From the nvram and wl traces it looks like the error isn't being properly trapped, falls through to the next bit of code and attempts to read the nonexistent socket.

I was trying to look back in old Merlin repo or John’s fork for any older “less-closed source” versions of nvram. I didn’t find anything I thought was useful.

I think we’re approaching Broadcom’s dead-end alley.

SomeWhereOverTheRainBow · Jun 18, 2022

dave14305 said:
Definition of BLOVIATE

to speak or write verbosely and windily… See the full definition

www.merriam-webster.com

I prefer the (google) oxford definition:

Tech9 · Jun 18, 2022

Do we know if other router models are affected by this bug?

Maverickcdn · Jun 18, 2022

I'll add my observation for reference... my wicens script runs every 10 mins making several sequential nvram get calls to set variables. Early days I'd find hung calls (including eth5 noise wl call) varying between weeks or months apart, I now kill wlceventd at startup as I don't use aimesh and I rarely see hung processes anymore, although more often now it is the wl call that gets hung. Since my bootdate Mar 26, Ive had 1 hung nvram get and 1 hung wl

So with the frequency my script is making calls to nvram get and the infrequent amount they get hung my opinion is there is a timing issue somewhere where if a wl or nvram call is made while the router is doing something else in the processes mentioned by ColinTaylor it will get hung. The test loop scripts from the other thread users would hang <10000 iterations, I can routinely go >100000 (15mins + runtime) with no issue, although sometimes it will hang between 3K-4K iterations.

I only had cfg_server, mcpd, eapd currently running.

ColinTaylor · Jun 18, 2022

dave14305 said:
I think we’re approaching Broadcom’s dead-end alley.

OK, I managed to setup a situation on my router that creates the pid-in-use problem seen on the RT-AC86U.

Here's @dave14305's strace for nvram get that hangs:

Code:

     0.000084 stat64("/jffs", 0xffb8e7a0) = 0
     0.000055 stat64("/jffs/nvram_war", 0xffb8e7a0) = 0
     0.000117 socket(AF_NETLINK, SOCK_RAW, 0x1f /* NETLINK_??? */) = 3
     0.000078 bind(3, {sa_family=AF_NETLINK, nl_pid=1098, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000073 brk(NULL)                 = 0x42e000
     0.000049 brk(0x44f000)             = 0x44f000
     0.000052 sendmsg(3, {msg_name=0xcffb8e830, msg_namelen=-4659160, msg_iov=NULL, msg_iovlen=0, msg_control=0xf6ec7d74ffb8ee04, msg_controllen=4142697156, msg_flags=MSG_DONTROUTE|MSG_CTRUNC|MSG_PROBE|MSG_TRUNC|MSG_DONTWAIT|MSG_WAITALL|MSG_FIN|MSG_CONFIRM|MSG_ERRQUEUE|MSG_NOSIGNAL|MSG_MORE|MSG_NO_SHARED_FRAGS|MSG_ZEROCOPY|MSG_FASTOPEN|MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT|0x1bb00000}, 0) = 36
     0.000178 recvmsg(3,

And here's my strace:

Rich (BB code):

     0.000094 stat64("/jffs", 0xffe693f0) = 0
     0.000054 stat64("/jffs/nvram_war", 0xffe693f0) = 0
     0.000103 getpid()                  = 1217
     0.000046 socket(AF_NETLINK, SOCK_RAW, 0x1f /* NETLINK_??? */) = 3<socket:[504128]>
     0.000087 bind(3<socket:[504128]>, {sa_family=AF_NETLINK, nl_pid=1217, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000074 bind(3<socket:[504128]>, {sa_family=AF_NETLINK, nl_pid=1218, nl_groups=00000000}, 12) = 0
     0.000074 openat(AT_FDCWD</jffs/scripts>, "/proc/sys/kernel/pid_max", O_RDONLY) = 4</proc/sys/kernel/pid_max>

Spot the difference?

I don't think this is a problem as-such with individual user space programs like nvram or wl or how they're being used, but a coding bug in a common module. My guess is that it's in libnvram.so which is supplied as a prebuilt module for each platform.

dave14305 · Jun 18, 2022

ColinTaylor said:
Spot the difference?

+1

SomeWhereOverTheRainBow · Jun 18, 2022

ColinTaylor said:

OK, I managed to setup a situation on my router that creates the pid-in-use problem seen on the RT-AC86U.

Here's @dave14305's strace for nvram get that hangs:

Code:

     0.000084 stat64("/jffs", 0xffb8e7a0) = 0
     0.000055 stat64("/jffs/nvram_war", 0xffb8e7a0) = 0
     0.000117 socket(AF_NETLINK, SOCK_RAW, 0x1f /* NETLINK_??? */) = 3
     0.000078 bind(3, {sa_family=AF_NETLINK, nl_pid=1098, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000073 brk(NULL)                 = 0x42e000
     0.000049 brk(0x44f000)             = 0x44f000
     0.000052 sendmsg(3, {msg_name=0xcffb8e830, msg_namelen=-4659160, msg_iov=NULL, msg_iovlen=0, msg_control=0xf6ec7d74ffb8ee04, msg_controllen=4142697156, msg_flags=MSG_DONTROUTE|MSG_CTRUNC|MSG_PROBE|MSG_TRUNC|MSG_DONTWAIT|MSG_WAITALL|MSG_FIN|MSG_CONFIRM|MSG_ERRQUEUE|MSG_NOSIGNAL|MSG_MORE|MSG_NO_SHARED_FRAGS|MSG_ZEROCOPY|MSG_FASTOPEN|MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT|0x1bb00000}, 0) = 36
     0.000178 recvmsg(3,

And here's my strace:

Rich (BB code):

     0.000094 stat64("/jffs", 0xffe693f0) = 0
     0.000054 stat64("/jffs/nvram_war", 0xffe693f0) = 0
     0.000103 getpid()                  = 1217
     0.000046 socket(AF_NETLINK, SOCK_RAW, 0x1f /* NETLINK_??? */) = 3<socket:[504128]>
     0.000087 bind(3<socket:[504128]>, {sa_family=AF_NETLINK, nl_pid=1217, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000074 bind(3<socket:[504128]>, {sa_family=AF_NETLINK, nl_pid=1218, nl_groups=00000000}, 12) = 0
     0.000074 openat(AT_FDCWD</jffs/scripts>, "/proc/sys/kernel/pid_max", O_RDONLY) = 4</proc/sys/kernel/pid_max>

Spot the difference?

I don't think this is a problem as-such with individual user space programs like nvram or wl or how they're being used, but a coding bug in a common module. My guess is that it's in libnvram.so which is supplied as a prebuilt module for each platform.

So would this have been an issue created from the blobs upstream of @RMerlin? BTW, superb diagnostic work on your part @ColinTaylor .

Stuck commands

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Very Senior Member

Regular Contributor

Very Senior Member

Part of the Furniture

Very Senior Member

Part of the Furniture

Regular Contributor

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Part of the Furniture

Senior Member

Part of the Furniture

Part of the Furniture

Part of the Furniture

Similar threads

Support SNBForums w/ Amazon

Sign Up For SNBForums Daily Digest