What's new

wlcsm_create_nl_socket binding netlink socket error

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Viktor Jaep

Part of the Furniture
I seem to get these kinds of errors quite frequently while running scripts on my GT-AX6000. When these pop up, it basically stops scripts in their tracks, and locks them until I break out of them. It's random, sometimes happening within a few minutes... other times, many many hours before I see them.

1700539089573.png


Like with my ol' RT-AC86U and it's strange script hang-ups, I guess the GT-AX6000 isn't immune. After a reboot, they stay away for a while, but eventually come back. Has anyone encountered these, and/or know of a way to fix this error? I'm sure it's a bug of sorts... and luckily the timeout command seems to be able to get around them... but it would be nice not to have to use these workarounds. ;)
 
Continuing the conversation here from the other thread post ...

I still can't say for sure. Still trying to determine what exactly is the cause... the PID it references in the error disappears quite quick, and have not been able to catch what its referring to. I did start incorporating more "timeout" wrappers around the nvram/wl commands in response to this, but I've had instances where this error still caused a lock. So either it doesn't care about "timeout" either, or it's hanging for some other reason.

I rebooted my router after many weeks of uptime, and the error has gone away for now... but it will be back at some point. Since it's such an usual error, I thought I'd report it through the chain here...

OK, understood. So you were still in the exploratory phase looking into where the errors might be coming from.

Have you tried running the "CheckStuckProcCmds.sh" script (link below) as a cron job every 5 or 10 minutes for diagnostics purposes on your router? If the errors are due to some rogue "nvram" or "wl" command hangs, the script might catch them and capture some useful information. If not, at least you can eliminate them as a possible source of the errors.


Note that other system commands that depend on nvram or wl calls can also get stuck, and the script is able to capture those cases as well. That's how I found out that some "cru" commands got stuck due to nvram calls being made.

My 2 cents.
 
I seem to get these kinds of errors quite frequently while running scripts on my GT-AX6000. When these pop up, it basically stops scripts in their tracks, and locks them until I break out of them. It's random, sometimes happening within a few minutes... other times, many many hours before I see them.

View attachment 54386

Like with my ol' RT-AC86U and it's strange script hang-ups, I guess the GT-AX6000 isn't immune. After a reboot, they stay away for a while, but eventually come back. Has anyone encountered these, and/or know of a way to fix this error? I'm sure it's a bug of sorts... and luckily the timeout command seems to be able to get around them... but it would be nice not to have to use these workarounds. ;)
Yea, @Martinski is right, you are definitely going to need more than just the one line output to determine the cause of this issue. You may have to run the check script a while just to find what you need. Good luck. I definitely want to hear more about this.
 
Have you tried running the "CheckStuckProcCmds.sh" script (link below) as a cron job every 5 or 10 minutes for diagnostics purposes on your router? If the errors are due to some rogue "nvram" or "wl" command hangs, the script might catch them and capture some useful information. If not, at least you can eliminate them as a possible source of the errors.

Note that other system commands that depend on nvram or wl calls can also get stuck, and the script is able to capture those cases as well. That's how I found out that some "cru" commands got stuck due to nvram calls being made.
Yea, @Martinski is right, you are definitely going to need more than just the one line output to determine the cause of this issue. You may have to run the check script a while just to find what you need. Good luck. I definitely want to hear more about this.
I do still have @Martinski's very helpful checkstuckproccmds.sh script in my back pocket, but didn't think I'd ever need it again after moving away from the troublesome AC86U. ;) I will definitely break this out next time I see this stop error. I just rebooted my router a few weeks ago after getting tired dealing with this... and as usual, it hasn't had a single issue... I'm sure we'll be seeing them again though after all the pounding my router takes. ;) Will definitely post an update once issues return with some results.
 
OK... These events are extremely rare... but I managed to catch something regarding this error with @Martinski's great tool... but I'm not able to make heads or tails out of what caused it. Would love your advice...

Initially, this error showed up right in the middle of doing a routine VPN reset (done via script):
Code:
_wlcsm_create_nl_socket:268: pid:5940 binding netlink socket error!!!

I immediately started a new SSH window, and ran @Martinski's tool, got this output:
Code:
ViktorJp@GT-AX6000-3C88:/jffs/scripts# sh CheckStuckProcCmds.sh
FOUND_12372: [4]
START_12372: [CheckStuckProcCmds.sh]
FOUND_12372: [4]
 5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state
 5925  5924 ViktorJp S     4008  0.3   2  0.0 nvram get http_username
 2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username
 2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable
 5920 13444 ViktorJp S     3508  0.3   1  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
 5924     1 ViktorJp S     3328  0.3   0  0.0 {cru} /bin/sh /usr/sbin/cru a Diversion_UpdateBL 00 2 * * Fri /bin/sh /opt/share/diversion/file/update-bl.div reset
 2679     1 ViktorJp S     3328  0.3   2  0.0 {cru} /bin/sh /usr/sbin/cru d Diversion_LocalBackup
 2582  2348 ViktorJp S     3872  0.3   1  0.0 {unbound_manager} /bin/sh /jffs/addons/unbound/unbound_manager.sh vpn=1
13444  8962 ViktorJp S     3508  0.3   0  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
 2348     1 ViktorJp S     3328  0.3   1  0.0 {unbound_DNS_via} /bin/sh /jffs/addons/unbound/unbound_DNS_via_OVPN.sh 1 start
 8962  7593 ViktorJp S     3328  0.3   0  0.0 -sh
 7593  2915 ViktorJp S     3704  0.3   2  0.0 dropbear -p 192.168.50.1:22 -j -k
 2915     1 ViktorJp S     3576  0.3   3  0.0 dropbear -p 192.168.50.1:22 -j -k
FOUND_12372: [0][ 5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state]
FOUND_12372: [4]
EXIT_12372: OK.

If I'm reading this right... it may have been hung on this?
Code:
FOUND_12372: [0][ 5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state]

After the tool ran, the netlink socket error was "killed", and the script resumed!
Code:
_wlcsm_create_nl_socket:268: pid:5940 binding netlink socket error!!!
Killed
 
...
Initially, this error showed up right in the middle of doing a routine VPN reset (done via script):
...

If I'm reading this right... it may have been hung on this?
Code:
FOUND_12372: [0][ 5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state]

After the tool ran, the netlink socket error was "killed", and the script resumed!
Code:
_wlcsm_create_nl_socket:268: pid:5940 binding netlink socket error!!!
Killed
Based on the output you posted, you had 4 distinct "nvram" cmd processes stuck at that point. Here is the breakdown:

1) "nvram get vpn_client4_state" call during execution of "vpnmon-r3.sh -reset" script.
Code:
 5930   5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state
 5920  13444 ViktorJp S     3508  0.3   1  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
 13444  8962 ViktorJp S     3508  0.3   0  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
 8962   7593 ViktorJp S     3328  0.3   0  0.0 -sh
 7593   2915 ViktorJp S     3704  0.3   2  0.0 dropbear -p 192.168.50.1:22 -j -k
 2915      1 ViktorJp S     3576  0.3   3  0.0 dropbear -p 192.168.50.1:22 -j -k

2) "nvram get http_username" call while trying to add a cron job (Diversion_UpdateBL) related to Diversion.
Code:
 5925  5924 ViktorJp S     4008  0.3   2  0.0 nvram get http_username
 5924     1 ViktorJp S     3328  0.3   0  0.0 {cru} /bin/sh /usr/sbin/cru a Diversion_UpdateBL 00 2 * * Fri /bin/sh /opt/share/diversion/file/update-bl.div reset

3) "nvram get http_username" call while trying to delete a cron job (Diversion_LocalBackup) related to Diversion.
Code:
 2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username
 2679     1 ViktorJp S     3328  0.3   2  0.0 {cru} /bin/sh /usr/sbin/cru d Diversion_LocalBackup

4) "nvram get http_enable" call during execution of "unbound_manager.sh vpn=1" script.
Code:
 2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable
 2582  2348 ViktorJp S     3872  0.3   1  0.0 {unbound_manager} /bin/sh /jffs/addons/unbound/unbound_manager.sh vpn=1
 2348     1 ViktorJp S     3328  0.3   1  0.0 {unbound_DNS_via} /bin/sh /jffs/addons/unbound/unbound_DNS_via_OVPN.sh 1 start

These stuck commands seem to indicate that the problem found in some older AC-class routers (e.g. RT-AC86U) is not yet completely fixed in some newer AX-class routers. It may take much longer to show up, perhaps based on the frequency of nvram calls being made in a very short time, or it could be something else entirely, but the problem is still there, and your current setup and number of add-on scripts running concurrently probably increases the chances of that happening.

There should be log & trace files created by the "CheckStuckProcCmds.sh" script that have time stamps and may show further details.

The log file (CheckStuckProcCmds.LOG) is located in the "/opt/var/log" directory if you have Entware installed; otherwise, check in the "/tmp/var/tmp" directory.

The trace file may be located in the "/opt/var/log/Trace" or the "/tmp/var/tmp" directory. The trace file name has the following pattern:

StuckProcCmds_{INDEX}_{PID}.TRC.txt
e.g. "StuckProcCmds_00001_18729.TRC.txt"

I'd recommend running the "CheckStuckProcCmds.sh" script manually a couple of more times to make sure that all 4 stuck commands have been killed and are no longer there.

HTH.
 
These stuck commands seem to indicate that the problem found in some older AC-class routers (e.g. RT-AC86U) is not yet completely fixed in some newer AX-class routers. It may take much longer to show up, perhaps based on the frequency of nvram calls being made in a very short time, or it could be something else entirely, but the problem is still there, and your current setup and number of add-on scripts running concurrently probably increases the chances of that happening.
I think you're right... doesn't look like they've ironed out the kinks yet.
There should be log & trace files created by the "CheckStuckProcCmds.sh" script that have time stamps and may show further details.

The log file (CheckStuckProcCmds.LOG) is located in the "/opt/var/log" directory if you have Entware installed; otherwise, check in the "/tmp/var/tmp" directory.
This definitely has some good info:
Code:
2023-Dec-17 12:14:00 PM FOUND: [0] **=OK=**

2024-Jan-21 08:38:11 AM START_12372: [CheckStuckProcCmds.sh]
2024-Jan-21 08:38:11 AM FOUND_12372: [4]
 5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state
 5925  5924 ViktorJp S     4008  0.3   2  0.0 nvram get http_username
 2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username
 2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable
 5920 13444 ViktorJp S     3508  0.3   1  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
 5924     1 ViktorJp S     3328  0.3   0  0.0 {cru} /bin/sh /usr/sbin/cru a Diversion_UpdateBL 00 2 * * Fri /bin/sh /opt/share/diversion/file/update-bl.div reset
 2679     1 ViktorJp S     3328  0.3   2  0.0 {cru} /bin/sh /usr/sbin/cru d Diversion_LocalBackup
 2582  2348 ViktorJp S     3872  0.3   1  0.0 {unbound_manager} /bin/sh /jffs/addons/unbound/unbound_manager.sh vpn=1
13444  8962 ViktorJp S     3508  0.3   0  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
 2348     1 ViktorJp S     3328  0.3   1  0.0 {unbound_DNS_via} /bin/sh /jffs/addons/unbound/unbound_DNS_via_OVPN.sh 1 start
 8962  7593 ViktorJp S     3328  0.3   0  0.0 -sh
 7593  2915 ViktorJp S     3704  0.3   2  0.0 dropbear -p 192.168.50.1:22 -j -k
 2915     1 ViktorJp S     3576  0.3   3  0.0 dropbear -p 192.168.50.1:22 -j -k
2024-Jan-21 08:38:13 AM FOUND_12372: [0][ 5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state]
2024-Jan-21 08:38:13 AM FOUND_12372: [4]
2024-Jan-21 08:38:13 AM EXIT_12372: OK.

2024-Jan-21 08:38:26 AM START_13581: [/jffs/scripts/CheckStuckProcCmds.sh]
2024-Jan-21 08:38:26 AM ARGs_13581: [-setdelay]
2024-Jan-21 08:38:27 AM FOUND_13581: [4]
 5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state
 5925  5924 ViktorJp S     4008  0.3   2  0.0 nvram get http_username
 2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username
 2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable
 5920 13444 ViktorJp S     3508  0.3   1  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
 5924     1 ViktorJp S     3328  0.3   0  0.0 {cru} /bin/sh /usr/sbin/cru a Diversion_UpdateBL 00 2 * * Fri /bin/sh /opt/share/diversion/file/update-bl.div reset
 2679     1 ViktorJp S     3328  0.3   2  0.0 {cru} /bin/sh /usr/sbin/cru d Diversion_LocalBackup
 2582  2348 ViktorJp S     3872  0.3   1  0.0 {unbound_manager} /bin/sh /jffs/addons/unbound/unbound_manager.sh vpn=1
13444  8962 ViktorJp S     3508  0.3   0  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
 2348     1 ViktorJp S     3328  0.3   1  0.0 {unbound_DNS_via} /bin/sh /jffs/addons/unbound/unbound_DNS_via_OVPN.sh 1 start
 8962  7593 ViktorJp S     3328  0.3   0  0.0 -sh
 7593  2915 ViktorJp S     3704  0.3   2  0.0 dropbear -p 192.168.50.1:22 -j -k
 2915     1 ViktorJp S     3576  0.3   3  0.0 dropbear -p 192.168.50.1:22 -j -k
2024-Jan-21 08:38:28 AM FOUND_13581: [1][ 5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state]
2024-Jan-21 08:38:28 AM PID_13581: [5930][5920], [16 >= 5] secs.
2024-Jan-21 08:38:28 AM CMD_13581: [kill -9 5930][0]
2024-Jan-21 08:38:31 AM FOUND_13581: [3]
2024-Jan-21 08:38:31 AM FOUND_13581: [3]
2024-Jan-21 08:38:31 AM EXIT_13581: OK.

2024-Jan-21 08:38:44 AM START_16741: [/jffs/scripts/CheckStuckProcCmds.sh]
2024-Jan-21 08:38:44 AM ARGs_16741: [-setdelay]
2024-Jan-21 08:38:44 AM FOUND_16741: [3]
 5925  5924 ViktorJp S     4008  0.3   2  0.0 nvram get http_username
 2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username
 2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable
 5924     1 ViktorJp S     3328  0.3   0  0.0 {cru} /bin/sh /usr/sbin/cru a Diversion_UpdateBL 00 2 * * Fri /bin/sh /opt/share/diversion/file/update-bl.div reset
 2679     1 ViktorJp S     3328  0.3   2  0.0 {cru} /bin/sh /usr/sbin/cru d Diversion_LocalBackup
 2582  2348 ViktorJp S     3872  0.3   1  0.0 {unbound_manager} /bin/sh /jffs/addons/unbound/unbound_manager.sh vpn=1
 2348     1 ViktorJp S     3328  0.3   1  0.0 {unbound_DNS_via} /bin/sh /jffs/addons/unbound/unbound_DNS_via_OVPN.sh 1 start
2024-Jan-21 08:38:45 AM FOUND_16741: [1][ 5925  5924 ViktorJp S     4008  0.3   2  0.0 nvram get http_username]
2024-Jan-21 08:38:45 AM PID_16741: [5925][5924], [33 >= 5] secs.
2024-Jan-21 08:38:45 AM CMD_16741: [kill -9 5925][0]
2024-Jan-21 08:38:47 AM FOUND_16741: [2]
2024-Jan-21 08:38:47 AM FOUND_16741: [2]
2024-Jan-21 08:38:48 AM EXIT_16741: OK.

2024-Jan-21 08:39:01 AM START_19561: [/jffs/scripts/CheckStuckProcCmds.sh]
2024-Jan-21 08:39:01 AM ARGs_19561: [-setdelay]
2024-Jan-21 08:39:01 AM FOUND_19561: [2]
 2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username
 2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable
 2679     1 ViktorJp S     3328  0.3   2  0.0 {cru} /bin/sh /usr/sbin/cru d Diversion_LocalBackup
 2582  2348 ViktorJp S     3872  0.3   1  0.0 {unbound_manager} /bin/sh /jffs/addons/unbound/unbound_manager.sh vpn=1
 2348     1 ViktorJp S     3328  0.3   1  0.0 {unbound_DNS_via} /bin/sh /jffs/addons/unbound/unbound_DNS_via_OVPN.sh 1 start
2024-Jan-21 08:39:02 AM FOUND_19561: [1][ 2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username]
2024-Jan-21 08:39:02 AM PID_19561: [2691][2679], [50 >= 5] secs.
2024-Jan-21 08:39:02 AM CMD_19561: [kill -9 2691][0]
2024-Jan-21 08:39:04 AM FOUND_19561: [1]
2024-Jan-21 08:39:04 AM FOUND_19561: [1]
2024-Jan-21 08:39:04 AM EXIT_19561: OK.

2024-Jan-21 08:39:17 AM START_22254: [/jffs/scripts/CheckStuckProcCmds.sh]
2024-Jan-21 08:39:17 AM ARGs_22254: [-setdelay]
2024-Jan-21 08:39:17 AM FOUND_22254: [1]
 2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable
 2582  2348 ViktorJp S     3872  0.3   1  0.0 {unbound_manager} /bin/sh /jffs/addons/unbound/unbound_manager.sh vpn=1
 2348     1 ViktorJp S     3328  0.3   1  0.0 {unbound_DNS_via} /bin/sh /jffs/addons/unbound/unbound_DNS_via_OVPN.sh 1 start
2024-Jan-21 08:39:18 AM FOUND_22254: [1][ 2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable]
2024-Jan-21 08:39:18 AM PID_22254: [2689][2582], [67 >= 5] secs.
2024-Jan-21 08:39:18 AM CMD_22254: [kill -9 2689][0]
2024-Jan-21 08:39:20 AM FOUND: [0]
2024-Jan-21 08:39:20 AM FOUND: [0]
2024-Jan-21 08:39:21 AM EXIT_22254: OK.

2024-Jan-21 08:54:46 AM FOUND: [0] **=OK=**

The trace file may be located in the "/opt/var/log/Trace" or the "/tmp/var/tmp" directory. The trace file name has the following pattern:

StuckProcCmds_{INDEX}_{PID}.TRC.txt
e.g. "StuckProcCmds_00001_18729.TRC.txt"

Here's the trace:
Code:
2024-01-21 08:39:18  2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable [KILLED]
2024-01-21 08:39:01  2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username [KILLED]
2024-01-21 08:38:44  5925  5924 ViktorJp S     4008  0.3   2  0.0 nvram get http_username [KILLED]
2024-01-21 08:38:27  5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state [KILLED]
2024-01-21 08:38:11  5930  5920 ViktorJp S     4008  0.3   0  0.0 nvram get vpn_client4_state
2024-01-21 08:38:11  5925  5924 ViktorJp S     4008  0.3   2  0.0 nvram get http_username
2024-01-21 08:38:11  2691  2679 ViktorJp S     4008  0.3   0  0.0 nvram get http_username
2024-01-21 08:38:11  2689  2582 ViktorJp S     4008  0.3   1  0.0 nvram get http_enable
2024-01-21 08:38:11  5920 13444 ViktorJp S     3508  0.3   1  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
2024-01-21 08:38:11  5924     1 ViktorJp S     3328  0.3   0  0.0 {cru} /bin/sh /usr/sbin/cru a Diversion_UpdateBL 00 2 * * Fri /bin/sh /opt/share/diversion/>
2024-01-21 08:38:11  2679     1 ViktorJp S     3328  0.3   2  0.0 {cru} /bin/sh /usr/sbin/cru d Diversion_LocalBackup
2024-01-21 08:38:11  2582  2348 ViktorJp S     3872  0.3   1  0.0 {unbound_manager} /bin/sh /jffs/addons/unbound/unbound_manager.sh vpn=1
2024-01-21 08:38:11 13444  8962 ViktorJp S     3508  0.3   0  0.0 sh /jffs/scripts/vpnmon-r3.sh -reset
2024-01-21 08:38:11  2348     1 ViktorJp S     3328  0.3   1  0.0 {unbound_DNS_via} /bin/sh /jffs/addons/unbound/unbound_DNS_via_OVPN.sh 1 start
2024-01-21 08:38:11  8962  7593 ViktorJp S     3328  0.3   0  0.0 -sh
2024-01-21 08:38:11  7593  2915 ViktorJp S     3704  0.3   2  0.0 dropbear -p 192.168.50.1:22 -j -k
2024-01-21 08:38:11  2915     1 ViktorJp S     3576  0.3   3  0.0 dropbear -p 192.168.50.1:22 -j -k

I'd recommend running the "CheckStuckProcCmds.sh" script manually a couple of more times to make sure that all 4 stuck commands have been killed and are no longer there.
I ran it a second time immediately after it killed those other 4 stuck processes, and it had found zero.

Thanks for your help and insight with all this! :)
 
You can delay the inevitable hangs by increasing the range of PIDs used by the OS. See @ColinTaylor post from before:
I’ll be very sad if this nonsense starts happening on my AX88U PRO. Thought those days were behind me.
 
You can delay the inevitable hangs by increasing the range of PIDs used by the OS. See @ColinTaylor post from before:
I’ll be very sad if this nonsense starts happening on my AX88U PRO. Thought those days were behind me.
Well,
1706067704085.png

Sorry @dave14305... I'm not seeing these for a good few 3-4 weeks after a reboot... but then they start up again. Maybe I'm just hammering this thing to hard... unreasonable expectations to excessively use nvram get. I dunno.
 
I’ll be very sad if this nonsense starts happening on my AX88U PRO. Thought those days were behind me.
I had to remind myself what we were doing in that thread from two years ago! He's a little script that you can run to see if you're susceptible to that problem. Modify the nvram get command to what you think is an appropriate test. This assumes that Entware's strace is installed.
Code:
#!/bin/sh

# reset pid_max on exit
trap "echo 32768 > /proc/sys/kernel/pid_max; exit" SIGINT

# Increase probability of problem. RESERVED_PIDS=300
echo 1000 > /proc/sys/kernel/pid_max

while true
do
   strace -r -e trace=bind nvram get wan0_enable 2>&1 | grep nl_pid
done
Look for EADDRINUSE lines. If it's working properly you should see the lines following it incrementing the nl_pid number until it's successful:
Rich (BB code):
     0.000044 bind(4, {sa_family=AF_NETLINK, nl_pid=1512, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000067 bind(4, {sa_family=AF_NETLINK, nl_pid=1513, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000064 bind(4, {sa_family=AF_NETLINK, nl_pid=1514, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000056 bind(4, {sa_family=AF_NETLINK, nl_pid=1515, nl_groups=00000000}, 12) = 0
 
Last edited:
I had to remind myself what we were doing in that thread from two years ago! He's a little script that you can run to see if you're susceptible to that problem. Modify the nvram get command to what you think is an appropriate test. This assumes that Entware's strace is installed.
Code:
#!/bin/sh

# reset pid_max on exit
trap "echo 32768 > /proc/sys/kernel/pid_max; exit" SIGINT

# Increase probability of problem. RESERVED_PIDS=300
echo 1000 > /proc/sys/kernel/pid_max

while true
do
   strace -r -e trace=bind nvram get wan0_enable 2>&1 | grep nl_pid
done
Look for EADDRINUSE lines. If it's working properly you should see the lines following it incrementing the nl_pid number until it's successful:
Rich (BB code):
     0.000044 bind(4, {sa_family=AF_NETLINK, nl_pid=1512, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000067 bind(4, {sa_family=AF_NETLINK, nl_pid=1513, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000064 bind(4, {sa_family=AF_NETLINK, nl_pid=1514, nl_groups=00000000}, 12) = -1 EADDRINUSE (Address already in use)
     0.000056 bind(4, {sa_family=AF_NETLINK, nl_pid=1515, nl_groups=00000000}, 12) = 0

@ColinTaylor... how do you manage to come up (or find) scripts like this! My gosh... will give this a go over the weekend and see if I can get my GT-AX6000 to act up again.
 
*Cough* Oh sorry did I revive a thread?

Just ran across this issue while doing some testing in dev of MerlinAU on my GT-AXE11000. Bad luck for the GT models eh?
@Viktor Jaep did you ever find out the root cause for you? It seems random and spotty and hard to duplicate for me.

To be honest I'm not sure what caused it to start on my side, I only have a gut feeling.
I've setup @Martinski's script on a 5 minute cron hoping to catch more detail if it ever happens again.
 
*Cough* Oh sorry did I revive a thread?

Just ran across this issue while doing some testing in dev of MerlinAU on my GT-AXE11000. Bad luck for the GT models eh?
@Viktor Jaep did you ever find out the root cause for you? It seems random and spotty and hard to duplicate for me.

To be honest I'm not sure what caused it to start on my side, I only have a gut feeling.
I've setup @Martinski's script on a 5 minute cron hoping to catch more detail if it ever happens again.
Seems to be the case... luckily I don't see them for weeks after a reboot, but the fact that there's still some issue affecting our routers in this way irks me, even on these newer models. :( @dave14305 & @ColinTaylor do have a way to delay the issue but it will inevitably happen (above), and @Martinski's script will continue to kill those stuck processes and let the system continue... both good bandaid fixes for a serious system stability issue.
 
Last edited:
So what what was the result of your test?
Sorry I forgot to follow-up on that... I had let it run for about 5mins, and didn't see "Address already in use" messages... I figured I would run this again after I started seeing these nl_socket_binding errors again, and see if that would show some different results.
 
Sorry I forgot to follow-up on that... I had let it run for about 5mins, and didn't see "Address already in use" messages... I figured I would run this again after I started seeing these nl_socket_binding errors again, and see if that would show some different results.
That's very surprising. I would have expected you to have seen those messages within a few seconds. Maybe a flaw in my methodology. Can you confirm that all you were seeing was a constant stream of these messages:
Code:
     0.003345 bind(3, {sa_family=AF_NETLINK, nl_pid=701, nl_groups=00000000}, 12) = 0
     0.000294 bind(4, {sa_family=AF_NETLINK, nl_pid=1703, nl_groups=00000000}, 12) = 0
Also, if you don't mind can you post the output of ps, I might need to adjust the scanning range for your model of router.

P.S. You do have strace installed which I said was a prerequisite?
 
That's very surprising. I would have expected you to have seen those messages within a few seconds. Maybe a flaw in my methodology. Can you confirm that all you were seeing was a constant stream of these messages:
Code:
     0.003345 bind(3, {sa_family=AF_NETLINK, nl_pid=701, nl_groups=00000000}, 12) = 0
     0.000294 bind(4, {sa_family=AF_NETLINK, nl_pid=1703, nl_groups=00000000}, 12) = 0
Yep, that's correct... here's a snapshot of some output:
Code:
     0.004689 bind(3, {sa_family=AF_NETLINK, nl_pid=985, nl_groups=00000000}, 12) = 0
     0.003223 bind(3, {sa_family=AF_NETLINK, nl_pid=997, nl_groups=00000000}, 12) = 0
     0.003326 bind(3, {sa_family=AF_NETLINK, nl_pid=302, nl_groups=00000000}, 12) = 0

Also, if you don't mind can you post the output of ps, I might need to adjust the scanning range for your model of router.
Here's the output of ps:
Code:
ViktorJp@GT-AX6000-3C88:/proc/sys/kernel# ps
  PID USER       VSZ STAT COMMAND
    1 ViktorJp 14956 S    /sbin/init
    2 ViktorJp     0 SW   [kthreadd]
    3 ViktorJp     0 IW<  [rcu_gp]
    4 ViktorJp     0 IW<  [rcu_par_gp]
    6 ViktorJp     0 IW<  [kworker/0:0H-kb]
    8 ViktorJp     0 IW<  [mm_percpu_wq]
    9 ViktorJp     0 SW   [ksoftirqd/0]
   10 ViktorJp     0 IW   [rcu_preempt]
   11 ViktorJp     0 IW   [rcu_sched]
   12 ViktorJp     0 IW   [rcu_bh]
   13 ViktorJp     0 SW   [migration/0]
   15 ViktorJp     0 SW   [cpuhp/0]
   16 ViktorJp     0 SW   [cpuhp/1]
   17 ViktorJp     0 SW   [migration/1]
   18 ViktorJp     0 SW   [ksoftirqd/1]
   20 ViktorJp     0 IW<  [kworker/1:0H-kb]
   21 ViktorJp     0 SW   [cpuhp/2]
   22 ViktorJp     0 SW   [migration/2]
   23 ViktorJp     0 SW   [ksoftirqd/2]
   25 ViktorJp     0 IW<  [kworker/2:0H-kb]
   26 ViktorJp     0 SW   [cpuhp/3]
   27 ViktorJp     0 SW   [migration/3]
   28 ViktorJp     0 SW   [ksoftirqd/3]
   30 ViktorJp     0 IW<  [kworker/3:0H-kb]
   31 ViktorJp     0 SW   [kdevtmpfs]
   32 ViktorJp     0 SW   [rcu_tasks_kthre]
   34 ViktorJp     0 SW   [oom_reaper]
   35 ViktorJp     0 IW<  [writeback]
   36 ViktorJp     0 SW   [kcompactd0]
   37 ViktorJp     0 SWN  [ksmd]
   38 ViktorJp     0 IW<  [crypto]
   39 ViktorJp     0 IW<  [kblockd]
   40 ViktorJp     0 SW   [skb_free_task]
   41 ViktorJp     0 SW   [watchdogd]
   42 ViktorJp     0 IW   [kworker/2:1-eve]
   79 ViktorJp     0 SW   [kswapd0]
   82 ViktorJp     0 IW   [kworker/3:1-ubi]
  112 ViktorJp     0 SW   [hwrng]
  114 ViktorJp     0 SW   [spi1]
  115 ViktorJp     0 SW   [irq/49-mmc0]
  117 ViktorJp     0 IW<  [ipv6_addrconf]
  118 ViktorJp     0 SW   [ubi_bgt0d]
  119 ViktorJp     0 IW<  [kworker/2:1H-kb]
  120 ViktorJp     0 IW<  [kworker/3:1H-kb]
  121 ViktorJp     0 IW<  [kworker/1:1H-kb]
  130 ViktorJp     0 IW<  [kworker/0:1H-kb]
  213 ViktorJp     0 SW   [ubifs_bgt0_10]
  229 ViktorJp     0 SW   [ubifs_bgt0_11]
  259 ViktorJp     0 SW   [ubifs_bgt0_13]
  387 ViktorJp     0 SW   [bpm_monitor]
  394 ViktorJp     0 SW   [recycle_sysb]
  400 ViktorJp     0 SW   [fc_evt]
  401 ViktorJp     0 SW   [fc_timer]
  402 ViktorJp     0 SW   [bcmFlwStatsTask]
  409 ViktorJp     0 SW   [bcmsw_rx]
  410 ViktorJp     0 SW   [enet-kthrd]
  418 ViktorJp     0 SW   [irq/54-bcmflex-]
  419 ViktorJp     0 SW   [irq/55-bcmflex-]
  420 ViktorJp     0 SW   [irq/56-bcmflex-]
  421 ViktorJp     0 SW   [irq/57-bcmflex-]
  492 ViktorJp     0 SW   [kbdmf_shell]
  534 ViktorJp  1600 S    {wdtctl} wdtd
  538 ViktorJp  1740 S    hotplug2 --persistent --no-coldplug
  539 ViktorJp  3872 S    dropbear -p 192.168.50.1:22 -j -k
  590 ViktorJp  2052 S    /usr/sbin/envrams
  814 ViktorJp     0 IW<  [cfg80211]
  938 ViktorJp     0 SW   [dhd_watchdog_th]
  939 ViktorJp     0 SW   [wfd0-thrd]
  947 ViktorJp     0 SW   [dhd_watchdog_th]
  948 ViktorJp     0 SW   [wfd1-thrd]
 1066 ViktorJp  3668 S    sh /jffs/scripts/rtrmon.sh -screen
 1462 ViktorJp  3328 S    /sbin/syslogd -m 0 -S -O /jffs/syslog.log -s 256 -l 6
 1464 ViktorJp  3328 S    /sbin/klogd -c 5
 1479 ViktorJp     0 IW   [kworker/3:2-eve]
 1514 ViktorJp 13368 S    /sbin/wanduck
 1732 ViktorJp  3328 S    -sh
 2311 ViktorJp  6412 S    screen -dr 24589
 2390 ViktorJp 13368 S    ledg
 2391 ViktorJp  2396 S    /usr/sbin/haveged -r 0 -w 1024 -d 32 -i 32
 2392 ViktorJp 13368 S    ledbtn
 2399 ViktorJp 13368 S    antled
 2425 ViktorJp 14152 S    nt_monitor
 2426 ViktorJp  7152 S    protect_srv
 2427 ViktorJp 15420 S    /sbin/netool
 2500 ViktorJp  3144 S    /bin/eapd
 2516 ViktorJp 13156 S    nt_center
 2551 ViktorJp  8412 S    hostapd -B /tmp/wl0_hapd.conf
 2560 ViktorJp  5156 S    nt_actMail
 2620 ViktorJp  8412 S    hostapd -B /tmp/wl1_hapd.conf
 2632 ViktorJp  9672 S    wps_pbcd
 2633 ViktorJp 13368 S    wpsaide
 2634 ViktorJp  5116 S    /usr/sbin/wlc_nt
 2635 ViktorJp  5112 S    /usr/sbin/wlc_monitor
 2637 ViktorJp  8000 S    /usr/sbin/awsiot
 2638 ViktorJp 13368 S    dhcpc_lease old 00:18:61:46:14:49 192.168.50.115 myx_001861461448
 2650 ViktorJp  3136 S    /bin/ceventd
 2654 ViktorJp  3116 S    /usr/sbin/debug_monitor /data
 2655 ViktorJp  5396 S    /usr/sbin/wlceventd
 2683 ViktorJp  3328 S    crond -l 9
 2700 ViktorJp 12028 S    httpd -i br0
 2702 ViktorJp  6252 S    vis-dcon
 2704 ViktorJp  9328 S    vis-dcon
 2705 ViktorJp  4928 S    vis-datacollector
 2706 ViktorJp  7360 S    /usr/sbin/infosvr br0
 2708 ViktorJp 13828 S    sysstate
 2709 ViktorJp 13368 S    watchdog
 2710 ViktorJp 13368 S    check_watchdog
 2711 ViktorJp 13368 S    alt_watchdog
 2712 ViktorJp 13368 S    sw_devled
 2713 ViktorJp  4500 S    rstats
 2715 ViktorJp 13368 S    amas_lanctrl
 2763 ViktorJp 15124 S    networkmap --bootwait
 2770 ViktorJp 12736 S    mastiff
 2771 ViktorJp 13368 S    bwdpi_check
 2773 ViktorJp 13368 S    pctime
 2798 ViktorJp 18496 S    roamast
 2800 ViktorJp 20284 S    conn_diag
 2801 ViktorJp 13368 S    amas_ssd_cd
 2815 ViktorJp  3348 S    lldpd -L /usr/sbin/lldpcli -I eth1,eth2,eth3,eth4,eth5,eth6,eth7,wds0.*.*,wds1.*.* -s GT-AX6000
 2818 nobody    3348 S    lldpd -L /usr/sbin/lldpcli -I eth1,eth2,eth3,eth4,eth5,eth6,eth7,wds0.*.*,wds1.*.* -s GT-AX6000
 2852 ViktorJp 18496 S    amas_portstatus
 2857 ViktorJp 14560 S    cfg_server
 2876 ViktorJp 20896 S    amas_lib
 2877 ViktorJp 13368 S    sched_daemon
 2886 ViktorJp  3592 S    dropbear -p 192.168.50.1:22 -j -k
 2889 ViktorJp  5020 S    fsmd
 2964 ViktorJp  3328 S    sh /jffs/scripts/wxmon.sh -screen
 3013 ViktorJp     0 IW<  [uas]
 3014 ViktorJp     0 SW   [scsi_eh_0]
 3015 ViktorJp     0 IW<  [scsi_tmf_0]
 3052 ViktorJp 13368 S    usbled
 3053 ViktorJp  8164 S    usbmuxd
 3308 ViktorJp  3680 S    /usr/sbin/ntp -t -S /sbin/ntpd_synced -p pool.ntp.org -p time.nist.gov -l -I br0
 3353 ViktorJp  2000 S    /bin/mcpd
 3477 ViktorJp     0 SW   [jbd2/sda1-8]
 3478 ViktorJp     0 IW<  [ext4-rsv-conver]
 3675 ViktorJp     0 IW   [kworker/2:0-rcu]
 3797 ViktorJp  3876 S    /usr/sbin/acsd2
 4449 ViktorJp 13368 S    disk_monitor
 4584 ViktorJp  6412 S    screen -dr 26165
 4846 ViktorJp  3328 S    /sbin/udhcpc -i eth0 -p /var/run/udhcpc0.pid -s /tmp/udhcpc_wan -A5 -O33 -O249
 4848 nobody    3256 S    avahi-daemon: running [GT-AX6000-3C88.local]
 4859 ViktorJp  9984 S    /usr/sbin/nmbd -D -s /etc/smb.conf
 4860 ViktorJp  9704 S    /usr/sbin/nmbd -D -s /etc/smb.conf
 4861 ViktorJp 10264 S    /usr/sbin/smbd -D -s /etc/smb.conf
 4863 ViktorJp  1884 S    /usr/sbin/wsdd2 -d -w -i br0 -b sku:GT-AX6000,serial:04421a593c88
10202 ViktorJp 12028 S    asd
10939 ViktorJp 13368 S    pc_block
15109 ViktorJp  3720 S    dropbear -p 192.168.50.1:22 -j -k
15130 ViktorJp  3328 S    -sh
17886 ViktorJp     0 IW   [kworker/1:2-eve]
19223 ViktorJp  3720 S    dropbear -p 192.168.50.1:22 -j -k
19256 ViktorJp  3328 S    -sh
22058 ViktorJp  3328 S    {MonitorCronJobL} /bin/sh ./MonitorCronJobList.sh start 2
23255 ViktorJp     0 IW   [kworker/u8:1]
23370 ViktorJp  3816 S    dropbear -p 192.168.50.1:22 -j -k
23371 ViktorJp  3328 S    -sh
23557 ViktorJp     0 IW   [kworker/u8:2-ev]
24234 ViktorJp     0 IW<  [cifsiod]
24235 ViktorJp     0 IW<  [cifsoplockd]
24589 ViktorJp  6544 S    {screen} SCREEN -dmS rtrmon /jffs/scripts/rtrmon.sh -monitor
24591 ViktorJp  3668 S    {rtrmon.sh} /bin/sh /jffs/scripts/rtrmon.sh -monitor
25132 ViktorJp  3720 S    dropbear -p 192.168.50.1:22 -j -k
25162 ViktorJp  3328 S    -sh
25601 ViktorJp     0 IW   [kworker/0:0-eve]
25619 ViktorJp  3644 S    sh /jffs/scripts/vpnmon-r3.sh -noswitch
25759 ViktorJp     0 IW   [kworker/0:1-ipv]
26165 ViktorJp  6544 S    {screen} SCREEN -dmS wxmon /jffs/scripts/wxmon.sh
26167 ViktorJp  3328 S    {wxmon.sh} /bin/sh /jffs/scripts/wxmon.sh
26208 ViktorJp  3328 S    sh /jffs/scripts/wxmon.sh -monitor
26350 ViktorJp     0 IW   [kworker/u8:0-ev]
26586 ViktorJp  3720 S    dropbear -p 192.168.50.1:22 -j -k
26600 ViktorJp  3328 S    -sh
26913 ViktorJp  3328 S    sh /jffs/scripts/killmon.sh
26917 ViktorJp  3328 S    sh /jffs/scripts/killmon.sh -monitor
27092 ViktorJp     0 IW   [kworker/1:0-eve]
27296 ViktorJp  3720 S    dropbear -p 192.168.50.1:22 -j -k
27332 ViktorJp  3328 S    -sh
28080 ViktorJp  3852 S    sh /jffs/scripts/backupmon.sh -setup
29665 ViktorJp  3644 S    sh /jffs/scripts/vpnmon-r3.sh -noswitch
29666 ViktorJp  5088 S    timeout 60 nc -w3 8.8.8.8 443
29667 ViktorJp  3328 S    nc -w3 8.8.8.8 443
29782 ViktorJp  3196 S    sleep 2
29783 ViktorJp  3328 R    ps
30629 ViktorJp  6860 S    /etc/openvpn/vpnclient1 --cd /etc/openvpn/client1 --config config.ovpn
31687 nobody   70756 S    unbound -c /opt/var/lib/unbound/unbound.conf
31727 ViktorJp  5568 S    /usr/sbin/stubby -g -C /etc/stubby/stubby.yml
31730 nobody   11468 S    dnsmasq --log-async
31731 ViktorJp 10696 S    dnsmasq --log-async

P.S. You do have strace installed which I said was a prerequisite?
Absolutely! :)
 
Thanks for the output @Viktor Jaep. Could you try changing this line in the script as follows:
Code:
echo 2000 > /proc/sys/kernel/pid_max
Hopefully that'll change the scanning range to something that matches your router better.
 
Thanks for the output @Viktor Jaep. Could you try changing this line in the script as follows:
Code:
echo 2000 > /proc/sys/kernel/pid_max
Hopefully that'll change the scanning range to something that matches your router better.
You're welcome! Unfortunately, no change on its behavior after this...
 
You're welcome! Unfortunately, no change on its behavior after this...
OK thanks anyway. Sounds like things work differently on your router. Just as a final sanity check can you post the output of these commands please.
Code:
echo 2000 > /proc/sys/kernel/pid_max
strace -r -e trace=bind nvram get http_username 2>&1 | grep nl_pid
echo 32768 > /proc/sys/kernel/pid_max
 

Similar threads

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top