What's new

RT-AX3000 crashing every few days in bcmsw_rx (Tried hw/sw reset/rebuilds and even new replacement router)

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

seanwo

Occasional Visitor
--------------------
Issue Description
--------------------

Currently running RT-AX3000 running 386.3_2 (also same issue with several previous builds)
Every couple of days (sometimes daily) the router crashes (usually in the bcmsw_rx process) and reboots.
I have tried several factory resets (hw & software), disabling features, and even a new identical router I bought thinking it was bad hw.
The crashes remain

---------------------
Crash log process
---------------------
Crashes usually show for thread <0>:

May 5 00:05:02 crashlog: <0>Internal error: Oops: 17 [#1] PREEMPT SMP ARM
May 5 00:05:02 crashlog: <0>Process bcmsw_rx (pid: 239, stack limit = 0xd778e210)

I will attach a couple of crashlogs to the thread one from each RT-AX3000 device (old and new).

-------------------------------------------
What I have tried to resolve the issue
-------------------------------------------

Manual reinstall using:
+ WPS/Power Button Factory Reset
+ Reset Button Factory Reset
+ Software Factory Reset w/ JFFS format on next boot.
+ All of the above in different sequences.
+ ALL HW/SOFTWARE RESETS WITH A SECOND NEW RT-AX3000 ROUTER!

+ Disabling VPN server
+ Disabling VPN client and policy rules

-------------------------
What I have left to try
-------------------------

- Disabling DDNS
- Disabling DNS over TLS
- Disabling secondary time server
- Removing all port forwarding
- Turning off all my connected APs (in case it is some sort of router to AP interaction)

-----------------------------------
Changes from factory defaults
-----------------------------------
Here are the changes I make when after I hw reset the latest merlin builds:

wireless:general
2.4ghz
network name: [my 2.4ghz network name]
wpa pre-shared key: [wpa shared key]
channel bandwidth: 20mhz
control channel: 2
5ghz
network name: [my 5ghz network name]
wpa pre-shared key: [wpa shared key (same as 2.4ghz)]
channel bandwidth: 40mhz
control channel: 52

administration:system
router login name: [admin username]
new password: [new password]
retype password: [new password]
enable jffs custom scripts and configs: yes
secondary ntp server: time.nist.gov
enable ssh: lan only
allow password login: no
authorized keys: [public key]
authentication method: both
https lan port: 1443

administration:firmware upgrade
Schedule check for new firmware availability: no

lan:lan ip
ip address: 192.168.0.1

wan:internet connection
dns privacy protocol: DoT
dns-over-tls profile: strict
address: 1.1.1.1, 853, cloudflare-dns.com
address: 1.0.0.1, 853, cloudflare-dns.com

wan:ddns
enable ddns client: yes
server: custom
hostname: [my afraid.org subdomain]
forced update interval: 10
https/ssl certificate: import your own

wan:virtual server port forwarding
enable port forwarding: on
service: [machine service name for ssh], [external port other than 22], 22, 192.168.0.100, tcp
service: [machine service name for plex], [external port other than 32400]], 32400, 192.168.0.100, tcp

guest network:
[network name I use to pin devices to this router via wifi], wpa2-personal, [key], unlimited access, enable

network tools:wake on lan
server: [macaddress2]
server: [macaddress2]
server: [macaddress2]

vpn:vpn server
enable openvpn server: on
keys and certificates: [cut/paste or scp to /jffs/openvpn & cycle vpn server]

lan:dhcp server
[manual assignments for around 15 macaddresses]

[jffs/scripts]
wan-start script (so I know the router rebooted)
ddns-start (for custom afraid.org DDNS registation)

[jffs/openvpn]
vpn client and server keys, ca, crt, etc.

-------
Help :)
-------

Any suggestions or debugging from the crashlogs that could lead me in the right direction.
I noticed one other thread regarding these types of crashes but the suggestion is always hw reset. I have found that the crashes always return regardless of resetting and rebuilding from scratch.
I have spent almost a month trying to track this down with no luck. Not sure what to do next since even new hardware reproduces the issue.
 

Attachments

  • crashlog.new.router.txt
    69.5 KB · Views: 163
  • crashlog.old.router.txt
    67 KB · Views: 152
Progress and another shout out for advice/help...

When the router crashes, it usually takes 2-4 days on firmwares that have the crashing bug.

Over the past weeks I have narrowed down the version where the crashing bug was introduced:

The bug was introduced in build 386.2_6.

Here is the version testing I did. If the version did not crash for 7 days it was considered stable.

+ 384.19 (14-Aug-2020)
? 386.1 (30-Jan-2021)
+ 386.1_2 (12-Feb-2021)
? 386.2 (2-Apr-2021)
+ 386.2_2 (13-Apr-2021)
+ 386.2_4 (30-Apr-2021)
- 386.2_6 (6-June-2021) <-- problem starts with this version of the firmware
- 386.3 (23-July-2021)
- 386.3_2 (6-Aug-2021)

+ stable for 7 days
- crashed prior to 7 days
? untested

This actually makes some sense now!

Here are the changes between 386.2_4 and 386.2_6:

The most likely candidate is the update of the Broadcom binary drivers:

HND502AX-675x: Fragattack patches to wireless drivers (RT-AX56U, RT-A……X58U)

Since the crash is always in bcmsw_rx (Broadcom drivers right?):

May 5 00:05:02 crashlog: <0>Internal error: Oops: 817 [#1] PREEMPT SMP ARM
May 5 00:05:02 crashlog: <0>Process bcmsw_rx (pid: 239, stack limit = 0xd77ac210)

Correct me if I am wrong... but all that remains is to figure out what feature or client is triggering the crash.

At first, I thought it might be a feature but given how minimal the setup was that I used on all the firmware version testing above, I think it is probably a client that triggers the crash.

For reference here are my instructions I followed for all the firmware testing above that identified the first firmware this problem showed up in:

[flash firmware]
[hardware wps reset]
[admin ui factory restore]
[format jffs]

lan:lan ip
ip address: 192.168.0.1

administration:system
router login name: [admin username]
new password: [new password]
retype password: [new password]
enable jffs custom scripts and configs: yes
enable ssh: lan only
allow password login: no
authorized keys: [public key]
authentication method: HTTP
http lan port: 80

wan:internet connection
dns server1: 8.8.8.8
dns server2: 8.8.4.4

lan:dhcp server
[add manual assignments]

wireless:general
(2.4ghz)
network name: [network name]
wpa pre-shared key: [key]
channel bandwidth: 20mhz
control channel: 2
(5ghz)
network name: [network name]
wpa pre-shared key: [key]
channel bandwidth: 40mhz
control channel: 52

guest network:
[network name], wpa2-personal, [key], unlimited access, enable

ssh -i ~/.ssh/id_rsa_router admin@192.168.0.1
create: jffs/scripts/wan-start

The wan-start script was setup to send me email so that I could track if the router crashed and rebooted.

Outside of any other advice, I now need to put it on the latest build and start removing devices for a week at a time until I can identify the client that triggers the crash. Ugg.

But let's say I figure out what client causes this to happen. Should any client really be able to crash the router?

Is there a hope of getting more stable Broadcom drivers? I am unclear how those are acquired and integrated. My guess is that this may also only be the Broadcom drivers for the RT-AX3000/RT-AX58U.

Open to suggestions on how to proceed to get this corrected?
 
Hi !

I can't help much with your logs, but I just want to say that I have a couple of AX3000 since may 2020 (not attached, they are in 2 separate homes). I keep them up to date with the latest stable Merlin and I've never experienced any reboot or instability. I have about 15 to 30 devices connected to each of them, from PC, iphones, android phones, smartTV, printer and other iOT etc...
One runs with Dual-Band smart connect (2.4 and 5) + guest, while the other one runs separate 2.4 and 5 networks + 2 guest networks also 2.4 and 5. Uptime is generally as long as it takes to get to the next stable release, unless there is a power outage.

If this is a Broadcom driver failure, it must be linked to a very specific client or configuration, or could it be a hardware failure ?

Did you check temps and load ? For example do they increase until crash ?

I leave in a rather hot area especially in summer, but my temps are very stable, as well as my cpu/memory load
  • 2.4 and 5 Ghz around 53-54°
  • CPU around 68-70° in summer when
  • Memory also stable about 70-75% used
  • CPU1 rarely over 50%, CPU2 and 3 rarely over 10% (but I don't spend all day watching them...)
I tried to compare your config and mine. Here are the main differences on mine :
  • 2.4 Ghz : Channel bandwidth: 20/40mhz / Control channel : Auto. Current control channel : 8
  • 5 Ghz : Channel bandwidth: 20/40/80mhz / Control channel : Auto. Current control channel : 36
Also it might not be related but :
  • I don't use DDNS and port forwarding, no access from outside
  • I don't use openVPN at all
  • I don't use wake on lan.
I'm not sure this help, but hopelfully you'll get a clue.
 
Hi !

I can't help much with your logs, but I just want to say that I have a couple of AX3000 since may 2020 (not attached, they are in 2 separate homes). I keep them up to date with the latest stable Merlin and I've never experienced any reboot or instability. I have about 15 to 30 devices connected to each of them, from PC, iphones, android phones, smartTV, printer and other iOT etc...
One runs with Dual-Band smart connect (2.4 and 5) + guest, while the other one runs separate 2.4 and 5 networks + 2 guest networks also 2.4 and 5. Uptime is generally as long as it takes to get to the next stable release, unless there is a power outage.

If this is a Broadcom driver failure, it must be linked to a very specific client or configuration, or could it be a hardware failure ?

Did you check temps and load ? For example do they increase until crash ?

I leave in a rather hot area especially in summer, but my temps are very stable, as well as my cpu/memory load
  • 2.4 and 5 Ghz around 53-54°
  • CPU around 68-70° in summer when
  • Memory also stable about 70-75% used
  • CPU1 rarely over 50%, CPU2 and 3 rarely over 10% (but I don't spend all day watching them...)
I tried to compare your config and mine. Here are the main differences on mine :
  • 2.4 Ghz : Channel bandwidth: 20/40mhz / Control channel : Auto. Current control channel : 8
  • 5 Ghz : Channel bandwidth: 20/40/80mhz / Control channel : Auto. Current control channel : 36
Also it might not be related but :
  • I don't use DDNS and port forwarding, no access from outside
  • I don't use openVPN at all
  • I don't use wake on lan.
I'm not sure this help, but hopelfully you'll get a clue.
Thanks for replying. Temps are fine. CPU load is fine. The router is located in the same position whether it is using a buggy firmware or a stable firmware so I can rule the environment out. In my minimal setup, I dumped the VPN, DDNS, and other features I was originally using from my router until I get this sorted out. I have about 70 clients of which 50 are wifi. I have tried two identical routers, both new units so I have ruled out hardware defects. As you pointed out, my last features to try are around bandwidth and control channel. Beyond that it has to be a client. I have to set control channel to keep it way from other channels used by nearby homes and to keep it distinct from my other APs (not used during the tests). I have seen a couple other random posts about bcmsw_rx crashes so I know it is not just me at this point but I refuse to live with a router that reboots every 2-4 days :) and I would like to get back to the latest builds at some point. Personally, know another person with an RT-AX3000 that does not crash either and we have been trying things to replicate it on his as well which leads me to the combination of >=386.2_6 and a particular client that triggers it. Nothing relevant in logs right before the crashes. Crashdump don't seem to help. @RMerlin any known bugs with the 386.2_6 updated Broadcom binaries mentioned above? Thanks for your reply!
 
I have had the same problems on 2 different RT-AX58U units that I no longer have, and now a brand new 2 week old RT-AX56U running 386.3_2. My latest crash I also posted today. Current crash log points to bcm_mcast_netlink_process_snoop_cfg

May 5 01:05:00 crashlog: <4>^[[0;33;41m[ERROR mcast] bcm_mcast_netlink_process_snoop_cfg,926: interface 22 could not be found^[[0m

Will roll back to older version to see if older Broadcom drivers are more stable.

I went back to 386.2_2 and I immediately noticed that IPv6 issues I have been having are working correctly on the older build 386.2_2. On 386.3_2 using IPv6 (native, prefix delegation, stateless), no IPv6 leases appear in the System Log - IPv6 page of the GUI. After dirty flashing back to 386.2_2, the devices that I have that support IPv6 are getting IPv6 addresses on the LAN now. So there does at least appear to be some IPv6 bugs in the newer 386.3_2 build.
 
Last edited:
Well, you've done quite some testing already to narrow the problem down... Sorry I couldn't help more. Hopefully someone else will !
 
I am getting the same error as the original poster. I am using an RT-AX58U which I believe is the same model. Currently running 386.3_2.

Code:
May  5 15:05:02 crashlog: <0>Internal error: Oops: 17 [#1] PREEMPT SMP ARM
May  5 15:05:02 crashlog: <4>CPU: 0 PID: 239 Comm: bcmsw_rx Tainted: P           O    4.1.52 #1
 
@sweetlyham (or anyone else who has this crash)... Do you have any of the following wireless devices on your network? I am trying to identify which client might be causing the crash since it seems to point to a client that triggers the problem starting with build 386.2_6:

mac.addresses.png


worst case scenario, you can do a clean revert to 386.2_4 which is the build prior where the problem was introduced.

Thanks in advance for anyone posting a client match who also has this crash on the rt-ax3000/rt-ax58u.
 
A basic design principle is that a client should not be able to cause a restart of a superior node in the network; a basic principle of defensive SW design. It should not be possible for my clients to crash my router. Most of the WIFI chips in our devices are made by just 4 companies: Intel, MediaTek, Qualcomm and Broadcom. These companies do cooperative testing and cross license patents. I would more suspect issues in the design of the Broadcom SW drivers on the router than anything else. Many users have noted that the 384 design base from Asus was really stable, and I would agree with that. The 384 base would run for weeks and weeks without issue. The 386 base is not as stable IMHO. For me personally, 80% of my traffic is on the guest network which Asus redesigned. I suspect but cannot prove that there are interactions in this function with the BCM drivers that is causing my own router restarts.
 
@seanwo I have a Pixel 5 on my network from your list of clients!
Is your Pixel 5 set to randomize its mac address? AKA it is switching its mac address all the time. Do you use the 2.4ghz or 5.0ghz network or do you have both configured so that the phone can choose. Any use of a guest network in your case? Thanks!
 
A basic design principle is that a client should not be able to cause a restart of a superior node in the network; a basic principle of defensive SW design. It should not be possible for my clients to crash my router. Most of the WIFI chips in our devices are made by just 4 companies: Intel, MediaTek, Qualcomm and Broadcom. These companies do cooperative testing and cross license patents. I would more suspect issues in the design of the Broadcom SW drivers on the router than anything else. Many users have noted that the 384 design base from Asus was really stable, and I would agree with that. The 384 base would run for weeks and weeks without issue. The 386 base is not as stable IMHO. For me personally, 80% of my traffic is on the guest network which Asus redesigned. I suspect but cannot prove that there are interactions in this function with the BCM drivers that is causing my own router restarts.
I agree with the principles you state. In the absence of any fix for broadcom sw drivers I can only look for workarounds to stay on the current builds which have security fixes. Outside of finding a work around I will need to stick with the 384 base until progress on broadcom drivers is made to increase stability.
 
I agree with the principles you state. In the absence of any fix for broadcom sw drivers I can only look for workarounds to stay on the current builds which have security fixes. Outside of finding a work around I will need to stick with the 384 base until progress on broadcom drivers is made to increase stability.
You can try enabling the "nightly reboot schedule" under the Admin/System tab.
(Mine: everyday at 4am.)

You can try the test build, here: https://onedrive.live.com/?authkey=!AGY2taGX02nVmWA&id=CCE5625ED3599CE0!1427&cid=CCE5625ED3599CE0
 
You can try enabling the "nightly reboot schedule" under the Admin/System tab.
(Mine: everyday at 4am.)

You can try the test build, here: https://onedrive.live.com/?authkey=!AGY2taGX02nVmWA&id=CCE5625ED3599CE0!1427&cid=CCE5625ED3599CE0
Thanks but periodic rebooting does not seem to help. I tried that previously but did not mention it; so it is not like a memory leak. As far as test builds, I am not seeing any bcmdriver updates since the latest stable build so I am not sure that would help if the issues are with the bcmdrivers: https://github.com/RMerl/asuswrt-merlin.ng/compare/386.3_2...master. Maybe I am missing an update that would apply to the issue. If so, could you point me to the commit? Thanks!
 
@sweetlyham (or anyone else who has this crash)... Do you have any of the following wireless devices on your network? I am trying to identify which client might be causing the crash since it seems to point to a client that triggers the problem starting with build 386.2_6:

View attachment 36724

worst case scenario, you can do a clean revert to 386.2_4 which is the build prior where the problem was introduced.

Thanks in advance for anyone posting a client match who also has this crash on the rt-ax3000/rt-ax58u.
I have Several Echos and Dots and Fire Sticks all from Gen 1 up
I have a few TP Link Kasa switches
We have two Iphones both set to use the phone's mac address
I have two LG TVs
I do have a texas chipset based device but the crashes happened before that was added so Kinda rules that out.
I have a MacMini.
I did re-flash the alpha 4 over itself on the main router and the two aimesh nodes (All RT-AX58Us) and the stability of the setup has improved.
In the past I did notice that the one node that is using an ethernet backhaul , when rebooting after the upgrade, caused the main router to also reboot.. The Wireless backhaul node did not
 
My opinion is that it is the 386 code base that is causing these crashes. I have gone back to 384.18 on my RT-AX56U with a clean flash to see if I can run for 5-7 days without a crash which I cannot do on the 386 code base (386.3-2, 3862-2 both experience 1 - 2 crashes a day and I also get crashes on 386.4 alpha1). Some of the crashes look like pure driver issues with the broadcom drivers and some look like a possible interaction between Stubby and the mcast driver.
 
My opinion is that it is the 386 code base that is causing these crashes. I have gone back to 384.18 on my RT-AX56U with a clean flash to see if I can run for 5-7 days without a crash which I cannot do on the 386 code base (386.3-2, 3862-2 both experience 1 - 2 crashes a day and I also get crashes on 386.4 alpha1). Some of the crashes look like pure driver issues with the broadcom drivers and some look like a possible interaction between Stubby and the mcast driver.
I would bet that 386.2_4 would be stable because that was the build before the Broadcom drivers were updated as per the comparison links higher up in the thread. Either way a 384 base should work too.
 
I suggest flashing the latest RMerlin Alpha 1 firmware instead.

Download | Asuswrt-Merlin

After flashing the latest firmware, I would then proceed to follow the suggestions below to verify if the issue is hardware or not (following the steps fully will weed out other setup issues too).

Fully Reset Router and Network

Fully Reset / Best Practice Setup / More
What commit in the alpha speaks to solving the Broadcom driver stability? Every build since 386.2_6 has been unstable for me so far. https://github.com/RMerl/asuswrt-merlin.ng/compare/386.3_2...master I am not against trying it but I want to understand why this has been recommended twice? I have flashed so many builds at this point to find the root cause that my head is spinning :)
 
I clean flashed 386.4 alpha1 when I first got the RT-AX56U router 2 weeks ago, and it was immediately crashing after running for 1/2 a day on a brand new router that had never been configured. All of the 386 builds behave the same way on the RT-AX58U (I have had 2 of them) and the RT-AX56U (a brand new one).

The 384 code base developed over a long period of time with mainly incremental improvements over time; that code base is very stable and I am running a test now to see if it will run on my RT-AX56U for 5-7 days without a crash which I cannot do on any of the 386 builds. When I went back to 384.18 on the RT-AX56U I clean flashed and also reformatted JFFS to make it a good test.

Asus did a lot of parallel development with the 386 design base and rolled out a lot of functional changes on the new code base + upgrades in the GUI. It is not all stable IMHO. There are a lot of problems with the stability of the Broadcom wireless drivers and also some interactions (some appear to be with Stubby). I assume Asus has some tools to read the registers from the crash logs, all I can do is infer what might be happening by looking at the logs.

It is really easy to miss a router crash if for example you auto-reboot your router every day or don't check your logs very often. My guess is that these problems are widespread but some are not aware it is happening. See https://forums.whirlpool.net.au/archive/9z8jj629 for similar problems on another forum. @RMerlin are you seeing any crashes like this and what are your thoughts on a fix?

May 5 01:05:00 kernel: Booting Linux on physical CPU 0x0
May 5 01:05:00 kernel: Linux version 4.1.52 (merlin@ubuntu-dev) (gcc version 5.5.0 (Buildroot 2017.11.1) ) #1 SMP PREEMPT Fri Aug 6 17:53:13 EDT 2021
May 5 01:05:00 kernel: CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
May 5 01:05:00 kernel: CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
May 5 01:05:00 kernel: Machine model: Broadcom BCM947622
May 5 01:05:00 kernel: bootconsole [earlycon0] enabled
May 5 01:05:00 kernel: Memory policy: Data cache writealloc
May 5 01:05:00 crashlog: <6>device wl0.1 entered promiscuous mode
May 5 01:05:00 kernel: PERCPU: Embedded 10 pages/cpu @dfbc9000 s11852 r8192 d20916 u40960
May 5 01:05:00 crashlog: <4>^[[0;33;41m[ERROR mcast] bcm_mcast_netlink_process_snoop_cfg,926: interface 22 could not be found^[[0m
May 5 01:05:00 kernel: Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130048
May 5 01:05:00 kernel: Kernel command line: isolcpus=3 root=ubi:rootfs_ubifs ubi.mtd=0 rootfstype=ubifs console=ttyAMA0 earlyprintk debug irqaffinity=0 pci=pcie_bus_safe
May 5 01:05:00 crashlog: <4>^[[0;33;41m[ERROR mcast] bcm_mcast_netlink_process_snoop_cfg,926: interface 22 could not be found^[[0m
May 5 01:05:00 kernel: PID hash table entries: 2048 (order: 1, 8192 bytes)
May 5 01:05:00 kernel: Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
May 5 01:05:00 kernel: Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
May 5 01:05:00 kernel: Memory: 510144K/524288K available (4724K kernel code, 1722K rwdata, 1284K rodata, 212K init, 414K bss, 14144K reserved, 0K cma-reserved, 0K highmem)
May 5 01:05:00 crashlog: <6>device eth4.501 entered promiscuous mode
May 5 01:05:00 kernel: Virtual kernel memory layout:
May 5 01:05:00 kernel: vector : 0xffff0000 - 0xffff1000 ( 4 kB)
May 5 01:05:00 kernel: fixmap : 0xffc00000 - 0xfff00000 (3072 kB)
May 5 01:05:00 kernel: vmalloc : 0xe0800000 - 0xff000000 ( 488 MB)
May 5 01:05:00 crashlog: <4>^[[0;33;41m[ERROR mcast] bcm_mcast_netlink_process_snoop_cfg,926: interface 22 could not be found^[[0m
May 5 01:05:00 kernel: lowmem : 0xc0000000 - 0xe0000000 ( 512 MB)
May 5 01:05:00 kernel: pkmap : 0xbfe00000 - 0xc0000000 ( 2 MB)
May 5 01:05:00 kernel: modules : 0xbf000000 - 0xbfe00000 ( 14 MB)
May 5 01:05:00 kernel: .text : 0xc0018000 - 0xc05f64c4 (6010 kB)
May 5 01:05:00 kernel: .init : 0xc05f7000 - 0xc062c000 ( 212 kB)
May 5 01:05:00 kernel: .data : 0xc062c000 - 0xc07dab60 (1723 kB)
May 5 01:05:00 crashlog: <4>^[[0;33;41m[ERROR mcast] bcm_mcast_netlink_process_snoop_cfg,926: interface 22 could not be found^
 

Similar threads

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top