What's new

Sporadic GT-AXE11000 reboots due to kernel panic

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

Well, that's a nasty Xmas present!
Just had a"Kernel panic - not syncing: BUG!" crash.
The first for many months, and the first ever on 388.5
 
Code:
May  5 01:05:10 crashlog: <4>BUG: failure at kernel/irq_work.c:141/irq_work_run_list()!
May  5 01:05:10 crashlog: <0>Kernel panic - not syncing: BUG!

Same here, i don't get why they won't fix such a core issue affecting the premium models. I really cannot have this router reboot randomly, unpredictably. I'm ready to give up on ASUS and go to a simpler Switches & PoE AP Setup using Netgear. It's been many months, maybe years? I don't remember even.

I have a GT-AXE11000 and a GT-AX11000 (without the E) and they have the same (latest) Merlin firmwares and virtually identical configuration. The reboots have been plaguing me on the AXE for months, every Firmware. I have since disabled everything, QOS, classification, roaming assistant, TM IPS. Now I feel like i've been getting less frequent reboots I believe, but it still occasionally happens. I turned off reboot scheduler to see how long it will go.

Interesting detail, why does AXE shows Protected Management Frames options, while the AX does not - they're all identically configured with WPA2/WPA3.

On the AXE i have 2.4/5 as capable, and i had 6GHz as required.

I turned 6GHz off now, to see if this or the protected management frames cause this, as someone had suggested a few posts ago.
 
Code:
May  5 01:05:10 crashlog: <4>BUG: failure at kernel/irq_work.c:141/irq_work_run_list()!
May  5 01:05:10 crashlog: <0>Kernel panic - not syncing: BUG!

Same here, i don't get why they won't fix such a core issue affecting the premium models. I really cannot have this router reboot randomly, unpredictably. I'm ready to give up on ASUS and go to a simpler Switches & PoE AP Setup using Netgear. It's been many months, maybe years? I don't remember even.

I have a GT-AXE11000 and a GT-AX11000 (without the E) and they have the same (latest) Merlin firmwares and virtually identical configuration. The reboots have been plaguing me on the AXE for months, every Firmware. I have since disabled everything, QOS, classification, roaming assistant, TM IPS. Now I feel like i've been getting less frequent reboots I believe, but it still occasionally happens. I turned off reboot scheduler to see how long it will go.

Interesting detail, why does AXE shows Protected Management Frames options, while the AX does not - they're all identically configured with WPA2/WPA3.

On the AXE i have 2.4/5 as capable, and i had 6GHz as required.

I turned 6GHz off now, to see if this or the protected management frames cause this, as someone had suggested a few posts ago.

Now with 6Gz turned off, i got 7-8d of uptime before the same crash. Very disappointing, still not resolved.

Done - no effect, or only extension of uptime
  1. Disabled IPS - DONE, didn't resolve
  2. Disabled all QoU / Traffic Analyzer - DONE, didn't resolve
  3. Disabled 6Ghz - DONE, didn't resolve
Still left to do on the AXE11000 (which is crashing)
  1. Turned off WPA3 - Disabled Protected Frames
  2. Disabled All IP6 (anywhere I could find it) - apparently this helps with Linux issues.
  3. Changed NAT Back From 'Full Cone' to 'Symmetric'
  4. TCP Adjustments (TCP Connections Limit to 100000, halfed all other settings) - Helps with Kernel Panic?
  5. Turning off WIFI6 (160Hz) majorly helped me
Weirdly, the AX with pretty much identical settings doesn't crash (AX is in AP mode though, AXE is in full router mode). Makes me believe it wouldn't be associated with WPA3 and 160MHz. But I also have different devices talking to the AX and AXE, so they can't be ruled out?

It'd be glad to for any pointers, things that have worked or systems like the AXE16 maybe that are 100% not affected.
 
Now with 6Gz turned off, i got 7-8d of uptime before the same crash. Very disappointing, still not resolved.

Done - no effect, or only extension of uptime
  1. Disabled IPS - DONE, didn't resolve
  2. Disabled all QoU / Traffic Analyzer - DONE, didn't resolve
  3. Disabled 6Ghz - DONE, didn't resolve
Still left to do on the AXE11000 (which is crashing)
  1. Turned off WPA3 - Disabled Protected Frames
  2. Disabled All IP6 (anywhere I could find it) - apparently this helps with Linux issues.
  3. Changed NAT Back From 'Full Cone' to 'Symmetric'
  4. TCP Adjustments (TCP Connections Limit to 100000, halfed all other settings) - Helps with Kernel Panic?
  5. Turning off WIFI6 (160Hz) majorly helped me
Weirdly, the AX with pretty much identical settings doesn't crash (AX is in AP mode though, AXE is in full router mode). Makes me believe it wouldn't be associated with WPA3 and 160MHz. But I also have different devices talking to the AX and AXE, so they can't be ruled out?

It'd be glad to for any pointers, things that have worked or systems like the AXE16 maybe that are 100% not affected.
I want to add that NAT is actually Symmetric already, both flow cache and runner are on, there's no swap file configured. I also DO use UPNP (secure mode) ... children, xbox on the network. I remember vaguely that @RMerlin ages ago said that UPNP isn't good, could that be a source of this kernel synching bug?

Maybe, to further test i can set up some mass static port forwards from public AXE to my AX (AP, only private IP) from the AXE (router, public IP), and enable UPNP on the AX only (private IP/port on AX to statically forwarded ports on AXE) and run it in full router mode also, downstream to the other router - but that would give me other issues of operating everything in one network probably. I could also make the switch getting an IP (AXE is getting the public IP atm) and then have both routers downstream. I am not sure whether.

I use UPNP for the XBOX but also Chromecast discovery on the whole network, and there may be other casting uses I am unaware of. So I can't just switch it off. But I could try transferring responsibility to the AX and see if the AXE still crashes. Kinda lost here...
 
Based on the log file below and "BUG: failure at kernel/irq_work.c:141/irq_work_run_list()"
i checked the code in the AXE branch https://github.com/RMerl/asuswrt-me...-5.02axhnd/kernel/linux-4.1/kernel/irq_work.c

@RMerlin in newer kernels looks like this "BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT));"
It seems to me that CONFIG_PREEMPT_RT is not set in the Makefile on first sight, making the right part always TRUE.

So naively something turned on interrupts, when it should be off? Is there any way to see what did?
OR would enabling CONFIG_PREEMPT_RT fix the problem? I assume it has further reaching implications... based on: "The PREEMPT_RT patch has been partially merged into the mainline Linux kernel, starting from version 5.15"

I either need to find the source of the problem, or find a new device that doesn't have the problem (different broadcom chip?).


Code:
May  5 01:05:10 crashlog: <4>BUG: failure at kernel/irq_work.c:141/irq_work_run_list()!
May  5 01:05:10 crashlog: <0>Kernel panic - not syncing: BUG!
May  5 01:05:10 crashlog: <4>CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O    4.1.52 #2
May  5 01:05:10 crashlog: <4>Hardware name: Broadcom-v8A (DT)
May  5 01:05:10 crashlog: <0>Call trace:
May  5 01:05:10 crashlog: <4>[<ffffffc000087398>] dump_backtrace+0x0/0x150
May  5 01:05:10 crashlog: <4>[<ffffffc0000874fc>] show_stack+0x14/0x20
May  5 01:05:10 crashlog: <4>[<ffffffc00055c068>] dump_stack+0x90/0xb0
May  5 01:05:10 crashlog: <4>[<ffffffc000559d2c>] panic+0xd8/0x220
May  5 01:05:10 crashlog: <4>[<ffffffc0000f8670>] irq_work_run+0x0/0x48
May  5 01:05:10 crashlog: <4>[<ffffffc0000f8888>] irq_work_tick+0x48/0x68
May  5 01:05:10 crashlog: <4>[<ffffffc0000db554>] update_process_times+0x54/0x70
May  5 01:05:10 crashlog: <4>[<ffffffc0000e9da8>] tick_sched_handle.isra.6+0x28/0x80
May  5 01:05:10 crashlog: <4>[<ffffffc0000e9e44>] tick_sched_timer+0x44/0x90
May  5 01:05:10 crashlog: <4>[<ffffffc0000dbf94>] __run_hrtimer.isra.4+0x4c/0x110
May  5 01:05:10 crashlog: <4>[<ffffffc0000dc37c>] hrtimer_interrupt+0xdc/0x298
May  5 01:05:10 crashlog: <4>[<ffffffc000383998>] arch_timer_handler_phys+0x30/0x40
May  5 01:05:10 crashlog: <4>[<ffffffc0000d1120>] handle_percpu_devid_irq+0x78/0xa0
May  5 01:05:10 crashlog: <4>[<ffffffc0000ccc24>] generic_handle_irq+0x34/0x50
May  5 01:05:10 crashlog: <4>[<ffffffc0000ccf34>] __handle_domain_irq+0x5c/0xb8
May  5 01:05:10 crashlog: <4>[<ffffffc000080c18>] gic_handle_irq+0x38/0x90
May  5 01:05:10 crashlog: <4>Exception stack(0xffffffc03e8cfdc0 to 0xffffffc03e8cfef0)
May  5 01:05:10 crashlog: <4>fdc0: 421c0f78 00002bda 00000000 00000080 3e8cff10 ffffffc0 00381494 ffffffc0
May  5 01:05:10 crashlog: <4>fde0: 421c0f78 00002bda 2c5b40a2 00228f8b 0000d8c7 00000000 14000963 00000000
May  5 01:05:10 crashlog: <4>fe00: 04fc49e8 00000000 00000018 00000000 98686cad 00227e9b 2f6074a8 ffffffc0
May  5 01:05:10 crashlog: <4>fe20: 3e8c3a70 ffffffc0 3e8cfec0 ffffffc0 00000003 00000000 ffeb472c 00000000
May  5 01:05:10 crashlog: <4>fe40: 00000000 00000000 ffeb3cb0 00000000 00013384 00000000 00000000 00000000
May  5 01:05:10 crashlog: <4>fe60: 00424138 ffffffc0 00000000 00000000 00000000 00000000 421c0f78 00002bda
May  5 01:05:10 crashlog: <4>fe80: 3ffd4338 ffffffc0 00000001 00000000 00000001 00000000 4216d11d 00002bda
May  5 01:05:10 crashlog: <4>fea0: 3e8cc000 ffffffc0 0091d000 ffffffc0 0074b000 ffffffc0 3ffd4338 ffffffc0
May  5 01:05:10 crashlog: <4>fec0: 0078d580 ffffffc0 3e8cff10 ffffffc0 0038148c ffffffc0 3e8cff10 ffffffc0
May  5 01:05:10 crashlog: <4>fee0: 00381494 ffffffc0 60000145 00000000
May  5 01:05:10 crashlog: <4>[<ffffffc000083f00>] el1_irq+0x80/0xf8
May  5 01:05:10 crashlog: <4>[<ffffffc000381598>] cpuidle_enter+0x18/0x20
May  5 01:05:10 crashlog: <4>[<ffffffc0000c5d54>] cpu_startup_entry+0x1ec/0x250
May  5 01:05:10 crashlog: <4>[<ffffffc00008d190>] secondary_start_kernel+0x150/0x178
 
just another crash, this time only 1d uptime :-(
Tried so far:
  1. Disabled IPS - DONE, didn't resolve
  2. Disabled all QoU / Traffic Analyzer / Roaming Assistant - DONE, didn't resolve
  3. Disabled 6Ghz - DONE, didn't resolve
  4. Turned off WPA3 - Disabled Protected Frames - DONE, didn't resolve
  5. Disabled All IP6 - DONE, has been off for a long time
  6. NAT on 'Symmetric' - DONE, has been off for a long time
  7. TCP Adjustments - changed from 300k, 250k, didn't do anything, also happened during time of no load
  8. Turning off WIFI6 (160Hz) - DONE, didn't resolve
I just changed WPA3 top Open on the disabled 6Ghz, you never know. Also disabled AX mode on 2.4GHz, actually will deactivate 2.4Ghz altogether, only leaving 5Ghz.
Should that crash again:

1. Deactivating UPNP
2. Deactivating flow cache

lol not much left to disabled, deactivating all Wifi. Really no idea how to troubleshoot this.

The last log entry before the most recent reset was a UPNP remove, 4 hours earlier, so unlikely to be connected.
 
i'm giving up on this, orderin ax88u pro or gt-ax6000 now... bye bye 6hz
 
After another crash, Documenting this for posterity. Uptime 8hrs atm.
  1. Turned of UPNP also > no effect
  2. Fixed WLAN channel > nope (hoping this would turn off DFS, but apparently 160Mhz has to go too)
Now really don't have much to turn off, so these are the final changes.
  1. Turned Traditional QoS ON (is there another way to turn off HW acceleration?)
  2. 80Mhz to turn off DFS for good
I may never find out how to fix this, as the praised AX6000 is arriving today and this AXE11000 will have to be decomissioned.
 
Got this router a month ago. Currently running the latest beta of Merlin's firmware, uptime is 13 days. Never seen a crash so far.
 
Maybe it is defective somehow, my ax (without e) 11000 has an uptime of many months, but as that exact bug has been posted over and over, over the years, I thought I might be able to get to the root of the problem... someone suggested its due to an old Linux kernel, but asus didn't accept the pull request as it would only hide the problem. No idea, I am giving up. Expensive piece of furniture now.
 
well, this is funny, first time i'm getting 2d+ uptime on the AXE11000, but now I have the AX6000 also. now I'll maybe just chase the problem all the way down. only 2 changes i made, turn on trad QoS (to turnoff HW acceleration) and disabled DFS/auto channels.
 
strange my AXE100 has run problem free ,, have been running Merlin FW on the router since Merlin was available . Lucky I guess
 
Had another "Kernel panic - not syncing: BUG!" on my AX86S last night.
The second I've had since installing 388.6
Any more and I might go back to 388.5 - Didn't get a single one on that release.
 
Had another "Kernel panic - not syncing: BUG!" on my AX86S last night.
The second I've had since installing 388.6
Any more and I might go back to 388.5 - Didn't get a single one on that release.
I got a few days of uptime with almost everything turned off. Last thing I did was turning on 6Ghz again and now off again, as i got another panic. When I deactivated 2.4 and 6, and deactivated channel selection for 5Ghz, i got the longest uptime. Also I had turned off hardware acceleration and pretty much everything else.

Maybe, if there's only one Wireless band on, the problem doesn't arise. Because there's no need for concurrency. I am fine with just 5Ghz being active, not really what I bought the machine for. Should I get it stable again (>5d uptime), I'll clear the NVRAM and JFFS and then start activating features 1:1 again. Kind of like an elimination diet ;-)

I don't know how concurrency is implemented with respect to Wireless handling, it's probably closed source, cannot speculate more.
 

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top