computer knocks out entire network at startup

p3ter · Jun 8, 2021

I have a Lenovo Gaming PC which is physically connected to a small 4-port switch which is in turn connected to port 4 on my RT-AC88U (I have given up using ports 5-8 on the router due to known RT-AC88U problem). I also have a RT-AC68U as an AiMesh with wired backhaul to port 2 on my RT-AC88U. It's a rather busy network, with 25+ devices continuously connected, and 60+ total.

The router & network is generally very reliable, I have removed all 'fluff' (QoS, Parental Controls, VPN etc) from the Router, and it generally sits at less than 15% CPU load even under heavy network traffic.

But this Lenovo seems to reliably knock out the entire network very soon after booting, or as my son goes into a game, loads a new youtube video etc. During that 'blackout' period, the responsible computer seems to have good network access, but nobody else does. For other clients it looks like DNS resolution dies first, then after a few seconds you start to see 'you are not connected to the internet' etc. Then after a while, (30+ seconds) the network recovers. The severity and duration of the problem is variable, but seems to be related to how 'busy' that computers network is at startup, but exactly the same type of tasks (loading a new game, etc) tend to not cause problems 30 minutes later - it's only during the first minutes the unreliability occurs.
The downtime also causes the Mesh node to lose the connection to the primary router, which in turn causes local name resolution failure (can't reach devices connected via the Mesh by their computer names), and the inevitable chunk of assocation/disassociation.

I have limited that computers network connection to 100Mbps on the client adapter (we have 1gbps network, and can get 800 Mbps reliably) so in theory, even if that computer was trying to steal all available bandwidth, there should still be plenty over... That combined with the 'self healing' nature of the problem makes it hard for me to figure out where to look. Checking router logs shows nothing noteworthy, and running netstat, ping -t (ping an address continuously) and task manager on that computer doesnt show anything amiss at the time the problem is occuring.

Any ideas where to start troubleshooting this?

Thanks!

L&LD · Jun 8, 2021

Is that ISP connection symmetrical 1Gbps up/down?

Is that pc infected? How much would you dread doing a clean Windows install on it again (if it's just for gaming)?

p3ter · Jun 8, 2021

I can certainly take a look at the OS level and look for viruses and rootkits, but my gut feel was that this might be more at the network level... Looking at task manager and doesn't show any strange activity, and it's weird they way it 'sorts itself out' after the initial startup chaos. I was wondering about broadcast storm / switching loop type issues, but frankly don't know how I would look for those... Since local computer names seems to die first, I also wondered if somehow the Computer 'competes' with the router, e.g. advertising itself as the master browser and confusing things...

I have some experience with Sysinternals Suite, and will look using TcpView and Process Explorer to see if there is anything weird going on. Will also check virus definitions and do a full scan.

p3ter · Jun 8, 2021

Does it matter if an Aimesh client with ethernet backhaul is connected to the first router via a Switch? is it possible that connecting via a Switch might create a switching loop? i.e. are there 2 routes to all the devices connected on the same switch as the Aimesh client node, either direct from the first router OR via the 2nd router? Might this cause broadcast packets to do infinite loops?

thiggins · Jun 8, 2021

p3ter said:
Does it matter if an Aimesh client with ethernet backhaul is connected to the first router via a Switch? is it possible that connecting via a Switch might create a switching loop?

No. Your Lenovo is Ethernet only (unless there's a Wi-Fi adapter you haven't mentioned.

What does Windows performance monitor tell you about network use on the Lenovo during the blackout period?

What is your ISP bandwidth?

Does the Gaming PC have any sort of gaming software on it that probes for gaming servers at startup? In the old days, some utilities were very impolite and launched dozens of parallel network probes looking for servers.

dosborne · Jun 8, 2021

Might be worthwhile to setup a network sniffer to see what is going on. What specific OS version is running on the PC? It could be trying to take control of the domain, flooding the network, etc

I'd also try connecting on a different port just to rule out any quirkiness.

I'd also try bypassing the switch or trying another if you have one. I've seen a bad switch kill an entire network before.

p3ter · Jun 9, 2021

Thanks, I have given that computer a good going over, done all the updates, run a couple of complete AV/Spyware scans, and disabled a chunk of startup bloat. I also returned the network card settings to default (back to Auto-Negotiated 1Gbps instead of manual 100mbps).
I have also changed the switch configuration, so that there is only one possible route to every device. Previously it looked like this:

Now it looks like this:

The situation seems to be significantly improved, and I am not getting the complete blackout (at least not during the limited testing I have done), but I have now found that I can closely simulate the problem to a lesser extent using a speedtest from fast.com if I change the test settings to minimum 8 parallel connections and min 30 seconds test.

While running a speed test on the 'problem' computer, I was running a constant ping to my ISP's closest gateway on my other computer, and I see this:

Code:

Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
(START SPEED TEST ON OTHER COMPUTER)
Reply from nnn.nnn.nnn.nnn: bytes=32 time=23ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1246ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=159ms TTL=61
Request timed out.
Reply from nnn.nnn.nnn.nnn: bytes=32 time=3945ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=2ms TTL=61
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from nnn.nnn.nnn.nnn: bytes=32 time=2846ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=3ms TTL=61
(END SPEED TEST ON OTHER COMPUTER)
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61
Reply from nnn.nnn.nnn.nnn: bytes=32 time=1ms TTL=61

...I then started looking more closely at the ASUS status page - I wondered if there is a chance that maybe I am not seeing a 'burst' of peak CPU usage, since it is consuming so much CPU that the web page is not updated? So instead I started looking via a terminal connection and the 'TOP' command. This shows a very clear peak while the speed test is running:

Code:

PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    3     2 admin    RW       0  0.0   0 40.6 [ksoftirqd/0]
   38     2 admin    SW       0  0.0   1  6.1 [mtdblock3]
  395     1 admin    R     6800  1.3   1  2.2 nt_center
  599     1 admin    S     7272  1.4   1  1.8 httpd -i br0
  547   391 admin    R     6388  1.2   1  1.8 nt_monitor
  392   391 admin    R     6388  1.2   1  1.7 nt_monitor
  585     1 admin    S     1560  0.3   0  0.6 avahi-daemon: running [router.loca
    1     0 admin    S     8728  1.6   1  0.5 /sbin/preinit
Mem: 170552K used, 344664K free, 3264K shrd, 1048K buff, 14348K cached
CPU:  3.0% usr  5.9% sys  0.0% nic 49.2% idle  0.2% io  0.0% irq 41.4% sirq
Load average: 0.39 0.20 0.15 2/119 30296

...Googling 'ksoftirqd' gives posts like this one:
https://askubuntu.com/questions/7858/why-is-ksoftirqd-0-process-using-all-of-my-cpu

In some situations IRQs come very very fast one after the other and the operating system cannot finish servicing one before another one arrives. This can happen when a high speed network card receives a very large number of packets in a short time frame.

Because the operating system cannot handle IRQs as they arrive (because they arrive too fast one after the other), the operating system queues them for later processing by a special internal process named ksoftirqd.

If ksoftirqd is taking more than a tiny percentage of CPU time, this indicates the machine is under heavy interrupt load.

...I think this all tells me that my AC88U cannot cope with symmetric 1Gbps data transfer - or at least, it can cope...

...but only by sacrificing every other connection while it buffers the communication!

follower · Jun 9, 2021

p3ter said:
I have a Lenovo Gaming PC which is physically connected to a small 4-port switch which is in turn connected to port 4 on my RT-AC88U (I have given up using ports 5-8 on the router due to known RT-AC88U problem). I also have a RT-AC68U as an AiMesh with wired backhaul to port 2 on my RT-AC88U. It's a rather busy network, with 25+ devices continuously connected, and 60+ total.

The router & network is generally very reliable, I have removed all 'fluff' (QoS, Parental Controls, VPN etc) from the Router, and it generally sits at less than 15% CPU load even under heavy network traffic.

But this Lenovo seems to reliably knock out the entire network very soon after booting, or as my son goes into a game, loads a new youtube video etc. During that 'blackout' period, the responsible computer seems to have good network access, but nobody else does. For other clients it looks like DNS resolution dies first, then after a few seconds you start to see 'you are not connected to the internet' etc. Then after a while, (30+ seconds) the network recovers. The severity and duration of the problem is variable, but seems to be related to how 'busy' that computers network is at startup, but exactly the same type of tasks (loading a new game, etc) tend to not cause problems 30 minutes later - it's only during the first minutes the unreliability occurs.
The downtime also causes the Mesh node to lose the connection to the primary router, which in turn causes local name resolution failure (can't reach devices connected via the Mesh by their computer names), and the inevitable chunk of assocation/disassociation.

I have limited that computers network connection to 100Mbps on the client adapter (we have 1gbps network, and can get 800 Mbps reliably) so in theory, even if that computer was trying to steal all available bandwidth, there should still be plenty over... That combined with the 'self healing' nature of the problem makes it hard for me to figure out where to look. Checking router logs shows nothing noteworthy, and running netstat, ping -t (ping an address continuously) and task manager on that computer doesnt show anything amiss at the time the problem is occuring.

Any ideas where to start troubleshooting this?

Thanks!

Check this out if it's similar.

AC88U kernel: net_ratelimit: number callbacks suppressed

Sys log. There is no other error messages except following errors. kernel: net_ratelimit: 239568 callbacks suppressed kernel: net_ratelimit: 242672 callbacks suppressed kernel: net_ratelimit: 243971 callbacks suppressed kernel: net_ratelimit: 241160 callbacks suppressed kernel: net_ratelimit...

www.snbforums.com

vaboro · Jun 15, 2021

p3ter said:
Any ideas where to start troubleshooting this?

Do you have STP (Spanning Tree Protocol) enabled?
Unmanaged switches are evil. I have had storms from time to time even with STP enabled until I changed all unmanaged switches to managed ones.
Enable storm protection in switches' settings.

p3ter · Jun 17, 2021

I have unmanaged switches... The main one is a D-Link GO SW‑16G, pretty low-end stuff. Since I made a number of changes in parallel when troubleshooting this I can't be absolutely sure, but I have a gut feel that taking the RT-AC68U AiMesh node off the D-Link Switch and cabling the backhaul directly to the RT-AC88U Router has made the most significant difference.

I feel like i'm in the same place with my home network right now as an owner of an old car... It's running, (just about), but it's not worth investing piecemeal in any one area... I've been considering a move to Mikrotik, Ubiquiti or PC-based router for a while now, and I keep seeing more evidence that the kit I have is simply not making the grade... But working from home at the moment the overriding principle is "if it 'aint broke, don't try to fix it"...

vaboro · Jun 18, 2021

p3ter said:
taking the RT-AC68U AiMesh node off the D-Link Switch and cabling the backhaul directly to the RT-AC88U Router has made the most significant difference

Enabling STP does also make a difference.

p3ter said:
"if it 'aint broke, don't try to fix it"...

I always use the same principle myself

John Fitzgerald · Jun 18, 2021

p3ter said:
I have a Lenovo Gaming PC which is physically connected to a small 4-port switch which is in turn connected to port 4 on my RT-AC88U (I have given up using ports 5-8 on the router due to known RT-AC88U problem). I also have a RT-AC68U as an AiMesh with wired backhaul to port 2 on my RT-AC88U. It's a rather busy network, with 25+ devices continuously connected, and 60+ total.

The router & network is generally very reliable, I have removed all 'fluff' (QoS, Parental Controls, VPN etc) from the Router, and it generally sits at less than 15% CPU load even under heavy network traffic.

But this Lenovo seems to reliably knock out the entire network very soon after booting, or as my son goes into a game, loads a new youtube video etc. During that 'blackout' period, the responsible computer seems to have good network access, but nobody else does. For other clients it looks like DNS resolution dies first, then after a few seconds you start to see 'you are not connected to the internet' etc. Then after a while, (30+ seconds) the network recovers. The severity and duration of the problem is variable, but seems to be related to how 'busy' that computers network is at startup, but exactly the same type of tasks (loading a new game, etc) tend to not cause problems 30 minutes later - it's only during the first minutes the unreliability occurs.
The downtime also causes the Mesh node to lose the connection to the primary router, which in turn causes local name resolution failure (can't reach devices connected via the Mesh by their computer names), and the inevitable chunk of assocation/disassociation.

I have limited that computers network connection to 100Mbps on the client adapter (we have 1gbps network, and can get 800 Mbps reliably) so in theory, even if that computer was trying to steal all available bandwidth, there should still be plenty over... That combined with the 'self healing' nature of the problem makes it hard for me to figure out where to look. Checking router logs shows nothing noteworthy, and running netstat, ping -t (ping an address continuously) and task manager on that computer doesnt show anything amiss at the time the problem is occuring.

Any ideas where to start troubleshooting this?

Thanks!

A few things to try from Lenovo end:

Resetting the network stack

Note that these commands affect all of your networking adapters, both physical and virtual, both used and unused, so you will see some errors when running these commands, where the resets targeted adapters that are not being used. These errors are perfectly normal, and not a cause for concern. Complete each step in order, even if you have done some of these previously, and even if you encounter errors.

In the search box on the taskbar click Start, type command prompt, right-click the command prompt result and then select Run as administrator and confirm.
At the command prompt (decline restarting your machine until you have entered the final command):
2. Type ipconfig /release and press Enter.
4. Type ipconfig /flushdns and press Enter.
6. Type ipconfig /renew and press Enter. (This will stall for a moment.)
8. Type netsh int ip reset and press Enter. (Don’t restart yet.)
10. Type netsh winsock reset and press Enter.
Now restart your machine using Start > Power > Restart once more and test to see if the issue is resolved.

Source: https://www.intel.com/content/www/us/en/support/articles/000058982/ethernet-products.html

From CMD or Powershell as "Administrator": Windows cleanup

Copy and paste:

dism.exe /online /cleanup-image /restorehealth

PRESS(enter)

OR / AND

Dism.exe /online /Cleanup-Image /StartComponentCleanup /ResetBase

PRESS(enter)

Search

Search

computer knocks out entire network at startup

p3ter

Occasional Visitor

L&LD

Part of the Furniture

p3ter

Occasional Visitor

p3ter

Occasional Visitor

thiggins

Mr. Easy

dosborne

Very Senior Member

p3ter

Occasional Visitor

follower

Very Senior Member

AC88U kernel: net_ratelimit: number callbacks suppressed

vaboro

Regular Contributor

p3ter

Occasional Visitor

vaboro

Regular Contributor

John Fitzgerald

Very Senior Member

Resetting the network stack

Similar threads

Similar threads

Latest threads

Support SNBForums w/ Amazon

Sign Up For SNBForums Daily Digest

Members online

computer knocks out entire network at startup

Occasional Visitor

Part of the Furniture

Occasional Visitor

Occasional Visitor

Mr. Easy

Very Senior Member

Occasional Visitor

Very Senior Member

Regular Contributor

Occasional Visitor

Regular Contributor

Very Senior Member

Resetting the network stack​

Similar threads

Similar threads

Support SNBForums w/ Amazon

Sign Up For SNBForums Daily Digest

Resetting the network stack