R7800 poor WAN download/upload rate when a Fast Ethernet client is connected at 100Mbps to a Gigabit built-in switch (Workaround available)

sppmaster

Regular Contributor
I'm trying to involve more people in investigation of an issue where R7800 cannot achieve full WAN download/upload speeds when a client (with 2 or more 100Mbps clients it's easier to reproduce the bug) is connected to the LAN port at 100Mbps.
I've firstly noticed this with OpenWRT firmware but later tried the Netgear stock firmware and Voxel's latest firmware. The issue is present in all of them even though the results are slightly different towards WAN /LAN speeds drop.
It's complicated to notice this because it requires a specific setup and several other conditions to be met all at the same time.
I've opened a thread here to report this issue.

and here too with full details and observations with several other people confirming the bug.

There is an issue opened on the OpenWRT Github report page.

I know a lot of users here that care for their R7800s and ask you to read the posts and try to replicate the issue if you can.
I really hope that anyone would be able to fix this so we can have free of bugs firmware for the otherwise excellent R7800.
 
Last edited:

sppmaster

Regular Contributor
I've posted the exact steps to reproduce/trigger the bug.
 

sppmaster

Regular Contributor
To summarize the issue.
The Netgear stock firmware has a bug which causes really huge WAN performance degradation when a 100Mbps LAN client (Smart TV, AndroidTV box, etc.) is connected to any router LAN port and that 100Mbps device transfers data over LAN.
The WiFi is not affected by this bug, only LAN ports that have simultaneous WAN and LAN traffic going through them with at least one 100Mbps device connected to any LAN port but also transferring data over LAN. In most cases LAN throughput is affected negatively too and interruptions of LAN transfers are observed too at the time when a data is downloaded from WAN.

Steps to reproduce the issue.
Here is my network setup

sppmaster_1-1649583843175.png



First I run iperf3 server on a PC (1) as a LAN server connected at 1Gbps. Then I run iperf3 as a client with -R (reverse option) on a client device (2) Laptop connected at 100Mbps with 4 wires cable (that's what triggers the bug). It doesn't matter if the connection of device 2 is directly to a router LAN port or to the same port via second switch (Gbit or not).
I run network traffic from PC server (1) to all three (2, 3 and 4) 100Mbps clients. Total around 300Mbps going from the PC (1) to clients 2, 3 and 4 at the same time.

Finally I run WAN speedtest on a PC (1) connected at 1Gbps. The PC (1) WAN down/up speed drops significantly.
When I run WAN speed test on PC (1) I have ridiculous WAN performance varying from 20-30Mbps to maximum 40-50Mbps with ping increasing to 50-500 ms (normally 1ms). WAN upload speed is negatively impacted too. I have 700Mbps from my ISP but am able to achieve 150-200Mbps at most. Sometimes speeds are little bit higher but all LAN transfers are interrupted during the WAN test.
With this setup If a client device (2) is connected at 1Gbps to LAN port 2 (I use the same Laptop and just change the cable with 1Gbps one) and it doesn't send any data to any other 100Mbps device over LAN then the Laptop (2) can download/upload from/to WAN at full Gigabit speeds. Even if the PC (1) cannot do this because it sends data to any other 100Mbps device.
Obviously the router LAN port, through which the LAN data is sent to 100Mbps device connected to another port, cannot receive/send data at Gigabit Full Duplex speeds to WAN in this case.

If I use second Gigabit switch it becomes a real nightmare because even more 100Mbps devices are present on the LAN part and none of the 100Mbps devices (connected to the second switch) cannot download form WAN at more than 1-2Mbps even if only one of them is involved in LAN traffic (sends or receives data to PC server connected at 1Gbps).

In the network setup above if all devices 1, 2, 3 and 4 are connected at 1Gbps the WAN and LAN performances are great. Even if the PC server (1) sends data at 930Mbps through the LAN to 2, 3 and 4 (now all are connected at 1Gbps) at the same time it's still possible to download from WAN on a PC (1) at 930Mbps. We have expected Gigabit Full Duplex performance - 930Mbps download (with 1 ms ping!) from WAN on PC (1) and simultaneous data sending over LAN from PC (1) at 930Mbps to other LAN clients.

I've used two different R7800 routers with three different firmwares with default and custom settings. The results are repeatable every time.

Unfortunately all current TVs and other AndroidTV boxes only have 100Mbps NICs.
With another really cheap Gigabit router for 20 bucks I'm able to download and upload from/to WAN from PC (1) at Gigabit Full Duplex speeds 930/670 Mbps with exactly the same network setup running same tests.
 

xyzzy

Occasional Visitor
It sounds like what's happening the the ethlan switch is getting backed up with traffic, filling all the buffer space.

I don't know for sure, but there is a probably a switch chip on the 4 LAN ports, rather than each one being connected to the SoC individually. It buffers packets when they come in on one port and the the port they are supposed to go out on it busy. If you have one device sending traffic at 1Gbps to one that is receiving it at 100Mbps, then the buffer is going to fill up. The sender needs to slow down.

It could be there is some misconfiguration of the switch chip that makes this so bad.

Or maybe 802.3x flow control isn't being used? This is supposed to slow down senders when switches get congested.
 

sfx2000

Part of the Furniture
I'm trying to involve more people in investigation of an issue where R7800 cannot achieve full WAN download/upload speeds when a client (with 2 or more 100Mbps clients it's easier to reproduce the bug) is connected to the LAN port at 100Mbps.

If you're running the netgear stock firmware, check the Quality of Service settings for upload/download if it's enabled.
 

HELLO_wORLD

Very Senior Member
It sounds like what's happening the the ethlan switch is getting backed up with traffic, filling all the buffer space.

I don't know for sure, but there is a probably a switch chip on the 4 LAN ports, rather than each one being connected to the SoC individually. It buffers packets when they come in on one port and the the port they are supposed to go out on it busy. If you have one device sending traffic at 1Gbps to one that is receiving it at 100Mbps, then the buffer is going to fill up. The sender needs to slow down.

It could be there is some misconfiguration of the switch chip that makes this so bad.

Or maybe 802.3x flow control isn't being used? This is supposed to slow down senders when switches get congested.

I personally believe it is something like that.
It would explain why the problem is the same whatever the firmware used.

Now, if it is what is happening, is there a way to trick the chip (using VLANs) or to fix it programmatically?
 

sppmaster

Regular Contributor
I personally believe it is something like that.
It would explain why the problem is the same whatever the firmware used.

Now, if it is what is happening, is there a way to trick the chip (using VLANs) or to fix it programmatically?
I'm currently using three VLANs with OpenWRT firmware and cannot see a difference.
There are 2 or 3 other really similar issues reported on OpenWRT Github with other router models.
Yesterday a Mikrotik user shared his experience https://forum.mikrotik.com/viewtopic.php?t=185253
Looks similar too. Several users from OpenWRT community believe this might be a linux kernel issue.
 

HELLO_wORLD

Very Senior Member
I'm currently using three VLANs with OpenWRT firmware and cannot see a difference.
There are 2 or 3 other really similar issues reported on OpenWRT Github with other router models.
Yesterday a Mikrotik user shared his experience https://forum.mikrotik.com/viewtopic.php?t=185253
Looks similar too. Several users from OpenWRT community believe this might be a linux kernel issue.
And we are stuck with an old kernel because of NG proprietary code…
 

sppmaster

Regular Contributor
If it's really a kernel issue it's strange no one noticed this.
Maybe it's a kernel driver issue and if someone is able to fix this soon, maybe Netgear can do it too.
 

HELLO_wORLD

Very Senior Member
Wait a second… Isn't OpenWRT based on more recent kernels?
So if they have the same bug, it is probably not kernel related, and as @sppmaster mentioned, other would have noticed it.

Something linked to this particular hardware seems more logical. What is the switch chip used in our R7800?
 

sppmaster

Regular Contributor
Wait a second… Isn't OpenWRT based on more recent kernels?
So if they have the same bug, it is probably not kernel related, and as @sppmaster mentioned, other would have noticed it.

Something linked to this particular hardware seems more logical. What is the switch chip used in our R7800?
I use it currently with
1650477141214.png


Switch: Qualcomm Atheros QCA8337
Look here https://github.com/Deoptim/atheros/blob/master/QCA8337-datasheet.pdf
Looks really advanced switch. 362 pages to read. 4k VLANs.
Other switches aren't even close to this.
 
Last edited:

xyzzy

Occasional Visitor
I doubt VLANs really has anything to do with it.

The huge pings and bad performance sound like congestion and full buffers. You ought to be able to wireshark capture the traffic on both ends and find out more. Are packets being sent fast, or slow? Do they go out and not make it to destination or not go out at all? Do they arrive corrupted?

I've seen something like this before, with 100 mbit hosts on a gigbit network. Not having 802.3x flow control was the problem for me. Take a look at §3.2.3 and §3.1.4 in that datasheet. It sounds like this feature isn't working like it should be:
If the link partner supports autonegotiation, the 802.3x full-duplex flow control is auto-negotiated between the remote node and the QCA8337. If the full-duplex flow control is enabled, when the buffer number used for specific port exceeds the per port buffer threshold or total buffer used exceeds global based buffer threshold, the QCA8337 sends out an IEEE 802.3x compliant pause frame to stop the remote device from sending more frames.

There's an open source driver for this chip, drivers/net/dsa/qca8k.c, so it's not impossible to change its config or try to debug it.
 

sppmaster

Regular Contributor
I agree with you. Currently I don't have enough time to try anything with wireshark.
About the debugging of the driver I'm not the right person although I have some experience working with debugger programs in the past.
But the main problem is that the manufacturer has to admit and fix the issue once it was found by users. Not everyone has the knowledge and the time to do it.
We do what we can in order others that don't have the same options to simply benefit as a whole in the end.
 
Last edited:

xyzzy

Occasional Visitor
Thinking about this more, I don't think it's the switch, or at least not the switch the way I was thinking. It looks like there is only a problem with WAN performance. I don't think the switch connects the LAN ports and the WAN port. The traffic between the two has to go though the SoC since it wouldn't be much of a firewall or router if it just switched LAN traffic to/from the WAN. The switch chip has 5 ports and 2 links to a SoC. So those two uplinks are ethlan and ethwan. The WAN port is just connected to the ethwan link and the four LAN ports are switched with the ethlan link.

It's still one switch, so there are buffer resources shared between the wan and lan side, even if they are basically two different networks. But if that was the case, then it should effect LAN traffic too.

I think from your description that you can have your server (1) send data to the 100mbps clients and then have a 1gbps laptop (2) test WAN performance, while the server is sending data to the LAN, and WAN performance is fine for the laptop. But if the server WAN performance was tested, it would be bad.

Try this test:
1650572207329.png


Have the server stream to the 100 mbps clients in all tests. Server is connected to gigabit switch connected to R7800.

Then try 1), WAN performance on server (which we know is bad already). Then 2), gigabit laptop directly connected to R7800, while server is streaming to 100 mbps clients, which I think you've done and WAN performance is good. And finally 3), gigabit laptop connected the same gigabit switch as server. I don't think you've done this.

It makes a difference because in test 2 the WAN traffic is on a different R7800 switch port than the 100 mbps lan traffic, but in test 3, it's on the same R7800 switch port.
 

Attachments

  • 1650572126278.png
    1650572126278.png
    49.3 KB · Views: 29

sppmaster

Regular Contributor
You've understood correct all that I've written so far. I haven't tried the third test you suggest because it will require a bit of rearrangement of my LAN equipment and cables.
Surely I'll try this. It may take 10 or even more days before I'm able to do this.
Unfortunately at the moment I have other personal worries that eat all of my time and I have to take care of them. I just hope all goes well and I will have a little spare time soon.
 

sppmaster

Regular Contributor
I've benchmarked a Belkin RT3200 with the same tests with a 100Mbps client connected to the LAN switch port of RT3200. It still runs factory firmware. I can hardly reach 200Mbps down and 20Mbps upload speeds. I have 1000/700Mbps from my ISP.
I can only say that I see the same decrease in the WAN throughput. The same issue is present as with R7800. So probably the bug is not connected to the specific hardware as both routers are based on different SoC.
 
Last edited:

kamoj

Very Senior Member
I've not followed this thread, but I want to share what was valid 10 years ago.
Maybe it's not true today/in this case, but here is my experience::

If you have Jumbo frames active for any device in the network, all devices need to have it, and
all devices must have the same frame size.
If not, ALL devices will get the same speed as the lowest speed connected device.
So check if the 100 Mbps connection uses Jumbo frames.
 

sppmaster

Regular Contributor
Pinging 192.168.1.12 with 9000 bytes of data:
Packet needs to be fragmented but DF set.
The client .12 is 100Mbps.
Probably it's not this because the WAN/LAN performance is varying between 20-200Mbps.

I've just contacted Belkin support and they couldn't help much but escalated the issue and I'm waiting for a call to explain the details to the tech support.
 
Last edited:

Similar threads

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top