What's new

LAN LACP/LAGG (802.3ad) Throughput & Balancing Issue

  • SNBForums Code of Conduct

    SNBForums is a community for everyone, no matter what their level of experience.

    Please be tolerant and patient of others, especially newcomers. We are all here to share and learn!

    The rules are simple: Be patient, be nice, be helpful or be gone!

DexterII

Occasional Visitor
I was having an issue with throughput speed through my RT-AX88U router that I posted in another thread. However I since upgraded the firmware of my router to Merlin's version 386.3_2 and it sparked a new issue. I figured I'd give this forum a shot.

Just a quick rundown of my setup:

SB8200 -> WAN LAGG Port WAN & 4 -> RT-AX88U -> LAN LAGG Port 1 & 2 -> 2x Thunderbolt to Gigabit Ethernet -> 2018 MacBook Pro

When connecting my computer directly up to the modem (Also using LAGG) I can hit almost 1200 Mbps. However when hooked up through the router I only achieve 800 - 930 Mbps.

Using the traffic monitor built into the GUI of the router I can see the LAN 2 is doing all the receiving and LAN 1 is doing most of the sending of data sometimes. LAN 1 is barely doing anything at all. The bond is good on both ends. This is a different behavior than the stock firmware as it was balanced before. (Rx/Tx was almost identical between the two ports)

I've even physically swapped the wires but the behavior remains the same. Down on L2 and up on L1. So I think this is indicating the router is deciding this behavior.

Is there a balancing setting somewhere? I'm assuming it would have to be command prompt through SSH if it exists. I've gone through almost all the menus two or three times now looking for something.

The problem with the stock firmware before was when it was balancing the load between the two ports it was maxing out around 750 Mbps. If i disconnected one of the links it would hit 940 Mbps (About the max for 1 GbE port).

It's a bit aggravating because I know from hooking up to the modem directly I could have higher speeds. The router as powerful as it is, is bottlenecking me here.

Here is the output from bond0 which appears the be the LACP link to my computer. I don't see anything out of the norm here to indicate why it's prioritizing the LAN 2. Unless is it normal for both eth interfaces to have the same 'Permanent HW addr'?

Also I'm leaning towards the issue may possible be related to 'Aggregator selection policy (ad_select)' setting being it's set to 'count' apposed to 'stable' or 'bandwidth'. Or perhaps something with 'xmit_hash_policy' But I'm probably totally wrong being the WAN bond seems to be configured the sameway but seems more balanced (See below)


This is bond0 (The LAN bonding, bond1 is the WAN bonding)

Code:
admin@RT-AX88U:/proc/3081/net/bonding# cat ./bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): count
Active Aggregator Info:
    Aggregator ID: 1
    Number of ports: 2
    Actor Key: 9
    Partner Key: 1
    Partner Mac Address: f0:XX:XX:XX:XX:f7

Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 32
Permanent HW addr: a8:XX:XX:XX:XX:20
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 16
details actor lacp pdu:
    system priority: 0
    port key: 9
    port priority: 255
    port number: 1
    port state: 63
details partner lacp pdu:
    system priority: 32768
    oper key: 1
    port priority: 32768
    port number: 4
    port state: 189

Slave Interface: eth4
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 33
Permanent HW addr: a8:XX:XX:XX:XX:20
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 17
Partner Churned Count: 17
details actor lacp pdu:
    system priority: 0
    port key: 9
    port priority: 255
    port number: 2
    port state: 63
details partner lacp pdu:
    system priority: 32768
    oper key: 1
    port priority: 32768
    port number: 5
    port state: 189

Being there is no seperation of the WAN bond on the GUI, a spit out from ifconfig shows the WAN bond (eth0, eth1) appears to be somewhat balanced. But the LAN bond (eth3, eth4) seems to be favoring eth4 for TX and eth3 for RX.

Code:
admin@RT-AX88U:ifconfig
bond0     Link encap:Ethernet  HWaddr A8:XX:XX:XX:XX:20
          inet6 addr: fe80::XXXX:XXXX:XXXX:XXXX/64 Scope:Link
          UP BROADCAST RUNNING ALLMULTI MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:274878609111 errors:0 dropped:0 overruns:0 frame:0
          TX packets:292061354549 errors:0 dropped:4294967295 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:270782263975 (252.1 GiB)  TX bytes:283944929931 (264.4 GiB)

bond1     Link encap:Ethernet  HWaddr A8:XX:XX:XX:XX:20
          inet addr:73.XX.XX.33  Bcast:73.XXX.XXX.255  Mask:255.255.254.0
          inet6 addr: XXXX:XXX:XXXX:XX:XXXX:XXXX:XXX:XXXX/128 Scope:Global
          inet6 addr: fe80::XXXX:XXXX:XXXX:XXXX/64 Scope:Link
          UP BROADCAST RUNNING ALLMULTI MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:50844481 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4316426183 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:63495128204 (59.1 GiB)  TX bytes:13758449827 (12.8 GiB)

br0       Link encap:Ethernet  HWaddr A8:XX:XX:XX:XX:20
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: XXXX:XX:XXXX:XXXX::X/64 Scope:Global
          inet6 addr: fe80::XXXX:XXXX:XXXX:XXXX/64 Scope:Link
          UP BROADCAST RUNNING ALLMULTI MULTICAST  MTU:1500  Metric:1
          RX packets:22238119 errors:0 dropped:0 overruns:0 frame:0
          TX packets:50943383 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:9233627966 (8.5 GiB)  TX bytes:63414801229 (59.0 GiB)

eth0      Link encap:Ethernet  HWaddr A8:XX:XX:XX:XX:20
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:33015624 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15677184 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:40099380129 (37.3 GiB)  TX bytes:7656874873 (7.1 GiB)

eth1      Link encap:Ethernet  HWaddr A8:XX:XX:XX:XX:20
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:17828860 errors:0 dropped:2 overruns:0 frame:0
          TX packets:5781732 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:23395748435 (21.7 GiB)  TX bytes:1806611615 (1.6 GiB)

eth2      Link encap:Ethernet  HWaddr A8:XX:XX:XX:XX:20
          inet6 addr: fe80::XXXX:XXXX:XXXX:XXXX/64 Scope:Link
          UP BROADCAST RUNNING ALLMULTI MULTICAST  MTU:1500  Metric:1
          RX packets:2428107 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4554473 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:326056859 (310.9 MiB)  TX bytes:3920580547 (3.6 GiB)

eth3      Link encap:Ethernet  HWaddr A8:XX:XX:XX:XX:20
          UP BROADCAST RUNNING ALLMULTI SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:463770 errors:0 dropped:0 overruns:0 frame:0
          TX packets:81907 errors:0 dropped:1569 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:152456236 (145.3 MiB)  TX bytes:23226299 (22.1 MiB)

eth4      Link encap:Ethernet  HWaddr A8:XX:XX:XX:XX:20
          UP BROADCAST RUNNING ALLMULTI SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:238397 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3496514 errors:0 dropped:22270 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:46868091 (44.6 MiB)  TX bytes:4748829392 (4.4 GiB)

Bottom line: Flashing to Merlin has degraded my overall throughput over the LAN to WAN LACP links. Before on stock firmware every once in awhile I would hit 990+ Mbps. It rarely goes above 870 Mbps now. Also it's primarily using one link for download and roughly 50/50 for upload instead of both being 50/50 like before on stock firmware.
 

Attachments

  • Screen Shot 2021-08-22 at 1.40.00 AM.png
    Screen Shot 2021-08-22 at 1.40.00 AM.png
    36.3 KB · Views: 118
  • Screen Shot 2021-08-22 at 1.40.17 AM.png
    Screen Shot 2021-08-22 at 1.40.17 AM.png
    31.8 KB · Views: 118
  • Screen Shot 2021-08-22 at 1.42.44 AM.png
    Screen Shot 2021-08-22 at 1.42.44 AM.png
    38.7 KB · Views: 108
Last edited:
Have you checked the cpu and system usage from cli both with and without lagg configured on the router?
If the drivers are decent the overhead should be small for lagg.

Another thing to consider (Merlin is more likely to know the specifics) is when comparing OEM firmware and Merlins fork, they don't always have the same kernel bits because he gets the closed source after Asus already has had time to play with it.
This might mean you are not comparing apples to apples.

Otherwise it sounds like you have done your homework and configured it correctly.

I agree that any tweak you could do to the default load sharing algorithm would be cli based.
With Cisco, the default I believe is per source/destination pair based. I do not know how Asus implemented it.
 
Have you checked the cpu and system usage from cli both with and without lagg configured on the router?
If the drivers are decent the overhead should be small for lagg.

Another thing to consider (Merlin is more likely to know the specifics) is when comparing OEM firmware and Merlins fork, they don't always have the same kernel bits because he gets the closed source after Asus already has had time to play with it.
This might mean you are not comparing apples to apples.

Otherwise it sounds like you have done your homework and configured it correctly.

I agree that any tweak you could do to the default load sharing algorithm would be cli based.
With Cisco, the default I believe is per source/destination pair based. I do not know how Asus implemented it.
I haven't tested CPU usage without LAGG configured. But even with it configured it doesn't appear to be what's limiting the speed per say. One core barely going over 60%. But attached is a quick screen clip of the CPU usage charge running a speedtest. Only hit 894 Mbps / 41 Mbps on this test. Again only using one link even though the test was conducted using 4 servers simultaneously. The upload test was balanced about 50/50, but the download was 100/0 balance between the links.

EDIT - I can confirm this through Wireshark. One link was receiving all the data during the download sequence, while the other wasn't doing much of anything but sending ACK's. Both worked together pretty balanced during the upload sequence.
 

Attachments

  • Screen Shot 2021-08-23 at 5.56.12 PM.png
    Screen Shot 2021-08-23 at 5.56.12 PM.png
    8.7 KB · Views: 106
Last edited:
Speed test done connected directly up to the modem bypassing the router. Used Wireshark and both ports were about evenly active with data. This is proof the router is the culprit.

This also proves the LACP is capable of doing what I need done. The router doesn't seem to be capable or isn't configured correctly internally.

I'm going to re-flash the firmware back to stock and factory reset. I don't know what other options I have right now.

EDIT - Can confirm Merlin firmware is causing the imbalance. After going back to stock, both links transfer data evenly. However total speed is degraded to about 700 Mbps.

Will try a factory reset and update results.
 

Attachments

  • Screen Shot 2021-08-27 at 5.58.05 PM.png
    Screen Shot 2021-08-27 at 5.58.05 PM.png
    38.2 KB · Views: 81
Last edited:
UPDATE:

I can't really explain the results but they are good.

After a re-flash of the original stock firmware I noticed both links were sending/receiving evenly but the through-put between the two was only about 700 Mbps.

So I factory reset the router. Then I flashed it back to Merlin firmware and it went back to only receiving on one link. However now every time I ran a speed test on my Mac the ethernet adapters would cut out mid speed test. The only way to reset them was either to unplug them from the laptop or taking the device down then putting it back up through terminal and 'ifconfig'.

I checked the system logs while the speed test was running and found the kernel complaining about packets being too large and dumping them. Then the device would just ultimately fail.

So I played around with the MTU and enabled Jumbo frames on the router and Mac. All of a sudden the adapters weren't crashing and it was sending on both links. Also the speed tests resulted in well over 1000 Mbps through put.

I noticed the CPU usage on the router during the speed tests were pretty much nothing. This was new. I couldn't explain it.

So I went and disabled Jumbo frames on the router but left it active on the laptop. I expected problems but didn't have any. The results stayed the same, over 1000 Mbps.

I lowered the MTU back to 1500 on the laptop and it started crashing the NIC's again. I lowered the MTU to 1496 and the crashing stopped and was hitting over 1000 Mbps again.

Weird, if I leave the MTU default 1500 the NIC's crash. But if it's anything but 1500 it works great, even if the MTU size doesn't make any sense (like 1700).

Bottom line is I don't know why just toggling on/off Jumbo frames on the router suddenly triggered the router to work correctly. Can anyone explain that?

Also why my NIC's hate 1500 MTU but love any other value? That's so weird. Attached is the highest speed result I got running through the router last night.

Also the speed test built into the router is actually reporting more correctly now. It used to max around 650 Mbps, even I would get higher results with wifi devices. Now it's reporting over 1000 Mbps.

Some kind of switch was flipped on the backend somewhere all of a sudden. The sudden lack of CPU usage on the router is very suspicious. Somehow it seems it became passive now to the data.

One thing I noticed after the reset was the TA Traffic Analyzer was turned back off. It's still off now. Wasn't off before, but I did toggle it and it made no difference before. I'm afraid to turn it on now.
 

Attachments

  • Screen Shot 2021-08-28 at 2.17.27 AM.png
    Screen Shot 2021-08-28 at 2.17.27 AM.png
    52.7 KB · Views: 99
  • Screen Shot 2021-08-28 at 12.00.25 PM.png
    Screen Shot 2021-08-28 at 12.00.25 PM.png
    31.4 KB · Views: 80
  • Screen Shot 2021-08-28 at 12.00.33 PM.png
    Screen Shot 2021-08-28 at 12.00.33 PM.png
    33.6 KB · Views: 94
  • Screen Shot 2021-08-28 at 12.00.51 PM.png
    Screen Shot 2021-08-28 at 12.00.51 PM.png
    37.9 KB · Views: 87
Last edited:
Don't mean to bump this. Doesn't seem anyone knows the issue.

But just wanted to update with another anomaly.

Just had to change the MTU size of another macOS iMac device on the LAN network to prevent its ethernet adapter from freezing up on a high speed transfer.

Think this is rather odd. Being all the devices have been running fine on the network until this recent update to the router.
 
It is indeed very odd. What I noticed in my RT-AC86U is it consumed more ram when traffic analyzer is enabled. I didn’t check how much it affects cpu usage. Hardware forwarding could make some difference as well. I remember my old router have to enable cut through forwarding in order to achieve higher transfer rate by giving up per client monitoring. LAG hashing also affects how traffic are spray among the lag ports. I suppose this is per flow hashing and not per packet hashing. Hashing algorithm is likely fixed and not configurable. If there is only single flow traffic will be hashed into one of the two lag ports. More flows are better that higher chance of the flows get hashed more equally between the lag members.
 

Latest threads

Sign Up For SNBForums Daily Digest

Get an update of what's new every day delivered to your mailbox. Sign up here!
Top