LACP LAG between USG1100 and XGS3700 not working properly
I am trying to set up a LACP LAG between a USG1100 and an XGS3700 switch (three switches in a stack). I want to increase available bandwidth between the switch and the USG for traffic on one particular vlan.
After doing the configuration on both devices the LAG shows as operational on the two ports of the XGS3700 (1/8 and 1/9) and is active on the USG1100 however network traffic is not passing properly. This is the first time I am configuring a LAG between a USG and a switch so I may have a configuration issue, but I cannot find it.
My testing (detailed below) leads me to believe that the network traffic issue is with the USG1100 side of the LAG and not the XGS3700 side. But I could be wrong. Sorry the description is so long but I wanted to provide all of the information I had.
Below is information on my setup and what I have done to try and diagnose the issue. I have a config file and a diaginfo file for the USG1100 which I can attach to a PM.
- I am setting up a LACP LAG between a USG1100 (ports 4 and 5) and an XGS3700 (ports 1/8 and 1/9).
- Port 1/8 of the XGS3700 is physically connected to port 4 of the USG1100 and port 1/9 is physically connected to port 5 with patch cables.
- The USG1100 is running V4.35(AAPK.0)ITS-WK46-2019-11-27-191100445D firmware
- The XGS3700 is running V4.30(AAGC.2)_20200115 | 01/15/2020 firmware
- The LACP LAG is being used only for traffic on vlan 2.
- Vlan 2 is “assigned” (base port) to lag0 (ports 4 and 5) on the USG1100.
- Vlan 2 is tagged on ports 1/8 and 1/9 on the XGS3700.
- With both ports physically connected and active on both machines the LACP LAG appears to be set up properly but is not passing traffic properly. Some devices in vlan 2 work fine, others pass traffic intermittently (access internet and other vlans sometimes), and others cannot pass traffic at all.
- On the Port Statistics page of the USG1100 port 5 always shows less the 256 for Rx B/s. Many times (5 second refresh) it shows 0 Rx B/s. Port 4 shows a much higher number (usually in the thousands).
- All configuration changes made to the XGS3700 and USG1100 for testing were done using a PC on vlan 1 (the management vlan - XGS3700 and USG1100 are in vlan 1).
- For testing I started a continuous ping from a PC in vlan 2 to the XGS3700 (its IP in vlan 1) and a continuous ping from a PC in vlan 2 to the USG1100 (using its IP in vlan 1). With both ports up the pings were working but some other PCs and mobile devices in vlan 2 could not reach either the XGS3700, the USG1100, other vlans, or the internet.
- I "downed" (unchecked Active in port setup screen) port 1/9 (part of LAG) on the XGS3700 and pressed Apply. The pings continued to work and all other devices started to function properly (had internet and inter-vlan access). At this point only port 1/8 (connected to USG1100 port 4) was active.
- I then activated port 1/9 again and some of the devices in vlan 2 stopped having internet and inter-vlan access.
- I next downed port 1/8 leaving only port 1/9 as part of the LACP LAG. As soon as I pressed Apply the continuous pings started timing out. None of the devices on vlan 2 had internet or inter-vlan access.
A summary so far:
- With both ports in the LACP LAG active some devices on the vlan assigned to the lag work, some devices work intermittently, and some devices do not work at all.
- With only switch port 1/8 active (port 1/9 down) all devices on the vlan assigned to the lag work properly.
- With only switch port 1/9 active (port 1/8 down) NO devices on the vlan assigned to the lag work properly.
Next I decided to check the physical patch cables between the USG1100 and the XGS3700.
- I checked the Port Status on ports 1/8 and 1/9. Neither port showed errors.
- I ran a cable test on port 1/8 and 1/9. Both cables passed.
- To be extra cautious I replaced the cable between port 1/9 on the XGS3700 and port 5 on the USG1100 with a known working cable. I still saw the same issues.
- I physically disconnected the cable from switch port 1/9. All devices on vlan 2 could pass traffic.
- I physically moved the cable from switch port 1/8 to port 1/9. A cable now connected port 4 on the USG1100 to port 1/9 on the XGS3700. All devices on vlan 2 could pass traffic.
- I disconnected the cable from port 4 of the USG1100 and tried with a cable from the USG1100 port 5 first to the XGS3700 port 1/8 and then 1/9. Neither setup worked. No devices on vlan 2 could pass traffic.
It appeared that the issue was with port 5 of the USG1100. I double checked the configuration but did not see any issues.
Next I decided to use the Packet Capture capability of the USG1100 to gather data for reporting the issue.
- I started with only port 1/9 of the LACP LAG on the XGS3700 active (port 1/8 down) and its patch cable physically connected to port 5 of the USG1100.
- The continuous pings from a PC on vlan 2 to the XGS3700 and the USG1100 were timing out.
- On the USG1100 I went to the Maintenance tab, Diagnostics, Packet Capture, Capture and selected interface 5. I then pressed "Capture".
- As soon as I started the Packet Capture the continuous pings from the PC in vlan 2 to the XGS3700 and the USG1100 started working.
- As soon as I stopped the Packet Capture the continuous pings from the PC in vlan 2 to the XGS3700 and the USG1100 started timing out again.
- This behavior of the pings working only when the packet capture was running is repeatable.
I'm not sure where to go from here. It appears the issue is with the USG1100. I may have a configuration issue, but I don't see it. Please let me know what additional information you would like me to capture and send to you.
For now, I have the LACP LAG up but only port 1/8 on the XGS3700 connected to port 4 on the USG1100. This is the only way I can reliably have traffic flowing in vlan 2.
Thanks for your help.