Problem with SFP+ line card in VSX stack

7. RE: Problem with SFP+ line card in VSX stack

Kudos

DP-7f7ca3

MVP EXPERT

Posted Sep 11, 2025 06:09 PM

Hi! pretty strange that a VMware VSS (no LACP) with just two active-active standalone SFP+ uplinks (again, absolutely no LACP involved due to VSS limitations) to VSX members generates this issue: we have two VSX (respectively 8320 and 8360 based) running in production with - if I understood what you described as a potentially problematic scenario - a similar setup to various ESXi hosts (each ESXi hosts owns many VSSes, each VSS owns many Port Groups for VLAN tagging to cope with corresponding tagging on VSX downlinks) and never experienced any failure in years, never (since AOS-CX 10.1 - yes! on 8320 VSX starting on 2018 - up to AOS-CX 10.13). Hundereds on VMs concurrently running, Terabytes and Terabytes of traffic has passed on those links. Never a single glitch.

Would be interesting to understand what are the VMware VSS settings and what are the corresponding exact configuration of involved interfaces on each VSX member.

Original Message

Original Message:
Sent: 9/11/2025 4:15:00 PM
From: muhittin
Subject: RE: Problem with SFP+ line card in VSX stack

Hello,

Thank you for sharing more details.

What are the possible triggers when LACP is not present?

1- Since ESXi Standard vSwitch (VSS) does not support LACP, if you run two 10G uplinks simultaneously on two different 6400 (VSX pair) switches, a single "logical connection" does not form on the physical side.

2- The vSwitch pins each VM's traffic to one uplink; however, when the same VLAN exits through two different chassis, physical switches may see the same MAC on two different ports and experience MAC flapping.

3- The MAC table is not shared between chassis in VSX; therefore, the effect of "the same host's MAC rapidly switching between two chassis" becomes more pronounced.

4- This abnormal flow is a known trigger class that can cause switchd_agent2 to assert and crash with SIGABRT (signal 6) in some AOS-CX versions (depending on the exact build).

So why does the problem persistently occur on the Primary chassis?

1 - In the VSX architecture, certain control-plane/statistical tasks may be more intensive on the primary; the combination of heavy MAC churn + features can more easily lock up the agent on the primary line card.

You mentioned that you are not using LACP. If the "correct design" is not implemented, you may encounter the following:

1- A VSS connected to two different chassis with the same VLANs in active-active mode can spread BUM (broadcast/unknown/multicast) packets along both paths, even if it does not create a loop at L2. This produces STP TCN and flood moments.

2- Even if the iSCSI vSwitch is separate, when coming out of maintenance, iSCSI sessions and LAN ARPs can explode simultaneously and overload the line card.

On the CX side, the same crash signature can persist for years across multiple minor branches and appear under specific triggers (VSX + MAC churn); the fix may be in a different version or another build of 10.15.

What can be done for a healthy design without LACP?

1- On the VMware side, do not use two active-active 10G uplinks to two chassis at once. For each Port Group:

- Active uplink: Only one 10G (going to Primary)

- Standby uplink: The other 10G (going to Secondary).

- Leave the 1G ports as standby.

- You can balance the load by reversing this order on other ESXi hosts.

- This setting prevents the same VLAN from being active on two chassis at the same time, significantly reducing MAC flapping and flooding.

Of course, we recommend using LACP if possible.

There was also the iSCSI issue;

1- Separate vSwitch and port binding is correct; pin each vmkernel to a single physical NIC (the other remains standby).

2- Keep iSCSI uplinks on separate VLANs/Subnets and, if possible, separate physical paths; LACP is not required.

I hope TAC resolves the case soon. Please share the solution results with us.

-------------------------------------------

Original Message:
Sent: Sep 11, 2025 01:11 PM
From: Neonium
Subject: Problem with SFP+ line card in VSX stack

I had actually ruled out a software bug, but your explanation sounds very plausible. However, I had it in different versions. If that were the case, the bug would have to have been in the firmware for a very long time, since the error occurred in a 10.15.x version.

But to add some more information, we do not use LACP in VMWare. We add 2 10G ports and 2 1G ports as standby to the vSwitch. In a separate vSwitch, we have 2 10G ports for ISCSI. The first vSwitch is for the LAN VLANs.

I opened a case earlier, but sometimes there are helpful tips in the community.

8. RE: Problem with SFP+ line card in VSX stack

Kudos

Neonium

Posted Sep 12, 2025 01:46 AM

Which firmware version are you using? I already wrote something about setting up the switches in another post.

I can't configure much in the VSS.
Security
Promiscuous mode -> Reject
Mac address change -> Accept
Fake transmissions -> Accept

Traffic shaping is disabled
Teaming and failover
Load balancing -> Route based on the original virtual port
Network failure detection -> Connection status only
Notify switches -> Yes
Failback -> Yes

What I don't think I mentioned is that the two fiber optic cards each have 1 ISCSI and one LAN link. Each network card is patched to a switch. This has always allowed us to ensure that there can be no failure. With the 1G links as standby, it was actually impossible for a failure to occur.

-------------------------------------------

Original Message

Original Message:
Sent: Sep 11, 2025 06:09 PM
From: parnassus
Subject: Problem with SFP+ line card in VSX stack