Wired Intelligent Edge

 View Only
Expand all | Collapse all

Problem with SFP+ line card in VSX stack

This thread has been viewed 61 times
  • 1.  Problem with SFP+ line card in VSX stack

    Posted Sep 11, 2025 09:10 AM
    Edited by Neonium Sep 11, 2025 02:56 PM

    Hello,

    I have a problem with a 6400 VSX stack. This is supposed to replace our existing server switches.
    We have 2 copper modules + 1 SPF+ module in there. That works fine so far.
    Now we have put the first two ESX servers on it. As soon as we take an ESX server out of maintenance mode, we get an error on the primary switch on the SPF+ line card.

    Event|1201|LOG_CRIT|LC|1/5|switchd_agent2 crashed due to signal:6
    Event|3207|LOG_ERR|AMM|1/1|Line module 1/5 has failed: Fatal agent crash

    The secondary switch is working fine.

    We have already done the following

        Firmware update to the latest version + additional updates
        Installed another LineCard with the existing GBics -> same error
        Installed the first LineCard in another slot with existing GBics -> same error
        Other GBics in new line card -> same error
        We swapped the connections -> same error (error on the primary)
        We installed a new chassis with new modules -> same error
        Connected another ESX server (identical design) -> same error

    Does anyone have any idea what this could be?



  • 2.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 11, 2025 11:10 AM
    Edited by parnassus Sep 11, 2025 11:26 AM

    Hello, when you write "Firmware update to the latest version + additional updates" what do you mean exactly? which AOS-CX software build (AOS-CX 10.nn.mmmm) have you updated your VSX to? are you using a LSR (AOS-CX 10.13 or AOS-CX 10.16)? an SSR (AOS-CX 10.14 or AOS-CX 10.15)? which one is running your VSX over? what exact build (mmmm)? thank you for clarifying us these details.

    P.S. It could be a software bug or, better, a bug regression causing a software bug (if you're running your VSX with very latest LSR or SSR software builds).

    -------------------------------------------



  • 3.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 11, 2025 12:48 PM

    Ah, okay. I didn't know that until recently either. You can use the command "show needed-updates" to display additional updates. These can be explicitly allowed with the command "allow-non-failsafe-updates xx". 

    I currently have 10.16.1006 running in VSX before that, I had several 10.15.x versions.

    If it is a software error, why is only the switch with the primary role affected?

    -------------------------------------------



  • 4.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 11, 2025 11:45 AM

    Hello,

    The two log lines don't actually say "hardware failure"; the software process (switchd_agent2) on the SFP+ line card crashes with SIGABRT (signal:6) and the module reboots due to a "Fatal agent crash." So when triggering traffic/frames arrive, the line card's software crashes. The fact that you see the same error with different line cards/slots/GBICs/chassis and even different ESX strongly indicates that it is a software bug triggered by a specific protocol/configuration combination, not a hardware or GBIC issue. The fact that the secondary switch remains unaffected also suggests that the trigger hits a code path (e.g., VSX/MC-LAG, AGW, LACP sync, etc.) running in the VSX primary role.

    Control frames such as LLDP/DCBx/ETS/PFC/LACP
    When ESXi comes out of maintenance, NICs start sending LACP-DU, LLDP (and even DCBx/ETS/PFC TLVs in some Broadcom/Intel drivers). In certain versions of AOS-CX, specific combinations of these TLVs have been known to cause PRs that crash the line-card agent. Your symptoms (only SFP+ card, only primary, crash when ESX comes up) align very well with this.

    VSX MC-LAG + config interaction
    If LACP is enabled on the ESX side, ports are trunked, and there are multiple VLANs/ACLs/QoS/sFlows simultaneously, an assertion may be triggered during programming on the primary, causing the agent to crash. The secondary may not be affected.

    Disable LLDP/DCBx (temporary test):
    Disable LLDP reception/transmission on switch ports (set no lldp receive and/or no lldp transmit on the interface in CX). Set LLDP for vDS to "Listen" or "Disable" in ESXi.
    If the crash stops: LLDP/DCBx/ETS/PFC TLVs are triggering it.

    Simplify LACP:
    Test the ESX uplink with a single link or static LAG (no lacp); also try lacp rate slow.
    If it stops: Bug in LACP/MC-LAG synchronization path.


    Simplify the port:
    Leave only 1-2 necessary VLANs on the ESX trunk; temporarily remove ACL, QoS, sFlow, mirror, 802.1X/port-access, PFC/ETS, etc. from the port and test.
    If it stops: one of these features (especially large ACL/QoS profiles) is triggering it.

    Test by connecting to the secondary:
    Temporarily connect the ESX to the secondary alone (or without MC-LAG) and observe.
    If it only happens on the primary: VSX role-specific flow (AGW/ICCP/LACP sync) is suspect.

    If you can share the following outputs, it may be possible to provide further commentary:

    show logging last 300 / show event recent
    show tech-support 
    show interface 1/1 and 1/5 
    show lacp aggregator
    show vsx status i show vsx brief

    It would be best to open a case with TAC regarding this issue.

    -------------------------------------------



  • 5.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 11, 2025 01:11 PM

    I had actually ruled out a software bug, but your explanation sounds very plausible. However, I had it in different versions. If that were the case, the bug would have to have been in the firmware for a very long time, since the error occurred in a 10.15.x version.

    But to add some more information, we do not use LACP in VMWare. We add 2 10G ports and 2 1G ports as standby to the vSwitch. In a separate vSwitch, we have 2 10G ports for ISCSI. The first vSwitch is for the LAN VLANs.

    I opened a case earlier, but sometimes there are helpful tips in the community.

    -------------------------------------------



  • 6.  RE: Problem with SFP+ line card in VSX stack
    Best Answer

    Posted Sep 11, 2025 04:15 PM

    Hello,

    Thank you for sharing more details.
     
    What are the possible triggers when LACP is not present?
     
    1- Since ESXi Standard vSwitch (VSS) does not support LACP, if you run two 10G uplinks simultaneously on two different 6400 (VSX pair) switches, a single "logical connection" does not form on the physical side.
    2- The vSwitch pins each VM's traffic to one uplink; however, when the same VLAN exits through two different chassis, physical switches may see the same MAC on two different ports and experience MAC flapping.
    3- The MAC table is not shared between chassis in VSX; therefore, the effect of "the same host's MAC rapidly switching between two chassis" becomes more pronounced.
    4- This abnormal flow is a known trigger class that can cause switchd_agent2 to assert and crash with SIGABRT (signal 6) in some AOS-CX versions (depending on the exact build).
     
    So why does the problem persistently occur on the Primary chassis?
     
    1 - In the VSX architecture, certain control-plane/statistical tasks may be more intensive on the primary; the combination of heavy MAC churn + features can more easily lock up the agent on the primary line card.
     
    You mentioned that you are not using LACP. If the "correct design" is not implemented, you may encounter the following:
     
    1- A VSS connected to two different chassis with the same VLANs in active-active mode can spread BUM (broadcast/unknown/multicast) packets along both paths, even if it does not create a loop at L2. This produces STP TCN and flood moments.
    2- Even if the iSCSI vSwitch is separate, when coming out of maintenance, iSCSI sessions and LAN ARPs can explode simultaneously and overload the line card.
     
    On the CX side, the same crash signature can persist for years across multiple minor branches and appear under specific triggers (VSX + MAC churn); the fix may be in a different version or another build of 10.15.
     
    What can be done for a healthy design without LACP?
     
    1- On the VMware side, do not use two active-active 10G uplinks to two chassis at once. For each Port Group:
     - Active uplink: Only one 10G (going to Primary)
     - Standby uplink: The other 10G (going to Secondary).
     - Leave the 1G ports as standby.
     - You can balance the load by reversing this order on other ESXi hosts.
     - This setting prevents the same VLAN from being active on two chassis at the same time, significantly reducing MAC flapping and flooding.
     
    Of course, we recommend using LACP if possible.
     
    There was also the iSCSI issue;
     
    1- Separate vSwitch and port binding is correct; pin each vmkernel to a single physical NIC (the other remains standby).
    2- Keep iSCSI uplinks on separate VLANs/Subnets and, if possible, separate physical paths; LACP is not required.
     
    I hope TAC resolves the case soon. Please share the solution results with us.
    -------------------------------------------



  • 7.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 11, 2025 06:09 PM
    Hi! pretty strange that a VMware VSS (no LACP) with just two active-active standalone SFP+ uplinks (again, absolutely no LACP involved due to VSS limitations) to VSX members generates this issue: we have two VSX (respectively 8320 and 8360 based) running in production with - if I understood what you described as a potentially problematic scenario - a similar setup to various ESXi hosts (each ESXi hosts owns many VSSes, each VSS owns many Port Groups for VLAN tagging to cope with corresponding tagging on VSX downlinks) and never experienced any failure in years, never (since AOS-CX 10.1 - yes! on 8320 VSX starting on 2018 - up to AOS-CX 10.13). Hundereds on VMs concurrently running, Terabytes and Terabytes of traffic has passed on those links. Never a single glitch.

    Would be interesting to understand what are the VMware VSS settings and what are the corresponding exact configuration of involved interfaces on each VSX member.








  • 8.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 12, 2025 01:46 AM

    Which firmware version are you using? I already wrote something about setting up the switches in another post. 

    I can't configure much in the VSS.
    Security
    Promiscuous mode -> Reject
    Mac address change -> Accept
    Fake transmissions -> Accept

    Traffic shaping is disabled
    Teaming and failover
    Load balancing -> Route based on the original virtual port
    Network failure detection -> Connection status only
    Notify switches -> Yes
    Failback -> Yes

    What I don't think I mentioned is that the two fiber optic cards each have 1 ISCSI and one LAN link. Each network card is patched to a switch. This has always allowed us to ensure that there can be no failure. With the 1G links as standby, it was actually impossible for a failure to occur.

    -------------------------------------------



  • 9.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 12, 2025 05:16 AM

    Don't know if these could be useful to you, below my ESXi side settings (VMware VSS "vSwitch1" which is the one carrying the VMs):

    Properties
    Standard switch              vSwitch1
    MTU                          9000

    Security
    Promiscuous mode             Reject
    MAC address changes          Reject
    Forged transmits             Reject

    Traffic shaping
    Average bandwidth            --
    Peak bandwidth               --
    Burst size                   --

    Teaming and failover
    Load balancing               Route based on originating virtual port
    Network failure detection    Link status only
    Notify switches              Yes
    Failback                     Yes
    Active adapters              vmnic2, vmnic5
    Standby adapters             --
    Unused adapters              --


    Port Group <redacted> settings (VLAN ID 2030 in this specific case):

    General
    Network label                <redacted>
    VLAN ID                      2030

    Security
    Promiscuous mode             Reject
    MAC address changes          Reject
    Forged transmits             Reject

    Traffic shaping
    Average bandwidth            --
    Peak bandwidth               --
    Burst size                   --

    Teaming and failover
    Load balancing               Route based on originating virtual port
    Network failure detection    Link status only
    Notify switches              Yes
    Failback                     Yes
    Active adapters              vmnic2, vmnic5
    Standby adapters             --
    Unused adapters              --

    Physical adapter settings (vmnic2):

    Properties
    Adapter                      Broadcom NetXtreme E-Series Advanced Dual-port 10Gb SFP+ Ethernet OCP 3.0 Adapter
    Name                         vmnic2
    Location                     PCI 0000:32:00.0
    Driver                       bnxtnet

    Status
    Status                       Connected
    Actual speed, Duplex         10 Gbit/s, Full Duplex
    Configured speed, Duplex     Auto negotiate
    Networks                     <redacted>

    SR-IOV
    Status                       Not supported

    Cisco Discovery Protocol
    Cisco Discovery Protocol is not available on this physical network adapter

    Link Layer Discovery Protocol
    Link Layer Discovery Protocol is not available on this physical network adapter

    Physical adapter setting (vmnic5), as above but with these differences:

    Properties

    Adapter                     Intel(R) Ethernet Controller X710 for 10GbE SFP+
    Name                        vmnic5
    Location                    PCI 0000:17:00.1
    Driver                      i40en

    These are the corresponding interface settings (1/1/28 on VSX Primary for vmnic2 and 1/1/28 on VSX Secondary for vmnic5):

    interface 1/1/28
        description DELL-R750-<redacted>-esxi03-s1p1-vmnic2-vSwitch1-A03
        no shutdown
        mtu 9198
        no routing
        vlan trunk native <redacted> tag
        vlan trunk allowed <redacted>
        spanning-tree bpdu-guard
        spanning-tree port-type admin-edge
        spanning-tree tcn-guard
        loop-protect
        loop-protect vlan <redacted>

    interface 1/1/28
        description DELL-R750-<redacted>-esxi03-s2p2-vmnic5-vSwitch1-A04
        no shutdown
        mtu 9198
        no routing
        vlan trunk native <redacted> tag
        vlan trunk allowed <redacted>
        spanning-tree bpdu-guard
        spanning-tree port-type admin-edge
        spanning-tree tcn-guard
        loop-protect
        loop-protect vlan <redacted>

    Cheers, Davide.

    -------------------------------------------



  • 10.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 12, 2025 01:30 AM

    Thanks for your explanation. That sounds very logical. Previously, we had an IRF stack with a 5700 for the fiber optic connections. For the copper connections, we had three 3500yls. One for ILO and devices with only one network card, and the other two for devices with two or more network cards. 

    We have configured our ESX servers so that we have the following equipment in the servers:

    2x 2-port SPF+ fiber optic cards
    1x 4-port copper card

    There are two exceptions where there are more fiber optic cards, but we need these for network monitoring.

    Under VMWare, we have the 2 vSwitches as described. Two VMKernel ports for ISCSI are set up on the ISCSI vSwitch. ISCSI already runs on a standalone VLAN, but not separately for each connection.

    -------------------------------------------



  • 11.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 17, 2025 06:57 AM

    Hi! any update? just curious.

    -------------------------------------------



  • 12.  RE: Problem with SFP+ line card in VSX stack

    Posted Sep 12, 2025 05:29 AM

    Hello,

    The two log lines don't actually say "hardware failure"; the software process (switchd_agent2) on the SFP+ line card crashes with SIGABRT (signal:6) and the module reboots due to a "Fatal agent crash." So when triggering traffic/frames arrive, the line card's software crashes. The fact that you see the same error with different line cards/slots/GBICs/chassis and even different ESX strongly indicates that it is a software bug triggered by a specific protocol/configuration combination, not a hardware or GBIC issue. The fact that the secondary switch remains unaffected also suggests that the trigger hits a code path (e.g., VSX/MC-LAG, AGW, LACP sync, etc.) running in the VSX primary role.

    Control frames such as LLDP/DCBx/ETS/PFC/LACP
    When ESXi comes out of maintenance, NICs start sending LACP-DU, LLDP (and even DCBx/ETS/PFC TLVs in some Broadcom/Intel drivers). In certain versions of AOS-CX, specific combinations of these TLVs have been known to cause PRs that crash the line-card agent. Your symptoms (only SFP+ card, only primary, crash when ESX comes up) align very well with this.

    VSX MC-LAG + config interaction
    If LACP is enabled on the ESX side, ports are trunked, and there are multiple VLANs/ACLs/QoS/sFlows simultaneously, an assertion may be triggered during programming on the primary, causing the agent to crash. The secondary may not be affected.

    Disable LLDP/DCBx (temporary test):
    Disable LLDP reception/transmission on switch ports (set no lldp receive and/or no lldp transmit on the interface in CX). Set LLDP for vDS to "Listen" or "Disable" in ESXi.
    If the crash stops: LLDP/DCBx/ETS/PFC TLVs are triggering it.

    Simplify LACP:
    Test the ESX uplink with a single link or static LAG (no lacp); also try lacp rate slow.
    If it stops: Bug in LACP/MC-LAG synchronization path.


    Simplify the port:
    Leave only 1-2 necessary VLANs on the ESX trunk; temporarily remove ACL, QoS, sFlow, mirror, 802.1X/port-access, PFC/ETS, etc. from the port and test.
    If it stops: one of these features (especially large ACL/QoS profiles) is triggering it.

    Test by connecting to the secondary:
    Temporarily connect the ESX to the secondary alone (or without MC-LAG) and observe.
    If it only happens on the primary: VSX role-specific flow (AGW/ICCP/LACP sync) is suspect.

    If you can share the following outputs, it may be possible to provide further commentary:

    show logging last 300 / show event recent
    show tech-support 
    show interface 1/1 and 1/5 
    show lacp aggregator
    show vsx status i show vsx brief

    It would be best to open a case with TAC regarding this issue.

    -------------------------------------------