Wired Intelligent Edge

 View Only
Expand all | Collapse all

HP 5412 10GbE Module Issues / Troubleshooting Tricks

This thread has been viewed 0 times
  • 1.  HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 20, 2016 04:43 PM

    We recently upgraded the firmware on our 5412Rzl2 (J9851A) as such: KB_15_17_0007 > KB_16_02_0013.  All operations are normal except some issues with our single 8-Port 10-GbE SFP+ v2 zl module (J9538A) as follows:

    • Occasional 1-2 second disconnect/reconnects in the switch log from the 10GbE ports
    • DRASTIC, but occasional, slowdown of speeds, which seem to self-correct eventually
    • High numbers of Discard Rxs
    • High numbers of Drop Txs (but not nearly as high as Discard Rxs)

    We are looking for any additional troubleshooting commands or techniques (other than show log and show interface <PORT>) which might yield insight into the issue.

    Right now we're unsure if the issue is:

    • The new firmware (both primary and secondary were updated but we can still rollback one of them)
    • Failing hardware (cables, module, ports, NICs)
    • Drivers (NICs)
    • Hosts

    More Info:

    • All ports in this module are in an unroutable VLAN so there shouldn't be any commuication in/out
    • Hosts have Intel X710s (rev 01) NICs (recent but not the latest firmware)
    • I cannot verify if these ports had abnormal Discard or Drop counts before the firmware upgrade to compare to

    Please let me know if I can provide any other details.  Thank you!



  • 2.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 20, 2016 05:52 PM

    I'd suggest logging a case and getting them to escalate to level2/3, you may find out they already know about this and a fix is slated for release. I've seen a few problems on Kx.16, and have already escalated one issue with a POE software bug - which turned out to be a known bug with unknown fix release date.

    I wouldn't use Kx.16 in production yet, maybe test it in a lab for now until those early release cycle problems are resolved.

    A useful troubleshooting tool is "show tech", and also have a look at "show instrumentation"

    switch# show instrumentation ?

    • cam       Show internal version-dependent counters for debugging.
    • monitor       Show latest values for monitored parameters.
    • port       Show internal version-dependent counters for debugging the specified port.
    • resptime       Show service response time data for performance sensitive operations registered for response time measurement.
    • routing       Show routing related instrumentation parameters.
    • vlan       Show internal version-dependent counters for debugging the specified VLAN.

    There is also a debug mode I've used in the past, that goes really deep into the "tech support" areas, but it gets quite complicated and probably is deeper than most customers would want to go.

    http://networkgeekstuff.com/networking/procurve-and-hidden-command-line/

    Search for term: edomtset

     

     



  • 3.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 05:08 AM

    Curious to know if (and how) those eight 10GbE ports - of the HP 8-port 10-GbE SFP+ v2 zl Module - are all used concurrently (maybe ports overcommit/oversubscription could/couldn't enter in the picture so having a role in the issue)...first of all start with collecting the status of each Transceiver used on those ports, what the command

    show interfaces transceiver n detail (where n is the port number)

    reports?

    Supposing that nothing else changed but the Firmware then the actual Firmware could be the first culprit one think of...but to diagnose that - without being necessarily biased by the concept "bad new Firmware versus good old Firmware" (I mean without considering other possible sources of issues) - you should be sure enough that exactly nothing else had changed in your environment before you did that Firmware upgrade.



  • 4.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 08:39 AM

    Thanks for the replies!  I dug through those commands and didn't find anything particularly useful, although perhaps much of it is past my understanding.  I did check the transceiver statuses but not sure what to look for.

    The firmware was the only thing which changed, unless you count the loss of network connectivity for the devices.  The devices in question are some VMware hosts and a few NAS devices.  We're going to try a full reboot of everything once we can afford downtime.

    Can anyone explain the differences between firmware versions? (Major vs minor vs incremental) <MAJOR>.<MINOR>.<INCREMENTAL>

    Perhaps the explaination of the three parts of the version number will help explain which firmware I should choose when upgrading...

     



  • 5.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 08:47 AM

    Mmm...the best document I've read is the HP ProVision Software Release Process (2015): it should be explain exactly what you're looking for...

    Don't you want to post and share (first trim all possible Serial Numbers and other relevant sensible information about your products/configurations) the result of the command above run against your various 10Gb Transceivers interfaces?



  • 6.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 08:53 AM

    Excellent PDF, thank you!  So the show interface transceiver n detail command is identical for all 8 ports, except for the incrementing Interface Index and the Serial Number....

    Transceiver in L1
    Interface Index : 353 (varies)
    Type : SFP+DA7
    Model : J9285B
    Connector Type : Vendor specific
    Wavelength : n/a
    Transfer Distance : 7m (copper),
    Diagnostic Support : None
    Serial Number : <VARIES>



  • 7.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 09:35 AM

    Are those DAC Cables installed correctly (respecting the minimum bend radius, not below 1")?



  • 8.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 10:47 AM

    @parnassus wrote:

    Are those DAC Cables installed correctly (respecting the minimum bend radius, not below 1")?


    It appears that they are all installed with at least this minimum.



  • 9.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 02:37 PM

    We are in the process of root causing an issue for that specific module (J9538A) on ports 4, 5, and 6.  In the meantime there is a configuration option you can disable that should alleviate the symptoms:

    HP-Switch-5406Rzl2(config)# no tcp-push-preserve

    There is a low level issue causing head of line blocking on those ports in the presence of of large amounts ot TCP traffic with the push bit set.  

     

    HP-Switch-5406Rzl2(config)# tcp-push-preserve help
    Usage: [no] tcp-push-preserve

    Description: Enable TCP Push Preserve mode. This mode determines the
    flow of the TCP packets that have the PUSH flag set. When
    this mode is enabled and the egress queue is full, TCP
    packets with the PUSH flag set are queued at the head of the
    ingress queue for egress queue space. This might delay
    subsequent incoming packets in the same queue. When this
    mode is disabled and the egress queue is full, TCP packets
    with the PUSH flag set are dropped from the head of the
    ingress queue.

    By default, this mode is enabled. Disable this mode when a
    large number of TCP packets with the PUSH flag are being
    dropped due to congestion.



  • 10.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 03:57 PM

    Interesting, Michael.  I think we're going to boot a dormant host on that module with a live DVD, mirror an active port to it, and use wireshark to investigate the actual traffic.  We'll hopefully be able to see what is being dropped or discarded, as well as if any of the TCP packets are indeed using the PSH flag or not.

    Our issues also seem low level, and we'll likely end up rolling back firmware.  First, to the previous, and then second, to some newer ones (but not the absolute latest 16.02.xxxx).

    I'll update here once we find more results.



  • 11.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 04:09 PM

    Interesting...that global setting was introduced and enabled by default since x.15.14.0007 (here) - I recall it was cited already on this Post - ...but the OP's Switch starting software version was KB.15.17...so the Switch was already running with a post KB.15.14 software version...at this point: is it possible that that setting was effective (because it was enabled by default on any version since the KB.15.14.0007) but didn't produced all the negative effects on his network until the last upgrade to KB.16.02 jumped in? why the negative effects didn't showed up before if no other changes (traffic consistently grown?) were introduced?



  • 12.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 21, 2016 05:34 PM

    We're still in the process of root causing it so I don't want to speculate here, but that "feature" was introduced with v1 modules many years ago.  The CLI command to enable/disable was added a few years back to address a particular issue at the time.  

    We believe there was a recent change specific to the J9538A modules that made it more susceptible to HOL blocking in some scenarios, which can cause latency and other performance issues..  Disabling tcp-push-preserve is a workaround.  

    I will post back when I have more information.  



  • 13.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 24, 2016 10:31 AM

    Is it possible that the no tcp-push-preserve will negatively impact other devices or throughput speed?  As it is a global setting I am hesitant to try it during production hours.  Honestly not sure what other devices might be using that setting.  We did find that about 25% of the TCP packets had the push flag set, for what it's worth.

    We're planning to try a reboot sequence of the hosts and NAS devices, then those with the switch.  If the reboots do not alleviate the issue, we'll proceed with firmware rollback, as follows: 15.17.0007 (last known working).  If that works, we'll try the versions until we hit an issue: 15.17.0013 > 15.18.0013 > 16.01.0010 > 16.02.0013 (current)

     



  • 14.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 24, 2016 02:40 PM

    We believe this issue, specific to the J9538A module ports 4, 5, and 6, was introduced in K/KB.16.01.0008 and K/KB.16.02.0011.  

    Disabling that feature should not negatively impact other devices or throughput.  Disabling it basically configures the switch to drop TCP push traffic as it would any other packet when the egress queue fills up.  The egress queue filling up incorrectly is the issue and "no tcp-push-preserve" helps in that condition.

     



  • 15.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 24, 2016 03:27 PM

    Thank you for the information!  We'll try releases before those, and maybe with and after those.  Actually, we're seeing issues on ports besides 4, 5, and 6.  (for example, 2).  Switching from 2 to 8 seems to have helped.  Additionally, we're using hardware v2 of module J9538A.



  • 16.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Oct 24, 2016 06:01 PM

    That's interesting.

    AFAIK the 8 ports 10-GbE v2 zl Module (J9538A) has two static Channels built, respectively, grouping module ports 1, 4, 6 and 8 for Channel 1 and module ports 2, 3, 5 and 7 for Channel 2.

    Each Channel provides a total aggregated bandwidth of 23.4 Gbps (so the total aggregated Channels bandwidth of the entire module is 46.8 Gbps).

    The Ports assignment on each Channel of the J9538A Module is fixed so the aggregated bandwidth is shared whitin that specific set of physical ports (exactly when those ports switch to a "linked state" so are active), the Ports versus Channels schema is:

    • Channel 1: Port 1
    • Channel 1: Port 4
    • Channel 1: Port 6
    • Channel 1: Port 8
    • Channel 2: Port 2
    • Channel 2: Port 3
    • Channel 2: Port 5
    • Channel 2: Port 7

    Basically this means that if you need full wire rate transfer speeds (10 Gbps Full Duplex, so 20 Gbps) you must not connect more than one 10 GbE port per Channel (so you must not connect more than one port every four ports of a Channel)...that's because the module applies oversubscription (simply it's not able to sustain 8 x 10 Gbps = 80 Gbps wire rate [*]).

    Said so it's somewhat important to know what ports to connect (and what ports don't) to let those alone connected ports to reach wire speed.

    Another interesting thing to pay attention of is that v2 zl Modules (like the J9538A) benefit of a maximum Bandwidth of 40 Gbps (per Slot) when the 5400R zl2 is operating in v2 Compatibility Mode, on the contrary v3 zl2 Modules (like the J9993A, successor of the J9538A) benefit of a maximum Bandwidth of 80 Gbps (per Slot) either when the 5400R zl2 operates in v2 Compatibility Mode or when it operates in v3 only Mode (v2 zl Modules will not be supported in this v3 only Mode of operation).

    [*] Question: those 80 Gbps refers to Full Duplex or not?

    Edit: it's also interesting to read the J9538A Module related defect ProVision CR_0000213551 report (this particular Issue was declared already fixed) in which - as a workaround (so it wasn't the cause for the issue we're discussing here!)  - HPE advised to use SFP+ Transceivers instead of SFP Transceivers on Ports 4, 5 or 6 or, if possible, use different remaining ports, 1, 2, 3, 7 or 8...



  • 17.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Nov 07, 2016 09:25 AM

    We rolled back from 16.02.13 to 16.02.10 and the issues with discards and drops appears resolved.  We did not use the workaround command to ignore the PUSH flag.

    We also rearranged our connections to more accurately adhere to the channels.  I find it strange that the channels are not 1-4/5-8 OR even/odd ports.



  • 18.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Nov 07, 2016 12:41 PM

    That's good to know.

    Yeah, channel <--> port binding order looks strange to me too...but it's that, look below (from the glorious HP 5400 zl Switches Installation and Getting Started Guide, Manual Port Number: 5998-2998 of June 2013):

    Screenshot_1.png

    or here (specifically sheet 9).

     

     

     

     



  • 19.  RE: HP 5412 10GbE Module Issues / Troubleshooting Tricks

    Posted Nov 10, 2016 02:09 PM

    The ArubaOS-Switch KB.16.02.0014 Release Notes is worth reading especially regarding:

    • Enhancement: TCP Push Preserve mode is set to disabled by default now.
    • Fix: CR_0000216989 related to Switch Module "Switch performance degrades when using ports 4, 5, or 6 on J9583A switch" (partially IMHO related to - due to that ...may improve... on the workaround - the subject - TCP Push Preserve mode - of the above enhancement).