ixgbe with SFP+ ports - wedges with 'tx hang' errors after a number of hours

Description

Hi,

I recently uprgaded from TrueNAS CORE to TrueNAS SCALE on my system. Found a few minor issues - but have now run into a major one.

My system is a SuperMicro X10DRU-i+ - with:

Intel Corporation 82599ES 10-Gigabit SFI/SFP+

Fitted. The box has 2 * E5-2630L v3's fitted, and 128G of RAM.

After the system has been on for a few hours - the SFP+ network card 'wedges' - in dmesg I see:

[106699.517823] ixgbe 0000:81:00.0 ens1f0: tx hang 9251 detected on queue 16, resetting adapter
[106699.527340] ixgbe 0000:81:00.0 ens1f0: initiating reset due to tx timeout
[106699.535305] ixgbe 0000:81:00.0 ens1f0: Reset adapter
[106699.619463] ixgbe 0000:81:00.0 ens1f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
[106699.619463] ixgbe 0000:81:00.0 ens1f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
[106699.852719] ixgbe 0000:81:00.0: ens1f0: master disable timed out
[106700.196252] ixgbe 0000:81:00.0 ens1f0: detected SFP+: 3
[106700.346639] ixgbe 0000:81:00.0 ens1f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[106700.554819] ixgbe 0000:81:00.0 ens1f0: Detected Tx Unit Hang
Tx Queue <0>
THD, TDT <0>, <1>
next_to_use <1>
next_to_clean <0>
tx_buffer_info[next_to_clean]
time_stamp <10195e200>
jiffies <10195e22b>

I've tried an 'ifconfig down / up' - but this doesn't clear it. So far the only fix is to reboot the machine.

This system was running continously with TrueNAS CORE for a 'long time' - without issue. The switch the other side of the NIC can't see any issues - there's just no traffic being passed when this happens.

The machine has onboard RJ45 NIC's - I'm in the process of switching over to one of those until this is resolved, but could really do with the SFP+ modules working reliably.

Thanks.

Problem/Justification

None

Impact

None

Activity

Show:

Alexander Motin April 6, 2022 at 7:52 PM

We haven't touched the Intel NIC drivers supplied with Linux kernel and don't have big development experience for them, so without the problem reproduction on site it is difficult to say anything other than it must be between the hardware and the driver.  We should release SCALE 22.02.1 release in few weeks, including slightly update Linux 5.10 kernel. You may experiment with those to see whether it help. We've recently switches to Linux 5.15 kernel in nightly builds of SCALE 22.12, but those are not published yet.

Bonnie Follweiler March 16, 2022 at 3:59 PM

Thank you .

I have moved this ticket into our queue to review.

An engineering representative will update with any further questions or details in the near future.

Tackyone March 15, 2022 at 10:37 PM

Hi,

Debug should be attached. I've had to reconfigure the NIC's - 'ens1f0' was the one that wedged, this no longer has an IP on it - I transferred that to one of the onboard RJ45's ('enp1s0f0') as they seem to be ok - so 'ens1f0' currently has no IP bound on it ('ens1f1' has an IP bound - but doesn't have anything critical using it afaik.

Thanks

Bonnie Follweiler March 15, 2022 at 9:10 PM

Thank you for the report, .

Can you please attach a debug file to the "Private Attachments" section of this ticket? To generate a debug file on TrueNAS SCALE, log in to the TrueNAS web interface, go to System Settings > Advanced, then click SAVE DEBUG and wait for the file to download to your local system.

Third Party to Resolve
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Impact

High

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created March 15, 2022 at 10:31 AM
Updated July 6, 2022 at 8:58 PM
Resolved April 6, 2022 at 7:52 PM