IXGBE Issue under TrueNAS Core - ix Interfaces 'hang' after random time.

Description

I've been using FreeNAS for quite a while now - on a SuperMicro X10DRU-i+ system.

This has:

4 * Intel(R) PRO/10GbE PCI-Express (twisted pair) onboard - not currently in use, show as ix0, ix1, ix2 and ix3

+

1 * Dual Port Intel(R) PRO/10GbE PCI-Express (sfp+) in a PCI-e slot - in use, show as ix4, ix5

After upgrading to TrueNAS Core 12.0 - these ix based NIC's randomly "hang", after a period of time.

This wasn't seen under FreeNAS 11.3-U3.2 (or earlier versions).

Unfortunately when this happens:

  • There is no console errors / output.

  • There are no errors logged anywhere I can see

  • The interface stops receiving / transmitting traffic (verified with tcpdump - even 'arp: Who has x tell y' requests for the MAC do not show up in tcpdump when this happens).

Also - the output from 'netstat -b -i ixX' appears unremarkable (i.e. no sudden surge in input/output errors).

I'd suspect something is wedging / getting 'stuck' somewhere.

In the TrueNAS community forums - there's at least two other people with similar issues:

https://www.truenas.com/community/threads/freenas-truenas-core-upgrade-broke-ix-nic.88477/#post-613005

I've tried doing an 'ifconfig ix[0-5] down' and 'up' - to no avail. The only resolution is to reboot the machine.

One interface is part of a LAGG - the other is standalone, both exhibit the problem randomly (i.e. network load doesn't affect it).

I'm in the process of setting up another identical box for testing - as in the meantime I've had to revert back to FreeNAS 11.3 (which has fixed the issue) - I'll try and update this report with dmesg etc. output when this is running.

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Tackyone October 28, 2021 at 8:59 AM

Hi,

In the spirit of actually trying to jinx this (on the basis - failing now is better than in months and months time slightly smiling face ) - the system is still up and OK. Uptime is now 13 days, and the ix4 interface (which is SFP+ and primary for traffic to / from the box) now shows:

Packets In / Out: 24,427,947,489 / 11,529,900,584 Bytes In / Out: 206,371,818,775,381 / 48,007,244,066,118

So whilst technically it's still too early to tell (I think the longest run before the interface stalling was ~2 weeks) - it's looking ok so far. I'll post back in another week or so...

Tackyone October 15, 2021 at 7:21 PM

Hi,

Mine's upgraded to U6, and I've switched back to the SFP+'s on the card, so just waiting to see what happens. If you get yours into that state again, might be worth seeing if:

sysctl -a | grep ix. | grep ring_state

Shows anything 'stalled' - as that's what seems to happen on mine.

It's reproducible - but the time between occurrences varies a lot (i.e. can be hours, days - I think the longest was ~2 weeks).

꧁•ᏒคᎥនтOʀ ࿐ October 15, 2021 at 6:27 PM

Hi guys!

It's my first post on this but I have been following the discussions on this issue for some time. Unfortunately U6 still don't resolve the hang problem with "no ping reply (NOP-Out) after 5 seconds; dropping connection". I have an Intel X550-T2 (verified as genuine) and randomly keeps disconnecting. Ifdown/Ifup doesn't solve the issue, as I suspect it's a driver glitch with ix0,ix1 interfaces. Because the storage it's in a production enmviroment, I've moved the iSCSI session to 1GB port, but this is killing me slowly. I've tried everyting without succes disappointed face

Hope someone will make light into this. Thanks!

Tackyone October 14, 2021 at 1:56 PM

Hi,

I've not had a chance to install U6 yet - I will tonight. It could be a few days wait then to see if the interface wedges. The LAGG is formed from two NIC's - one LAGG partner is on the SFP+ NIC, the other is on the Copper/RJ45 which terminates on a 1Gbit switch - with the hope if we lose the 10Gig side, the 1Gig will at least work (albiet slower). 

When the SFP+ NIC 'wedges' none of this triggers though as the system doesn't see any kind of link transition - it just stops passing traffic in/out.

For clarity - even if I run the system without the LAGG (just in case it was causing issues) - the SFP+ NIC still wedges under U5. At least this time round I know what I'm looking for (i.e. the 'stalled' line).

I'll get the update done ASAP - and see how it goes.

Thanks.

Kristopher Kolpin October 14, 2021 at 1:15 PM
Edited

, did the fixes ported from upstream FreeBSD solve your issue?  I noticed that this ticket is marked as done without confirmation of the code changes resolving your issue.

Also, can you clarify the configuration of your LAGG?  From what I could gather of your setup the LAGG has two members? but each member is from a different physical NIC?  Is that correct?

Complete

Details

Assignee

Reporter

Labels

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created November 11, 2020 at 10:53 AM
Updated July 1, 2022 at 5:00 PM
Resolved September 14, 2021 at 4:29 PM

Flag notifications