Every ~2-7 days 3(or 4) same pools gets FAULTED and SYSTEM drives crazy

Description

There are ~20 pools in our setup: SuperMicro server with 2*120SSD for OS inside the server itself, external LSI HBA (03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)) with 2 cascaded Huawei storage enclosures, 24 disk each.
Services: SSH replication (zfs recv, receiver side), ~4-7 iSCSI targets for client server backups + MSSQL dumps, ~2-4 NFS exports (file storages)

30% of pools are made from SATA drives, 70% - SAS (no multipath used/configured)

FALTED pools are always SATA mirrored (each pair of drives is different size and vendor, the only common thing is that they have SATA interface)
Both disks in each pool become unavailable (error count increases) and "removed" at the same time - there is no inconsistency on any of FAULTED pool after reboot

HBA itself, external SAS cables and even Storage Enclosures have alrealy been changed - with no effect. Storage topology (cascading / parallel) doesn't make sense too

All FAULTED pools become HEALTHY after reboot without scrub/resilvering

This happens every 2-7 days at random time (no time dependence noticed so far)

Sometimes even SSH service on the host becomes unavailable. Sometimes only services related to FALTED pools are affected. GUI works all the time

Every time reboot process gets stuck on unmounting/service shutdown on affected pools. Only hard reset helps

Any help would be very appreciated !

Problem/Justification

None

Impact

None

Attachments

3

Activity

Show:

Alexander Motin March 23, 2023 at 7:06 PM

In the debug I see number of cases where one or another pool was suspended due to I/O errors. It seems in all cases some disks lost and later restored connection to the host. Unfortunately we can’t provide support for issues of that kind for third-party systems, especially considering the system is pretty old and the HBA used is long as discontinued. I’d recommend to check all the hardware, update firmware versions where available, etc. I can see that upcoming OpenZFS 2.2, planned in TrueNAS Cobia should make Linux to be more aggressive with retrials on some errors. Planned migration to newer Linux kernel in TrueNAS Cobia may also improve something from the HBA driver side, but I don’t have anything particular.

Michelle Johnson March 10, 2023 at 4:24 PM

Thank you for your report, !

This issue ticket is now in the queue for review. An Engineering representative will update with further details or questions in the near future.

Automation for Jira March 9, 2023 at 9:33 AM

Thank you for submitting this TrueNAS Bug Report! So that we can quickly investigate your issue, please attach a Debug file and any other information related to this issue through our secure and private upload service below. Debug files can be generated in the UI by navigating to System -> Advanced -> Save Debug.

https://ixsystems.atlassian.net/servicedesk/customer/portal/15/group/37/create/153

Hardware failure

Details

Assignee

Reporter

Labels

Impact

High

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created March 9, 2023 at 9:33 AM
Updated March 23, 2023 at 7:07 PM
Resolved March 23, 2023 at 7:07 PM