Hardware failure
Details
Assignee
Alexander MotinAlexander MotinReporter
Andrey MatveevAndrey MatveevLabels
Impact
HighComponents
Fix versions
Affects versions
Priority
Low
Details
Details
Assignee
Alexander Motin
Alexander MotinReporter
Andrey Matveev
Andrey MatveevLabels
Impact
High
Components
Fix versions
Affects versions
Priority
More fields
More fields
More fields
Katalon Platform
Katalon Platform
Katalon Platform
Created March 9, 2023 at 9:33 AM
Updated March 23, 2023 at 7:07 PM
Resolved March 23, 2023 at 7:07 PM
There are ~20 pools in our setup: SuperMicro server with 2*120SSD for OS inside the server itself, external LSI HBA (03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)) with 2 cascaded Huawei storage enclosures, 24 disk each.
Services: SSH replication (zfs recv, receiver side), ~4-7 iSCSI targets for client server backups + MSSQL dumps, ~2-4 NFS exports (file storages)
30% of pools are made from SATA drives, 70% - SAS (no multipath used/configured)
FALTED pools are always SATA mirrored (each pair of drives is different size and vendor, the only common thing is that they have SATA interface)
Both disks in each pool become unavailable (error count increases) and "removed" at the same time - there is no inconsistency on any of FAULTED pool after reboot
HBA itself, external SAS cables and even Storage Enclosures have alrealy been changed - with no effect. Storage topology (cascading / parallel) doesn't make sense too
All FAULTED pools become HEALTHY after reboot without scrub/resilvering
This happens every 2-7 days at random time (no time dependence noticed so far)
Sometimes even SSH service on the host becomes unavailable. Sometimes only services related to FALTED pools are affected. GUI works all the time
Every time reboot process gets stuck on unmounting/service shutdown on affected pools. Only hard reset helps
Any help would be very appreciated !