Pool keeps failing a specific disk, in a specific RAIDZ2

Description

Related to this post in the forums:

https://www.ixsystems.com/community/threads/pool-keeps-failing-a-specific-disk-in-a-specific-raidz2.87665/

Summary:

Pool with 3 x RAIDZ2 keeps getting checksum errors on 5th "slot" in the 3. RAIDZ2.

Resilvered 4-5 times, always finishes without error. Then after some minutes the checksum errors start again on the same "slot".

This happens on all drives i provide, regardless of how the drive is physically connect (different controller, enclosure etc. seems to be entirely software related, no SMART errors).

Compression and Dedupe is enabled.

It's running on an old'ish SuperMicro server board with redundant LSI SAS controllers.

1 server, 3 enclosures with 2 controllers each, to be exact.

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Brian Hansen October 26, 2020 at 1:49 PM

So, after digging around I found a single disk reporting errors on mps3, but that was a SMART related error because the particular disk (old OCZ SSD) doesn't want to report SMART status. 

The main error seems to be a faulty port / connector on one of the controllers in the lower enclosure:

I moved the SAS cable from the lower port to the middle port, no errors!

Moving it back to the lower port and the errors come back.

 

I'm sorry to have raised this as a bug report, it's something obvious I should not have missed.

Alexander Motin October 19, 2020 at 3:14 PM

If you look into output of `camcontrol devlist -v` you'll see that your disks are spread between mps0, mps1 and mps2, about the same as on your diagram, while all errors I see only on mps1.

Brian Hansen October 19, 2020 at 1:51 PM

Sorry for the delay, I looked into it today and it's quite strange.

All disks show up on MPS1, but that's because all disks are reported as being on MPS1, even though they are connected using 3 separate HBA's, each to their own backplane / enclosure.

I've tried to illustrate how it's all connected above, it's pretty straight forward I'd think?

Brian Hansen October 9, 2020 at 11:19 AM

Hmm! I'm currently on a businesstrip for the next 10 days, I'll get right back on it when I'm back home

Alexander Motin October 7, 2020 at 4:24 PM

But in your debug in very brief look I saw errors only from mps1.

Cannot Reproduce
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Impact

High

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created September 28, 2020 at 12:17 PM
Updated July 1, 2022 at 4:54 PM
Resolved October 6, 2020 at 3:38 PM