ZFS data corruption and random reboots during scrubs on TrueNAS12+

Description

Hi all,

Since 28 November I’ve been going through my worst hardware troubleshooting nightmare ever. I hope someone here can help me get to the bottom of this.

The hardware / software
---------------------------------------
The problem occurs with my 1 year old FreeNAS / TrueNAS server, which has following hardware:

OS: FreeNAS11U3 until TrueNAS 12U1 (the upgrade to TrueNAS12 was somewhere beginning of November)
Case: Fractal Design Define R6 USB-C
PSU: Fractal Design ION+ 660W Platinum
Mobo: ASRock Rack X470D4U2-2T
NIC: Intel X550-AT2 (onboard)
CPU: AMD Ryzen 5 3600
RAM: 32GB DDR4 ECC (2x Kingston KSM26ED8/16ME) which was replaced by 64GB DDR4 ECC (2x Samsung M391A4G43MB1-CTD) in the beginning of November
HBA: Dell H310 ( = LSI SAS 9211-8i) which was replaced by LSI SAS 9207-8i in the beginning of 2021 as part of the troubleshooting
HDDs: 8x WD Ultrastar DC HC510 10TB (RAID-Z2)
Boot disk: Intel Postville X25-M 160GB
SLOG: Intel Optane 900P 280GB (added beginning of October)​
This server ran for almost a year without issues, until 28 November, and was properly burned-in by using Memtest86, Prime95, solnet-array-test, badblocks, etc…

The (horror) issue
-----------------------------
I have monthly scrubs scheduled on my server on the 28th of the month. On 28 November I got 6 reboots during the scrub-run and it produced 8 CKSUM errors on 5 different HDDs. Before this, I never had issues with scrub…
I’ve thoroughly checked the logs and nothing can be found regarding the reboots or data corruption.

Recent hardware / software changes
-----------------------------------------------------------
As you could read earlier, I have replaced the RAM and upgraded from FreeNAS11 to TrueNAS12 in the same month as the errors started occuring. One month before that, I also added an Intel Optane as SLOG (but this change did have a problem-free scrub on 28 October).
So those are of course “likely suspects”…
A downgrade of TrueNAS12 back to FreeNAS11 isn’t easy, as I’ve already upgraded my pools flags, so I would need to destroy my pool and restore all data from my (non-redundant) offline backup (which is something I’m hesitant to do).

Something weird about the reboots
---------------------------------------------------------
The reboots occur “in pairs” with about 4m30s / 4m35s in between. I’ve discovered that I can “prevent” the 2nd reboot by manually rebooting before 4m30s / 4m35s pass.
The 2nd reboot also occurs when the pool is still locked (I have encryption on my pool). So while nothing or hardly anything actually happens with the pool.

What I’ve tried so far (unsuccessful)
----------------------------------------------------------

  • Since 28 November I’ve run about 30-50 scrubs for my troubleshooting, which take about 12 hours per pass. For each thing I try, I need to run scrub at least twice, because the first run, it can still find CKSUM errors that were created before the thing I’ve changed.

  • At first I wasn’t able to re-produce the reboots. Since mid December I’ve discovered that setting sync=always without the Optane SLOG increases the chance of reboots. But still the reboots occur very random (but always in pairs).

  • In Windows, I’ve ran an extended-self-test on all 8 HDDs simultaneously, while also running Prime95 blended at the same time. This should stress the HDDs, the PSU, the CPU, the RAM all at the same time. No reboots occured and all HDDs were found to be 100% healthy.

  • Also the SMART values of all HDDs are 100% healthy.

  • I’ve confirmed that Memory ECC reporting properly works with the 64GB RAM (I already confirmed this with my 32GB RAM before the issues started occuring, but, just to make sure, I reconfirmed it also with the 64GB RAM).

  • I underclocked the RAM (ran it below spec). This didn’t help.

  • I’ve tried triggering the issue on my Intel Optane by creating a single disk pool on the Optane, constantly running scrub / writing data to it, for a whole day. I couldn’t reproduce the issue on the Intel Optane. This shifted my suspicion to the HBA, as the issue only seemed to occur when using a pool on the HBA.

  • I’ve completely removed the Intel Optane from my server. This also did not solve the issue. (here I did discover the impact of sync=always on the likelihood of reboots)

  • I’ve forced “Power Supply Idle Control” to “Typical Current Idle” in the BIOS and confirmed that CPU C-States already were disabled. Also this did not help.

  • I’ve upgraded my BIOS and the IPMI and upgraded from TrueNAS12 to TrueNAS12U1. Also this did not help.

  • I’ve re-inserted the HBA and re-attached the SFF-8087 cable, both on the HDD and HBA side. Didn’t help.

  • I’ve (temporarely) added a screamingly loud Delta datacenter 120mm fan. My wife almost kicked me out of the house because of the noise, but, as it again didn’t help, I’m quite sure it is not temperature related.

  • I’ve (temporarely) replaced the SFF-8087 cable with an old one. Didn’t help.

  • I’ve (temporarely) replaced the PSU with an old 850W Seagate PSU. Didn’t help.

  • I’ve bought a new HBA (LSI 9207-8i instead of Dell H310), installed an Intel CPU cooler on it, to make sure it is properly cooled and tried again. Didn’t help.

What I’ve tried so far (slightly successful)
------------------------------------------------------------------

  • I’ve moved the HBA from PCI-e slot 6 to PCI-e slot 4. Both of these slots come directly from the CPU (not from the Chipset), so it shouldn’t really matter, but it did make a difference… There were noticeably less CKSUM errors during the many scrub runs I’ve tried. But still some CKSUM errors did occur (about 1 per run and once even 0). So I’m getting closer (finally :slight_smile: )

  • I’ve replaced the Ryzen 3600 in my server, with my desktops Ryzen 3900XT and ran scrub 3x without errors with the HBA in slot 6 and 2x without errors with the HBA in slot 4. So it seems like my issue is related to my CPU! I also discovered (I think) a tiny little bit of thermal paste covering about 2-3 pins on the side of my Ryzen 3600 CPU. So with a soft brush and lots of patience / IPA I cleaned it off.
    Finally, I re-inserted my cleaned Ryzen 3600 and ran scrub some more. The first time it completed without CKSUM errors, but the 2nd run, I again got 1 error. So although it certainly seems better than before, my problem still isn’t solved :frowning:

Conclusions so far and questions
------------------------------------------------------

  • It seems like the issue is related to PCI-e / the CPU

  • It could be related to the thermal paste covering the pins, but then it is still very strange that

  • the problem only started occurring after almost a year

  • cleaning it didn’t help to solve it completely

  • When googling for PCI-e errors, I found 2 “remarkable” things in other reports related to PCIe errors:

  • PCI-e errors should be detected

  • PCI-e errors can even be corrected sometimes

  • That the problem didn’t occur with my Intel Optane, but only with the HBA, could be related to the Optane being in PCI-e slot 4, while the HBA was in PCI-e slot 6

  • It still blows my mind that the reboots occur in pairs. This smells like a “software-issue”, but everything else clearly points at a “hardware-issue”. This is the main reason that I create a bug report for this. The fact that NOTHING is in the logs regarding these reboots, seems like a serious bug to me. It can't be a hardware-only issue with those "reboots-in-pairs".

  • Also I wonder why I'm not seeing PCIe errors. I've just confirmed that AER is enabled in my BIOS and in Linux I can clearly see that it is also enabled by the OS:

-bash-5.1# dmesg |grep -i aer
[ 0.869870] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 1.215562] pcieport 0000:00:01.1: AER: enabled with IRQ 26
[ 1.215726] pcieport 0000:00:01.3: AER: enabled with IRQ 27
[ 1.215876] pcieport 0000:00:03.1: AER: enabled with IRQ 28
[ 1.216020] pcieport 0000:00:03.2: AER: enabled with IRQ 29
[ 1.216215] pcieport 0000:00:07.1: AER: enabled with IRQ 31
[ 1.216346] pcieport 0000:00:08.1: AER: enabled with IRQ 32
[ 1.216489] pcieport 0000:00:08.2: AER: enabled with IRQ 33
[ 1.216630] pcieport 0000:00:08.3: AER: enabled with IRQ 34

On TrueNAS, I see no such thing in dmesg. Does TrueNAS support AER?

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Alexander Motin February 1, 2021 at 8:02 PM

Thanks for update. The ticket will remain here for anybody to see.  Let us know if there is anything substantial for us to debug.  There is always a chance for software component in such issues, but we have to focus on more reproducible ones.

Mastakilla February 1, 2021 at 7:36 PM

Last week I bought a new AMD Ryzen 5 3600 CPU for my TrueNAS server and, similar to when I temporarily used my desktops AMD Ryzen 7 3900XT, it has solved the issues.

This once more confirms that the "root cause" does not seem to be a software issue in TrueNAS or the "still-sometimes-claimed" incompatibility of TrueNAS and AMD platforms. The root cause seems to be a broken CPU.

What still remains incredibly weird to me is that the CPU went from working-without-any-problem to broken, without touching the CPU / cooler and also without thermally stressing the CPU:

  • the cooler that I'm using for this low-TDP CPU is certainly overkill

  • as I'm not yet running any VMs, I strongly doubt that TrueNAS alone is capable of really stressing this CPU.

  • I didn't overclock the CPU

I wasn't really expecting a CPU to "go broken" like this from itself. Normally a CPU is DOA or it works for 10+ years.

Anyway... I guess life is full of surprises

But, back to this bug report: Even though TrueNAS / FreeBSD is not the root cause, I still feel something is missing / wrong with how it handled the broken CPU. The reboot that occurs in pairs, indicates that something-software is involved, so the reboots occurring without anything in the logs or without kernel panic, seems very wrong to me.

Also that the data corruption itself occurs without any notification or log entry, seems an important defect for product that has "data security" so high on its priority list. But I understand that this might be a problem of that goes beyond the TrueNAS software itself (might be AMD-FreeBSD related).

As I was unable to collect any "useable" debug, I do understand that there is probably not much you can do about this right now. But I do hope that you can keep my experiences tracked somewhere, so that if someone else gets a similar issue, perhaps more progress can be made in solving these issues.

Thanks for looking into this!

Mastakilla January 23, 2021 at 1:33 AM

I tried finding a setting in the BIOS related to "ACPI Error Reporting Interface", but couldn't find it...

I also just swapped CPUs again, to re-confirm that switching the Ryzen 3600 with my desktops Ryzen 3900XT still solves the issue. Keeping in mind your comment regarding performance differences potentially hiding bugs, this time I've disabled the 2nd CCD of my Ryzen 3900XT, so that it only uses the same amount of cores / threads as on the Ryzen 3600. Because of the higher TPD, it is probably still faster than my Ryzen 3600, but at least performance should be closer and a better test...

Alexander Motin January 22, 2021 at 4:52 PM

Debug kernel should not remove any work, but add much more random sanity checking, significantly reducing performance, that sometimes may hide bugs.

Mastakilla January 22, 2021 at 2:41 PM

One strange detail is that, when switching to the debug kernel, it didn't even find CKSUM errors anymore from before the switch to the debug kernel.

"Normally", after changing something, I first have to do a complete scrub run "to clear CKSUM errors" that were created before the aspect that I've changed. So it takes a second scrub run to see "the result".

But when switching to debug kernel, there were no CKSUM errors even on the first run (from my past experiences, it seems highly unlikely that this is a coincidence).

I'm not sure what to conclude from this... Perhaps that scrub does its work less complete / correct in debug mode?

Cannot Reproduce
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created January 14, 2021 at 9:57 AM
Updated July 1, 2022 at 2:53 PM
Resolved February 1, 2021 at 8:02 PM