CLONE - Data corruption after TrueNAS upgrade

Description

Copy and pasting from TrueNAS forums: https://www.truenas.com/community/threads/freenas-now-truenas-is-no-longer-stable.89445

Subject: FreeNAS (now TrueNAS) is no longer stable

Hello, I'm writing this with deeply concerns about TrueNAS/FreeNAS and the move that seemed a little bit irresponsible regarding quality and testing.

I have three virtualization pools that relied on FreeNAS for years. One specifically is running since 2013, other 2014 and the newer one since 2016.

On the 2014 pool, we've updated from FreeNAS 11.3-U5 to TrueNAS 12.0-RELEASE, 3 weeks ago, precisely on November 20th. Suddenly we started to discover extreme VM corruption within the XFS filesystem, everything was getting corrupted, including the filesystem superblocks, leading to the inability to recover from xfs_repair.

We blamed everyone: we blamed the Hypervisor, in that case oVirt 4.4, we blamed the fabrics and network, since this pool is using a Cisco Catalyst 2960X as a core, which is not ideal, we blamed the XFS filesystem due to issues on writeback mode, we blamed everyone. We didn't even considered blaming TrueNAS. I've even opened a discussion within the hypervisor mailing lists, but nothing conclusive was found: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2DVB4ULURXWJ5VGHX64FDUZW27F7DY3J

So for next days we blamed mainly the network, since there's some packets dropped on the switch. We concluded that the load on the park, for whatever reasons, have increased and the drops could have causing the issue. A guy on the maililng list recommended falling back to NFSv3 as VM storage instead of NFSv4 due to weird things happening under load. We tried, the situation was better but the issue is still happening.

In this monday, we've had maintenance on the pools from 2013 and 2016. So it's upgrade time. We upgraded both pools to TrueNAS 12.0-RELEASE. 12.0-U1 wasn't available yet.

Everything went fine... but on this Thursday the mail server on the pool from 2013 went down with a disconnection on the iSCSI disk due to I/O errors. Well let's see what happened and the result was: the VM was completely trashed. Corruption on the filesystem, on the operational system, on the service and on the databases that held the mailboxes. Other VM's like a webserver are completely trashed too. So it's a disaster scenario.

Regarding the pool from 2016, I've already detected in place XFS corruption in one VM. For safety measures everything was shut down.

So what happened?

All three pools have different equipments and software, but the only common denominator is the storage system, which was ranging from FreeNAS 11.1 to 11.3. The hypervisors are mixed: oVirt 4.3, oVirt 4.4 and XenServer 7.2; two of them uses iSCSI as the storage backend and one is with NFSv3. Hardware is completely different either, so as you can see. TrueNAS is the only piece that's equal.

For now, I've upgraded everything from 12.0-RELEASE to 12.0-U1. In hope that this will fix this issues.

I don't have any artifact to blame FreeNAS/TrueNAS, the only thing that I've is my word of what happened on those pools. I never had any issue with FreeNAS/TrueNAS for almost 8 years running it, but this move to 12.0 may seem rushed by iXsystems. There's no logs generated within TrueNAS, no errors, no health issues on the zpools, nothing. Which leads me to believe that the software is in an silent unstable state.

I don't have any options right now, I can't downgrade back to 11.3-U5/6/7/etc since the zpool was upgraded on the three systems. But there's one things that dropped the ball regarding to trust with iX releasing the proper stable versions of TrueNAS.

After the upgrade I've noted that 12.0-RELEASE was built with RC (Release Candidate) code:

OpenZFS 2.0 wasn't even released yet, leading to confusion. When 12.0-RELEASE was announced I understood that OpenZFS 2.0 was released together, but this seems not to be the case since the announcement was done two day ago, on December 10th: https://www.ixsystems.com/blog/openzfs-2-on-truenas

What we got running on 12.0-RELEASE?

OpenZFS snapshot from October 2nd. This is not STABLE at all...

In 12.0-U1 we got the proper released OpenZFS version, and a non RC FreeBSD 12 system. As we would expect from a RELEASE release.

Yeah, so... given the evidence I cannot conclude anything different from: 12.0-RELEASE is not STABLE. It should not be marketed as STABLE in first place. Even upgrade to 12.0-U1 still marks 12.0-U1 as development branch and should not be used in production: https://jira.ixsystems.com/browse/NAS-108580; yes it may seems to be a cosmetic bug, but for paying customers TrueNAS 12 isn't available yet. So all this TrueNAS Core thing leads to extreme confusion. There is cleary two separate branches from the OpenSource release and the one that iX ships, which is fine, but this should be explained better.

For now I don't even know if 12.0-U1 would solve the reported issues, and if 12.0-U1 will be considered stable. Because it's not.

Regarding the original issue, I'm pretty much confident that the issues were consequence of running 12.0-RELEASE. People can blame me for "upgrading it too early" or "you should have paid support since your environment is critical", or other nonsenses like: "you probably don't know how to build proper ZFS systems". But the reality is that none of them applies to the situation.

I know that iX is not responsible for this, this is FOSS software and delivered "as is"; this is just as an alert to keep running FreeNAS 11.3-U5/6/7/etc until things get really stable on the 12.0 branch. Keep an eye with the paying customers, look when they will receive the updates, I've read somewhere that this release will be on December 22th. We hope this will be stable, so people could have a proper Christmas and a good new year.

Thanks for listening.

PS: If there's any artifact that I can generate to help further investigate I'm totally willing to do it, but I don't know what I could provide to help it out. And now all the three pools were upgraded to 12.0-U1.

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Ryan Moeller December 15, 2021 at 7:36 PM

We're not in a position to revive the async CoW/DMU work at this time.

Ryan Moeller April 9, 2021 at 3:28 PM

Pushing this work to 12.1 for now, there are still issues with the patch and it's not something I'm comfortable maintaining at this time.

Ryan Moeller April 9, 2021 at 3:24 PM

This issue does not affect 2.1, you are encountering something unrelated.

Paul Veloso April 9, 2021 at 9:55 AM

I can also attest to this having occurred to me with a GELI encrypted pool.

If anyone cares to read about situation, please see https://www.truenas.com/community/threads/help-upgrade-from-freenas-11-3-u5-to-truenas-12-0-u2-1-attempted-to-upgrade-geli-encrypted-pool-it-failed-cannot-unlock-nor-import-pool.92023/

In short, all the uberblocks on ALL drives are now invalid.

Ryan Moeller February 16, 2021 at 5:26 PM

We are investigating issues with the async CoW work. The commits have been backed out for the time being and will be included again when we have worked out the issues.

Future Consideration
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created January 15, 2021 at 9:36 PM
Updated July 1, 2022 at 4:58 PM
Resolved December 15, 2021 at 7:36 PM