Thanks for using the TrueNAS Community Edition issue tracker! TrueNAS Enterprise users receive direct support for their reports from our support portal.

Data corruption after TrueNAS upgrade

Description

Copy and pasting from TrueNAS forums: https://www.truenas.com/community/threads/freenas-now-truenas-is-no-longer-stable.89445

Subject: FreeNAS (now TrueNAS) is no longer stable

Hello, I'm writing this with deeply concerns about TrueNAS/FreeNAS and the move that seemed a little bit irresponsible regarding quality and testing.

I have three virtualization pools that relied on FreeNAS for years. One specifically is running since 2013, other 2014 and the newer one since 2016.

On the 2014 pool, we've updated from FreeNAS 11.3-U5 to TrueNAS 12.0-RELEASE, 3 weeks ago, precisely on November 20th. Suddenly we started to discover extreme VM corruption within the XFS filesystem, everything was getting corrupted, including the filesystem superblocks, leading to the inability to recover from xfs_repair.

We blamed everyone: we blamed the Hypervisor, in that case oVirt 4.4, we blamed the fabrics and network, since this pool is using a Cisco Catalyst 2960X as a core, which is not ideal, we blamed the XFS filesystem due to issues on writeback mode, we blamed everyone. We didn't even considered blaming TrueNAS. I've even opened a discussion within the hypervisor mailing lists, but nothing conclusive was found: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2DVB4ULURXWJ5VGHX64FDUZW27F7DY3J

So for next days we blamed mainly the network, since there's some packets dropped on the switch. We concluded that the load on the park, for whatever reasons, have increased and the drops could have causing the issue. A guy on the maililng list recommended falling back to NFSv3 as VM storage instead of NFSv4 due to weird things happening under load. We tried, the situation was better but the issue is still happening.

In this monday, we've had maintenance on the pools from 2013 and 2016. So it's upgrade time. We upgraded both pools to TrueNAS 12.0-RELEASE. 12.0-U1 wasn't available yet.

Everything went fine... but on this Thursday the mail server on the pool from 2013 went down with a disconnection on the iSCSI disk due to I/O errors. Well let's see what happened and the result was: the VM was completely trashed. Corruption on the filesystem, on the operational system, on the service and on the databases that held the mailboxes. Other VM's like a webserver are completely trashed too. So it's a disaster scenario.

Regarding the pool from 2016, I've already detected in place XFS corruption in one VM. For safety measures everything was shut down.

So what happened?

All three pools have different equipments and software, but the only common denominator is the storage system, which was ranging from FreeNAS 11.1 to 11.3. The hypervisors are mixed: oVirt 4.3, oVirt 4.4 and XenServer 7.2; two of them uses iSCSI as the storage backend and one is with NFSv3. Hardware is completely different either, so as you can see. TrueNAS is the only piece that's equal.

For now, I've upgraded everything from 12.0-RELEASE to 12.0-U1. In hope that this will fix this issues.

I don't have any artifact to blame FreeNAS/TrueNAS, the only thing that I've is my word of what happened on those pools. I never had any issue with FreeNAS/TrueNAS for almost 8 years running it, but this move to 12.0 may seem rushed by iXsystems. There's no logs generated within TrueNAS, no errors, no health issues on the zpools, nothing. Which leads me to believe that the software is in an silent unstable state.

I don't have any options right now, I can't downgrade back to 11.3-U5/6/7/etc since the zpool was upgraded on the three systems. But there's one things that dropped the ball regarding to trust with iX releasing the proper stable versions of TrueNAS.

After the upgrade I've noted that 12.0-RELEASE was built with RC (Release Candidate) code:

Last login: Tue Dec 8 17:24:17 2020 FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS TrueNAS (c) 2009-2020, iXsystems, Inc. All rights reserved. TrueNAS code is released under the modified BSD license with some files copyrighted by (c) iXsystems, Inc. For more information, documentation, help or support, go here: http://truenas.com FreeBSD freenas.win.versatushpc.com.br 12.2-RC3 FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS amd64

OpenZFS 2.0 wasn't even released yet, leading to confusion. When 12.0-RELEASE was announced I understood that OpenZFS 2.0 was released together, but this seems not to be the case since the announcement was done two day ago, on December 10th: https://www.ixsystems.com/blog/openzfs-2-on-truenas

What we got running on 12.0-RELEASE?

root@freenas:~ # pkg info | grep -i zfs beadm-1.4 Solaris-like utility to manage Boot Environments on ZFS iohyve-0.7.9 bhyve manager utilizing ZFS and other FreeBSD tools openzfs-2020100200 OpenZFS userland for FreeBSD openzfs-kmod-2020100200 OpenZFS kernel module for FreeBSD py38-libzfs-1.0.202008212020 Python libzfs bindings py38-zettarepl-0.1_24 Cross-platform ZFS replication solution

OpenZFS snapshot from October 2nd. This is not STABLE at all...

In 12.0-U1 we got the proper released OpenZFS version, and a non RC FreeBSD 12 system. As we would expect from a RELEASE release.

FreeBSD freenas.win.versatushpc.com.br 12.2-RELEASE-p2 FreeBSD 12.2-RELEASE-p2 663e6b09467(HEAD) TRUENAS amd64 root@freenas:~ # pkg info | grep -i zfs beadm-1.4 Solaris-like utility to manage Boot Environments on ZFS iohyve-0.7.9 bhyve manager utilizing ZFS and other FreeBSD tools openzfs-2020120100 OpenZFS userland for FreeBSD openzfs-kmod-2020120100 OpenZFS kernel module for FreeBSD py38-libzfs-1.0.202011201432 Python libzfs bindings py38-zettarepl-0.1_27 Cross-platform ZFS replication solution

Yeah, so... given the evidence I cannot conclude anything different from: 12.0-RELEASE is not STABLE. It should not be marketed as STABLE in first place. Even upgrade to 12.0-U1 still marks 12.0-U1 as development branch and should not be used in production: https://jira.ixsystems.com/browse/NAS-108580; yes it may seems to be a cosmetic bug, but for paying customers TrueNAS 12 isn't available yet. So all this TrueNAS Core thing leads to extreme confusion. There is cleary two separate branches from the OpenSource release and the one that iX ships, which is fine, but this should be explained better.

For now I don't even know if 12.0-U1 would solve the reported issues, and if 12.0-U1 will be considered stable. Because it's not.

Regarding the original issue, I'm pretty much confident that the issues were consequence of running 12.0-RELEASE. People can blame me for "upgrading it too early" or "you should have paid support since your environment is critical", or other nonsenses like: "you probably don't know how to build proper ZFS systems". But the reality is that none of them applies to the situation.

I know that iX is not responsible for this, this is FOSS software and delivered "as is"; this is just as an alert to keep running FreeNAS 11.3-U5/6/7/etc until things get really stable on the 12.0 branch. Keep an eye with the paying customers, look when they will receive the updates, I've read somewhere that this release will be on December 22th. We hope this will be stable, so people could have a proper Christmas and a good new year.

Thanks for listening.

PS: If there's any artifact that I can generate to help further investigate I'm totally willing to do it, but I don't know what I could provide to help it out. And now all the three pools were upgraded to 12.0-U1.

Problem/Justification

None

Impact

None

is cloned by

is duplicated by

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

William Gryzbowski January 22, 2021 at 12:26 PM

20.12-ALPHA is likely affected by this as well.

Ilkka Myller January 22, 2021 at 9:29 AM

 Related to this: Do you expect TrueNAS SCALE 20.12-Alpha to be affected by this same async CoW issue? Or is this clearly limited only to OpenZFS 2.0 under BSD kernel?

 

Michele Noberasco January 21, 2021 at 4:05 PM

some of my VMs run on storage served by FreeNAS over NFS, but as I wrote after fsck-ing them they seem fine.

I do, however, have seen corruption in database files stored on ZFS and served by FreeNAS over SMB. These have already been recovered and temporarily moved to other storage. I guess I can attempt to move them back to FreeNAS and see what happens now that I upgraded to 12.0-U1.1.

I also observed corruption on boot pools after upgrading from Freenas 11 to TrueNAS Core 12. This I recovered from by reinstalling from scratch and importing configuration file.

Alexander Motin January 21, 2021 at 2:22 PM

I was able to identify the cause of corruption.  It appeared to be a read data corruption, happening when some block absent in ARC is read and partially rewritten exactly same time.  As Kris told, such access pattern is very unlikely for simple file sharing and observed more from VMs, file systems and possibly databases, unaware about and not aligning to underlying ZFS block/record size, and so doing many random misaligned and uncached I/O.

Kris Moore January 21, 2021 at 1:42 PM

 - Thus far we've not seen any sort of corruption in the SMB use-case, or 'big shared storage' in your terms slightly smiling face

The access pattern was very specific and unlikely to occur with traditional file-based workloads. The thing which seemed to trigger it was running another filesystem on top of ZFS. So if none of the files you hosted on SMB were file-based backing for VM's, it would be extremely unlikely that anything in that case was harmed.

Complete
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created December 12, 2020 at 8:45 PM
Updated July 1, 2022 at 4:59 PM
Resolved January 15, 2021 at 10:09 PM