[NAS-126809] Autoextend fails when upgrading drives in vdev - manual extend causes drive to drop offline - iXsystems TrueNAS Jira

Description

I have encountered what I believe to be a bug when attempting to grow a vdev by replacing the existing drives with larger versions. I’ve documented the saga in the forums and have solicited help, but haven’t made any progress. You can find that post here, but I’ll reproduce the salient points directly in the bug.

The system in question is home-built. Supermicro X11SSH-LN4F motherboard, E3-1275v6 processor, 64GB ECC RAM, LSI 9211-8i HBA in IT mode. This system and the drives have been incrementally upgraded over the years, with the storage pool’s history dating back to FreeNAS 9. The storage pool in question, named Tier3, consists of 6 3TB SATA drives connected to the motherboard’s SATA controller and 6 4TB SAS drives connected to the 9211, each configured as a RAID-Z2 and then aggregated into a single pool. There’s nothing overwhelmingly unique about this pool - it is currently sitting at around 93% usage (hangs head in shame).

After getting a couple SMART errors on one of the 3TB drives and realizing that those drives are pushing 10 years old with more than 80K hours, I figured it was time to fix my capacity issue and proactively replace drives. I purchased 6 20TB Exos drives for this undertaking. The drives were tested using the usual process - SMART short, SMART conveyance, badblocks, and then SMART long. No errors were noted after the process completed (almost 10 days later). I then set about removing one drive at a time, installing a new drive, resilver. Repeat for a total of 6 times. I then eagerly checked my pool capacity to find that… there was no change. I confirmed that autoextend was on - it was.

I figured something was confused and I would manually extend via the GUI. That failed abruptly with this error:

The pool remained OK until I rebooted - at that point, I lost a drive. The partuuid still exists, so I tried a zpool online, which fails. I went through a complete wipe to zero of the confused drive, did a replace and resilver, and tried several other things suggested in various forum posts - offlining each drive then onlining with the -e flag, exporting and importing the pool, etc. Each step is documented in the forum post I referenced above along with data from lsblk, zpool list, etc. Nothing has improved matters.

I did observe that the partition on the drive that goes missing after hitting extend does get properly grown to fill the drive and does show to still be a zfs member. I suspect the issue is somewhere in the partition resize process.

At this point, I have ordered an additional 6 drives. My intention, unless anyone has a better idea, is to bring those new drives online after another testing/burn-in process, create a new pool, replicate my data over, destroy the old pool and wipe the drives, then bring the 6 currently-in-use 20TB drives up as a second vdev in the new pool. This should (I hope) resolve the issue but does result in quite a bit of time consumed.

If this reaches someone in the near future, I’ll gladly provide any additional information or debugging that I can.

Steps to Reproduce

None

Expected Result

None

Actual Result

None

Environment

None

Hardware Health

None

Error Message (if applicable)

None

Attachments

Linked issues

duplicates

NAS-126996

Autoexpand does not trigger, Gui expand renders disk unavailable

is duplicated by

NAS-126929

Replacing disks to expand vdev doesn't work

NAS-126948

Disk replacement to expand vdev doesn't work

NAS-124721

ZFS expand requests reboot, then disk is unavail

NAS-125549

[EFAULT] when expanding pools created on Core

Activity

Show:

Dink Nasty February 12, 2024 at 4:07 AM

I have tried expanding the partitions manually using parted but the way TrueNAS replace formatted the drive partitions it seems to have made the zfs data partition first and then added a second partition after. I am not able to expand the partition because I get the error along the lines of “can’t have overlapping partitions.” I am not aware why the replace operation would create them in this order when my original disks were formatted with the swap? partition first and the zfs data partition after.

I guess I will have to give manual replacement a try in CLI.

See my previous post for image of lsblk output.

Thanks all for your help.

Dan Brown February 11, 2024 at 10:54 PM
Edited

Jason, you could always adjust the partitions manually. See:

Or do the drive replacement manually at the CLI; see:

Jason DeWeese February 11, 2024 at 10:13 PM
Edited

Is there is other more manual work around can we can use for now? Or just stuck at the lower capacity. I also thought I saw the PR is tagged with SCALE-23.10.2 release in github. We will need to upgrade to 24 to get this fixed? Thank you

Vladimir Vinogradenko February 11, 2024 at 9:35 PM

this behavior will only be changed in 24.10. SInce 24.10, every time you replace a drive, it will be formatted up to its maximum capacity (not the lowest drive’s capacity in the pool as it was before): So you’ll need to replace one yout 6TB drive with a 3TB drive, then replace it back, then repeat this for all 6TB drives.

Dink Nasty February 10, 2024 at 11:02 PM

Sorry for my ignorance but I’m a bit confused on how this fix will be implemented.

Running TrueNAS-SCALE-23.10.0.1

I replaced 6x3TB drives with 6x6TB drives and having this issue. Will I have to switch back to my 3TB drives, resilvering 1 at a time, install 23.10.2 and then resilver back to my 6TB drives? Will I be able to keep my 6TB drives in, update TrueNAS, and hit the expand button in GUI to fix the partitions?

Thanks in advance.

Complete

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Vladimir Vinogradenko

Reporter

Terry Van Sickle

Labels

Debug_Linkedready_for_assignment

Impact

High

Affects versions

SCALE-23.10.1

Fix versions

SCALE-24.10-ALPHA.1 (ElectricEel)

SCALE-23.10.2

SCALE-24.04-RC.1 (DragonFish)

Time remaining

Original estimate

Components

ZFS

Priority

Highest

Katalon Platform

Created January 16, 2024 at 8:30 PM

Updated February 13, 2024 at 6:06 PM

Resolved January 30, 2024 at 12:31 PM

Configure

Autoextend fails when upgrading drives in vdev - manual extend causes drive to drop offline

Description

Steps to Reproduce

Expected Result

Actual Result

Environment

Hardware Health

Error Message (if applicable)

Attachments

Linked issues

duplicates

is duplicated by

Activity

Dink Nasty February 12, 2024 at 4:07 AM

Dan Brown February 11, 2024 at 10:54 PMEdited

Jason DeWeese February 11, 2024 at 10:13 PMEdited

Vladimir Vinogradenko February 11, 2024 at 9:35 PM

Dink Nasty February 10, 2024 at 11:02 PM

Assignee

Reporter

Labels

Impact

Affects versions

Fix versions

Time remaining

Original estimate

Components

Priority

Dan Brown February 11, 2024 at 10:54 PM
Edited

Jason DeWeese February 11, 2024 at 10:13 PM
Edited