ZFS space efficiency on devices with huge physical blocks

Description

I have a total of 8x 15 TB Intel U.2 NVMe drives (13.97 TiB in TrueNAS).
Created a RaidZ2 pool with 7 Drives + 1 Spare.
After the pool creation, TrueNAS CORE 13.0-U1.1 only shows 32.3T of the available size.
This pool size calc is off and does not match the size of the disks. I'd expect around

On TrueNAS Scale, the pool size is correct, 65 TiB.

I reached out to Kris Moore on Discord, and after sending him some logs, he agreed that there might be something wrong.

Problem/Justification

None

Impact

None

Attachments

1

Activity

Show:

Alexander Motin September 26, 2022 at 2:05 PM

The patch is included into upcoming OpenZFS 2.1.6. If it will be possible, it will be included into SCALE 22.12-BETA2, if not – following RC1.

Alexander Motin August 24, 2022 at 6:36 PM

This PR should make ZFS prefer space efficiency to some performance in case like that: .

Alexander Motin August 23, 2022 at 7:03 PM

In product briefs I see that those NVMes are read-optimized, and specs measure performance and endurance in terms of 64KB blocks. So it may be that they indeed prefer writes of that size. If that is true, then ZFS and TrueNAS Core did exactly what they were made to do. With the sysctl above you can override that, if you think it is right, but then I suppose write endurance or performance may suffer, but I have no idea how much. If that will be write-once-read-many storage, then you should be fine, since ZFS writes data sequentially, when possible, but otherwise once your pool get fragmented with time performance may suffer badly.

Alexander Motin August 23, 2022 at 6:37 PM

It was surprising first, but I think I have an answer. I see that kernel reports “stripe size” AKA physical sector size for those NVMes of 64KB. Seeing that ZFS happily increases its ashift value to its maximum of 16 (64KB), that makes RAIDZ2 extremely space inefficient, that is why you probably see so small projected capacity. The question is why the NVMe driver in kernel reports such a big stripe size.

Could you please provide outputs of nvmecontrol identify nvme1, nvmecontrol identify nvme1ns1 , nvmecontrol identify -xv nvme1 and nvmecontrol identify -xv nvme1ns1 commands? I suppose the devices report Preferred Write Granularity or Preferred Write Alignment of that big size.

To workaround the issue you should be able to set sysctl vfs.zfs.max_auto_ashift=12 before the pool creation.

Michelle Johnson August 17, 2022 at 12:23 PM

Thank you for your report, !

This issue ticket is now in the queue for review. An Engineering representative will update with further details or questions in the near future.

Complete

Details

Assignee

Reporter

Labels

Impact

Critical

Original estimate

Time remaining

0m

Components

Affects versions

Priority

Katalon Platform

Created August 16, 2022 at 9:03 PM
Updated February 27, 2025 at 10:23 PM
Resolved September 26, 2022 at 2:05 PM