Improve NVMe timeout handling

Description

Chassis: Supermicro AS-1113S-WN10RT
Mainboard: H11SSW-NT
Disk drives: 6x Intel SSDPE2KX010T8

With FreeNAS 11.2-U3 as soon as there are more than 4 of these drives in the system any moderate write load on the drives leads to errors like this:

Apr 12 13:42:16 freenas01 nvme6: aborting outstanding i/o
Apr 12 13:42:16 freenas01 nvme6: WRITE sqid:1 cid:117 nsid:1 lba:981825104 len:176
Apr 12 13:42:16 freenas01 nvme6: ABORTED - BY REQUEST (00/07) sqid:1 cid:117 cdw0:0
Apr 12 13:42:49 freenas01 nvme6: resetting controller
Apr 12 13:42:50 freenas01 nvme6: aborting outstanding i/o
Apr 12 13:42:50 freenas01 nvme6: WRITE sqid:1 cid:127 nsid:1 lba:984107936 len:96
Apr 12 13:42:50 freenas01 nvme6: ABORTED - BY REQUEST (00/07) sqid:1 cid:127 cdw0:0
Apr 12 13:43:35 freenas01 nvme6: resetting controller

In a discussion on freebsd-stable we came to suspect that the NVMe driver in FreeBSD 11 misses completion interrupts issued by the device when finishing a task and then runs into timeouts.

This leads to the system becoming unresponsive.

Tests with plain FreeBSD without FreeNAS show that 11-STABLE does exhibit the problem while 12-STABLE doesn't.

All hardware components have the latest BIOS/firmware as provided by the vendor.

There have been substantial changes in the NVMe subsystem in FreeBSD >=12, initially targeting endianess problems on e.g. Sparc64, but some code to specifically deal with missed interrupts was added to nvme_timeout() in nvme_qpair.c - with a 12-STABLE kernel my system loggs a "Missing interrupt" every half an hour or so under synthetic write load, but runs otherwise stable. An 11-STABLE system hangs seconds after I start my "dd" jobs.

More details can be found in the added links.

Kind regards,
Patrick

Problem/Justification

None

Impact

None

is cloned by

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Alexander Motin May 15, 2019 at 6:32 PM

I propose to create another linked ticket for further work on this if you have useful input. This ticket is already marked done, providing partial workaround above.

Patrick M. Hausen May 14, 2019 at 2:06 PM

Just had a crash with that setup on reboot of a bhyve VM. I'll revert back to just 4 NVME drives and 11.2-U4.1 ... will let you know if that works.

Patrick M. Hausen May 6, 2019 at 5:52 PM

The system seems to be working now. Just for documentation purposes:

  • Install FreeNAS 11.2-U3

  • Install a FreeBSD 12-STABLE kernel over the FreeNAS installation (new boot environment)

  • Install a FreeBSD 12-STABLE world the same way

  • Build and install freenas_sysctl.ko for FreeBSD 12

  • Add to loader.conf: zfs_load, if_tap_load, if_bridge_load, if_bnxt_load (the last one is necessary for my HW in any case)

Finally since the syntax for bhyve(8) changed slightly, apply this patch:

root@freenas01[/]# diff /usr/local/lib/python3.6/site-packages/middlewared/plugins/vm.py.orig /usr/local/lib/python3.6/site-packages/middlewared/plugins/vm.py 229c229 < args += ['-s', '29,fbuf,vncserver,tcp={}:{},w={},h={},{},{}'.format(vnc_bind, vnc_port, width, --- > args += ['-s', '29,fbuf,tcp={}:{},w={},h={},{},{}'.format(vnc_bind, vnc_port, width,

No more missed interrupts. Dashboard does not quite work, but netdata looks OK. VMs work. Lots of alerts about new ZFS feature flags, but I won't upgrade my pool, of course.

Kind regards,
Patrick

Alexander Motin May 2, 2019 at 7:18 PM
Edited

"Unfortunately this seems to be a more or less abandoned branch

Yes, it was. Initial FreeNAS 12 development started in other repo with other build infrastructure.  I've recreated the OS freenas/12-stable branch yesterday, and we'll start gradually putting all part together, but there is too many activity going same time now, so I can't recommend it even for initial testing.  12 is our future, it is just too distant still.

Patrick M. Hausen April 30, 2019 at 8:58 AM

OK, and now after I installed a 12-STABLE world over the FreeNAS installation some system utilities like netstat even stopped dumping core ... hail to boot environments!

Complete
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created April 15, 2019 at 1:17 PM
Updated July 1, 2022 at 4:32 PM
Resolved April 24, 2019 at 5:17 PM