In a discussion on freebsd-stable we came to suspect that the NVMe driver in FreeBSD 11 misses completion interrupts issued by the device when finishing a task and then runs into timeouts.
This leads to the system becoming unresponsive.
Tests with plain FreeBSD without FreeNAS show that 11-STABLE does exhibit the problem while 12-STABLE doesn't.
All hardware components have the latest BIOS/firmware as provided by the vendor.
There have been substantial changes in the NVMe subsystem in FreeBSD >=12, initially targeting endianess problems on e.g. Sparc64, but some code to specifically deal with missed interrupts was added to nvme_timeout() in nvme_qpair.c - with a 12-STABLE kernel my system loggs a "Missing interrupt" every half an hour or so under synthetic write load, but runs otherwise stable. An 11-STABLE system hangs seconds after I start my "dd" jobs.
You can always create a new ticket and we link them if indeed related.
Patrick M. Hausen November 21, 2019 at 4:03 PM
Edited
Hi!
I tried to create a ticket linked to with new info as suggested by Alexander. Please close this clone and advise me how to do this. Your JIRA behaves weird compared to mine Create linked issue is not available, editing the description is not possible, "{ code }" tag in description does not work but does in comments ... stuff like that.
Chassis: Supermicro AS-1113S-WN10RT
Mainboard: H11SSW-NT
Disk drives: 6x Intel SSDPE2KX010T8
With FreeNAS 11.2-U3 as soon as there are more than 4 of these drives in the system any moderate write load on the drives leads to errors like this:
Apr 12 13:42:16 freenas01 nvme6: aborting outstanding i/o
Apr 12 13:42:16 freenas01 nvme6: WRITE sqid:1 cid:117 nsid:1 lba:981825104 len:176
Apr 12 13:42:16 freenas01 nvme6: ABORTED - BY REQUEST (00/07) sqid:1 cid:117 cdw0:0
Apr 12 13:42:49 freenas01 nvme6: resetting controller
Apr 12 13:42:50 freenas01 nvme6: aborting outstanding i/o
Apr 12 13:42:50 freenas01 nvme6: WRITE sqid:1 cid:127 nsid:1 lba:984107936 len:96
Apr 12 13:42:50 freenas01 nvme6: ABORTED - BY REQUEST (00/07) sqid:1 cid:127 cdw0:0
Apr 12 13:43:35 freenas01 nvme6: resetting controller
In a discussion on freebsd-stable we came to suspect that the NVMe driver in FreeBSD 11 misses completion interrupts issued by the device when finishing a task and then runs into timeouts.
This leads to the system becoming unresponsive.
Tests with plain FreeBSD without FreeNAS show that 11-STABLE does exhibit the problem while 12-STABLE doesn't.
All hardware components have the latest BIOS/firmware as provided by the vendor.
There have been substantial changes in the NVMe subsystem in FreeBSD >=12, initially targeting endianess problems on e.g. Sparc64, but some code to specifically deal with missed interrupts was added to nvme_timeout() in nvme_qpair.c - with a 12-STABLE kernel my system loggs a "Missing interrupt" every half an hour or so under synthetic write load, but runs otherwise stable. An 11-STABLE system hangs seconds after I start my "dd" jobs.
More details can be found in the added links.
Kind regards,
Patrick