System panic on spa_do_crypt_abd() return 5 during high-volume activities

Description

This is similar to what I reported in a while back but now on 12.u5 and more RAM assigned to the system (10 GBs). This system is only intended to store backups of pools from a prod system, hence the low specs.

The issue occurs when exporting Hyper-v VMs (large files) to an iSCSI mounted block device, which is in an encrypted pool in TrueNAS.

I was able to consistently reproduce the issue yesterday (.. system crashed 5 times in a row or so) a few minutes after starting the VM export activity, which in a nutshell equates to writing large files to the block device.

I then enable the kernel debug option in the hope to capture more info with the next crash but after enabling the debug option the system didn't any longer crash.

Aug 17 14:17:48 NAS2 panic: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, &zio->io_bookmark, BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp), salt, iv, mac, ps
ize, zio->io_abd, eabd, &no_crypt)) failed (0 == 5)
Aug 17 14:17:48 NAS2 cpuid = 1
Aug 17 14:17:48 NAS2 time = 1629173770
Aug 17 14:17:48 NAS2 KDB: stack backtrace:
Aug 17 14:17:48 NAS2 db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a1854990
Aug 17 14:17:48 NAS2 vpanic() at vpanic+0x17b/frame 0xfffffe00a18549e0
Aug 17 14:17:48 NAS2 spl_panic() at spl_panic+0x3a/frame 0xfffffe00a1854a40
Aug 17 14:17:48 NAS2 zio_encrypt() at zio_encrypt+0x609/frame 0xfffffe00a1854ad0
Aug 17 14:17:48 NAS2 zio_execute() at zio_execute+0x6a/frame 0xfffffe00a1854b20
Aug 17 14:17:48 NAS2 taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe00a1854b80
Aug 17 14:17:48 NAS2 taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe00a1854bb0
Aug 17 14:17:48 NAS2 fork_exit() at fork_exit+0x7e/frame 0xfffffe00a1854bf0
Aug 17 14:17:48 NAS2 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00a1854bf0
Aug 17 14:17:48 NAS2 — trap 0, rip = 0, rsp = 0, rbp = 0 —
Aug 17 14:17:48 NAS2 KDB: enter: panic
Aug 17 14:17:48 NAS2 ---< >---

Problem/Justification

None

Impact

None

Activity

Show:

Alexander Motin June 15, 2022 at 6:55 PM

Just for reference, I've decided to still plug it on ZFS side just in case some driver in future will report error in a different way: https://github.com/openzfs/zfs/pull/13563 .

Alexander Motin June 14, 2022 at 7:15 PM

I've found a way how this could happen on TrueNAS 12.  AESNI driver there was returning errors in two ways same time, and ENOMEM errors are not handled on one of them, causing this panic.  I was able to reproduce it once by creating extreme memory pressure during 15 minutes of active write.  On 13 through the AESNI code was changed as part of big refactoring, and the scenario should no longer happen.  I've still tried to stress it with memory pressure, but failed to trigger it.  Since we do not plan more 12 releases this may now rest in peace.

Michele Lucini January 10, 2022 at 12:32 AM

Just a quick update to this: The crashes are still there and even seem to have become more frequent with U7. An additional system I use to backup data to started crashing as well (... wasn't crashing previously).

These backup systems are running inside Hyper-V hypervisors with approx. 8GB memory assigned and 4-6 vCPU cores assigned. I can prevent the crashes from occurring by limiting the ARC size to approx. 2 GBs (... I haven't tested other values). I've done this by setting the vfs.zfs.arc_max tunable to 2147483648.

After doing so, the crashes stopped and the backup complete successfully.

Michele Lucini October 15, 2021 at 1:00 AM

FYI, I've just applied u6 and this happened shortly thereafter. Just noticed the fix was moved to u7 but I'll add this note anyways...

 

Oct 15 11:19:22 NAS2 syslog-ng[1571]: syslog-ng starting up; version='3.29.1'
Oct 15 11:19:22 NAS2 panic: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, &zio->io_bookmark, BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp), salt, iv, mac, psize, zio->io_abd, eabd, &no_crypt)) failed (0 == 5)
Oct 15 11:19:22 NAS2 cpuid = 1
Oct 15 11:19:22 NAS2 time = 1634257076
Oct 15 11:19:22 NAS2 KDB: stack backtrace:
Oct 15 11:19:22 NAS2 db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b1efb990
Oct 15 11:19:22 NAS2 vpanic() at vpanic+0x17b/frame 0xfffffe00b1efb9e0
Oct 15 11:19:22 NAS2 spl_panic() at spl_panic+0x3a/frame 0xfffffe00b1efba40
Oct 15 11:19:22 NAS2 zio_encrypt() at zio_encrypt+0x609/frame 0xfffffe00b1efbad0
Oct 15 11:19:22 NAS2 zio_execute() at zio_execute+0x6a/frame 0xfffffe00b1efbb20
Oct 15 11:19:22 NAS2 taskqueue_run_locked() at taskqueue_run_locked+0x144/frame 0xfffffe00b1efbb80
Oct 15 11:19:22 NAS2 taskqueue_thread_loop() at taskqueue_thread_loop+0xb6/frame 0xfffffe00b1efbbb0
Oct 15 11:19:22 NAS2 fork_exit() at fork_exit+0x7e/frame 0xfffffe00b1efbbf0
Oct 15 11:19:22 NAS2 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00b1efbbf0
Oct 15 11:19:22 NAS2 — trap 0, rip = 0, rsp = 0, rbp = 0 —
Oct 15 11:19:22 NAS2 KDB: enter: panic
Oct 15 11:19:22 NAS2 ---<<BOOT>>---

Alexander Motin August 23, 2021 at 7:15 PM

Debug kernel is expected to reduces performance due to additional checks for things that should not happen.  So unfortunately it indeed may workaround some problems.

Complete
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Time remaining

0m

Components

Fix versions

Affects versions

Priority

Katalon Platform

Created August 18, 2021 at 12:41 AM
Updated July 6, 2022 at 9:02 PM
Resolved June 14, 2022 at 7:15 PM