Kernel panic with nfs traffic
Description
Problem/Justification
Impact
SmartDraw Connector
Katalon Manual Tests (BETA)
Activity

Rob LewisNovember 18, 2020 at 10:01 PM
The ddrescue that wrote back onto itself, through the mounted qemu-nbd command and the incorrect use of the ddrescue command, finished without any kernel panics. Command were run from the Proxmox host server, connecting over nfs.
The .raw file in the file listing is really the ddrescue log file
I was then able to run qemu-img convert and convert the vm-103-disk1.qcow2 file which was actually in raw format, back to a qcow2 file. The qcow2 file was bootable by the virtual machine. There was some file corruption, but the disk was bootable after running chkdsk,
At this point, from my point of view, the bug seems to be resolved.
If there are any other conditions you would like me to test, please let me know.

Matthew MacyNovember 17, 2020 at 8:16 PM
It's a non debug kmod. '.debug' is just symbols, which I don't think you have any use for unless you get a core dump.
"In the interest of testing weird cases, I will re-run the ddrescue that wrote the file back onto itself."
Yes. This is exactly what the second panic was caused by, writing to a corrupted file.

Rob LewisNovember 17, 2020 at 3:43 PM
I uncompressed the openzfs.ko.debug.gz, and uploaded to /boot/modules along with openzfs.ko. Rebooted.
With latest openzfs.ko and openzfs.ko.debug, the system would not boot with debug kernel enabled. I didn't write down the error, but I believe it was unrecognized file system
Booted with debug kernel disabled, system booted fine.
Ran ddrescue on the corrupted file, without a kernel panic.
Having said that, looking at my previous post, it appears that I made a mistake with the ddrescue command that resulted in the kernel panic, and was actually writing the corrupted file back on to itself, the qcow2 file, rather than to the new .raw file.
In the interest of testing weird cases, I will re-run the ddrescue that wrote the file back onto itself.

Matthew MacyNovember 16, 2020 at 10:45 PM
I updated the write failure path. Please try with latest. Don't expect it to be any faster, but should not panic.

Rob LewisNovember 2, 2020 at 9:20 PMEdited
Applied latest openzfs.ko, rebooted.
Performed rsync of corrupted file. It progressed through the 24% mark, where is would have previously hanged. However, there were errors with the rsync, and the destination file was discarded:
I then attempted to recover the qcow2 file, using qemu-nbd on one of the proxmox servers to mount the qcow2 file over nfs, ddrescue to a raw file, then qemu-convert back to qcow2. This is a process that I have perfomed in the past with a corrupted qcow2 file, using a Nexenta Community Edition 3.1.6 as the zfs based, NFS host. Ddrescue ran into 5 read errors at the 24% mark, but continued. Remaining time drastically increased to over 3 hours after hitting the read errors. Initial time to complete ddrescue was approx 50 minutes.
Ultimately ddrescue over nfs caused another kernel panic and reboot before it completed. Here is what was left on screen from ddrescue running on one of the proxmox servers.
At this point, the ddrescue is still running. Enabling the NFS service on TrueNAS, results in a kernel panic and reboot.
Full disclosure, I moved the pool disks from the Dell r720xd hardware to a different system, and purchased new disks for the Dell r720xd so that I could deploy it to be used for production storage. These new errors are happening on different hardware than this ticket was originally opened for.
Attached is the latest debug file, however, I wasn't running debug kernel, since I didn't have a debug vesion of openzfs.ko.
If you need additional data or info, let me know.
TrueNas 12 RC1, with 10 Gbps x520 dual port nic. Dell r720XD server with 128GB ram. Started with a knockoff Intel nic, Gtek x520-da2, now running a refurbished Dell OEM x520 without any improvement.
Serving NFS to 3 node proxmox cluster. Was also serving iSCSI to 2 node Microsoft Failover File Server cluster, but iscsi service on TrueNas, and VM's running Failover cluster are all currently disabled. This has not improved kernel panics
Initially server will run fine, but after a period of time, will kernel panic and automatically reboot. Will constantly reboot within a few seconds of displaying available IP addresses in the console.
If I restart Proxmox cluster nodes, the panic/reboot stops for a period of time, then start again.
The only way I am able to keep the system from kernel panic on every boot once the panics start, is to disable the switch port for the 10Gbps card, then once it is up and running, log in on an ip address assigned to a 1Gbps port, and stop/disable nfs service. I can then enable the switch port and send traffic over it without issue.
Ran a few iperf3 tests this morning to the 10Gbps nic with the affected server acting as iperf3 server, with Gtek x520 card installed. Client was a Debian system, with a 10Gbps Gtek x520 nic Multiple 5 minute tests without any kernel panics. Average speed 7Gbps.
After running these tests, as soon as I re-enabled nfs, kernel panic occurred within a few seconds.
I originally started discussing this on ixsystems forums, see: https://www.ixsystems.com/community/threads/kernel-panic-possibly-nfs-and-10gbps-nic.88018/#post-609767
Screen shot of kernel panic, and debug file uploaded. Let me know if you need anything else?
Thanks.