Box crashes on load
Description
Problem/Justification
Impact
SmartDraw Connector
Katalon Manual Tests (BETA)
Activity

Rand__ January 21, 2021 at 9:34 PM
Hm those look like the kernel panics that I got every couple of reboots (https://jira.ixsystems.com/browse/NAS-108096).
Will observe IPMI Logs & future crashes.
Thanks again

Alexander Motin January 21, 2021 at 9:00 PM
Memory errors are not in logs, since they are fatal, they are in kernel dumps and look like that:
Some errors that happen to be correctable look alike to the first 6 lines of the above and are visible in logs. Also I'd check your IPMI/BMC logs, they should also report it.

Rand__ January 21, 2021 at 8:30 PMEdited
Hi Alexander,
thank you very much for looking into this issue.
Can you provide me the appropriate log parts for the 2 memory errors? I have had memory errors in the HW log for a long while and Supermicro told me I could ignore them (as they originally deemed them to be a warning about mixing different rank'ed modules).
But if these cause crashes with the latest Bios I would like to revisit that previous statement from them...
I updated to U1.1 and will run some tests, thanks a lot.
Kind regards,
Thomas
Edit: Its much more stable. Transferred ~2TB worth of VMs and not a single glitch.
Brilliant, thank you very much

Alexander Motin January 21, 2021 at 6:46 PM
Thomas, as I see from kernel dumps in provided debug, of the 5 last recorded panics 2 were caused by uncorrectable memory error, while 3 by ZFS bug that should already be fixed in recently released TrueNAS 12.0-U1.1. So please replace your RAM and update software.
Details
Details
Assignee

Reporter

This box regularly crashes when causing higher load via nfs4.
Environment:
TN12U1 acts as nfs4 attached datastore for ESXi 6.7U3.
When I migrate VMs to or from TN to another datastore (vSan) the TN box crashes after a while. This happens both on reads and writes.
Unfortunately I changed a bunch of things from back when it was stable,
replaced 12 SAS3 SSDs with 4 NVME drives (Dell branded), added a MLX5 NIC (100G connected), updated to TN 12.1;
so I am not sure which change was the one that actually broke things.
My first suspicion were power issues with the NVMe drives, but I now have a 1k PSU running and my power measurement only shows ~200 to 300W usage; i evenly distributed the nvme drives to the power connector to prevent overload of individual rails.
The box also seems to randomly reboot sometimes; I get error messages that an unexpected reboot has occured and the boot volume was checked ok.
I had these before as well, but attributed them to running an NVME drive as system drive which I have since removed; OS has been reinstalled and config restored.
Sometimes (every other/third boot), the systemd does not boot properly but ends up with a kernel panic. Another reboot usually resolves this.
The system has been stable while running 11.U3 with SSDs, reboots with kernel panics only happen since 12, as mentioned I am not sure if the reboots under load are HW related (NVME) or also TN 12 related.
I tried capturing core files but couldnt see any.
Thanks