Box crashes on load

Description

This box regularly crashes when causing higher load via nfs4.

Environment:
TN12U1 acts as nfs4 attached datastore for ESXi 6.7U3.
When I migrate VMs to or from TN to another datastore (vSan) the TN box crashes after a while. This happens both on reads and writes.

Unfortunately I changed a bunch of things from back when it was stable,
replaced 12 SAS3 SSDs with 4 NVME drives (Dell branded), added a MLX5 NIC (100G connected), updated to TN 12.1;
so I am not sure which change was the one that actually broke things.

My first suspicion were power issues with the NVMe drives, but I now have a 1k PSU running and my power measurement only shows ~200 to 300W usage; i evenly distributed the nvme drives to the power connector to prevent overload of individual rails.

The box also seems to randomly reboot sometimes; I get error messages that an unexpected reboot has occured and the boot volume was checked ok.
I had these before as well, but attributed them to running an NVME drive as system drive which I have since removed; OS has been reinstalled and config restored.

Sometimes (every other/third boot), the systemd does not boot properly but ends up with a kernel panic. Another reboot usually resolves this.

The system has been stable while running 11.U3 with SSDs, reboots with kernel panics only happen since 12, as mentioned I am not sure if the reboots under load are HW related (NVME) or also TN 12 related.

I tried capturing core files but couldnt see any.

Thanks

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Rand__ January 21, 2021 at 9:34 PM

Hm those look like the kernel panics that I got every couple of reboots (https://jira.ixsystems.com/browse/NAS-108096).

 

Will observe IPMI Logs & future crashes.

 

Thanks again

Alexander Motin January 21, 2021 at 9:00 PM

Memory errors are not in logs, since they are fatal, they are in kernel dumps and look like that:

Some errors that happen to be correctable look alike to the first 6 lines of the above and are visible in logs.  Also I'd check your IPMI/BMC logs, they should also report it.

Rand__ January 21, 2021 at 8:30 PM
Edited

Hi Alexander,

thank you very much for looking into this issue.

Can you provide me the appropriate log parts for the 2 memory errors? I have had memory errors in the HW log for a long while and Supermicro told me I could ignore them (as they originally deemed them to be a warning about mixing different rank'ed modules).

But if these cause crashes with the latest Bios I would like to revisit that previous statement from them...

 

I updated to U1.1 and will run some tests, thanks a lot.

 

Kind regards,

Thomas

 

 Edit: Its much more stable. Transferred ~2TB worth of VMs and not a single glitch.

Brilliant, thank you very much

Alexander Motin January 21, 2021 at 6:46 PM

Thomas, as I see from kernel dumps in provided debug, of the 5 last recorded panics 2 were caused by uncorrectable memory error, while 3 by ZFS bug that should already be fixed in recently released TrueNAS 12.0-U1.1.  So please replace your RAM and update software.

Duplicate
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created January 13, 2021 at 2:25 PM
Updated July 1, 2022 at 2:53 PM
Resolved January 21, 2021 at 6:47 PM