FreeNAS 11.2-U6 - Supermicro server crashing

Description

Hello,

About 6 months ago I have reported the same issues with FreeNAS 11.1-U7, but the problem still exist on 11.2-U6. I would like to debug what is root cause, and what is need to replace if it is a HW issue. Currently I have restored the original BIOS and FW version as the motherboard arrived from the vendor.

The problem is the server crashing periodically. (between 10 - 48 hours)
I have tried with the latest BIOS/FW, but didn't helped. Also try to get any information with SMCIPMITool "Super Diagnostics Offline", but no errors reported.

Motherboard: Supermicro X11SLP-F
BIOS: 2.0b (02/26/2018)
Firmware: 01.46 ( 02/14/2018)
CPU: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
Memory: 2x16GB
L2ARC: 240GB (SSD)
Storage: 6x6TB WD Gold

The storage configuration:

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:00:10 with 0 errors on Sat Oct 26 03:45:10 2019
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
ada0p2 ONLINE 0 0 0

errors: No known data errors

pool: vm-pool
state: ONLINE
scan: scrub repaired 0 in 0 days 02:17:51 with 0 errors on Sun Oct 27 06:17:52 2019
config:

NAME STATE READ WRITE CKSUM
vm-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/8e09230a-4891-11e8-9fc1-ac1f6b62a617 ONLINE 0 0 0
gptid/8f15f7cf-4891-11e8-9fc1-ac1f6b62a617 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/9094d901-4891-11e8-9fc1-ac1f6b62a617 ONLINE 0 0 0
gptid/92b130aa-4891-11e8-9fc1-ac1f6b62a617 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/93dd513c-4891-11e8-9fc1-ac1f6b62a617 ONLINE 0 0 0
gptid/94ebd8bc-4891-11e8-9fc1-ac1f6b62a617 ONLINE 0 0 0
cache
ada1p1 ONLINE 0 0 0

errors: No known data errors

The latest crashdump information:

Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address = 0xffffc043cabba160
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff80fa9423
stack pointer = 0x28:0xfffffe084ce58fd0
frame pointer = 0x28:0xfffffe084ce59000
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 4383 (uwsgi-3.6)
version.txt06000026113556060127 7617 ustarrootwheelFreeBSD 11.2-STABLE #0 r325575+5920981193f(HEAD): Mon Sep 16 23:00:13 UTC 2019
root@nemesis:/freenas-releng/freenas/_BE/objs/freenas-releng/freenas/_BE/os/sys/FreeNAS.amd64

Crashdumps:

rw------ 1 root www 685 Oct 27 12:20 info.1
rw------ 1 root www 74632 Oct 27 12:20 textdump.tar.1.gz
rw------ 1 root www 493 Oct 27 19:45 info.2
rw------ 1 root www 74326 Oct 27 19:45 textdump.tar.2.gz
rw------ 1 root www 492 Oct 28 21:11 info.3
rw------ 1 root www 66081 Oct 28 21:11 textdump.tar.3.gz
rw------ 1 root www 500 Oct 29 01:37 info.4
rw------ 1 root www 71051 Oct 29 01:37 textdump.tar.4.gz
rw-rr- 1 root www 2 Oct 29 15:55 bounds
rw------ 1 root www 494 Oct 29 15:55 info.0
rw------ 1 root www 66730 Oct 29 15:55 textdump.tar.0.gz
lrwxr-xr-x 1 root www 6 Oct 29 15:55 info.last -> info.0
lrwxr-xr-x 1 root www 17 Oct 29 15:55 textdump.tar.last.gz -> textdump.tar.0.gz

I really would like to what is happen, because I goole a lot, but no success ...

Many thanks!!

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Zoltan Szabo October 29, 2019 at 9:26 PM

Dear Alexander,

Many thanks for your fast reply!
I will check the hardware deeper locally soon. I am just frustrated why no find out any "normal logs", where possible to identify what kind of HW part(s) are defective. RAM test okay ... run more than 8 hours without any crash. This is why I have requested FreeNAS support, because the problem is started when updated to 11.1-U7 and still persist on 11.2-U6.

By the way, thanks a lot!

Alexander Motin October 29, 2019 at 8:00 PM

We can't debug software for zillion of unrelated crashes. It must be hardware.  At least try to isolate some specific reproduction scenario.

Alexander Motin October 29, 2019 at 7:57 PM

In the debug provided I see 5 crashes over 2 days, and all of them are completely different.  My only guess is that something (hardware or software) corrupts memory content.  I would blame RAM first, but as I see it is ECC and I see no ECC errors.  I would try to remove/replace random hardware components, trying to localize the problem.  If this is a self-made server, I would also carefully checked that all hardware components, such as chipset and all the controllers are properly cooled.

Zoltan Szabo October 29, 2019 at 4:15 PM

The latest (after the crash) debug file has been attached.

Cannot Reproduce
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created October 29, 2019 at 4:12 PM
Updated July 1, 2022 at 4:43 PM
Resolved October 29, 2019 at 8:00 PM