Nightly crashes (system reboots!)

Description

Every night around 3 AM the system scrashes and causes a reboot.

All my NFS shares fail and I have to restart my remote sevices that depend on them.

Problem/Justification

None

Impact

None

Activity

Show:

Jean-Paul Lizotte February 20, 2022 at 5:00 PM

I wasn't willing to go back to a previous version to further debug the issue. Instead I chose the upgrade path to the Ubuntu based version (Truenas Scale) and since, the system has become rock solid again. 

Sorry for the delay, but I needed to see if any issues would crop up again. This indeed solves many other issues such as the ability to run PowerShell scripts and Docker containers directly. 

So this can be taken as the final solution.

Cheers!

Alexander Motin December 27, 2021 at 3:16 PM

> It's interesting that you mention that my memory is non ECC because it's supposed to be

In `dmidecode` output I see "Error Correction Type: None".  Looking on the motherboard specification (https://www.ecs.com.tw/he/Product/Motherboard/B85H3-M/specification) I also see "Supports DDR3 1600 non-ECC, Un-buffered SDRAM Memory".

> The reboots and crashes are pretty consistent

Please attach new debug. May be there I see any more consistency.  Indeed I see several panics around 3:30AM in the last log, but looking back I also see number of panics in different times.

> But it's been happening exactly since the last update of Truenas.

I you you've updated to U5.1, U6, U6.1 and U7. Are you saying the problem appeared only when updating from U6.1, or it could be earlier?  As William told, it should be trivial to load previous boot environment for verification.

William Gryzbowski December 27, 2021 at 1:47 PM

What if you rollback to previous version using boot environment? Will it get back to being stable?

Did you scrub your boot pool?

Jean-Paul Lizotte December 25, 2021 at 3:46 PM

The reboots and crashes are pretty consistent as for the time period and I wish I could peer into the dumps as you are able to so I could help direct you. But it's been happening exactly since the last update of Truenas.

I now have a "python3.9.core" file that has been produced. It's 144 MB so I can't upload it here.

Link: https://drive.google.com/file/d/1bFAPiM-Alse2SXrOmLALt4vHyjCXcD1k/view?usp=sharing

Any insight of what was happening on the system at that time, in the core dumps or the debug file, could also help me direct you

It's interesting that you mention that my memory is non ECC because it's supposed to be and I did a complete system reseat (not reset) that is, pulling out the parts from the motherboard to clean and change the thermal compound on the CPU about six months ago.

Alexander Motin December 22, 2021 at 8:52 PM

In the debug provided I see 5 kernel dumps, each completely different.  I don't know what happen in 3AM (if anything related, not just routine periodic), but I suspect here some memory corruptions, either hardware of software.  Since it is pretty old desktop system with non-ECC RAM I'd recommend you to clean it up from dust, check cooling and run good long memory test.  Otherwise with all those random errors I have nowhere to start.

Need additional information
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created December 18, 2021 at 2:33 PM
Updated July 6, 2022 at 8:58 PM
Resolved February 18, 2022 at 2:36 PM