Multiple unexpected restarts

Description

The TrueNAS has had multiple unexpected reboots in the last two weeks. Each time a reboot occurs, the TrueNAS is running at a moderate load (oddly, it seems to work fine under high load). The most recent reboot occurred while I was reading a file list over SMB.

Except for the last reboot, each previous reboot caused the Docker service to crash and required a pool reset to recover.

I looked at the memory information recorded by Netdata, and the system's memory usage increased dramatically with each reboot. Therefore, I suspect the following:

Memory leak.
ARC cache, Docker, Incus, and some scheduling policy conflicts with the system.

Problem/Justification

None

Impact

None

Activity

Show:

Bug Clerk2 days ago

This issue has now been closed. Comments made after this point may not be viewed by the TrueNAS Teams. Please open a new issue if you have found a problem or need to re-engage with the TrueNAS Engineering Teams.

Bug Clerk2 days ago

Sorry but this does not look like a bug in TrueNAS. This looks like a culmination of issues specific to your install and/or hardware. Logs are littered with messages like these:

Mar 23 22:31:34 truenas kernel: get_swap_device: Bad swap file entry 1ffffffffffff
Mar 23 22:31:34 truenas kernel: get_swap_device: Bad swap file entry 1ffffffffffff
Mar 23 22:31:34 truenas systemd-coredump[3369607]: Process 3462 (syslog-ng) of user 0 dumped core.

Stack trace of thread 3369489:
#0  0x00007f5f70c24e2e n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

Mar 23 22:31:34 truenas systemd-coredump[3371785]: Process 3369633 (syslog-ng) of user 0 dumped core.

Stack trace of thread 3369650:
#0  0x00007f7546396e2e n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

Mar 23 22:31:34 truenas systemd-coredump[3371887]: Process 3371813 (syslog-ng) of user 0 dumped core.

Stack trace of thread 3371869:
#0  0x00007fb3d1b94e2e n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

Mar 23 22:33:19 truenas kernel: usb 2-1.3: device not accepting address 5, error -71
Mar 23 22:33:19 truenas kernel: usb 2-1.3: device not accepting address 6, error -71
Mar 23 22:33:19 truenas kernel: usb 2-1.3: device not accepting address 7, error -71
Mar 23 22:33:19 truenas kernel: usb 2-1.3: device not accepting address 8, error -71
Mar 23 22:33:19 truenas kernel: usb 2-1-port3: unable to enumerate USB device
Apr  1 22:21:59 truenas kernel: INFO: task txg_sync:2933 blocked for more than 120 seconds.
Apr  1 22:21:59 truenas kernel:       Tainted: P           OE      6.12.15-production+truenas #1
Apr  1 22:21:59 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  1 22:24:00 truenas kernel: INFO: task txg_sync:2933 blocked for more than 241 seconds.
Apr  1 22:24:00 truenas kernel:       Tainted: P           OE      6.12.15-production+truenas #1
Apr  1 22:24:00 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  2 20:04:12 truenas systemd-coredump[2742932]: Process 1506 (asyncio_loop) of user 0 dumped core.

Stack trace of thread 3919:
#0  0x00000000005ee650 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

Apr  2 22:47:14 truenas systemd-coredump[2955435]: Process 11037 (python3) of user 0 dumped core.

Stack trace of thread 304:
#0  0x00007fd470fc5b00 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

This is not happening in a wide-spread fashion and we do not have the resources to investigate your specific environment and/or hardware. Suggest investigating your apps usage and also investigate the zpools hard drive health. The fact that txg_sync is “hung” is a bit alarming. It means the hard drives can’t keep up with the data being requested to be written and/or the hard drives are responding extremely slowly causing a cascading set of failures. Suggest starting simple and turn off/stop all vms/apps/containers etc and try to isolate the reboots with a particular workload.

Linzi Moorelast week

Thanks for your submission! This is in our queue to review now. An engineering representative will update with any further questions or details in the near future.

Andrew Walkerlast week
Edited

A userspace memory leak will typically not trigger a reboot. At most you’ll have userspace applications getting killed.

Bug Clerklast week

Thank you for submitting this TrueNAS Bug Report! So that we can quickly investigate your issue, please attach a Debug file and any other information related to this issue through our secure and private upload service below. Debug files can be generated in the UI by navigating to System -> Advanced -> Save Debug.

https://ixsystems.atlassian.net/servicedesk/customer/portal/15/group/37/create/153

Not Applicable

Details
Assignee
TrueNAS Backend Triage
Reporter
Hello World
Labels
Components
System
Fix versions
N/A
Affects versions
SCALE-25.04-RC.1 (Fangtooth)
Priority
Low

More fields

Katalon Platform

Created last week

Updated 2 days ago

Resolved 2 days ago

Multiple unexpected restarts

Description

Problem/Justification

Impact

Activity

Bug Clerk2 days ago

Bug Clerk2 days ago

Linzi Moorelast week

Andrew Walkerlast weekEdited

Bug Clerklast week

DetailsAssigneeTrueNAS Backend TriageTrueNAS Backend TriageReporterHello WorldHello WorldLabelsCOMMUNITY_USERready_for_assignmentComponentsSystemFix versionsN/AAffects versionsSCALE-25.04-RC.1 (Fangtooth)PriorityLow

Details

Assignee

Reporter

Labels

Components

Fix versions

Affects versions

Priority

More fieldsTime tracking

More fields

Katalon PlatformLinked Test Cases, Katalon Defect Results, Katalon Studio Test Results

Katalon Platform

Andrew Walkerlast week
Edited

Details
Assignee
TrueNAS Backend Triage
Reporter
Hello World
Labels
Components
System
Fix versions
N/A
Affects versions
SCALE-25.04-RC.1 (Fangtooth)
Priority
Low

More fields

Katalon Platform