Thanks for using the TrueNAS Community Edition issue tracker! TrueNAS Enterprise users receive direct support for their reports from our support portal.

oom-killer randomly starts every few weeks

Description

As outlined in this community thread
https://www.truenas.com/community/threads/scale-becomes-unreachable-every-few-weeks.111707/
Scale systems will see all swap memory used and then oom-killer starts taking down a lot of services, middleware and ssh in some cases become unresponsive.
In my own experience this occurred while my system had 32GB and 64GB of ram, I had assumed more ram was needed and tried to rectify it with an upgrade.
Myself and others have also tried reinstalling the OS without being able to successfully resolve this.
The last time this happened was about 1h before raising this ticket.
Any assistance is appreciated.

Problem/Justification

None

Impact

None

Activity

Show:

Grigory Vakulenchuk September 19, 2023 at 6:33 PM
Edited

Thanks Caleb, since this has been impossible to replicate consistently I am worried that reducing the number of apps would free enough ram to prevent this from happening. To test this theory I think the best way would be to eliminate one app at a time, unfortunately this would mean running the system for at least two weeks in my case without that app. I am not prepared to do this yet.

Morgan mentions on the community thread that Cobia makes use of containerd in place of dockerd and that the issue could be reduced or resolved as a result. This might be the best first step in my case, assuming truecharts signs off on support for their apps on Cobia. I will upgrade as soon as possible (very likely once stable is released.)

If there is a leak in dockerd or one of the apps is causing a leak then restarting the entire server on a schedule would be a good workaround for me.

For now, I have set a shell script to run via cron, that appends the top 10 pods (docker stats) and top 10 processes (top) on the system every 5m, hopefully the next time this happens the output will capture some useful information.

Caleb September 19, 2023 at 10:13 AM

Hi, thanks for the ticket. You have 78 pods running on your system. There is 0 chance we can figure out why oom-killer is being invoked but it’s almost guaranteed to be because of an app you’re running or because you’re simply overloading your box. I can only suggest that you try to reduce the amount of apps you’re running to try and pinpoint which one is causing issues.

Morgan Littlewood September 19, 2023 at 4:44 AM

Other user reported an issue that might be related to dockerd.

Apps stopped responding and looking at the webui
- The Apps tab was no longer loading
- WebUI reported the ram used up completly by "Services"

htop through ssh showed all ram used up by
/usr/bin/dockerd -H fd://
~71% and rising slowly

There were multiple rows with the same command using up the same exact percentage of ram
at the top was one that had a Time+ of ~405h (I think, wouldn't match my current uptime of ~9days though)

I killed that process using htop => F9 => SIGKILL

- the ram usage in htop immediately dropped to near zero
- the apps tab in the webui worked again
- the webui responded faster
- ram usage shown in webui returned back to near zero, slowly increasing on zfs cache & services
- all apps had been killed, slowly deploying again.

Automation for Jira September 19, 2023 at 2:46 AM

Thank you for submitting this TrueNAS Bug Report! So that we can quickly investigate your issue, please attach a Debug file and any other information related to this issue through our secure and private upload service below. Debug files can be generated in the UI by navigating to System -> Advanced -> Save Debug.

https://ixsystems.atlassian.net/servicedesk/customer/portal/15/group/37/create/153

Third Party to Resolve
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created September 19, 2023 at 2:46 AM
Updated February 27, 2025 at 9:13 PM
Resolved September 19, 2023 at 10:13 AM