Close stdin/stdout/stderr
Description
Problem/Justification
None
Impact
None
Activity
Show:
Bug Clerk November 24, 2023 at 11:31 AM
Bug Clerk November 24, 2023 at 11:29 AM
Bug Clerk November 23, 2023 at 1:03 PM
23.10.1 PR: https://github.com/truenas/zettarepl/pull/289
Complete
Pinned fields
Click on the next to a field label to start pinning.
Details
Details
Assignee
Unassigned
UnassignedReporter
Bug Clerk
Bug ClerkComponents
Priority

More fields
Time tracking
More fields
Time trackingKatalon Platform
Linked Test Cases, Katalon Defect Results, Katalon Studio Test Results
Katalon Platform
Linked Test Cases, Katalon Defect Results, Katalon Studio Test Results
Created November 23, 2023 at 1:01 PM
Updated February 6, 2024 at 4:35 PM
Resolved November 24, 2023 at 11:34 AM
24.04 PR: https://github.com/truenas/zettarepl/pull/288
This PR fixes a fatal memory leak in our TrueNAS CORE setup. I am not 100% sure why it works so let me tell you how I arrived at this.
The TrueNAS server ran out of its 128GB of memory on a monthly basis, fixable only by a reboot. I finally checked it mid-month and noticed a huge `zettarepl` process.
Digging in, zettarepl would start leaking memory some hours after a restart – but not due to any particular job (so there probably is a kind of race involved).
I noticed that whenever the leaking started, `zettarepl` would grow from five to seven permanent threads. The two extra threads were named like `Thread-17428` and `retention.close_shell`.
Hmm, what the heck is a "close shell" thread ... well, it turns out that it was introduced in #166 to fix a replication hang reported in https://ixsystems.atlassian.net/browse/NAS-110234. It seems like the hang was concealed more than actually fixed and leaving the thread around somehow causes the memory leak ("somehow" => this is the part where I am not 100% sure why).
Anyway, I searched for issues with Paramiko hanging on close and ended up at https://github.com/paramiko/paramiko/issues/2075#issuecomment-1178468092 which states:
> `stdin`, `stdout` and `stderr` objects from `SSHClient.exec_command()` should be closed before closing an instance of `SSHClient` if you invoked `SSHClient.exec_command()`.
Looking for `std*.close()` in zettarepl I didn't find much and what I did find was rarely called.
So on a hunch, I added explicit closing of all the fds and hooray, no more leaks.
(I think #166 can be reverted after this change but I have no way to verify the original report. So to keep the PR in "it can't hurt" territorry, I didn't include that cleanup.)