Complete
Details
Assignee
CalebCalebReporter
Bug ClerkBug ClerkLabels
Time remaining
0mComponents
Fix versions
Priority
Low
Details
Details
Assignee
Caleb
CalebReporter
Bug Clerk
Bug ClerkLabels
Time remaining
0m
Components
Fix versions
Priority
Katalon Platform
Katalon Platform
Katalon Platform
Created February 2, 2022 at 12:24 PM
Updated July 11, 2022 at 4:43 PM
Resolved February 4, 2022 at 12:56 AM
PR: https://github.com/truenas/middleware/pull/8197
The fact that we even have this makes me squirm in my seat. The "debugability" that this provides doesn't outweigh the cons that it introduces. I've found 3 major problems.
1. because we put all websocket results in this queue, a reference is held which prevents the memory from being reclaimed. The ONLY way memory is "reclaimed" is when the deque is full and a new entry is added AND the entry in the deque that is being replaced is smaller than the entry being inserted.....so this gets out of hand quite quickly
2. we were storing the non-serialized results so the memory usage was, in theory, considerably larger than what it should be
3. we set the deque size to 1000 entries which is kind of insane, this allows the potential of growing exponentially
This immediately starts eating memory on large systems (systems with 100's of hard drives) because every 5 seconds a websocket call is made to `disk.temperatures`. On a system with lots of hard drives, the result is quite large. While a single entry isn't that big a deal, 1000 of them is.
Furthermore, this is used by the webUI team and we capture this in the debug by running `core.get_websocket_messages` HOWEVER when you call that message via websocket it returns the entire contents of the `deque` and then turns around and stores those results in the `deque`.....
To fix, I've done somethings:
1. shrunk the deque to 50....no reason to have 1000
2. if the serialized string of the result is greater than 1MB in size, then instead of storing the result we overwrite the results with a string letting the end user know.
With my fixes, this resolves massive memory usage on a system with 20k snapshots and calling `zfs.snapshot.query` multiple times. (Calling it 5 times grew the parent process to ~1.4GB of resident memory).