TrueNAS COBIA (BETA) - CPU spikes

Description

Hello,

After upgrading to COBIA (BETA), I observed regular CPU spikes happening constantly, about every minute. My problem is highly similar to , however, I do not see any EDID-related (or any errors for that matter) on the console.

I have attached a screenshot of how high these CPU spikes can be (in this case all 12% was due to this), and that the corresponding command looks to be middlewared (worker).

I have checked syslogs and middlewaredlogs. I did not see anything weird in syslogs, but I found the following things in middlewaredlogs (hopefully, they can provide some helpful information):

1) pyroute2 fails with the following traceback:

2) I get a weird warning saying that a mountpoint for /var/db/system was not found:

3) I get an error regarding a service called idmap:

My problem seems to also be similar to the post described in (hence my reply on said forum post), suggesting that this may not be a COBIA-exclusive bug (even though I have not experienced it on any previous 22.12.3.x versions using the exact same system in the last 6+ months).

Problem/Justification

None

Impact

None

Attachments

Activity

Show:

Gergo September 19, 2023 at 3:15 PM

Thank you @Caleb for your insights.

I was planning on giving RC a try, so I will give it a try and see what happens. (Although it may take a week or two as my current schedule is too right.)

I will also make sure to report the results.

Caleb September 19, 2023 at 12:33 PM

actually, I just remembered that we did some CPU spike investigation and we found quite a few places where we could improve this. The fixes were merged AFTER BETA.1 was released so there should be some pretty good improvements in RC.1. I will be very curious if you continue to see the CPU spikes after upgrading.

Caleb September 19, 2023 at 12:23 PM

so I went off and looked at some internal systems to see if I could spot this in the wild. I was able to find a system that showed similar behavior, however, this was on a system with 1255 hard drives. The problem exposed there was something that was specific to how large that system was so I don't believe I've found the problem that you're experiencing. It's hard to say what's going without capturing a flamegraph of the parent middleware process. Capturing a flamegraph of the parent middleware process requires installing a few things to the base OS which I'm not fond of doing on a BETA version. We're getting ready to release RC.1 and we've made quite a few fixes there so I would suggest you upgrade to that version when it's released and open a fresh ticket (with a fresh debug) if this continues. If it continues, we'll probably need to try and schedule a remote session to get on your system and see if we can figure out what's going on.

StubbornGarrett September 16, 2023 at 11:56 PM
Edited

Exact minutly CPU spikes do indeed sound like a problem, especially if we do not know for what reason. Most resource intensive processes on my timeline are temporary, triggered for a reason and I can understand it. Unfortunately that is not the case here.

In my case, "top" shows 100% CPU utilization of the "middleware (worker)" process during these spikes, which is also clearly reflected in the CPU temperature with up to 13°C difference. I can only assume that it depends on CPU model (Intel i3-9100 on my system) or other software configuration.

I decided against uploading my debug dump currently, since the original ticket creator can apparently see the file as well and it is from a production system with private data. If I can provide more information or open another ticket, since I am also on Bluefin, please let me know.

Gergo September 14, 2023 at 6:04 PM
Edited

Thank you for your feedback @Caleb!

- The 12% cpu usage is a big problem for me. This is the main reason why I did not power on the NAS at all since experiencing this problem, except for backups and simple tests to see if they fix the problem and the reason why I am thinking of degrading back to Bluefin. The reason why the display image display 33 degrees as the temperature is only that there is a slight delay in the temperature being updated in the UI. I attached another image that I took 1 or 2 seconds later, showing 44 degrees. Due to the increased CPU usage, the temperature inside the chassis and most likely the overall consumption of my system also increased. In Bluefin, the reported CPU usage was constantly almost 0 and the temperatures were about 30-32 degrees, as opposed to the 40+, or sometimes almost 60 degrees now (see the second attached image). In Bluefin, the temperature curve looked like a constant, fixed line due to the low demand (essentially there is nothing running on the server), but after switching to Cobia, it constantly cycles up to 40-50+ degrees (see the second attached image). For comparison, my system only displays 4% CPU usage during heavy file transfers via a 1 Gbit connection contrary to the 12% or higher spikes.

- Thank you for the pyroute2 heads up! I submitted the IPv4 address of my default gateway (router) to the bridge that I created in the UI. I will try experimenting with disabling that bridge for now. (The only reason for that bridge was so that the communication of the VMs does not go out to the wire).

Cannot Reproduce

Details
Assignee
Caleb
Reporter
Gergo
Labels
Impact
High
Time remaining
0m
Components
Fix versions
N/A
Affects versions
SCALE-23.10-BETA.1
Priority
Low

Katalon Platform

Created September 10, 2023 at 6:03 PM

Updated September 19, 2023 at 3:15 PM

Resolved September 19, 2023 at 12:23 PM

TrueNAS COBIA (BETA) - CPU spikes

Description

Problem/Justification

Impact

Attachments

Activity

Gergo September 19, 2023 at 3:15 PM

Caleb September 19, 2023 at 12:33 PM

Caleb September 19, 2023 at 12:23 PM

StubbornGarrett September 16, 2023 at 11:56 PMEdited

Gergo September 14, 2023 at 6:04 PMEdited

DetailsAssigneeCalebCalebReporterGergoGergoLabelsready_for_reviewImpactHighTime remaining0mComponentsFix versionsN/AAffects versionsSCALE-23.10-BETA.1PriorityLow

Details

Assignee

Reporter

Labels

Impact

Time remaining

Components

Fix versions

Affects versions

Priority

Katalon PlatformLinked Test Cases, Katalon Defect Results, Katalon Studio Test Results

Katalon Platform

StubbornGarrett September 16, 2023 at 11:56 PM
Edited

Gergo September 14, 2023 at 6:04 PM
Edited

Details
Assignee
Caleb
Reporter
Gergo
Labels
Impact
High
Time remaining
0m
Components
Fix versions
N/A
Affects versions
SCALE-23.10-BETA.1
Priority
Low

Katalon Platform