Incus VM with PCIe passthru crash
Description
Problem/Justification
Impact
Attachments
Activity

George Davidson 2 days ago
Thanks to for finding the timestamp of when everything fell apart. I went back and examined the metrics just before and after 01:01:50 and discovered a few things. First let me set the stage as to what was happening before the issue. My brother and I both have TrueNAS systems and we backup to each other over the internet. The VM that crashed was a SecurityOnion, which is a security appliance for network monitoring, which receives SPAN traffic from my network switch. So everything that my brother’s TrueNAS sends to my TrueNAS will also be written by the SecurityOnion VM as packet captures.
Timeline of events
00:20:00 that scheduled push from my brother to my system began. It’s large, maybe 100GiB.
01:00:00 IO for the vdev (sdl & sdm) backing the packet capture zvol stalls. Maybe because the mirrored vdev is backed by SMR drives that had been writing 30MB/s for 40 minutes, maybe because the VM discarded a bunch of data and that caused a bunch of reads on the vdev slowing the speed of the disk flushes. Probably little bit of both.
01:01:00 ZFS starts to flush data to nvme1n1 SLOG as data is not flushing to sdl/sdm in a timely matter. Memory pressure starts to increase here.
01:01:50 Kernel OOM kills the VM due to memory pressure.
I see 2 problems. 1) the IO pressure was not communicated to the VM fast enough for it to react and slow down its disk IO. 2) When the memory pressure started to approach the OOM threshold, the ARC wasn’t dumped to free memory.

Morgan Littlewood 2 days ago
if there is another ticket… can we post it here for future reference.

George Davidson 2 days ago
Well there wasn’t a large spike in memory that precipitated this. Here are NetData metrics on the memory usage. There was a replication happening from my brother’s TrueNAS to my TrueNAS, but the available memory was reported as 80GB. If you’d like the full metrics from NetData, I can send them if they might be useful?

William Gryzbowski 2 days ago
That might be, we have another ticket to subtract VM ram from ARC.

George Davidson 2 days ago
Well I would call that a bug since my system is 128GB of RAM, and more than 50% was in the ARC.
Details
Assignee
Alexander MotinAlexander MotinReporter
George DavidsonGeorge DavidsonFix versions
Affects versions
Priority
Undefined
Details
Details
Assignee

Reporter

Fix versions
Affects versions
Priority

I'm not sure what happened, but my SecurityOnion VM crashed this morning at 1am. TrueNAS is still up, HomeAssistant VM is still up, but SecurityOnion was in a powered off state when I woke up. The only noteworthy thing I feel is this VM is using the PCIe device passthru that was just added to 25.04 RC1. I'm passing in 1 nic on a 4 nic Intel I350 card. Mostly submitting this to send you guys debug logs, hopefully they'll make sense to you.
Session ID: 381246e9-9a32-b6e8-80ca-6a31db7cd26c