Thanks for using the TrueNAS Community Edition issue tracker! TrueNAS Enterprise users receive direct support for their reports from our support portal.

Issues

Select view

Select search mode

 
1 of 2

NVIDIA GPU not being used by Apps - Assigned VGA Arbitration by TrueNAS?

Duplicate

Description

I can’t get my NVIDIA GTX 1660 working with Plex (transcoding) and Immich (machine learning). I recently upgraded from TrueNAS-Core to Scale.

I can confirm that Plex is not using my GPU for encoding, as my CPU usage spikes considerably when it’s transcoding, and there are no processes present in when I run nvidia-smi while transcoding.

When I run the maching learning pods in Immich i get continual ERROR Worker was sent code 139 which is a SIGSEGV memory violation error.

I think the issue is that my GPU is being used by something and is not available to the system, as [VGA Controller] is listed after the GPU when I run lspci – if I understand the meaning of that correctly.

TrueNAS Scale Version: ElectricEel-24.10.0
Plex Version: 1.0.24
Immich Version: 1.6.24

I do not have any displays connected.

I have followed this post which details adding the following code…

resources: gpus: nvidia_gpu_selection: '0000:07:00.0': use_gpu: true uuid: '' <<-- the problem use_all_gpus: false

… to the user_config.yaml file, located in the ixVolume volume, found at /mnt/.ix-apps/user_config.yaml, and setting the IOMMU and UUID values correcty – which I have done.

I also came across this post. However, I’m able to run the nvidia-smi command without errors.

Interestingly, I don’t have any of the following files on my system:

/etc/modprobe.d/kvm.conf /etc/modprobe.d/nvidia.conf /etc/modprobe.d/vfio.conf

My system also doesn’t present me with any GPUs avaible for isolation, as shown in the screenshot further below.

Is anyone able to point me in the right direction as to what I should do?

— — — Additional Info — — —

nvidia-smi Output

root@truenas[~]# nvidia-smi Thu Oct 31 12:24:16 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1660 ... Off | 00000000:01:00.0 Off | N/A | | 28% 43C P0 N/A / 125W | 1MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

 

modprobe Output

root@truenas[~]# modprobe nvidia_current_drm modprobe: FATAL: Module nvidia_current_drm not found in directory /lib/modules/6.6.44-production+truenas
root@truenas[~]# modprobe nvidia-current modprobe: FATAL: Module nvidia-current not found in directory /lib/modules/6.6.44-production+truenas

 

lsmod Output

root@truenas[~]# lsmod | grep nvidia nvidia_uvm 4911104 0 nvidia_drm 118784 0 nvidia_modeset 1605632 1 nvidia_drm nvidia 60620800 2 nvidia_uvm,nvidia_modeset drm_kms_helper 249856 4 ast,nvidia_drm drm 757760 6 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm video 73728 1 nvidia_modeset

 

lspci Output

root@truenas[~]# lspci -v ... 01:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] Flags: bus master, fast devsel, latency 0, IRQ 16, IOMMU group 1 Memory at f6000000 (32-bit, non-prefetchable) [size=16M] Memory at e0000000 (64-bit, prefetchable) [size=256M] Memory at f0000000 (64-bit, prefetchable) [size=32M] I/O ports at e000 [size=128] Expansion ROM at f7000000 [virtual] [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] Secondary PCI Express Capabilities: [bb0] Physical Resizable BAR Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_drm, nvidia

 

No GPUs available to Isolate

image

 

Application Settings

image

Plex Resources

image

 

Plex Transcoding Settings - No specific GPU available

image

 

These are all the iommu groups I currently have.

root@truenas[~]# find /sys/kernel/iommu_groups/ -type l /sys/kernel/iommu_groups/7/devices/0000:00:1c.4 /sys/kernel/iommu_groups/5/devices/0000:00:1c.0 /sys/kernel/iommu_groups/3/devices/0000:00:19.0 /sys/kernel/iommu_groups/11/devices/0000:04:00.0 /sys/kernel/iommu_groups/1/devices/0000:00:01.0 /sys/kernel/iommu_groups/1/devices/0000:01:00.2 /sys/kernel/iommu_groups/1/devices/0000:01:00.0 /sys/kernel/iommu_groups/1/devices/0000:01:00.3 /sys/kernel/iommu_groups/1/devices/0000:01:00.1 /sys/kernel/iommu_groups/8/devices/0000:00:1d.0 /sys/kernel/iommu_groups/6/devices/0000:00:1c.1 /sys/kernel/iommu_groups/4/devices/0000:00:1a.0 /sys/kernel/iommu_groups/12/devices/0000:05:00.0 /sys/kernel/iommu_groups/2/devices/0000:00:14.0 /sys/kernel/iommu_groups/10/devices/0000:03:00.0 /sys/kernel/iommu_groups/10/devices/0000:02:00.0 /sys/kernel/iommu_groups/0/devices/0000:00:00.0 /sys/kernel/iommu_groups/9/devices/0000:00:1f.2 /sys/kernel/iommu_groups/9/devices/0000:00:1f.0 /sys/kernel/iommu_groups/9/devices/0000:00:1f.3 /sys/kernel/iommu_groups/9/devices/0000:00:1f.6

 

It seems that my GPU 0000:01:00.0 is the only device in it’s group, and each of it’s other components also has it’s own group.

root@truenas[~]# lspci -Dnn | grep -i NVIDIA 0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] [10de:21c4] (rev a1) 0000:01:00.1 Audio device [0403]: NVIDIA Corporation TU116 High Definition Audio Controller [10de:1aeb] (rev a1) 0000:01:00.2 USB controller [0c03]: NVIDIA Corporation TU116 USB 3.1 Host Controller [10de:1aec] (rev a1) 0000:01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU116 USB Type-C UCSI Controller [10de:1aed] (rev a1)

 

I also ran the following command:

root@truenas[~]# dmesg | grep -i 'vga\|display\|nvidia' [ 0.211805] pci 0000:01:00.0: vgaarb: setting as boot VGA device [ 0.211805] pci 0000:01:00.0: vgaarb: bridge control possible [ 0.211805] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none [ 0.211805] pci 0000:03:00.0: vgaarb: setting as boot VGA device (overriding previous) [ 0.211805] pci 0000:03:00.0: vgaarb: bridge control possible [ 0.211805] pci 0000:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none [ 0.211805] vgaarb: loaded [ 0.694443] fb0: EFI VGA frame buffer device [ 12.876613] ast 0000:03:00.0: vgaarb: deactivate vga console [ 12.876769] ast 0000:03:00.0: [drm] Using analog VGA [ 12.907490] snd_hda_intel 0000:01:00.1: Handle vga_switcheroo audio client [ 12.962552] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input6 [ 12.963921] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input7 [ 12.963960] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input8 [ 12.963992] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input9 [ 13.529272] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 13.530190] nvidia 0000:01:00.0: enabling device (0000 -> 0003) [ 13.530340] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none [ 13.576894] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.127.05 Tue Oct 8 03:22:07 UTC 2024 [ 13.618688] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.127.05 Tue Oct 8 02:56:05 UTC 2024 [ 13.626835] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver [ 13.626838] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1 [ 112.662664] audit: type=1400 audit(1730371701.484:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=3350 comm="apparmor_parser" [ 112.663725] audit: type=1400 audit(1730371701.484:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=3350 comm="apparmor_parser" [ 164.737708] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint. [ 164.794357] nvidia-uvm: Loaded the UVM driver, major device number 237.

This shows both my GPUs.

My other VGA device, which is my motherboard’s ASpeed AST, is identified as ast 0000:03:00.0.
This device seems to be designated as the boot VGA, while it seems my NVIDIA GPU (located at 0000:01:00.0) has also been configured with VGA arbitration – assuming that’s what the references to VGA and framebuffer mean.

 

What do I need to do to get TrueNAS to release my NVIDIA GPU so that my apps can actually use it? Assuming that is the issue…

Problem/Justification

None

Impact

None

Attachments

6
  • 31 Oct 2024, 05:41 PM
  • 31 Oct 2024, 04:48 PM
  • 31 Oct 2024, 04:42 PM
  • 31 Oct 2024, 04:42 PM
  • 31 Oct 2024, 04:42 PM
  • 31 Oct 2024, 04:42 PM
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Impact

Medium

Time remaining

0m

Components

Fix versions

Priority

Katalon Platform

Created October 31, 2024 at 4:42 PM
Updated October 31, 2024 at 6:28 PM
Resolved October 31, 2024 at 6:28 PM

Activity

Show:

Bug Clerk October 31, 2024 at 6:28 PM

This issue has now been closed. Comments made after this point may not be viewed by the TrueNAS Teams. Please open a new issue if you have found a problem or need to re-engage with the TrueNAS Engineering Teams.

Stavros Kois October 31, 2024 at 6:27 PM

Nice, I was expecting that to work.
I’ve mainly seen existing installation have this issue. You can track progress in the linked ticket if you want to (https://ixsystems.atlassian.net/browse/NAS-132086 )

I’ll close this one now, as its essentially a duplicate.

Thanks!

Michael Wesley October 31, 2024 at 5:41 PM
Edited

Well that is extemely interesting.

I just created a new Plex container as requested, and now this shows up:

image-20241031-173650.png

And according to nvidia-smi Plex is now using my GPU.

root@truenas[~]# nvidia-smi Thu Oct 31 18:38:35 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1660 ... Off | 00000000:01:00.0 Off | N/A | | 40% 47C P2 38W / 125W | 374MiB / 6144MiB | 5% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 126273 C ...lib/plexmediaserver/Plex Transcoder 370MiB | +-----------------------------------------------------------------------------------------+

 

I can’t believe it was that simple. Perhaps ElectricEel did fix my issue after all…

I’ll try with Immich as well, although that will take a lot longer as it will need to search through all my photos again.

I guess I just stick with this new Plex container and re-import all my media.

Stavros Kois October 31, 2024 at 5:22 PM

Yes please. You can try plex only, without uninstalling your current app.
Just make sure the current plex app its stopped when you start the second plex app

Michael Wesley October 31, 2024 at 5:19 PM
Edited

Hi , I installed the Plex and Immich apps a few days ago after I updated from TrueNAS-CORE to TrueNAS-SCALE (Dragonfish). I updated to ElectricEel today in the hope that this would solve the problem. As such, the containers were ported over from Dragonfish to ElectricEel. I was experiencing the same issue in Dragonfish.

Do you still want me to try and do a fresh install on ElectricEel?