amdgpu_job_timedout crash while transcoding on the gpu
Description
Problem/Justification
Impact
Activity
Bug Clerk June 16, 2024 at 1:03 PM
This issue has now been closed. Comments made after this point may not be viewed by the TrueNAS Teams. Please open a new issue if you have found a problem or need to re-engage with the TrueNAS Engineering Teams.
Bug Clerk June 16, 2024 at 1:03 PM
Thanks for the ticket but this is something that we rely pretty heavily on upstream maintaining and fixing. We do not have the resources to investigate very specific hardware configurations. I’d suggest trying 24.04.2 when it releases since it has some Intel ARC GPU fixes (I’m not sure if it has anything related to AMD).
Caspar von Beöczy June 13, 2024 at 10:53 PM
I have uploaded a debug file. In dmesg it seems that a cpu core locked up, which might cause the slow user interface:
[532941.987322] watchdog: BUG: soft lockup - CPU#3 stuck for 7823s! [kworker/u64:0:3979137]
[532941.987340] Modules linked in: mptcp_diag(E) xsk_diag(E) raw_diag(E) unix_diag(E) af_packet_diag(E) netlink_diag(E) squashfs(E) tcp_diag(E) udp_diag(E) inet_diag(E) rpcsec_gss_krb5(E) wireguard(E) libchacha20poly1305(E) chacha_x86_64(E) poly1305_x86_64(E) curve25519_x86_64(E) libcurve25519_generic(E) libchacha(E) ip6_udp_tunnel(E) udp_tunnel(E) nf_conntrack_netlink(E) veth(E) nft_log(E) nft_limit(E) xt_limit(E) xt_NFLOG(E) nfnetlink_log(E) xt_physdev(E) xt_multiport(E) xt_addrtype(E) ip_vs_rr(E) dummy(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) ip_set_hash_ipport(E) xt_nat(E) xt_ipvs(E) xt_set(E) ip_vs(E) ip_set_hash_ip(E) ip_set_hash_net(E) ip_set(E) xt_MASQUERADE(E) nft_chain_nat(E) xt_mark(E) xt_conntrack(E) xt_comment(E) nft_compat(E) nf_tables(E) nfnetlink(E) iptable_filter(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) overlay(E) br_netfilter(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E) tun(E) tls(E) nvme_fabrics(E) nvme_core(E) binfmt_misc(E) bridge(E) stp(E) llc(E)
[532941.987433] ntb_netdev(E) ntb_transport(E) ntb_split(E) ntb(E) ioatdma(E) dca(E) essiv(E) authenc(E) crypto_null(E) dm_crypt(E) ib_core(E) intel_rapl_msr(E) intel_rapl_common(E) edac_mce_amd(E) kvm_amd(E) kvm(E) irqbypass(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) sha1_ssse3(E) aesni_intel(E) amdgpu(E) crypto_simd(E) cryptd(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) rapl(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_usb_audio(E) drm_exec(E) snd_intel_dspcfg(E) amdxcp(E) eeepc_wmi(E) drm_buddy(E) snd_usbmidi_lib(E) gpu_sched(E) snd_rawmidi(E) asus_wmi(E) snd_hda_codec(E) drm_suballoc_helper(E) snd_seq_device(E) battery(E) drm_display_helper(E) ledtrig_audio(E) sparse_keymap(E) snd_hda_core(E) cec(E) mc(E) platform_profile(E) rc_core(E) snd_hwdep(E) sp5100_tco(E) drm_ttm_helper(E) snd_pcm(E) rfkill(E) pcspkr(E) ccp(E) k10temp(E) acpi_cpufreq(E) watchdog(E) ttm(E) snd_timer(E) wmi_bmof(E) snd(E) drm_kms_helper(E) i2c_algo_bit(E) soundcore(E) joydev(E) button(E) sg(E) evdev(E) nfsd(E)
[532941.987574] auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) loop(E) drm(E) efi_pstore(E) dm_mod(E) configfs(E) sunrpc(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) spl(OE) efivarfs(E) raid10(E) sr_mod(E) cdrom(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) hid_logitech_hidpp(E) hid_logitech_dj(E) hid_generic(E) sd_mod(E) t10_pi(E) uas(E) usbhid(E) usb_storage(E) hid(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) crct10dif_generic(E) ahci(E) ahciem(E) xhci_pci(E) libahci(E) r8169(E) crct10dif_pclmul(E) crct10dif_common(E) realtek(E) mdio_devres(E) xhci_hcd(E) libata(E) crc32_pclmul(E) crc32c_intel(E) i2c_piix4(E) libphy(E) scsi_mod(E) usbcore(E) scsi_common(E) usb_common(E) video(E) gpio_amdpt(E) wmi(E) gpio_generic(E)
[532941.987763] CPU: 3 PID: 3979137 Comm: kworker/u64:0 Tainted: P OEL 6.6.29-production+truenas #1
[532941.987773] Hardware name: ASUS System Product Name/PRIME B450M-A II, BIOS 4401 09/04/2023
[532941.987781] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[532941.987798] RIP: 0010:amdgpu_device_rreg.part.0+0x2f/0xe0 [amdgpu]
[532941.988115] Code: 41 54 44 8d 24 b5 00 00 00 00 55 89 f5 53 48 89 fb 4c 3b a7 f8 08 00 00 73 62 83 e2 02 74 21 4c 03 a3 00 09 00 00 45 8b 24 24 <48> 8b 43 08 0f b7 70 3e 66 90 44 89 e0 5b 5d 41 5c e9 9b 45 77 ef
[532941.988130] RSP: 0018:ffffa65b5a87bbf8 EFLAGS: 00000286
[532941.988138] RAX: ffffffffc1e34fe0 RBX: ffff89481cf00000 RCX: 0000000000000003
[532941.988145] RDX: 0000000000000000 RSI: 000000000000ec05 RDI: ffff89481cf00000
[532941.988152] RBP: 000000000000ec05 R08: 0000000000000000 R09: 0000000000000000
[532941.988158] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000ffffffff
[532941.988164] R13: 0000000000000001 R14: ffff894990a81c00 R15: 0000000000000000
[532941.988171] FS: 0000000000000000(0000) GS:ffff895700ac0000(0000) knlGS:0000000000000000
[532941.988179] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[532941.988185] CR2: 0000740d0c3da000 CR3: 0000000287a6e000 CR4: 00000000003506e0
[532941.988192] Call Trace:
[532941.988201] <IRQ>
[532941.988207] ? watchdog_timer_fn+0x1b8/0x220
[532941.988217] ? __pfx_watchdog_timer_fn+0x10/0x10
[532941.988226] ? __hrtimer_run_queues+0x112/0x2b0
[532941.988237] ? hrtimer_interrupt+0xf8/0x230
[532941.988246] ? __sysvec_apic_timer_interrupt+0x50/0x140
[532941.988255] ? sysvec_apic_timer_interrupt+0x6d/0x90
[532941.988264] </IRQ>
[532941.988267] <TASK>
[532941.988271] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[532941.988285] ? amdgpu_device_rreg.part.0+0x2f/0xe0 [amdgpu]
[532941.988639] ? srso_return_thunk+0x5/0x5f
[532941.988652] gfx_v9_0_set_safe_mode+0x65/0xf0 [amdgpu]
[532941.989031] amdgpu_gfx_rlc_enter_safe_mode+0x54/0x70 [amdgpu]
[532941.989392] gfx_v9_0_update_coarse_grain_clock_gating+0x13/0x340 [amdgpu]
[532941.989718] gfx_v9_0_set_clockgating_state+0x9f/0xd0 [amdgpu]
[532941.990047] amdgpu_device_set_cg_state+0x99/0xf0 [amdgpu]
[532941.990325] ? __irq_put_desc_unlock+0x1c/0x50
[532941.990337] amdgpu_device_ip_suspend_phase1+0x27/0xe0 [amdgpu]
[532941.990656] ? srso_return_thunk+0x5/0x5f
[532941.990668] amdgpu_device_ip_suspend+0x1f/0x70 [amdgpu]
[532941.990982] amdgpu_device_pre_asic_reset+0xd3/0x2a0 [amdgpu]
[532941.991255] amdgpu_device_gpu_recover+0x4f6/0xdd0 [amdgpu]
[532941.991525] ? ___drm_dbg+0xa0/0xd0 [drm]
[532941.991579] amdgpu_job_timedout+0x186/0x270 [amdgpu]
[532941.991915] drm_sched_job_timedout+0x85/0x120 [gpu_sched]
[532941.991931] process_one_work+0x17b/0x350
[532941.991942] worker_thread+0x303/0x450
[532941.991951] ? __pfx_worker_thread+0x10/0x10
[532941.991958] kthread+0xe8/0x120
[532941.991964] ? __pfx_kthread+0x10/0x10
[532941.991971] ret_from_fork+0x34/0x50
[532941.991979] ? __pfx_kthread+0x10/0x10
[532941.991984] ret_from_fork_asm+0x1b/0x30
[532941.991997] </TASK>
[532965.777758] perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Bug Clerk June 13, 2024 at 10:31 PM
Thank you for submitting this TrueNAS Bug Report! So that we can quickly investigate your issue, please attach a Debug file and any other information related to this issue through our secure and private upload service below. Debug files can be generated in the UI by navigating to System -> Advanced -> Save Debug.
https://ixsystems.atlassian.net/servicedesk/customer/portal/15/group/37/create/153
My TrueNAS Scale sometimes becomes unresponsive while transcoding videos in the jellyfin app. In the commandline of the server, the following error messages show up:
[524555.1732661 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq-242024, emitted seq=242026 [524555.174044] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ffmpeg pid 4086633 thread ffmpeg:cs0 pid 4086634 [524555.174616] amdgpu 0000:0900.0 amdgpu: GPU reset begin!
The server runs Scale 24.04.1 on a Ryzen 5 2400G and a Radeon RX 580. Both graphics units are passed to the jellyfin application. This happened when transcoding a 720p .m4v file, which initially ran at 300fps. When this happens, the system slows down to a crawl, but does not shut down.
I dont suspect jellyfin or ffmpeg to be the issue here, but can’t really tell for sure. (Driver? Hardware? Kernel?)