UNMAP delays possibly causing FC errors

Description

Just upgraded from 11.3-U2 (~180 days of uptime no issues that I recall regarding this.) Now on 11.3-U4.1, it started a scrub after it came up and is causing the systems that run on it to lockup fairly frequently. I am sure I am missing the info you want, so please ask me for anything that you don't see in the debug or want to know.

Sep 13 23:50:53 The-Archive (0:4:10/10): UNMAP. CDB: 42 00 00 00 00 00 00 03 e8 00
Sep 13 23:50:53 The-Archive (0:4:10/10): Tag: 0x12a6f0, type 1
Sep 13 23:50:53 The-Archive (0:4:10/10): ctl_process_done: 288 seconds
Sep 13 23:50:53 The-Archive isp1: CTIO7 completed with Invalid RX_ID 0x12a6f0
Sep 13 23:50:54 The-Archive isp1: CTIO7 completed with Invalid RX_ID 0x12a6f0
Sep 13 23:50:54 The-Archive isp1: isp_handle_platform_ctio: CTIO7[12a6f0] seq 0 nc 1 sts 0x8 flg 0x1 sns 0 resid 0 MID
Sep 13 23:50:54 The-Archive isp1: CTIO7 completed with Invalid RX_ID 0x12a6f0
Sep 13 23:50:54 The-Archive isp1: isp_handle_platform_ctio: CTIO7[12a6f0] seq 1 nc 1 sts 0x8 flg 0x8040 sns 0 resid 0 FIN
Sep 13 23:50:54 The-Archive isp1: CTIO7 completed with Invalid RX_ID 0x12a6f0
Sep 13 23:50:54 The-Archive isp1: CTIO7 completed with Invalid RX_ID 0x12a6f0
Sep 13 23:50:54 The-Archive isp1: isp_handle_platform_ctio: CTIO7[12a6f0] seq 0 nc 1 sts 0x8 flg 0x1 sns 0 resid 0 MID
Sep 13 23:50:54 The-Archive isp1: CTIO7 completed with Invalid RX_ID 0x12a6f0
Sep 13 23:50:54 The-Archive isp1: isp_handle_platform_ctio: CTIO7[12a6f0] seq 1 nc 1 sts 0x8 flg 0x8040 sns 0 resid 0 FIN
Sep 13 23:50:54 The-Archive isp1: CTIO7 completed with Invalid RX_ID 0x12a6f0

Problem/Justification

None

Impact

None

Linked issues

relates to

NAS-107364

Scrub causes system "catatonic", apparently due to extreme CPU starvation

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Alexander Motin October 19, 2020 at 3:08 PM

I am closing this as duplicate of NAS-107364.

Alexander Motin October 19, 2020 at 3:08 PM

The only errors I see in last few days are related to SAS, not FC. It looks like your enclosure decvice disappears and reappears, but for that I'd blame the hardware/firmware. At least for your HBAs there should be newer firmware versions. Whether there are for your expander/backplane/JBOD – I don't know.

gcs8 October 19, 2020 at 2:26 PM

Latest debug after "clean" scrub

gcs8 October 19, 2020 at 2:25 PM

ok, so I think the theory about dedup + scrub is right, I nuked the dataset/zvol that had dedup enables and ran a scrub for that pool and did not see any of the previous errors. I do some other odd crap in dmesg, I will throw a newer debug up for your viewing pleasure.

This does make me wonder though, the deduped dataset/zvol only had like 40G of data, pretty sure I can hold 40G of DDT in 768G of RAM, wonder if for some reason it was forcing reads of the DDT off disk or it really is some hashing issue.

gcs8 October 16, 2020 at 5:27 PM

ok, sorry for the delay, had to wait for a workload to finish, moved the dedup dataset off the affected pool and nuked it, scrub seems to be going ok at the moment, going to watch it for a bit and see what happen with a light load, then going to introduce some more pressure to it in a day or so. Will report back.