Update 12.0 to U1 causes Command timeouts and multiple zpools to become unavailable

Description

System is stable running 12.0 base. Shortly after update to U1, received emails notifying of known disk issues related to FAILED SMART self-checks (in process of replacing shortly) such as

New alerts:

Device: /dev/da13 [SAT], ATA error count increased from 9 to 10.

Current alerts:

Device: /dev/da3 [SAT], 4 Currently unreadable (pending) sectors.
Device: /dev/da3 [SAT], 1 Offline uncorrectable sectors.
Device: /dev/da8 [SAT], FAILED SMART self-check. BACK UP DATA NOW!.
Device: /dev/da8 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct..
Device: /dev/da0 [SAT], not capable of SMART self-check.
Device: /dev/da3 [SAT], FAILED SMART self-check. BACK UP DATA NOW!.
Device: /dev/da3 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct..
Device: /dev/da3 [SAT], Self-Test Log error count increased from 17 to 21.
Device: /dev/da13 [SAT], ATA error count increased from 9 to 10.

but sometime shortly after (from minutes to just about an hour) I get an updated email notifying of UNAVAIL pool(s).

New alerts:

Pool Pool-01_2x3TB_Z1 state is UNAVAIL : One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
Disk 13469299449177422213 is REMOVED
Disk 18225843187637751625 is REMOVED

Current alerts:

Device: /dev/da3 [SAT], 4 Currently unreadable (pending) sectors.
Device: /dev/da3 [SAT], 1 Offline uncorrectable sectors.
Device: /dev/da8 [SAT], FAILED SMART self-check. BACK UP DATA NOW!.
Device: /dev/da8 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct..
Device: /dev/da0 [SAT], not capable of SMART self-check.
Device: /dev/da3 [SAT], FAILED SMART self-check. BACK UP DATA NOW!.
Device: /dev/da3 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct..
Device: /dev/da3 [SAT], Self-Test Log error count increased from 17 to 21.
Device: /dev/da13 [SAT], ATA error count increased from 9 to 10.
Pool Pool-01_2x3TB_Z1 state is UNAVAIL: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
Disk 13469299449177422213 is REMOVED
Disk 18225843187637751625 is REMOVED

Checking /var/log/messages is this:

Dec 26 10:12:42 freenas-rms (da7:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 68 4a e9 60 00 00 00 08 00 00 length 4096 SMID 645 Command timeout on target 16(0x0010) 60000 set, 60.202389277 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 16 for SMID 645
Dec 26 10:12:42 freenas-rms (da7:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 68 4a e9 60 00 00 00 08 00 00 length 4096 SMID 645 Aborting command 0xfffffe010db2e2b8
Dec 26 10:12:42 freenas-rms (da7:mps0:0:16:0): WRITE(10). CDB: 2a 00 f5 7a 41 d8 00 04 d0 00 length 630784 SMID 822 Command timeout on target 16(0x0010) 60000 set, 60.211860396 elapsed
Dec 26 10:12:42 freenas-rms (da8:mps0:0:18:0): READ(10). CDB: 28 00 75 95 a2 88 00 00 80 00 length 65536 SMID 816 Command timeout on target 18(0x000a) 60000 set, 60.211873174 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 18 for SMID 816
Dec 26 10:12:42 freenas-rms (da8:mps0:0:18:0): READ(10). CDB: 28 00 75 95 a2 88 00 00 80 00 length 65536 SMID 816 Aborting command 0xfffffe010db3c880
Dec 26 10:12:42 freenas-rms (da8:mps0:0:18:0): READ(10). CDB: 28 00 75 95 a2 08 00 00 80 00 length 65536 SMID 698 Command timeout on target 18(0x000a) 60000 set, 60.340101584 elapsed
Dec 26 10:12:42 freenas-rms (da8:mps0:0:18:0): READ(10). CDB: 28 00 75 95 a1 88 00 00 80 00 length 65536 SMID 489 Command timeout on target 18(0x000a) 60000 set, 60.398693933 elapsed
Dec 26 10:12:42 freenas-rms (da4:mps0:0:13:0): READ(10). CDB: 28 00 75 95 a4 88 00 00 80 00 length 65536 SMID 559 Command timeout on target 13(0x000b) 60000 set, 60.414102026 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 13 for SMID 559
Dec 26 10:12:42 freenas-rms (da4:mps0:0:13:0): READ(10). CDB: 28 00 75 95 a4 88 00 00 80 00 length 65536 SMID 559 Aborting command 0xfffffe010db26f28
Dec 26 10:12:42 freenas-rms (da3:mps0:0:12:0): READ(10). CDB: 28 00 75 95 a4 90 00 00 80 00 length 65536 SMID 602 Command timeout on target 12(0x000c) 60000 set, 60.536237026 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 12 for SMID 602
Dec 26 10:12:42 freenas-rms (da3:mps0:0:12:0): READ(10). CDB: 28 00 75 95 a4 90 00 00 80 00 length 65536 SMID 602 Aborting command 0xfffffe010db2a8f0
Dec 26 10:12:42 freenas-rms (da4:mps0:0:13:0): READ(10). CDB: 28 00 75 95 a4 08 00 00 80 00 length 65536 SMID 697 Command timeout on target 13(0x000b) 60000 set, 60.658369333 elapsed
Dec 26 10:12:42 freenas-rms (da3:mps0:0:12:0): READ(10). CDB: 28 00 75 95 a4 10 00 00 80 00 length 65536 SMID 859 Command timeout on target 12(0x000c) 60000 set, 60.716850790 elapsed
Dec 26 10:12:42 freenas-rms (da5:mps0:0:14:0): WRITE(10). CDB: 2a 00 cf 8e 90 a8 00 00 58 00 length 45056 SMID 446 Command timeout on target 14(0x0011) 60000 set, 60.781890314 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 14 for SMID 446
Dec 26 10:12:42 freenas-rms (da5:mps0:0:14:0): WRITE(10). CDB: 2a 00 cf 8e 90 a8 00 00 58 00 length 45056 SMID 446 Aborting command 0xfffffe010db1d750
Dec 26 10:12:42 freenas-rms (da14:mps0:0:27:0): WRITE(10). CDB: 2a 00 cf 8e 90 48 00 00 58 00 length 45056 SMID 649 Command timeout on target 27(0x0012) 60000 set, 60.904812950 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 27 for SMID 649
Dec 26 10:12:42 freenas-rms (da14:mps0:0:27:0): WRITE(10). CDB: 2a 00 cf 8e 90 48 00 00 58 00 length 45056 SMID 649 Aborting command 0xfffffe010db2e818
Dec 26 10:12:42 freenas-rms (da6:mps0:0:15:0): WRITE(10). CDB: 2a 00 cf 8e 88 60 00 07 e8 00 length 1036288 SMID 583 Command timeout on target 15(0x0013) 60000 set, 60.1029242264 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 15 for SMID 583
Dec 26 10:12:42 freenas-rms (da6:mps0:0:15:0): WRITE(10). CDB: 2a 00 cf 8e 88 60 00 07 e8 00 length 1036288 SMID 583 Aborting command 0xfffffe010db28f68
Dec 26 10:12:42 freenas-rms (da4:mps0:0:13:0): READ(10). CDB: 28 00 75 95 a3 88 00 00 80 00 length 65536 SMID 675 Command timeout on target 13(0x000b) 60000 set, 60.1155205855 elapsed
Dec 26 10:12:42 freenas-rms (da3:mps0:0:12:0): READ(10). CDB: 28 00 75 95 a3 90 00 00 80 00 length 65536 SMID 775 Command timeout on target 12(0x000c) 60000 set, 60.1214052189 elapsed
Dec 26 10:12:42 freenas-rms (da1:mps0:0:10:0): READ(10). CDB: 28 00 fa ec 41 20 00 00 80 00 length 65536 SMID 767 Command timeout on target 10(0x000e) 60000 set, 60.1244367865 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 10 for SMID 767
Dec 26 10:12:42 freenas-rms (da1:mps0:0:10:0): READ(10). CDB: 28 00 fa ec 41 20 00 00 80 00 length 65536 SMID 767 Aborting command 0xfffffe010db386a8
Dec 26 10:12:42 freenas-rms (da2:mps0:0:11:0): READ(10). CDB: 28 00 fa ec 41 20 00 00 80 00 length 65536 SMID 578 Command timeout on target 11(0x000d) 60000 set, 60.1366931409 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 11 for SMID 578
Dec 26 10:12:42 freenas-rms (da2:mps0:0:11:0): READ(10). CDB: 28 00 fa ec 41 20 00 00 80 00 length 65536 SMID 578 Aborting command 0xfffffe010db288b0
Dec 26 10:12:42 freenas-rms (da1:mps0:0:10:0): READ(10). CDB: 28 00 fa ec 41 a0 00 00 80 00 length 65536 SMID 801 Command timeout on target 10(0x000e) 60000 set, 60.1496735489 elapsed
Dec 26 10:12:42 freenas-rms (da2:mps0:0:11:0): READ(10). CDB: 28 00 fa ec 41 a0 00 00 80 00 length 65536 SMID 892 Command timeout on target 11(0x000d) 60000 set, 60.1555541084 elapsed
Dec 26 10:12:42 freenas-rms (da0:mps0:0:9:0): READ(10). CDB: 28 00 fa ec 43 a0 00 00 80 00 length 65536 SMID 810 Command timeout on target 9(0x000f) 60000 set, 60.1599611643 elapsed
Dec 26 10:12:42 freenas-rms mps0: Sending abort to target 9 for SMID 810
Dec 26 10:12:42 freenas-rms (da0:mps0:0:9:0): READ(10). CDB: 28 00 fa ec 43 a0 00 00 80 00 length 65536 SMID 810 Aborting command 0xfffffe010db3c070
Dec 26 10:12:42 freenas-rms (da1:mps0:0:10:0): READ(10). CDB: 28 00 fa ec 42 a0 00 00 80 00 length 65536 SMID 752 Command timeout on target 10(0x000e) 60000 set, 60.1720593886 elapsed
Dec 26 10:12:42 freenas-rms (da2:mps0:0:11:0): READ(10). CDB: 28 00 fa ec 42 a0 00 00 80 00 length 65536 SMID 808 Command timeout on target 11(0x000d) 60000 set, 60.1779450090 elapsed
Dec 26 10:12:42 freenas-rms (da0:mps0:0:9:0): READ(10). CDB: 28 00 fa ec 44 a0 00 00 80 00 length 65536 SMID 426 Command timeout on target 9(0x000f) 60000 set, 60.1826374956 elapsed
Dec 26 10:12:42 freenas-rms (da0:mps0:0:9:0): READ(10). CDB: 28 00 fa ec 44 20 00 00 80 00 length 65536 SMID 700 Command timeout on target 9(0x000f) 60000 set, 60.1884557776 elapsed
Dec 26 10:12:43 freenas-rms (da0:mps0:0:9:0): WRITE(10). CDB: 2a 00 70 40 27 50 00 00 08 00 length 4096 SMID 916 Command timeout on target 9(0x000f) 60000 set, 60.827126265 elapsed
Dec 26 10:12:43 freenas-rms (da2:mps0:0:11:0): WRITE(10). CDB: 2a 00 70 40 27 48 00 00 08 00 length 4096 SMID 854 Command timeout on target 11(0x000d) 60000 set, 60.884782873 elapsed
Dec 26 10:12:43 freenas-rms (da1:mps0:0:10:0): WRITE(10). CDB: 2a 00 70 40 27 48 00 00 08 00 length 4096 SMID 623 Command timeout on target 10(0x000e) 60000 set, 60.943332036 elapsed
Dec 26 10:12:43 freenas-rms (da8:mps0:0:18:0): WRITE(10). CDB: 2a 00 e5 95 a0 f0 00 00 08 00 length 4096 SMID 528 Command timeout on target 18(0x000a) 60000 set, 60.1001827882 elapsed
Dec 26 10:12:43 freenas-rms (da4:mps0:0:13:0): WRITE(10). CDB: 2a 00 e5 95 a0 f0 00 00 08 00 length 4096 SMID 420 Command timeout on target 13(0x000b) 60000 set, 60.1060720636 elapsed
Dec 26 10:12:43 freenas-rms (da3:mps0:0:12:0): WRITE(10). CDB: 2a 00 e5 95 a0 f0 00 00 08 00 length 4096 SMID 883 Command timeout on target 12(0x000c) 60000 set, 60.1119711278 elapsed
Dec 26 10:12:46 freenas-rms (da6:mps0:0:15:0): WRITE(16). CDB: 8a 00 00 00 00 01 b0 4d 9e d8 00 00 00 28 00 00 length 20480 SMID 929 Command timeout on target 15(0x0013) 60000 set, 60.38550836 elapsed
Dec 26 10:12:46 freenas-rms (da5:mps0:0:14:0): WRITE(16). CDB: 8a 00 00 00 00 01 b0 4d 9e d8 00 00 00 30 00 00 length 24576 SMID 386 Command timeout on target 14(0x0011) 60000 set, 60.103505142 elapsed
Dec 26 10:12:46 freenas-rms (da7:mps0:0:16:0): WRITE(16). CDB: 8a 00 00 00 00 01 b0 4d 9e d0 00 00 00 30 00 00 length 24576 SMID 692 Command timeout on target 16(0x0010) 60000 set, 60.169085370 elapsed
Dec 26 10:12:46 freenas-rms (da14:mps0:0:27:0): WRITE(16). CDB: 8a 00 00 00 00 01 b0 4d 9e d0 00 00 00 30 00 00 length 24576 SMID 786 Command timeout on target 27(0x0012) 60000 set, 60.234615252 elapsed
Dec 26 10:12:48 freenas-rms (xpt0:mps0:0:16:0): SMID 1 task mgmt 0xfffffe010daf8158 timed out
Dec 26 10:12:48 freenas-rms mps0: Reinitializing controller
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 16
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 18
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 13
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 12
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 14
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 27
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 15
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 10
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 11
Dec 26 10:12:48 freenas-rms mps0: Unfreezing devq for target ID 9
Dec 26 10:12:48 freenas-rms mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Dec 26 10:12:48 freenas-rms mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Dec 26 10:12:50 freenas-rms mps0: Error reading device 0xa SATA PASSTHRU; iocstatus= 0x47
Dec 26 10:12:56 freenas-rms mps0[3392]: Last message 'Error reading device' repeated 4 times, suppressed by syslog-ng on freenas-rms.rmscomputersystems.com
Dec 26 10:12:56 freenas-rms (da8:mps0:0:18:0): Invalidating pack
Dec 26 10:12:56 freenas-rms da8 at mps0 bus 0 scbus0 target 18 lun 0
Dec 26 10:12:56 freenas-rms da8: <ATA WDC WD30EZRX-00D 0A80> s/n WD-WMC1T1274257 detached
Dec 26 10:12:56 freenas-rms GEOM_MIRROR: Device swap2: provider da8p1 disconnected.
Dec 26 10:12:56 freenas-rms da3 at mps0 bus 0 scbus0 target 12 lun 0
Dec 26 10:12:56 freenas-rms da3: <ATA WDC WD30EZRX-00D 0A80> s/n WD-WMC1T1274257 detached
Dec 26 10:12:56 freenas-rms ses0: da3,pass4,da8 in 'Slot 01', SAS Slot: 1 phys at slot 0
Dec 26 10:12:56 freenas-rms ses0: phy 0: SATA device
Dec 26 10:12:56 freenas-rms ses0: phy 0: parent 5003048000b626bf addr 5003048000b6268c
Dec 26 10:12:57 freenas-rms GEOM_MIRROR: Device swap3: provider da3p1 disconnected.
Dec 26 10:12:57 freenas-rms (da3:mps0:0:12:0): Periph destroyed
Dec 26 10:12:57 freenas-rms da3 at mps0 bus 0 scbus0 target 12 lun 0
Dec 26 10:12:57 freenas-rms da3: <ATA WDC WD30EZRX-00D 0A80> Fixed Direct Access SPC-4 SCSI device
Dec 26 10:12:57 freenas-rms da3: Serial Number WD-WMC1T1274257
Dec 26 10:12:57 freenas-rms da3: 600.000MB/s transfers
Dec 26 10:12:57 freenas-rms da3: Command Queueing enabled
Dec 26 10:12:57 freenas-rms da3: 2861588MB (5860533168 512 byte sectors)
Dec 26 10:12:57 freenas-rms da3: quirks=0x8<4K>
Dec 26 10:12:57 freenas-rms (da8:mps0:0:18:0): Periph destroyed
Dec 26 10:13:56 freenas-rms 1 2020-12-26T10:13:56.670542-05:00 freenas-rms.rmscomputersystems.com smartd 7937 - - Device: /dev/da3 [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Dec 26 10:13:56 freenas-rms 1 2020-12-26T10:13:56.993394-05:00 freenas-rms.rmscomputersystems.com smartd 7937 - - Device: /dev/da3 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Dec 26 10:13:57 freenas-rms 1 2020-12-26T10:13:57.370463-05:00 freenas-rms.rmscomputersystems.com smartd 7937 - - Device: /dev/da3 [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Dec 26 10:13:57 freenas-rms 1 2020-12-26T10:13:57.855543-05:00 freenas-rms.rmscomputersystems.com smartd 7937 - - Device: /dev/da3 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.

Da3 and da8 are in the same z1 pool (Pool-01_2x3TB_Z1), so I do understand the why the pool is pushed offline, but at issue is that 12.0 base does NOT have this issue, it is the application of 12.0-U1 which has introduced this instability/issue (possibly something within the change to OpenZFS?). I have reverted back to 12.0 base and this issue has disappeared (the da3 and da8 SMART errors continue to be notified, but I do not see any of the command timeouts and pools being pushed offline.)

Problem/Justification

None

Impact

None

Linked issues

duplicates

NAS-108881

Panics on mps(4) firmware timeouts

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Peter Scordamaglia February 11, 2021 at 2:10 AM

Hello. I applied the recommendation in NAS-108881 of set hw.mps.max_io_pages loader tunable == 32 and upgraded to 12.0-U2. I Will update as this do or do not occur. thanks!

Alexander Motin January 15, 2021 at 7:24 PM

This looks like possible duplicate of NAS-108881. Could you try my recommendation from it to see whether it helps?

Peter Scordamaglia December 29, 2020 at 1:30 PM
Edited

Update: The system is inaccessible now. Attempted to access via GUI, SSH and running iocage services.

Gui is up enough to show 'Connecting to TrueNAS ... Make sure the TrueNAS system is powered on and connected to the network".
SSH prompts for user AND for password but is stalled (no more 'movement in session') just after that.
Existing connections also are in a limbo state. HTOP seems to be ok, but another SSH session I left connected would not display anything past enter of 'cat /var/log/messages'.
iocage services are also not responding (Intermapper (linux VM) remote client will not connect; web interface is inaccessible; nextcloud (plugin) inacessible, Plex will allow logins but does not display any selections...)

it's like all services are up until they need to access data residing on HD... I am providing this as information (do not know if this helps or muddies). I MIGHT be able to kill ONE thing with HTOP... will leave system as is (it is in an interesting state. I am fairly certain we will want to try to get more info).

Peter Scordamaglia December 29, 2020 at 1:53 AM

TWo debugs attached. Debug from 12.0-Release and 12.0-U1. I have reasonably confirmed it is triggered from data WRITES. I read a few 10s of GBs from the drive but when I duplicated data on that drive, after about 15GB written, the issues occurred. I think I will leave this running overnight in case you need more info, but by morning (Eastern TZ) I will most likely need to put this back into production.

Peter Scordamaglia December 28, 2020 at 7:59 PM

Let me find an idle point to update and let it happen. I should be able to get it done and exported in the next 6 hours.

Duplicate

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Triage Team

Reporter

Peter Scordamaglia

Impact

Critical

Components

Fix versions

N/A

Affects versions

12.0-U1

Priority

Low

More fields

Katalon Platform

Created December 28, 2020 at 4:05 PM

Updated July 1, 2022 at 4:59 PM

Resolved January 27, 2021 at 5:34 PM

Configure

Update 12.0 to U1 causes Command timeouts and multiple zpools to become unavailable

Description

Problem/Justification

Impact

Linked issues

duplicates

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Peter Scordamaglia February 11, 2021 at 2:10 AM

Alexander Motin January 15, 2021 at 7:24 PM

Peter Scordamaglia December 29, 2020 at 1:30 PMEdited

Peter Scordamaglia December 29, 2020 at 1:53 AM

Peter Scordamaglia December 28, 2020 at 7:59 PM

Details

Assignee

Reporter

Impact

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Peter Scordamaglia December 29, 2020 at 1:30 PM
Edited