pool middleware.exceptions wrong gptid
Description
Problem/Justification
Impact
SmartDraw Connector
Katalon Manual Tests (BETA)
Activity

William Gryzbowski September 26, 2019 at 12:59 PM
Its in the disk_encrypteddisk table for the sqlite3 /data/freenas-v1.db database

James September 25, 2019 at 7:11 AM
William,
Being repeatable would definitely make it easier to diagnose and fix.
I am fearful to do anything that might break the pool.
Can you tell me where the disk cache table is stored?
If I'm bored I might spin up a VM and randomly fuzz adding/removing SATA drives.

William Gryzbowski September 23, 2019 at 3:30 PM
The 12 number means one of the disks could not decrypt, likely the disk cache table has gone out of sync.
It would be interesting to know how to reproduce it, other than that its just a wild guess.

James September 21, 2019 at 5:16 AM
Powered back on FreeNAS. I couldn't mount the pool (see error above).
What I meant to say was after I powered on FreeNAS I logged into the Web UI. I went to Storage -> Pools. I couldn't unlock the encrypted pool. I got the error on the unlock attempt.
Unfortunately, I don't have the zpool status output. But it had one disk with gpt/..... and the other was a string of about 12 numbers. It wasn't the disk serial number or part of the gptid so I have no idea where it got that config for the pool.
That's when I decided to do the risky disconnect/re-import of the pool (which worked fine).
This is similar symptoms to this post:
https://www.ixsystems.com/community/threads/zpool-status-no-longer-showing-gptid-for-one-disk.78845/
Except it wasn't the disk device it was random? numbers instead of the gptid.

William Gryzbowski September 18, 2019 at 6:29 PM
Exactly, how could that have even worked out if pool was not imported?
At what point did you do that? Do you have the actual output at the time?
Error from the UI:
Error: Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/tastypie/resources.py", line 219, in wrapper
response = callback(request, *args, **kwargs)
File "./freenasUI/api/resources.py", line 949, in unlock
form.done(obj)
File "./freenasUI/storage/forms.py", line 2847, in done
raise MiddlewareError(msg)
freenasUI.middleware.exceptions.MiddlewareError: [MiddlewareError: Volume could not be imported]
What caused it
I'm not 100% sure.
I shut down the system and added 3 disks. 2 empty new drives, and one windows formatted drive. I 3g-ntfs mounted the windows drive from the shell. Then I unmounted it and removed it from the system as I wanted to pull some files off before wiping it.
An hour later I noticed the web UI stopped working. I couldn't SSH in since I (foolishly) had root disabled and still didn't give su access to my other account. So I pressed the power button waited a few seconds and FreeNAS gracefully shut itself down.
I moved the windows formatted disk back into the pc. Powered back on FreeNAS. I couldn't mount the pool (see error above).
The pool is an ecrypted + password protected 2-disk stripe (raid 0). And it couldn't find the second disk. However, in the UI under Storage -> Disks the disk was clearly there!
Digging into the logs:
cat /var/log/middlewared.log | grep -i fail
middleware.notifier: 1946 Importing moon_pool0 failed with cannot import no such pool or dataset
I checked the pool status. It had a disk that didn't have “gptid/...eli” in the name:
zpool status
I checked to make sure the partitions are really there on the disks:
glabel status
gpart list | grep da
All the disks were there and all in “gptid/...eli” format for the partitions.
Work around
I exported the geli keys for all the pools again "just in case". Then I disconnected the pool (keeping shares, and not deleting the pool). Then I imported the pool, uploaded my key and typed in my password and everything worked.
I checked all the shares worked (they did). Then I rebooted FreeNAS and made sure everything still worked again after a reboot (worked).
Similar
I had a similar problem when first testing FreeNAS months ago. I was simulating a disk failure. I don't remember exactly, but I was either replacing the same missing disk or putting in a new disk to replace a missing disk. When it failed I gave up and just deleted everything and started from scratch.
problems:
#1 somehow the wrong partition id for a disk was written into the pool config for one of the encrypted pools. After reimporting the pool everything worked so I am not sure what I did that triggered this bug? But it's kind of catastrophic as it can cause down time and or total data loss of a pool.
#2 The UI doesn't gracefully handle missing disks (see error).
#3 /var/log/middlewared.log should be rotated and not merely overwritten every reboot. I went back and checked and nothing before September 9th is in the logs (when I last rebooted).