Manual reboot of active controller via SSH breaks HA on SCALE
Description
Problem/Justification
Impact
Attachments
- 21 Mar 2023, 09:00 PM
Activity
Automation for JiraMarch 23, 2023 at 1:31 PM
This issue has now been closed. Comments made after this point may not be viewed by the TrueNAS Teams. Please open a new issue if you have found a problem or need to re-engage with the TrueNAS Engineering Teams.
CalebMarch 22, 2023 at 12:32 PM
@Andrew Walker could you help me investigate this one please? Here is what I suspect.
1. fenced is running
2. reboot invokes systemd and does it’s various things related to shutdown.target
3. keepalived systemd unit gets stopped which sends BACKUP event
4. we start processing failover event and being to export pools
5. at same time systemd stopping of services continues and middlewared
process gets stopped (more than likely killed)
6. because middlewared gets killed, the processing of the failover event stops
7. fenced is also stopped at some point in this process
8. by this point, other controller has received MASTER event and has started fenced
9. other controller detects that scsi reservation keys did NOT change on the disks (because fenced was stopped/killed during reboot process)
10. controller reserves the disks while export of the zpools never completes
11. everything gets hung
I’m not sure what we can do in this scenario….i looked up systemd shutdown.target and “conflicts” arguments that can be added to systemd unit service files but just curious if you could help me investigate it.
I attempted to failover by running
reboot
in SSH session on active controller.Now passive controller is froze in this state and HA never becomes healthy again.