Manual reboot of active controller via SSH breaks HA on SCALE

Description

I attempted to failover by running reboot in SSH session on active controller.

Now passive controller is froze in this state and HA never becomes healthy again.

Problem/Justification

None

Impact

None

Attachments

21 Mar 2023, 09:00 PM

Activity

Show:

Bug ClerkMarch 23, 2023 at 1:31 PM

22.12.3 PR: https://github.com/truenas/middleware/pull/10940

Bug ClerkMarch 23, 2023 at 1:31 PM

22.12.2 PR: https://github.com/truenas/middleware/pull/10939

Automation for JiraMarch 23, 2023 at 1:31 PM

This issue has now been closed. Comments made after this point may not be viewed by the TrueNAS Teams. Please open a new issue if you have found a problem or need to re-engage with the TrueNAS Engineering Teams.

Bug ClerkMarch 22, 2023 at 8:16 PM

23.10 PR: https://github.com/truenas/middleware/pull/10937

CalebMarch 22, 2023 at 12:32 PM

@Andrew Walker could you help me investigate this one please? Here is what I suspect.

1. fenced is running
2. reboot invokes systemd and does it’s various things related to shutdown.target
3. keepalived systemd unit gets stopped which sends BACKUP event
4. we start processing failover event and being to export pools
5. at same time systemd stopping of services continues and middlewared process gets stopped (more than likely killed)
6. because middlewared gets killed, the processing of the failover event stops
7. fenced is also stopped at some point in this process
8. by this point, other controller has received MASTER event and has started fenced
9. other controller detects that scsi reservation keys did NOT change on the disks (because fenced was stopped/killed during reboot process)
10. controller reserves the disks while export of the zpools never completes
11. everything gets hung

I’m not sure what we can do in this scenario….i looked up systemd shutdown.target and “conflicts” arguments that can be added to systemd unit service files but just curious if you could help me investigate it.

Complete

Details
Assignee
Andrew Walker
Reporter
Andrew Walker
Impact
High
Components
HA
Fix versions
SCALE-22.12.3
SCALE-23.10-ALPHA.1
SCALE-22.12.2
Priority
Blocker

More fields

Katalon Platform

Created March 21, 2023 at 9:00 PM

Updated March 23, 2023 at 1:45 PM

Resolved March 23, 2023 at 1:45 PM

Manual reboot of active controller via SSH breaks HA on SCALE

Description

Problem/Justification

Impact

Attachments

Activity

Bug ClerkMarch 23, 2023 at 1:31 PM

Bug ClerkMarch 23, 2023 at 1:31 PM

Automation for JiraMarch 23, 2023 at 1:31 PM

Bug ClerkMarch 22, 2023 at 8:16 PM

CalebMarch 22, 2023 at 12:32 PM

Details
Assignee
Andrew Walker
Reporter
Andrew Walker
Impact
High
Components
HA
Fix versions
SCALE-22.12.3
SCALE-23.10-ALPHA.1
SCALE-22.12.2
Priority
Blocker

Details

Assignee

Reporter

Impact

Components

Fix versions

Priority

More fields

More fields

Katalon Platform

Katalon Platform

Flag notifications

Something's gone wrong

Manual reboot of active controller via SSH breaks HA on SCALE

Description

Problem/Justification

Impact

Attachments

Activity

Bug ClerkMarch 23, 2023 at 1:31 PM

Bug ClerkMarch 23, 2023 at 1:31 PM

Automation for JiraMarch 23, 2023 at 1:31 PM

Bug ClerkMarch 22, 2023 at 8:16 PM

CalebMarch 22, 2023 at 12:32 PM

DetailsAssigneeAndrew WalkerAndrew WalkerReporterAndrew WalkerAndrew WalkerImpactHighComponentsHAFix versionsSCALE-22.12.3SCALE-23.10-ALPHA.1SCALE-22.12.2PriorityBlocker

Details

Assignee

Reporter

Impact

Components

Fix versions

Priority

More fieldsTime tracking

More fields

Katalon PlatformLinked Test Cases, Katalon Defect Results, Katalon Studio Test Results

Katalon Platform

Flag notifications

Something's gone wrong

Details
Assignee
Andrew Walker
Reporter
Andrew Walker
Impact
High
Components
HA
Fix versions
SCALE-22.12.3
SCALE-23.10-ALPHA.1
SCALE-22.12.2
Priority
Blocker

More fields

Katalon Platform