All replication hangs until system is rebooted after getting SSHException

Description

There are 6 replication tasks that run every hour, but 10m offset from each other due to yet other issues running more than 2 replication jobs at once causing the system to panic.

At some point a job will encounter the following exception:
[2021/03/05 21:20:16] WARNING [retention] [zettarepl.zettarepl] Remote retention failed on : error listing snapshots: SSHException('Timeout opening channel.')

After that point ALL replication jobs will be stuck in WAITING status claiming the job is already running.

The only way to clear this state is to reboot the system.

Again, I've checked the box to attach a debug, but I suspect the bug about running the debug still exists and thus it won't be automatically attached.

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Josh Wisely March 9, 2021 at 7:48 PM

That seems like a good idea, but I'd go further in saying that all operations need a timeout, even if that's something like 15m. Just something so that things don't indefinitely hang.

 

Another path could be to look at the SSH connection itself. There should be keep-alive things that can be done in the SSH session to detect when it ceases to be functional. That should cause the SSH session to be torn down, which in turn should cause the command using the SSH session to fail.

Vladimir Vinogradenko March 9, 2021 at 9:48 AM

we decided to add a timeout exclusively for retention queries.

Josh Wisely March 8, 2021 at 6:53 PM

 The way this gets stuck makes sense. There has been a fair number of snapshots on these systems (in the 1-2k range) which cause the GUI to take ~20s to show the list of snapshots.

I disagree that a sane timeout isn't possible here.  As a blanket statement, you can't trust anything and there should always be guardrails. I think it's reasonable to say after 5, 10, or 15m that if it appears dead it should be considered to have timed out.

I'm also confused by your statement that there can't be a timeout here where the PR seems to be implementing exactly that.

Bug Clerk March 8, 2021 at 5:48 PM

Vladimir Vinogradenko March 8, 2021 at 4:16 PM

usually you have retention for 6 replication tasks

At the end of this logs you have it five times

I suspect remote zfs list hangs indefinitely when listing snapshots for the 6th retention due to https://jira.ixsystems.com/browse/NAS-109146
We don't have any timeouts for ZFS CLI operations as we can't have a sane one. We rely on ZFS CLI to behave properly.

The following replication tasks do not run because we don't run replication tasks during retention to avoid conflicts and race conditions.

Your retention SSH connection will break if you reboot a malfunctioning remote machine, retention will finish with an error, and replication tasks will resume.

Does this make sense?

Complete
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Affects versions

Priority

More fields

Katalon Platform

Created March 5, 2021 at 10:45 PM
Updated July 1, 2022 at 5:13 PM
Resolved March 9, 2021 at 7:23 PM