Huge backlog of periodic snapshot replications continuously error out

Description

Please see my forum post below. Sorry, but I really don't want to have to repeat myself here.

https://www.ixsystems.com/community/threads/cant-pin-down-why-i-keep-developing-this-error-with-snapshot-replication-to-2-other-freenases-on-same-lan.87048/

Gist is that the huge backlog in snapshot replication (because of any number of reasons) continuously errors out after exceeding 'kern.ipc.maxpipekva'.

Something in the replication processing of a backlog is opening pipes and not closing them. A reboot of the server is the only was to clear the condition. But as replication resumes, it eventually returns to this same error condition.

Again, details are in my forum post. Thanks!

Problem/Justification

None

Impact

None

Linked issues

duplicates

NAS-106928

zettarepl middlewared file descriptor leak

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Vladimir Vinogradenko September 2, 2020 at 9:19 PM

this issue was fixed in 11.3-U4.1

Mark Bailey August 31, 2020 at 10:42 PM
Edited

This has been attached.

Also, everything is NOW finally synced and running again. But I had to reboot the server numerous times before it completed, as once the max pipe exceeded condition develops, most everything quits working (SMB, etc) until a reboot. The internal default for 'kern.ipc.maxpipekva' is, I think, something like 10000. Prior to massively increasing this max value in Tunables, the error condition would appear after only about 30 minutes or so. Setting it to 256MB got me to about 2 days before I would have to reboot. Were this not an issue, I think my replication backlogs would've been all caught up in 48-72 hrs (if not sooner) as most snapshots are very small to zero byte in size.

Others on the forum had suggested more snapshots than could be pushed out, and, over WAN, I suppose this might prove true. But this is all over gigabit LAN. and, replication is automatic, so at any given snapshot(s) event interval, the number and size of snapshots to be replicated is really quite manageable. But should some replication setup not be, I would hope the GUI would be able to advise the user of this. But I really don't think running out of pipes should even be a factor here. I even tried disabling all but one replication task, and it still eventually errored out.

Anyways, thanks again very much for taking a deeper look into this! It really surprised me that simply just catching back up would prove such an effort.

Bonnie Follweiler August 31, 2020 at 1:06 PM

Thank you for the report . Can you please provide a debug by navigating to System -> Advanced, click save debug, and upload attachment to this ticket?

Duplicate

Details
Assignee
Triage Team
Reporter
Mark Bailey
Labels
Impact
High
Components
Fix versions
N/A
Affects versions
11.3-U3
Priority
Low

More fields

Katalon Platform

Created August 29, 2020 at 9:58 AM

Updated July 1, 2022 at 4:55 PM

Resolved September 2, 2020 at 9:19 PM

Configure

Huge backlog of periodic snapshot replications continuously error out

Description

Problem/Justification

Impact

Linked issues

duplicates

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Vladimir Vinogradenko September 2, 2020 at 9:19 PM

Mark Bailey August 31, 2020 at 10:42 PMEdited

Bonnie Follweiler August 31, 2020 at 1:06 PM

DetailsAssigneeTriage TeamTriage TeamReporterMark BaileyMark BaileyLabelsbugready_for_reviewImpactHighComponentsFix versionsN/AAffects versions11.3-U3PriorityLow

Details

Assignee

Reporter

Labels

Impact

Components

Fix versions

Affects versions

Priority

More fieldsTime tracking

More fields

Katalon PlatformLinked Test Cases, Katalon Defect Results, Katalon Studio Test Results

Katalon Platform

Mark Bailey August 31, 2020 at 10:42 PM
Edited

Details
Assignee
Triage Team
Reporter
Mark Bailey
Labels
Impact
High
Components
Fix versions
N/A
Affects versions
11.3-U3
Priority
Low

More fields

Katalon Platform