Huge backlog of periodic snapshot replications continuously error out
Description
Problem/Justification
Impact
duplicates
SmartDraw Connector
Katalon Manual Tests (BETA)
Activity

Vladimir Vinogradenko September 2, 2020 at 9:19 PM
this issue was fixed in 11.3-U4.1

Mark Bailey August 31, 2020 at 10:42 PMEdited
This has been attached.
Also, everything is NOW finally synced and running again. But I had to reboot the server numerous times before it completed, as once the max pipe exceeded condition develops, most everything quits working (SMB, etc) until a reboot. The internal default for 'kern.ipc.maxpipekva' is, I think, something like 10000. Prior to massively increasing this max value in Tunables, the error condition would appear after only about 30 minutes or so. Setting it to 256MB got me to about 2 days before I would have to reboot. Were this not an issue, I think my replication backlogs would've been all caught up in 48-72 hrs (if not sooner) as most snapshots are very small to zero byte in size.
Others on the forum had suggested more snapshots than could be pushed out, and, over WAN, I suppose this might prove true. But this is all over gigabit LAN. and, replication is automatic, so at any given snapshot(s) event interval, the number and size of snapshots to be replicated is really quite manageable. But should some replication setup not be, I would hope the GUI would be able to advise the user of this. But I really don't think running out of pipes should even be a factor here. I even tried disabling all but one replication task, and it still eventually errored out.
Anyways, thanks again very much for taking a deeper look into this! It really surprised me that simply just catching back up would prove such an effort.

Bonnie Follweiler August 31, 2020 at 1:06 PM
Thank you for the report . Can you please provide a debug by navigating to System -> Advanced, click save debug, and upload attachment to this ticket?
Please see my forum post below. Sorry, but I really don't want to have to repeat myself here.
https://www.ixsystems.com/community/threads/cant-pin-down-why-i-keep-developing-this-error-with-snapshot-replication-to-2-other-freenases-on-same-lan.87048/
Gist is that the huge backlog in snapshot replication (because of any number of reasons) continuously errors out after exceeding 'kern.ipc.maxpipekva'.
Something in the replication processing of a backlog is opening pipes and not closing them. A reboot of the server is the only was to clear the condition. But as replication resumes, it eventually returns to this same error condition.
Again, details are in my forum post. Thanks!