Replication of dataset fails after big deletion on the source

Description

Hi, I have a 200TB storage which is replicated to a similar server every night at 0am.

In the past when I was on freenas, I used to notice that sometimes, after big deletions on the source (>10TB), the next replication would fail (dataset is busy). Deletes were pretty slow on source BTW, even using local shell commands.

Now i've upgraded to truenas 12.0u2-1 and I encounter some issues with the same cause.

I've deleted like 20TB of data and next night, the replication failed but with a different error.

Replication "My source - Mydest" failed: destination "mydest" contains partially-complete state from "zfs receive -s".
signal received..

I've tried re scheduling snapshot->replication manually during the next day but it failed too with same error.

Now it is again running on the scheduled 0am one but I'm afraid it will fail again.

Initally my snapshot retention was set to 2 days, I've set it to 4 in order to save time.

Will the source delete i's own snapshot (the one it still has in common with destination) even if the replication still fails ? If this happens, do I need to restart the whole replication from scratch (>150TB of unsecured data) ?

Any idea how to correct the replication issues?

Best regards

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

pierre billet May 31, 2022 at 8:26 PM
Edited

Finally everything went back in order doing things in that order:

-rollback the dest to latest existing snapshot

-start replication using PULL method on SSH (not SSH+netcat) and be patient

Despite of totally erroneous displays in the monitor (stating exabytes of datas to sync and 140 snaps to sync instead of 4) the first snasphot(the one that caaires deletions) finally went through

pierre billet May 31, 2022 at 1:14 PM
Edited

Here is the log from the pull replication that timeouts after one hour, then fails after one hour of retrying every minute

280.log

pierre billet May 31, 2022 at 1:12 PM

after the second hour the pull replication fails the same way

The patch seems to have delayed the issue by one hour..

I've attached the log

pierre billet May 31, 2022 at 12:12 PM

OK , one hour later, the patch you made last years seems to enter in action, the zettarepl.log on destination states this :

If I remember well it will retry every minute for 60 minutes

but the monitoring window starts displaying strange things, initially it was 1 of 4 (4 snapshots to sync) but everythime the fix code runs, it adds 4 more:

now it is 1 of 28

Is that ok? do I let it run for another hour ?

I'm a bit lost

pierre billet May 31, 2022 at 11:10 AM

Thanks, I started it, Is there a way to try to monitor if something progresses ?

I should probably have set logging to debug

Complete

Details
Assignee
Vladimir Vinogradenko
Reporter
pierre billet
Labels
Impact
Critical
Components
Fix versions
SCALE-21.06-BETA.1
12.0-U4
Affects versions
12.0-U2.1
Priority
Low

More fields

Katalon Platform

Created April 5, 2021 at 10:55 PM

Updated September 20, 2023 at 1:14 PM

Resolved May 3, 2021 at 2:33 PM

Replication of dataset fails after big deletion on the source

Description

Problem/Justification

Impact

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

pierre billet May 31, 2022 at 8:26 PMEdited

pierre billet May 31, 2022 at 1:14 PMEdited

pierre billet May 31, 2022 at 1:12 PM

pierre billet May 31, 2022 at 12:12 PM

pierre billet May 31, 2022 at 11:10 AM

DetailsAssigneeVladimir VinogradenkoVladimir VinogradenkoReporterpierre billetpierre billetLabelsready_for_reviewImpactCriticalComponentsFix versionsSCALE-21.06-BETA.112.0-U4Affects versions12.0-U2.1PriorityLow

Details

Assignee

Reporter

Labels

Impact

Components

Fix versions

Affects versions

Priority

More fieldsTime tracking

More fields

Katalon PlatformLinked Test Cases, Katalon Defect Results, Katalon Studio Test Results

Katalon Platform

pierre billet May 31, 2022 at 8:26 PM
Edited

pierre billet May 31, 2022 at 1:14 PM
Edited

Details
Assignee
Vladimir Vinogradenko
Reporter
pierre billet
Labels
Impact
Critical
Components
Fix versions
SCALE-21.06-BETA.1
12.0-U4
Affects versions
12.0-U2.1
Priority
Low

More fields

Katalon Platform