Replication of dataset fails after big deletion on the source

Description

Hi, I have a 200TB storage which is replicated to a similar server every night at 0am.

In the past when I was on freenas, I used to notice that sometimes, after big deletions on the source (>10TB), the next replication would fail (dataset is busy). Deletes were pretty slow on source BTW, even using local shell commands.

Now i've upgraded to truenas 12.0u2-1 and I encounter some issues with the same cause.

I've deleted like 20TB of data and next night, the replication failed but with a different error.

Replication "My source - Mydest" failed: destination "mydest" contains partially-complete state from "zfs receive -s".
signal received..

I've tried re scheduling snapshot->replication manually during the next day but it failed too with same error.

Now it is again running on the scheduled 0am one but I'm afraid it will fail again.

Initally my snapshot retention was set to 2 days, I've set it to 4 in order to save time.

Will the source delete i's own snapshot (the one it still has in common with destination) even if the replication still fails ? If this happens, do I need to restart the whole replication from scratch (>150TB of unsecured data) ?

Any idea how to correct the replication issues?

Best regards

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

pierre billet May 31, 2022 at 8:26 PM
Edited

Finally everything went back in order doing things in that order:

 

-rollback the dest to latest existing snapshot

-start replication using PULL method on SSH (not SSH+netcat) and be patient

 

Despite of totally erroneous displays in the monitor (stating exabytes of datas to sync and 140 snaps to sync instead of 4) the first snasphot(the one that caaires deletions) finally went through

 

 

 

 

pierre billet May 31, 2022 at 1:14 PM
Edited

Here is the log from the pull replication that timeouts after one hour, then fails after one hour of retrying every minute

 

280.log

pierre billet May 31, 2022 at 1:12 PM

after the second hour the pull replication fails the same way  

The patch seems to have delayed the issue by one hour..

 

I've attached the log

pierre billet May 31, 2022 at 12:12 PM

OK , one hour later, the patch you made last years seems to enter in action, the zettarepl.log on destination states this :

If I remember well it will retry every minute for 60 minutes

 

but the monitoring window starts displaying strange things, initially it was 1 of 4 (4 snapshots to sync)  but everythime the fix code runs, it adds 4 more:

now it is 1 of 28

 

Is that ok? do I let it run for another hour ?

I'm a bit lost 

 

 

pierre billet May 31, 2022 at 11:10 AM

Thanks, I started it, Is there a way to try to monitor if something progresses ?

I should probably have set logging to debug  

 

 

Complete

Details

Assignee

Reporter

Labels

Impact

Critical

Components

Affects versions

Priority

More fields

Katalon Platform

Created April 5, 2021 at 10:55 PM
Updated September 20, 2023 at 1:14 PM
Resolved May 3, 2021 at 2:33 PM