Two differently named snapshots of the same file system done at the same time cause ssh replication to fail

Description

When having two snapshot cronjobs (i.e. one @daily and the other @weekly) on the same filesystem triggers two snapshots to happen exactly at the same timem the end up being registered as 0B (Possible race condition).

This happened on a Linux box where I set ZFS replication onto TrueNAS using pull configuration.

On TrueNAS, the replication worked until it encountered this problem, when zero byte snapshot was being pulled via SSH. this failed.

Below output form zettarepl.log and zfs list from the source host.

Not sure whether this is a bug of TrueNAS or ZFS, however due to race conditions, it causes the replication over ssh to fail on said 0B snapshots.

Had this happen on two different filesystems, which had daily and weekly snapshots replicated over ssh.

This works fine in local replication.

See below logs as well

zfs list of snapshots on source host:

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-05-0000 19.4M - 5.84T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-06-0000 18.5M - 6.00T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-07-0000 145M - 6.06T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-08-0000 18.7M - 6.15T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-09-0000 19.1M - 6.23T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-10-0000 18.7M - 6.23T -

data/jfrog-filestore@zfs-auto-snap_weekly-2021-07-11-0000 0B - 6.22T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-11-0000 0B - 6.22T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-12-0000 19.3M - 6.19T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-13-0000 18.6M - 6.23T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-14-0000 167M - 6.32T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-15-0000 19.0M - 6.33T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-16-0000 19.1M - 6.31T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-17-0000 92.9M - 6.34T -

data/jfrog-filestore@zfs-auto-snap_weekly-2021-07-18-0000 0B - 6.30T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-18-0000 0B - 6.30T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-19-0000 18.4M - 6.30T -

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-20-0000 18.5M - 6.32T -

zettarepl.log

2021/08/01 00:03:00] INFO [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Replication Task 'task_6'>]

[2021/08/01 00:03:00] INFO [Thread-30045] [zettarepl.paramiko.replication_task__task_6] Connected (version 2.0, client OpenSSH_7.6p1)

[2021/08/01 00:03:00] INFO [Thread-30045] [zettarepl.paramiko.replication_task__task_6] Authentication (publickey) successful!

[2021/08/01 00:03:02] INFO [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing pull from 'data/jfrog-filestore' to 'bowl2/replicated/jfrog-filestore' of snapshot='zfs-auto-snap_daily-2021-07-11-0000' incremental_base='zfs-auto-snap_daily-2021-07-10-0000' receive_resume_token=None encryption=False

[2021/08/01 00:11:24] INFO [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing pull from 'data/jfrog-filestore' to 'bowl2/replicated/jfrog-filestore' of snapshot='zfs-auto-snap_weekly-2021-07-11-0000' incremental_base='zfs-auto-snap_daily-2021-07-11-0000' receive_resume_token=None encryption=False

[2021/08/01 00:11:36] ERROR [replication_task__task_6] [zettarepl.replication.run] For task 'task_6' unhandled replication error ExecException(134, 'zettarepl: zfs send PID is 28483\n0 == nvlist_lookup_nvlist(nvfs, "snapprops", &snapprops)\nASSERT at libzfs_sendrecv.c:1279:dump_snapshot()cannot receive: failed to read from stream\n')

Traceback (most recent call last):

File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 162, in run_replication_tasks

run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer)

File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 252, in run_replication_task_part

run_replication_steps(step_templates, observer)

File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 560, in run_replication_steps

replicate_snapshots(step_template, incremental_base, snapshots, encryption, observer)

File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 611, in replicate_snapshots

run_replication_step(step, observer)

File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 688, in run_replication_step

ReplicationProcessRunner(process, monitor).run()

File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/process_runner.py", line 33, in run

raise self.process_exception

File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process

self.replication_process.wait()

File "/usr/local/lib/python3.8/site-packages/zettarepl/transport/ssh.py", line 147, in wait

stdout = self.async_exec.wait()

File "/usr/local/lib/python3.8/site-packages/zettarepl/transport/async_exec_tee.py", line 103, in wait

raise ExecException(exit_event.returncode, self.output)

zettarepl.transport.interface.ExecException: zettarepl: zfs send PID is 28483

0 == nvlist_lookup_nvlist(nvfs, "snapprops", &snapprops)

ASSERT at libzfs_sendrecv.c:1279:dump_snapshot()cannot receive: failed to read from stream

Workaround found: Delete one of the colliding snapshots on source machine. The other snapshot will gain a non-zero size and will get transfered via SSH correctly.

For future snapshots - change the cron timing between different snapshots.

data/jfrog-filestore@zfs-auto-snap_daily-2021-07-05-0000 19.4M - 5.84T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-06-0000 18.5M - 6.00T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-07-0000 145M - 6.06T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-08-0000 18.7M - 6.15T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-09-0000 19.1M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-10-0000 18.7M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_weekly-2021-07-11-0000 81.3M - 6.22T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-12-0000 19.3M - 6.19T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-13-0000 18.6M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-14-0000 167M - 6.32T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-15-0000 19.0M - 6.33T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-16-0000 19.1M - 6.31T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-17-0000 92.9M - 6.34T -
data/jfrog-filestore@zfs-auto-snap_weekly-2021-07-18-0000 18.7M - 6.30T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-19-0000 18.4M - 6.30T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-20-0000 18.5M - 6.32T -

How to replicate the problem:
Setup two snapshot schedules for the same filesystem to be triggered at the same time.

Ten try to replicate over SSH said snapshots.

Please check as the same issue may happen on TrueNAS (the source machine issue).

Problem/Justification

None

Impact

None

Activity

Show:

Ryan Moeller September 10, 2021 at 9:56 AM

Merged for U6 and should be landing upstream soon.

w.kruzel August 24, 2021 at 12:57 PM

Debug as requested.

Apologies for the delay but I was on leave.

iX-Tony August 3, 2021 at 8:35 PM

Thank you for the report, @w.kruzel. Can you please attach a debug file to the "Private Attachments" section of this ticket? To generate a debug file on TrueNAS CORE, log in to the TrueNAS web interface, go to System > Advanced, then click SAVE DEBUG and wait for the file to download to your local system.