Two differently named snapshots of the same file system done at the same time cause ssh replication to fail
Description
Problem/Justification
None
Impact
None
Activity
Show:

Ryan Moeller September 10, 2021 at 9:56 AM
Merged for U6 and should be landing upstream soon.

w.kruzel August 24, 2021 at 12:57 PM
Debug as requested.
Apologies for the delay but I was on leave.

iX-Tony August 3, 2021 at 8:35 PM
Thank you for the report, @w.kruzel. Can you please attach a debug file to the "Private Attachments" section of this ticket? To generate a debug file on TrueNAS CORE, log in to the TrueNAS web interface, go to System > Advanced, then click SAVE DEBUG and wait for the file to download to your local system.
Complete
Pinned fields
Click on the next to a field label to start pinning.
Details
Details
Assignee

Reporter

Labels
Impact
Medium
Time remaining
0m
Components
Fix versions
Priority
Katalon Platform
Linked Test Cases, Katalon Defect Results, Katalon Studio Test Results
Katalon Platform
Linked Test Cases, Katalon Defect Results, Katalon Studio Test Results
Created August 1, 2021 at 9:21 PM
Updated July 6, 2022 at 9:01 PM
Resolved September 10, 2021 at 9:56 AM
When having two snapshot cronjobs (i.e. one @daily and the other @weekly) on the same filesystem triggers two snapshots to happen exactly at the same timem the end up being registered as 0B (Possible race condition).
This happened on a Linux box where I set ZFS replication onto TrueNAS using pull configuration.
On TrueNAS, the replication worked until it encountered this problem, when zero byte snapshot was being pulled via SSH. this failed.
Below output form zettarepl.log and zfs list from the source host.
Not sure whether this is a bug of TrueNAS or ZFS, however due to race conditions, it causes the replication over ssh to fail on said 0B snapshots.
Had this happen on two different filesystems, which had daily and weekly snapshots replicated over ssh.
This works fine in local replication.
See below logs as well
zfs list of snapshots on source host:
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-05-0000 19.4M - 5.84T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-06-0000 18.5M - 6.00T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-07-0000 145M - 6.06T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-08-0000 18.7M - 6.15T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-09-0000 19.1M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-10-0000 18.7M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_weekly-2021-07-11-0000 0B - 6.22T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-11-0000 0B - 6.22T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-12-0000 19.3M - 6.19T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-13-0000 18.6M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-14-0000 167M - 6.32T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-15-0000 19.0M - 6.33T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-16-0000 19.1M - 6.31T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-17-0000 92.9M - 6.34T -
data/jfrog-filestore@zfs-auto-snap_weekly-2021-07-18-0000 0B - 6.30T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-18-0000 0B - 6.30T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-19-0000 18.4M - 6.30T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-20-0000 18.5M - 6.32T -
zettarepl.log
2021/08/01 00:03:00] INFO [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Replication Task 'task_6'>]
[2021/08/01 00:03:00] INFO [Thread-30045] [zettarepl.paramiko.replication_task__task_6] Connected (version 2.0, client OpenSSH_7.6p1)
[2021/08/01 00:03:00] INFO [Thread-30045] [zettarepl.paramiko.replication_task__task_6] Authentication (publickey) successful!
[2021/08/01 00:03:02] INFO [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing pull from 'data/jfrog-filestore' to 'bowl2/replicated/jfrog-filestore' of snapshot='zfs-auto-snap_daily-2021-07-11-0000' incremental_base='zfs-auto-snap_daily-2021-07-10-0000' receive_resume_token=None encryption=False
[2021/08/01 00:11:24] INFO [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing pull from 'data/jfrog-filestore' to 'bowl2/replicated/jfrog-filestore' of snapshot='zfs-auto-snap_weekly-2021-07-11-0000' incremental_base='zfs-auto-snap_daily-2021-07-11-0000' receive_resume_token=None encryption=False
[2021/08/01 00:11:36] ERROR [replication_task__task_6] [zettarepl.replication.run] For task 'task_6' unhandled replication error ExecException(134, 'zettarepl: zfs send PID is 28483\n0 == nvlist_lookup_nvlist(nvfs, "snapprops", &snapprops)\nASSERT at libzfs_sendrecv.c:1279:dump_snapshot()cannot receive: failed to read from stream\n')
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 162, in run_replication_tasks
run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer)
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 252, in run_replication_task_part
run_replication_steps(step_templates, observer)
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 560, in run_replication_steps
replicate_snapshots(step_template, incremental_base, snapshots, encryption, observer)
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 611, in replicate_snapshots
run_replication_step(step, observer)
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 688, in run_replication_step
ReplicationProcessRunner(process, monitor).run()
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/process_runner.py", line 33, in run
raise self.process_exception
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
self.replication_process.wait()
File "/usr/local/lib/python3.8/site-packages/zettarepl/transport/ssh.py", line 147, in wait
stdout = self.async_exec.wait()
File "/usr/local/lib/python3.8/site-packages/zettarepl/transport/async_exec_tee.py", line 103, in wait
raise ExecException(exit_event.returncode, self.output)
zettarepl.transport.interface.ExecException: zettarepl: zfs send PID is 28483
0 == nvlist_lookup_nvlist(nvfs, "snapprops", &snapprops)
ASSERT at libzfs_sendrecv.c:1279:dump_snapshot()cannot receive: failed to read from stream
Workaround found: Delete one of the colliding snapshots on source machine. The other snapshot will gain a non-zero size and will get transfered via SSH correctly.
For future snapshots - change the cron timing between different snapshots.
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-05-0000 19.4M - 5.84T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-06-0000 18.5M - 6.00T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-07-0000 145M - 6.06T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-08-0000 18.7M - 6.15T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-09-0000 19.1M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-10-0000 18.7M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_weekly-2021-07-11-0000 81.3M - 6.22T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-12-0000 19.3M - 6.19T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-13-0000 18.6M - 6.23T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-14-0000 167M - 6.32T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-15-0000 19.0M - 6.33T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-16-0000 19.1M - 6.31T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-17-0000 92.9M - 6.34T -
data/jfrog-filestore@zfs-auto-snap_weekly-2021-07-18-0000 18.7M - 6.30T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-19-0000 18.4M - 6.30T -
data/jfrog-filestore@zfs-auto-snap_daily-2021-07-20-0000 18.5M - 6.32T -
How to replicate the problem:
Setup two snapshot schedules for the same filesystem to be triggered at the same time.
Ten try to replicate over SSH said snapshots.
Please check as the same issue may happen on TrueNAS (the source machine issue).