The problem#

ZFS datasets are a powerful way to organize your filesystems. At first glance, datasets look a lot like filesystems, so you may default to just one or at most a handful per pool. But unlike with traditional filesystems where you have to decide how much of your disk space each one gets when it's created, ZFS datasets share the space available to the entire pool. Since datasets are the granularity at which ZFS operations like snapshots and zfs send/recv work, having more datasets can give you better control over having different backup policies for different subsets of your data, and ZFS scales just fine to hundreds or thousands of datasets, so you don't have to really worry about creating too many.

But if you're me (well, not just me) and you realize this after you already have months of snapshots of a few terabytes of data, how do you reorganize your ZFS pool into more datasets without either losing the snapshot history or ending up wasting a lot of disk space on redundant copies of data?

The solution#

Before doing anything with real data, make backups and confirm you can restore from them.

I do not have a one-size-fits-all solution here; instead I'll outline the general process and recommend you continually review at each step to make sure things look correct and be ready to zfs rollback and retry if you make a mistake or notice a way you could have done something in a more space-efficient manner.

Create the new dataset hierarchy. I'll refer to the old dataset as tank/old and the new dataset root as tank/new.
Do an initial copy of the earliest snapshot you want to keep from the .zfs directory. If it's @first, then the copy command will be rsync -avhxPHS /tank/old/.zfs/snapshot/first/ /tank/new/.
Check your work and possibly delete or dedup files.
zfs snapshot -r tank/new@first
Do an incremental copy of the next snapshot. If it's @second, this may be as simple as rsync -avhxPHS@-1 --delete /tank/old/.zfs/snapshot/second/ /tank/new/, but that will waste space if you have moved files or modified small sections of large files.
Check your work, and make any necessary changes.
zfs snapshot -r tank/new@second
Repeat steps 5-7 for each snapshot you want to keep.
zfs rename tank/old tank/legacy && zfs rename tank/new tank/old

The details#

Space requirements#

As this process involves making duplicates of every file in your dataset and its snapshots, you will need enough free space on your zpool to make a full copy of the dataset you want to split (or at least the parts you want to split out). If you have hardlinks between directories that you are splitting into multiple datasets, it's possible you will in fact need more space than that. (There is a feature request for allowing datasets to share data, which ZFS supports but unfortunately does not work on Linux at the moment.)

You can use zfs list -o space to get the space used by a dataset and its snapshots and descendents.

Creating datasets#

When creating datasets, you generally don't want to specify options except at the root of your hierarchy as options are inherited, so specifying them only once will simplify updating them later if you ever decide to, especially if you are creating a lot of datasets.

In particular, if you're using encryption, you probably only want the root of your hierarchy to be the only encryptionroot and if you instead specify the encryption settings for every dataset, they'll each be their own encryptionroot, meaning you'll have to load keys for each one individually. (You can also fix that after creating the dataset with zfs change-key -i for "inherit".) Additionally, consider not storing any data in the encryptionroot dataset as that will complicate things if you decide to reorganize the datasets in the future.

For other options, you probably want -o compression=on, -o xattr=sa, and -o atime=off. There's mixed opinions, but -o recordsize=1M is probably the right choice for most workloads.

Mount points#

By default ZFS datasets inherit their mount point from their parent based on their name, so generally you can just name datasets in the same hierarchy as your directories and everything just works. One catch is if you want to make a separate dataset deep into a directory hierarchy like tank/foo/a/b/c/bar but want /tank/foo/a/b/ to be in the same dataset as /tank/foo/. You can:

Create the dataset as tank/foo/bar but with /tank/foo/a/b/c/bar as its mountpoint. This is awkward if the parent dataset mount point changes, although there's a feature request for relative mount points to fix that.
Create the dataset as tank/foo/bar and reorgnize your files so it's in /tank/foo/bar, maybe symlinking the old location to the new location.
Create the dataset as tank/foo/a/b/c/bar and on the intermediate datasets, set the canmount option to off.

`rsync` flags#

I recommended a lot of flags to rsync, so I'm going to go over all of them:

-v (--verbose), -h (--human-readable), and --progress (implied by -P) are all about making it output more status in a more readable manner. You may also like --info=progress2 if you prefer its progress output format.
-a (--archive): implies a bunch of flags that generally are what you want for making the destination more or less identical to the source. Note the documentation points out that it omits a few things you may want like creation times and ACLs, so you may want to add -AXUN to be more complete.
-x (--one-file-system): don't recurse past mount points.
-P (--progress and --partial): mainly just wanted the short version of --progress but also --partial allows resuming interrupted transfers.
-H (--hard-links): preserve hard links.
-S (--sparse): turn sequences of nulls into sparse blocks. Doesn't actually have much effect with ZFS compression enabled.

Additionally for transfers after the first#

--delete: delete files that aren't in the source, so any files that were deleted or moved will actually be removed.
-@-1 (--modify-window=-1): when comparing timestamps, treat any difference as different, even if under a second. As you're comparing a filesystem to a snapshot of itself, if the timestamps are identical, they really will be exactly identical; this option is not default because it causes trouble when transferring between filesystems that have different granularities on timestamps.
--inplace --no-whole-file: in combination, will write only the changed parts of a modified file, which may save storage space on large files that are only modified in small parts (e.g. VM images). Note that this means rsync won't break hardlinks on those files, so it may result in the destination not matching the source. As a workaround, I used this option only for subdirectories that I expected it to be relevant for and then redid the copy for the entire filesystem.

Handling moves#

Moving a file takes up nearly no space in a snapshot. But rsync doesn't know to associate the two files, so it will recopy it from scratch, which could be significant for large files. If you know you moved an entire directory, consider just manually moving it. Alternatively, my zfs-diff-move.sh script will enumerate all of the moves between two snapshots and recreate them.

Although it seems like it should be possible to get enough information out of zfs diff to determine which files to change instead of having rsync inspect every file, I didn't work out how to do that in a way that I trust is actually doing that right thing. Instead, I handled the moves first and then ran rsync since I trust that after running rsync the destination will match the source, but I don't trust any script I write would properly handle all of the edge cases.

Detailed steps#

Putting it all together, the full process is:

Create the new dataset hierarchy. Note you will need enough free space on your zpool to make a full copy of the dataset you want to split (or at least the parts you want to split out), at least until this GitHub issue is resolved. zfs list -o space will tell you the space used by a dataset and its snapshots and descendents.
Do an initial copy of the earliest snapshot you want to keep. Your copy will be from the subdirectory of the .zfs/snapshot directory of your dataset corresponding to the snapshot. To copy all of the files and maintain hardlinks and sparse files where possible, use rsync -avhxPHS. You may also want to use --exclude if you don't want to keep all of the files for some reason.
Now is a good time to inspect your files to see if there's anything you want to change. In particular, make sure you aren't surprised by which files ended up in which dataset and consider whether there may be duplicate files to deduplicate using a tool like rdfind, fdupes, or jdupes or my hardlink-dups-by-name.sh script.
Run zfs snapshot -r tank/new@first. The -r option of zfs snapshot recursively creates snapshots on all of the descendent datasets.
Do an incremental copy of the next snapshot. If you believe there are moved files, my zfs-diff-move.sh script may be useful. If you believe there are large files with only a small section changed, the rsync options --inplace --no-whole-file will make it write only the changed blocks, but beware that it may not do what you want in the presence of hardlinks. Also make sure you add the --delete option to your rsync command.
Check your work, and make any necessary changes.
Create the new snapshot: zfs snapshot -r tank/new@second.
Repeat steps 5-7 for each snapshot you want to keep. Run zfs list -t snapshot tank/old to list all snapshots.
Once you're satisfied, swap around the dataset names and start using the new datasets: zfs rename tank/old tank/legacy && zfs rename tank/new tank/old.

A Weird Imagination

Splitting ZFS datasets

The problem#

The solution#

The details#

Space requirements#

Creating datasets#

Mount points#

`rsync` flags#

Additionally for transfers after the first#

Handling moves#

Detailed steps#

Comments

The problem#

The solution#

The details#

Space requirements#

Creating datasets#

Mount points#

rsync flags#

Additionally for transfers after the first#

Handling moves#

Detailed steps#

`rsync` flags#