The problem#
ZFS datasets are a powerful way to organize your filesystems.
At first glance, datasets look a lot like filesystems,
so you may default to just one or at most a handful per pool. But
unlike with traditional filesystems where you have to decide how much
of your disk space each one gets when it's created, ZFS datasets
share the space available to the entire pool. Since datasets are the
granularity at which ZFS operations like snapshots and
zfs send
/recv
work, having more
datasets can give you better control over having different backup
policies for different subsets of your data, and ZFS
scales just fine to hundreds or thousands of
datasets, so you don't have to really worry about creating too many.
But if you're me (well, not just me) and you realize this after you already have months of snapshots of a few terabytes of data, how do you reorganize your ZFS pool into more datasets without either losing the snapshot history or ending up wasting a lot of disk space on redundant copies of data?
The solution#
Before doing anything with real data, make backups and confirm you can restore from them.
I do not have a one-size-fits-all solution here; instead I'll outline
the general process and recommend you continually review at each step to
make sure things look correct and be ready to
zfs rollback
and retry if you make a mistake
or notice a way you could have done something in a more space-efficient
manner.
- Create the new dataset hierarchy. I'll refer to the old dataset as
tank/old
and the new dataset root astank/new
. - Do an initial copy of the earliest snapshot you want to keep
from the
.zfs
directory. If it's@first
, then the copy command will bersync -avhxPHS /tank/old/.zfs/snapshot/first/ /tank/new/
. - Check your work and possibly delete or dedup files.
zfs snapshot -r tank/new@first
- Do an incremental copy of the next snapshot. If it's
@second
, this may be as simple asrsync -avhxPHS@-1 --delete /tank/old/.zfs/snapshot/second/ /tank/new/
, but that will waste space if you have moved files or modified small sections of large files. - Check your work, and make any necessary changes.
zfs snapshot -r tank/new@second
- Repeat steps 5-7 for each snapshot you want to keep.
zfs rename tank/old tank/legacy && zfs rename tank/new tank/old
The details#
Space requirements#
As this process involves making duplicates of every file in your dataset and its snapshots, you will need enough free space on your zpool to make a full copy of the dataset you want to split (or at least the parts you want to split out). If you have hardlinks between directories that you are splitting into multiple datasets, it's possible you will in fact need more space than that. (There is a feature request for allowing datasets to share data, which ZFS supports but unfortunately does not work on Linux at the moment.)
You can use zfs list -o space
to get the space used by a
dataset and its snapshots and descendents.
Creating datasets#
When creating datasets, you generally don't want to specify options except at the root of your hierarchy as options are inherited, so specifying them only once will simplify updating them later if you ever decide to, especially if you are creating a lot of datasets.
In particular, if you're using encryption, you probably only want the
root of your hierarchy to be the only encryptionroot
and if you instead
specify the encryption settings for every dataset, they'll each be
their own encryptionroot
, meaning you'll have to load keys for each one
individually. (You can also fix that after creating the dataset with
zfs change-key -i
for "inherit".) Additionally,
consider not storing any data in the encryptionroot
dataset as that
will complicate things if you decide to reorganize the datasets in the
future.
For other options, you probably want
-o compression=on
, -o xattr=sa
, and -o atime=off
. There's mixed
opinions, but -o recordsize=1M
is probably the right
choice for most workloads.
Mount points#
By default ZFS datasets inherit their mount point from their parent
based on their name, so generally you can just name datasets in the same
hierarchy as your directories and everything just works. One catch is if
you want to make a separate dataset deep into a directory hierarchy like
tank/foo/a/b/c/bar
but want /tank/foo/a/b/
to be in the same dataset
as /tank/foo/
. You can:
- Create the dataset as
tank/foo/bar
but with/tank/foo/a/b/c/bar
as its mountpoint. This is awkward if the parent dataset mount point changes, although there's a feature request for relative mount points to fix that. - Create the dataset as
tank/foo/bar
and reorgnize your files so it's in/tank/foo/bar
, maybe symlinking the old location to the new location. - Create the dataset as
tank/foo/a/b/c/bar
and on the intermediate datasets, set thecanmount
option tooff
.
rsync
flags#
I recommended a lot of flags to rsync
, so I'm going
to go over all of them:
-v
(--verbose
),-h
(--human-readable
), and--progress
(implied by-P
) are all about making it output more status in a more readable manner. You may also like--info=progress2
if you prefer its progress output format.-a
(--archive
): implies a bunch of flags that generally are what you want for making the destination more or less identical to the source. Note the documentation points out that it omits a few things you may want like creation times and ACLs, so you may want to add-AXUN
to be more complete.-x
(--one-file-system
): don't recurse past mount points.-P
(--progress
and--partial
): mainly just wanted the short version of--progress
but also--partial
allows resuming interrupted transfers.-H
(--hard-links
): preserve hard links.-S
(--sparse
): turn sequences of nulls into sparse blocks. Doesn't actually have much effect with ZFS compression enabled.
Additionally for transfers after the first#
--delete
: delete files that aren't in the source, so any files that were deleted or moved will actually be removed.-@-1
(--modify-window=-1
): when comparing timestamps, treat any difference as different, even if under a second. As you're comparing a filesystem to a snapshot of itself, if the timestamps are identical, they really will be exactly identical; this option is not default because it causes trouble when transferring between filesystems that have different granularities on timestamps.--inplace --no-whole-file
: in combination, will write only the changed parts of a modified file, which may save storage space on large files that are only modified in small parts (e.g. VM images). Note that this meansrsync
won't break hardlinks on those files, so it may result in the destination not matching the source. As a workaround, I used this option only for subdirectories that I expected it to be relevant for and then redid the copy for the entire filesystem.
Handling moves#
Moving a file takes up nearly no space in a snapshot. But rsync
doesn't know to associate the two files, so it will recopy it from
scratch, which could be significant for large files. If you know
you moved an entire directory, consider just manually moving it.
Alternatively, my zfs-diff-move.sh
script will
enumerate all of the moves between two snapshots and recreate them.
Although it seems like it should be possible to get enough information
out of zfs diff
to determine which files to change
instead of having rsync
inspect every file, I didn't work out how
to do that in a way that I trust is actually doing that right thing.
Instead, I handled the moves first and then ran rsync
since I
trust that after running rsync
the destination will match the
source, but I don't trust any script I write would properly handle all
of the edge cases.
Detailed steps#
Putting it all together, the full process is:
-
Create the new dataset hierarchy. Note you will need enough free space on your zpool to make a full copy of the dataset you want to split (or at least the parts you want to split out), at least until this GitHub issue is resolved.
zfs list -o space
will tell you the space used by a dataset and its snapshots and descendents. -
Do an initial copy of the earliest snapshot you want to keep. Your copy will be from the subdirectory of the
.zfs/snapshot
directory of your dataset corresponding to the snapshot. To copy all of the files and maintain hardlinks and sparse files where possible, usersync -avhxPHS
. You may also want to use--exclude
if you don't want to keep all of the files for some reason. -
Now is a good time to inspect your files to see if there's anything you want to change. In particular, make sure you aren't surprised by which files ended up in which dataset and consider whether there may be duplicate files to deduplicate using a tool like
rdfind
,fdupes
, orjdupes
or myhardlink-dups-by-name.sh
script. -
Run
zfs snapshot -r tank/new@first
. The-r
option ofzfs snapshot
recursively creates snapshots on all of the descendent datasets. -
Do an incremental copy of the next snapshot. If you believe there are moved files, my
zfs-diff-move.sh
script may be useful. If you believe there are large files with only a small section changed, thersync
options--inplace --no-whole-file
will make it write only the changed blocks, but beware that it may not do what you want in the presence of hardlinks. Also make sure you add the--delete
option to yourrsync
command. -
Check your work, and make any necessary changes.
-
Create the new snapshot:
zfs snapshot -r tank/new@second
. -
Repeat steps 5-7 for each snapshot you want to keep. Run
zfs list -t snapshot tank/old
to list all snapshots. -
Once you're satisfied, swap around the dataset names and start using the new datasets:
zfs rename tank/old tank/legacy && zfs rename tank/new tank/old
.
Comments
Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.
There are no comments yet.