Snapshots are one of the most powerful features of ZFS. A snapshot provides a read-only, point-in-time copy of the dataset. With Copy-On-Write (COW), ZFS creates snapshots fast by preserving older versions of the data on disk… Snapshots preserve disk space by recording just the differences between the current dataset and a previous version… [and] use no extra space when first created, but consume space as the blocks they reference change.
But ZFS also comes with an uncomfortable truth that doesn't get talked about enough: the filesystem is only as good as the operating system wrapping it. And if you're running ZFS on a generic Linux distribution, you're often signing up for more risk, maintenance, and subtle breakage than you expect. ZFS works on Linux, and many use it daily, but it's not a seamless, built-in part of the kernel. Instead, it's an add-on with caveats, and setting it up can feel frustratingly difficult. //
The problem with ZFS is Oracle
Licensing is a major issue
the Linux kernel's GPLv2 license is legally incompatible with ZFS's CDDL license, meaning that it can't be combined with the Linux kernel. Oracle's licensing is the major bottleneck.
it would be very nice not to have to do two consecutive resilvers - one for each failing drive. Luckily, ZFS allows you to amortize a single resilver operation over multiple drives.
The workflow is almost identical - you begin by doing a hot-spare resilver of the first drive:
zpool replace POOL da13 da99
... but then, after that command completes and you verify that the resilver has properly begun (by running 'zpool status') you simply run a second 'zpool replace' command with the other pair of failing/spare drives:
zpool replace POOL da15 da100
Your 'zpool status' output will then show two drives resilvering with two different hot-spares and your time to completion will not increase much as compared to when you were only resilvering one drive.
Append-Only backups with rclone serve restic --stdio ... ZFS vdev rebalancing ... borg mount example
It's all very well to say 'bookmarks mark the point in time when [a] snapshot was created', but how does that actually work, and how does it allow you to use them for incremental ZFS send streams?
The succinct version is that a bookmark is basically a transaction group (txg) number. In ZFS, everything is created as part of a transaction group and gets tagged with the TXG of when it was created. Since things in ZFS are also immutable once written, we know that an object created in a given TXG can't have anything under it that was created in a more recent TXG (although it may well point to things created in older transaction groups). If you have an old directory with an old file and you change a block in the old file, the immutability of ZFS means that you need to write a new version of the data block, a new version of the file metadata that points to the new data block, a new version of the directory metadata that points to the new file metadata, and so on all the way up the tree, and all of those new versions will get a new birth TXG.
This means that given a TXG, it's reasonably efficient to walk down an entire ZFS filesystem (or snapshot) to find everything that was changed since that TXG. When you hit an object with a birth TXG before (or at) your target TXG, you know that you don't have to visit the object's children because they can't have been changed more recently than the object itself. If you bundle up all of the changed objects that you find in a suitable order, you have an incremental send stream. Many of the changed objects you're sending will contain references to older unchanged objects that you're not sending, but if your target has your starting TXG, you know it has all of those unchanged objects already. //
Bookmarks specifically don't preserve the original versions of things; that's why they take no space. Snapshots do preserve the original versions, but they take up space to do that. We can't get something for nothing here.
RAID type - Supported RAID levels are:
Mirror (two-way mirror - RAID1 / RAID10 equivalent);
RAID-Z1 (single parity with variable stripe width);
RAID-Z2 (double parity with variable stripe width);
RAID-Z3 (triple parity with variable stripe width).
Drive capacity - we expect this number to be in gigabytes (powers of 10), in-line with the way disk capacity is marked by the manufacturers. This number will be converted to tebibytes (powers of 2). The results will be presented in both tebibytes (TiB) and terabytes (TB). Note: 1 TB = 1000 GB = 1000000000000 B and 1 TiB = 1024 GiB = 1099511627776 B
Single drive cost - monetary cost/price of a single drive; used to calculate the Total cost and the Cost per TiB. The parameter is optional and has no impact on capacity calculations.
Number of RAID groups - the number of top-level vdevs in the pool.
Number of drives per RAID group - the number of drives per vdev.
The problem was that the new motherboard's BIOS created a host protected area (HPA) on some of the drives, a small section used by OEMs for system recovery purposes, usually located at the end of the harddrive.
ZFS maintains 4 labels with partition meta information and the HPA prevents ZFS from seeing the upper two.
Solution: Boot Linux, use hdparm to inspect and remove the HPA. Be very careful, this can easily destroy your data for good. Consult the article and the hdparm man page (parameter -N) for details.
The problem did not only occur with the new motherboard, I had a similar issue when connecting the drives to an SAS controller card. The solution is the same.
I was able to access the pool a couple of weeks ago. Since then, I had to replace pretty much all of the hardware of the host machine and install several host operating systems.
My suspicion is that one of these OS installations wrote a bootloader (or whatever) to one (the first ?) of the 500GB drives and destroyed some zpool metadata (or whatever) - 'or whatever' meaning that this is just a very vague idea and that subject is not exactly my strong side... //
I think I have found the root cause: Max Bruning was kind enough to respond to an email of mine very quickly, asking for the output of zdb -lll. On any of the 4 hard drives in the 'good' raidz1 half of the pool, the output is similar to what I posted above. However, on the first 3 of the 4 drives in the 'broken' half, zdb reports failed to unpack label for label 2 and 3. The fourth drive in the pool seems OK, zdb shows all labels. //
This did take a while indeed. I've spent months with several open computer cases on my desk with various amounts of harddrive stacks hanging out and also slept a few nights with earplugs, because I could not shut down the machine before going to bed as it was running some lengthy critical operation. However, I prevailed at last! :-) I've also learned a lot in the process and I would like to share that knowledge here for anyone in a similar situation.
This article is already much longer than anyone with a ZFS file server out of action has the time to read, so I will go into details here and create an answer with the essential findings further below. //
Finally, I mirrored the problematic drives to backup drives, used those for the zpool and left the original ones disconnected. The backup drives have a newer firmware, at least SeaTools does not report any required firmware updates. I did the mirroring with a simple dd from one device to the other, e.g.
sudo dd if=/dev/sda of=/dev/sde
I believe ZFS does notice the hardware change (by some hard drive UUID or whatever), but doesn't seem to care. //
As a last word, it seems to me ZFS pools are very, very hard to kill. The guys from Sun from who created that system have all the reason the call it the last word in filesystems. Respect!
MMM4
Reply Icon
New vs old generations
This is the nicest and shortest description of how "new generation" filesystems differ from older ones. It's focused on ZFS but not really specific to it. Bookmark that page because I found it un-googlable for some unknown reason.
https://illumos.org/books/zfs-admin/zfsover-1.html#zfsover-2
ZFS eliminates the volume management altogether. Instead of forcing you to create virtualized volumes, ZFS aggregates devices into a storage pool....
...
ZFS is a transactional file system, which means that the file system state is always consistent on disk. Traditional file systems overwrite data in place, which means that if the machine loses power, for example, between the time a data block is allocated and when it is linked into a directory, the file system will be left in an inconsistent state....
...
With a transactional file system, data is managed using copy on write semantics. Data is never overwritten, and any sequence of operations is either entirely committed or entirely ignored. This mechanism means that the file system can never be corrupted through accidental loss of power or a system crash. So, no need for a fsck equivalent exists.
Liam Proven(Written by Reg staff) Silver badge
Reply Icon
Re: Justice for bcachefs!
Anyone want to educate me on what bcachefs brings to the party that, say, ext4 doesn't?
I have gone into this at some length before. For instance, here:
https://www.theregister.com/2022/03/18/bcachefs/
... which is linked from the article you are commenting upon.
ext2/3/4 only handle one partition on one disk at a time.
As well as this, first, for partitioning, you need another tool, such as MBR or GPT. But you can do without, in some situations.
For RAID, you need another tool, e.g. kernel mdraid.
(Example of the intersection of 1 & 2: it is normal to make a new device with mdraid and then format that new device directly with ext4, not partitioning it first.)
Want resizable volumes, which might span multiple disks? You need another tool, LVM2.
But don't try to manage mdraid volumes with LVM2, or LVM2 with mdraid. Doesn't work.
Want encryption? You need another tool, such as LUKS. There are several.
Watch out if you use hardware RAID or hardware encryption. The existing tools won't see it or handle it.
It is complicated. There is lots of room for error.
So, ZFS fixed that. It does the partitioning part, and the RAID part, and the encryption part, and the resizing part, and also the mounting part, all in one.
It's great, it's easier and it's faster and you can nominate a fast disk to act as a cache for a bigger array of slower disks...
And it can take snapshots. While it is running. Take an image of your whole OS in a millisecond and then keep running and all the changes go somewhere new. So you can do an entire distribution upgrade, realise one critical tool doesn't work on the new version, and undo the entire thing, and go back to where you were...
While keeping all your data and all your files intact.
All while the OS is running.
And it does it all in one tool.
But it's not GPL so it can't be built into the Linux kernel.
You can load it as a module and that's fine but its cache remains separate from the Linux cache, so it uses twice the memory, maybe more.
So, there are other GPL tools that replicate some of this.
Btrfs does some of it. But Btrfs overlaps with, and does not interoperate with, LVM and with mdraid and with LUKS... and it collapses if the disk fills up... and it's easy to fill up because its "how much free space do I have?" command is broken and lies... and when it corrupts, you can't fix it.
It is, in short, crap, but you can't say that because it is rude and so being the way of Linux it has passionate defenders who complain they are being attacked if you mention problems.
Bcachefs is an attempt to fix this with an all-GPL tool, designed for Linux, which does all the nice stuff ZFS does but integrates better with the Linux kernel. It does not just replace ext4, it will let you replace ext4 and LVM2 and LUKS and mdraid, all in one tool.
It will do everything Btrfs does but not collapse in a heap if the volume fills up. And if it does have problems, you can fix it.
All this is good. All this is needed. We know it's doable because it already exists in a tool from Solaris in a form that FreeBSD can use but Linux can't.
But in a mean-spirited and unfair summary, Kent Overstreet is young and smart and cocky and wants to deliver something better for Linux and Linux users, and the old guard hate that and they hate him. They hate that this smart punk kid has shown up the problems with their tools they've been working on for 20-30 years.
eldakka
Reply Icon
Re: Justice for bcachefs!
Not properly, it doesn't re-stripe the existing data like mdadm or btrfs, it just evens out the disk usage.
A 3 disk raid5 expanded to 5 will inherit the same 50% parity overhead for existing data,
And that can be solved by a simple mv and copy back the file. e.g.
mv $i $i.tmp && cp -p $i.tmp $i && rm $i.tmp
Stick that (or your own preference, using rsync for example) in a simple script/find command to recurse it (with appropriate checks/tests etc.), and that'll make the 'old' data stripe 'properly' across the full RAID width.
eldakka
Reply Icon
Re: Justice for bcachefs!
is as much a "simple solution" and so divorced from the behaviour we'd get if ZFS did the re-striping itself* that you may as well say we don't need ZFS to do snapshots for us, we could write our own simple script to, ooh, create a new overlay/passthrough file system, change all the mount points, halt all processes with writable file handles open... (yes, yes, I'm being hyperbolic).
I never said it shouldn't be something ZFS does transparently. I never said it would be a bad idea or unnecessary thing for ZFS to support.
I was merely pointing out that it is a fairly simple thing to work around such that maybe the unpaid ZFS devs feel they have more important things to work on for now. I mean, it's taken the best part of 20 years to even get the ability to expand a RAIDZ vdev at all.
I'll also say that if anyone actually cares about the filesystem they are using, making conscious decisions to choose a filesystem like ZFS or whatever, then they are not a typical average user. Typical average users don't create ZFS arrays of multiple disks in various raidz/mirror volumes and then grow them. That is not the use-case of an average user.
Later (below) you say "production-ready", why are you messing around with growing raidz vdevs and wanting to re-stripe them to distribute across the array? That is a hobbyist/homelab-type situation. If you are using ZFS in a production environment - that is revenue/income is tied to it - then the answer is to create a new raidz and migrate (zfs-send/receive) data to it. No messing about with growing raidz vdevs and re-striping the data, that's just totally unnecessary.
e.g. 'beneath' the user file access level with no possibility of access control issues,
If you run the mv and cp as root, then there will be no access control issues, cp -p (as root) will preserve file permissions and FACLs.
not risking problems when changing your simplistic commands into production-ready "appropriate check/tests etc" like status reports, running automatically, maybe even backing off when there is a momentary load increase so the whole server isn't bogged down as the recursive cp
If you system gets bogged down from doing a single file copy, then I think you have a system problem.
chews the terabytes,
Why would it chew terabytes? Unless you have TB-sized files, it won't. Recursive doesn't mean what I think you think it means. It does not mean "in parallel". The example I gave will work on a single file at a time in a serial process, and will not move onto the next file until the current file is complete (tehniically it won't move on at all, it's the inner part of a loop you'd need to feed a file list to it). Therefore no extra space beyond the size of the currently being worked on file is needed.
not risking losing track when your telnet into the server shell dies
Why would that do anything? At worst you'll have a single $i.tmp file that you might have to manually do the cp back to the original ($i) name. There will be no data loss (and especially not if you snapshot it first). And even if you 'lose track', just start again, no biggie, will just take longer as you're redoing some of the already done work.
And as I said, you can use things like rsync instead, which would give you the ability to 'keep track' instead. The command I pasted was just the simplest one to give an idea of what is needed, just making a new copy of the file will re-stripe it across the full raidz. Or if you have your pool split up into many smaller filesystems rather than just a single one for the entire pool, then you can zfs-send/receive the filesystem to a enw filesystem in the same pool then use "zfs set mountpoint=<oldmountpoint>) to give the new filesystem the same mountpoint as the old one, then delete the old one.
(not risking a brainfart and doing all that copying over the LAN and back again!) - and simply being accessible to Joe Bloggs ZFS user who just would like it all to work, please.
I agree, it would be. But it doesn't. I'm pointing out that there is a solution to the issue the poster I am replying to mentioned. It is annoying to have to do (I've done it when I changed the recordsize of my filesystems), but it can be done, and it's not particularly difficult.
If someone is going to choose something like ZFS, I'd expect them to be able to do internet searches on topics like this and get help from technical forums or various guides that people have written to cover this sort of use-case. There are guides and instructions on how to do this sort of thing.
One of the questions that comes up time and time again about ZFS is “how can I migrate my data to a pool on a few of my disks, then add the rest of the disks afterward?”
If you just want to get the data moved and don’t care about balance, you can just copy the data over, then add the new disks and be done with it. But, it won’t be distributed evenly over the vdevs in your pool.
Don’t fret, though, it’s actually pretty easy to rebalance mirrors. In the following example, we’ll assume you’ve got four disks in a RAID array on an old machine, and two disks available to copy the data to in the short term.
These are policy-driven snapshot management and replication tools which use OpenZFS for underlying next-gen storage.
A pool is a collection of vdevs. Vdevs can be any of the following (and more, but we’re keeping this relatively simple):
single disks (think RAID0)
redundant vdevs (aka mirrors – think RAID1)
parity vdevs (aka stripes – think RAID5/RAID6/RAID7, aka single, dual, and triple parity stripes)
The pool itself will distribute writes among the vdevs inside it on a relatively even basis. //
striped (RAIDZ) vdevs aren’t supposed to be “as big as you can possibly make them.” Experts are cagey about actually giving concrete recommendations about stripe width (the number of devices in a striped vdev), but they invariably recommend making them “not too wide.” If you consider yourself an expert, make your own expert decision about this. If you don’t consider yourself an expert, and you want more concrete general rule-of-thumb advice: no more than eight disks per vdev. //
According to Dell, “Raid 5 for all business critical data on any drive type [is] no longer best practice.”
RAIDZ2 and RAIDZ3 try to address this nightmare scenario by expanding to dual and triple parity, respectively. This means that a RAIDZ2 vdev can survive two drive failures, and a RAIDZ3 vdev can survive three. Problem solved, right? Well, problem mitigated – but the degraded performance and resilver time is even worse than a RAIDZ1, because the parity calculations are considerably gnarlier. And it gets worse the wider your stripe (number of disks in the vdev). //
When a disk fails in a mirror vdev, your pool is minimally impacted – nothing needs to be rebuilt from parity, you just have one less device to distribute reads from. When you replace and resilver a disk in a mirror vdev, your pool is again minimally impacted – you’re doing simple reads from the remaining member of the vdev, and simple writes to the new member of the vdev. In no case are you re-writing entire stripes, all other vdevs in the pool are completely unaffected, etc. Mirror vdev resilvering goes really quickly, with very little impact on the performance of the pool. Resilience to multiple failure is very strong, though requires some calculation – your chance of surviving a disk failure is 1-(f/(n-f)), where f is the number of disks already failed, and n is the number of disks in the full pool. In an eight disk pool, this means 100% survival of the first disk failure, 85.7% survival of a second disk failure, 66.7% survival of a third disk failure. This assumes two disk vdevs, of course – three disk mirrors are even more resilient.
But wait, why would I want to trade guaranteed two disk failure in RAIDZ2 with only 85.7% survival of two disk failure in a pool of mirrors? Because of the drastically shorter time to resilver, and drastically lower load placed on the pool while doing so. The only disk more heavily loaded than usual during a mirror vdev resilvering is the other disk in the vdev – which might sound bad, but remember that it’s no more heavily loaded than it would’ve been as a RAIDZ member. //
Too many words, mister sysadmin. What’s all this boil down to?
- don’t be greedy. 50% storage efficiency is plenty.
for a given number of disks, a pool of mirrors will significantly outperform a RAIDZ stripe. - a degraded pool of mirrors will severely outperform a degraded RAIDZ stripe.
- a degraded pool of mirrors will rebuild tremendously faster than a degraded RAIDZ stripe.
- a pool of mirrors is easier to manage, maintain, live with, and upgrade than a RAIDZ stripe.
- BACK. UP. YOUR POOL. REGULARLY. TAKE THIS SERIOUSLY.
I'm using ZREP to replicate two servers with each other and each server contains one ZFS-pool containing two datasets as replication master and two sets as replication target. The master sets contain the system and VirtualBox-VMs of the local server, the replication targets the same from the other one. //
The problem.
But from time to time it happened that ZREP got into some state not able to sync anymore. To resolve that issue, a coworker told me that he needed to delete snapshots and follow the process to initialise ZREP again all over. That problem got fixed by not let ZREP run in parallel with rsync and our own snapshot anymore in the end.
ANSWER
Yes you still get all the data in-between but you just can't rewind to in-between.
If you have snapshot's 1,2 and 3 and the remote pool only has snapshot 1, you can give it snapshot 3 and skip 2.. it just won't be able to roll back to the '2' state. But the data will still be there.
The snapshots describe what was there at the time. So missing snapshot '2' on the remote pool, it's like you never took one at that point in time. It literally doesn't know about the '2' snapshot and what stuff looked like back then.
If you change your mind, you'll need to delete snapshot '3' on the remote pool and only then can you send '2', then '3' again.
How to Create ZFS File Systems
Something broke. The VPS would not boot:
ZFS: out of temporary buffer space
So this sounds like a missing step in the automated upgrade flow. Normally, using new features in zpool is deferred until you choose to upgrade the pool after the reboot, so you get to see the warnings. At a guess (because I'm on FreeBSD 11.4 still), the OpenZFS migration forces the issue to do the zpool upgrade early and they missed the gpart requirement. //
boot from the current rescue disk
bring ifaces up
scp a current/13 zfsbootcode file
install that
//
gpart bootcode -p /root/Downloads/gptzfsboot -i<gpart index of freebsd-boot> <block device>
with that just use the correct path from gptzfsboot
or just dd if=/root/Downloads/gptzfsboot of=/dev/vtbd0p1 if you are brave
zfs: out of temporary buffer space
posted in: computer | 0
system: FreeBSD v13.0-p7
reason: the bootloader is broken (e.g. after update)
solution: reinstall the bootloader(s) to your boot disk(s)
- Boot from recent FreeBSD image
- find out the devicenames and boot partition number from your boot-disks
gpart show
(the partition named “freebsd-boot” is the boot partition on every disk) - reinstall the pMBR and GPT ZFS bootloader (for every booting disk)
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i <boot-partition-number> <devicename>
(e.g.gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0)
(e.g.gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1) - reboot
used sources:
Welcome to the ZFS Handbook, your definitive guide to mastering the ZFS file system on FreeBSD and Linux. Discover how ZFS can revolutionize your data storage with unmatched reliability, scalability, and advanced features.
It’s not exactly difficult to figure out how much space you’ve got left when you’re using OpenZFS–but it is different from doing so on traditional filesystems, as OpenZFS brings considerably more complexity to the table. Space accounting in OpenZFS requires a different approach due to factors like snapshots, compression, and deduplication.
By the time we’re done today, we’ll understand:
-
how to use both filesystem-agnostic tools like du and df
-
OpenZFS-native tools like zfs list and zpool list.
OpenZFS brings new concepts to filesystem management that muddy this simple picture a bit: snapshots, inline compression, and block-level deduplication. To effectively manage our OpenZFS filesystem, we’ll need to begin by understanding three properties: USED, REFER, and AVAIL.
All three properties revolve around the status of logical sectors, not physical sectors.