31 Comments
- LastWord, on 10/12/2007, -5/+23Simple: Don't add removable drives to the pool.
- TheSolomon, on 10/12/2007, -1/+8You were marked down, but I think it's a valid comment. Pooled storage only works when you can depend on the disks in the pool. Adding disks is fine, but once they have been added, the pool assumes they aren't going anywhere. This a different concept than, say, flexible/dynamically-allocated storage, which is wholly incompatible with storage pools. If you want to join multiple drives together in a pool, you can't remove members of the pool at will. And if you want to use a USB or Firewire drive, and then take it with you, you can't include it within a storage pool. This sounds perfectly reasonable to me.
- epileet, on 10/12/2007, -0/+5@kunalthakar
i agree but hopefully with java being gpl'd we will see zfs follow in the future - qmeister, on 10/12/2007, -0/+5I'm confused as to why people are knocking your comment. Are they offended by something? Seems to me to be a simple response...
- kunalthakar, on 10/12/2007, -3/+5hmm... ZFS is seriously good. If only the GPL didn't have problems with the CDDL license, ZFS would have already been in the Linux kernel. I somehow feel that Sun doesn't get the recognition they deserve for releasing some standards setting products like ZFS, NFS etc.
- volsung, on 10/12/2007, -0/+2There are rumors that Sun will GPL OpenSolaris, much as they have GPLed Java. That will also include the ZFS code, which will solve that problem as well. For now, the guy porting ZFS for Linux has to make it a userspace driver (using FUSE) to avoid license clashes with the kernel.
- CheeseheadDave, on 10/12/2007, -0/+2Will ZFS end up being a replacement for HFS+, or will it be an available option like UFS currently is, but with warnings saying "Don't select this unless you really need it and know what you're doing!"?
- mohaine, on 10/12/2007, -0/+2Opps, I spoke too soon. It looks like ZFS only allows for removal of mirrored drives using the remove, detach and replace zpool options. If there isn't enough space available to remove the disk, it will refuse to remove it. See here:
http://docs.sun.com/app/docs/doc/819-2240/6n4htdnq0?a=view
There is a really nice demo video here:
http://www.opensolaris.org/os/community/zfs/demos/basics/
I hope the Linux/FUSE version of ZFS gets up to production quality soon. LVM is nice, but ZFS is damn slick. - eridius, on 10/12/2007, -2/+4The one question I have yet to see an answer to, is what happens when I remove a hard drive? One of the features of ZFS is that the filesystems grow to fill the storage pool, but when I add removable drives to the mix, what happens?
- CompIsMyRx, on 10/12/2007, -0/+1ZFS sounds like the perfect replacement to Ext3 and XFS. As soon as this is released to Linux users, I might think about implementing it.
- volsung, on 10/12/2007, -0/+1@gavintgold
No, checksumming will require some extra CPU time (it has to), but the question is how much more and whether the extra verification it provides is worth the time required.
For the curious, ZFS supports two versions of Fletcher's checksum, as well as SHA256. The default is Fletcher's checksum, which is vastly faster than MD5 since Fletcher is not intended to be cryptographically secure:
http://en.wikipedia.org/wiki/Fletcher's_checksum - DnasTheGreat, on 10/12/2007, -0/+1"Filesystems are nested and making them is as easy as making a directory."
Hmm, am I misunderstanding this one? I was under the impression that most UNIXes supported mounting anyway, so most UNIX users have had this for quite a while, including OS X... and isn't ease of creation was of a matter the frontend, not the filesystem... since making any filesystem is a mkfs.[blah] away anyway... the matter of having to actually find hard disk space and stuff could be solved by LVM.
Or do they mean that the filesystem itself allows to make subfilesystems for setting different attributes per folders and stuff? In which case, I would expect something akin to UNIX-style permissions would be easier and more effective. - sstidman, on 10/12/2007, -0/+1Yes, you can remove a disk from a pool by using the "zpool detach" command. If a disk was used in a RAID-Z file system, you will not be able to remove it currently. This is something Sun is working to fix. From the ZFS FAQ (http://www.sun.com/software/solaris/faqs/zfs.xml):
================================================================================
Q: Can I remove a disk from my storage pool?
A:
In this first release of ZFS, you are limited to removing disks only from mirrors where there is redundancy. For example, you can remove a disk from a 5-way mirror to make it a 4-way mirror, but you cannot remove the last remaining disk in a mirror.
You cannot remove a disk from a RAID-Z stripe, nor remove a disk that is by itself. (That is, not part of a mirror or RAID-Z stripe.) The zpool detach command is used to remove disks. In effect, you cannot reduce the capacity of a storage pool, only it's redundancy.
This restriction will be removed in a future release of ZFS, which will support evacuating data from a disk to enable it to be removed.
Note that to replace a disk with another disk is fully supported. For more information, see the zpool replace description in the /zpool.1m/ man page.
================================================================================
For more info, go to the Solaris ZFS Administration Guide:
http://docs.sun.com/app/docs/doc/819-5461/6n7ht6qrt?a=view
From the Admin Guide:
================================================================================
You can use the zpool detach command to detach a device from a pool. For example:
+-------------------------------------------------------------------------------
| # zpool detach zeepool c2t1d0
+-------------------------------------------------------------------------------
However, this operation is refused if there are no other valid replicas of the data. For example:
+-------------------------------------------------------------------------------
| # zpool detach newpool c1t2d0
| cannot detach c1t2d0: only applicable to mirror and replacing vdevs
+-------------------------------------------------------------------------------
================================================================================ - mohaine, on 10/12/2007, -0/+1It looks like removing a disk is as simple as the following command:
zpool remove pool vdev
Now, what happens if there isn't enough space to hold the data without the removed device I don't know. - volsung, on 10/12/2007, -0/+1Nope, you understood that correctly. ZFS views filesystems in an unusual way. A filesystem is very lightweight in ZFS and variable in size. Perhaps the best way to think about a filesystem in ZFS is that it is a "directory with special attributes," such as compresion, quotas, reserved sizes, snapshots, etc.
- volsung, on 10/12/2007, -0/+1Yes, but that is because there is no one-to-one mapping from partitions to filesystems, like you are used to. In ZFS, there are storage pools and filesystems. A storage pool is composed of 1 or more devices. These are usually whole disks, but can also be partitions or plain files. You can add new devices to a storage pool whenever you want. Then there are filesystems, which are much like fancy directories, that take blocks from the storage pool when they need them and return them to the pool when they are no longer used. By default filesystems can grow to the limits of the available space in storage pool.
So yes, everything is handled automatically. All you need to do is run "zpool add" to add a new disk to the pool, and now that space is available to all of your filesystems. - gavintlgold, on 10/12/2007, -0/+1@volsung
About checksumming: So you're saying it won't take longer to checksum everything? When I heard that I thought, "Wait, will it take as long as it does on a normal disk image? Because that is quite a while (30MB takes about 7 seconds or so on my mac)." It actually won't affect it?
Anyone else know? - volsung, on 10/12/2007, -0/+1This guy has been working on the userspace port (due to CDDL license issues) of ZFS to linux:
http://zfs-on-fuse.blogspot.com/
He's so far only made it to the read-only stage, though I hope in the coming months we'll see this finally complete. Since it uses FUSE, it won't be as fast as a kernel driver, but that will have to wait until someone reimplements ZFS from the spec document, or until Sun decides to GPL OpenSolaris (a possibility). - inactive, on 10/12/2007, -0/+1Ok, I'm convined. I will get Leopard like i had planned. But won't i have to reformat my disk!!? What do I do?
- geronimo, on 10/12/2007, -0/+1I know that LVM has a feature whereby you can tell it to migrate all the data off of a physical drive onto another drive, then you can remove it. If ZFS doesn't have this feature then it should in the future, it sounds pretty trivial to implement after they've got all the pooling code written.
- geronimo, on 10/12/2007, -0/+1One current bug is that if a fatal disk error happens it doesn't yet recover gracefully, but this may have been fixed recently.
ZFS is much better than LVM feature-wise but ZFS is relatively new. In a year or two I would love to use it after all the details are ironed out. LVM was once a newcomer and people didn't use it, but it is now a very mature system, ZFS will get that way and we will all use ZFS some day.
Right now I can resize my LVM array manually but ZFS, which combines a filesystem and volume manager, makes this 1000 times easier. I can't wait for the time when I get to use ZFS in production. - lampshade, on 10/12/2007, -0/+1Does this mean that partition size is dynamic? As in you could create a 200gb partition on an HD and have it expand to 400gb without doing any magic partition tricks? Is it all just handled by ZFS?
- volsung, on 10/12/2007, -0/+1I think you dismiss some of these features too quickly.
Compression - Compressing all your files might negatively impact performance, but you don't have to compress them all. I carry around directories of files on my laptop which I seldom access, but would rather not have to manually decompress to read (or stick on an unchangeable compressed disk image). I'd want to see benchmarks on today's CPUs before writing off compression as being a total performance killer.
Nested filesystems - You should read up more on how ZFS uses the term "filesystem" before casting the whole thing off. (I apologize for not taking the time in my article to explain in detail the differences. It was long enough as it was.) A "filesystem" in ZFS is more like a fancy directory. When you add files to the filesystem, it pulls blocks from the storage pool. When you remove files, the blocks go back into the storage pool. The distinction between filesystems and regular directories is that filesystems have special attributes (like compression on/off, encryption on/off) and you only snapshot filesystems rather than directories.
You describe accurately the evils of standard filesystems, like spare blocks on one partition that you can't use on other partitions and the need to manually grow/shrink partitions and filesystems. These problems do not apply to ZFS due to it's architecture of a central storage pool, with lightweight filesystems. Empty blocks in a ZFS filesystem ARE returned to the storage pool.
Checksumming - Again, checksumming is not as bad in practice as you might expect, at least it doesn't seem to hugely affect the ZFS benchmarks that I have seen. I do concede that the value of checksumming depends on your paranoia level. I have yet to see any statistics estimating how often the failure modes that ZFS protects against happen in practice.
Snapshots - Why should every app implement its own revision control, when snapshots cover 80% of the cases? Moreover, ZFS's copy-on-write architecture means that the space wastage is limited only to changed blocks. Entire files are not duplicated on disk unless you change every single block. Closely spaced snapshots where nothing happens take up nearly no space, due to the block sharing.
SW RAID - This is a common argument I've seen. For expensive HW RAID controllers, yes, performance is much better, and I can see lots of reasons why servers would want to still use them. But the average RAID controller in consumer grade hardware is junk, and the main CPU can do it much faster. Moving those tasks into software lets you have some of the benefits of RAID in a moderately priced system. The decision between HW and SW RAID depends on what you plan to do, and your budget.
ZFS is not the ultimate filesystem by any means (despite what the promotional materials say), but I really think you should at least read through this presentation to understand the architecture a bit better before passing judgment:
http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf
ZFS is designed radically differently that previous systems, so many obvious truisms do not hold. - HappyScrappy, on 10/12/2007, -1/+1"Filesystems can be compressed." - You'll get lousy performance. It'll choke your CPU just to read data. Of limited usefulness. Anyone try it in NTFS? Yeah you know it isn't worth it. Reading data without compression only requires a little bit of CPU, the rest is done with DMA. Not with compression.
"Filesystems are nested and making them is as easy as making a directory." - Yuck. Why would I want lots of file systems? UNIX does this already and it's annoying. Because you can fill a filesystem and still have space in other filesystems (think of partitioning). You can have a window open the finger saying 1GB free, but when you try to put files into a folder in that window, there's no space because it's a different file system. Automatically growing filesystems only works as long as there is virgin (or scavenged) space available for growth. Empty blocks in one FS are not automatically returned to this pool, so it isn't true free space sharing.
"Every block of data on the disk is checksummed so errors can be detected during read operations." - Will produce godawful performance. It'll consume a ton of CPU and breaks up large reads into small ones with lots of "inchworm" memory moves to close up gaps in the actual data where the checksums are stored. Your hard drive already uses sophisticated hardware-based error correction and signalls errors to the OS if it fails. Additional checking is of limited usefulness.
"Space-efficient and fast snapshots." - No, thanks, revisions should be managed by apps. Many apps modify files just when launching. These files will be duplicated even though nothing worth saving happened. Skip.
As to all the SW RAID stuff. HW RAID is the way to go. Skip SW RAID.
"Highly SMP-friendly design." - That's actually pretty good.
It could be worse though, one of the commenters thinks Apple should have used UFS. UFS is a very poor choice nowadays, it's far less suitable than HFS+, and probably less than ZFS too. - pvcooper, on 01/04/2008, -0/+0To all (especially those who have read the current Internals information on ZFS will understand),
I have only on thing to say:
"ZFS is to the storage realm what TCP/IP was to the networking realm in the 80's." Quote by me: Peter Cooper
I hope all WHO are questioning the value of ZFS like some above in this blog posting area would read and
try understand it's current implementation and superior design. - zonk3r, on 10/12/2007, -1/+1It would seem like a BSD style license would be a better fit if they want it to be used widely (by many users across many platforms).
- aplardi, on 10/12/2007, -3/+3Good question, perhaps there is a sort of built in solution. Or perhaps HFS+ will be better off for removable storage.
What's worked best for me in terms of compatibility is having my external drive (320 GB) partitioned in HFS+ and with a 50 GB FAT32 partition that has some tools for Windows to read/write HFS+.
I'm excited about the new ZFS system, but I am not scarred that Apple will make a mistake big enough have a negative effect on our storage. But they have been known to make mistakes. - moofer, on 10/12/2007, -0/+0It's gonna be interesting to add the WWN's for our Xserves into the same pool as our Solaris boxen. Mount one filesystem anywhere? It's a brave new world. We can flip apache docroots, mailspools, openldap databases from host to host, regardless of platform now. I think our world just got a lot cheaper.
- HappyScrappy, on 10/12/2007, -1/+1Returning single blocks to the free pool between systems is a negative thing.
Thus any fragmentation in one filesystem spreads to them all? Yuck. File systems end up interlaced on the disk, so that one has 1000 sectors in a row, then the next 1 belongs to another and then another has 1000 again? Yuck. Maybe I shouldn't worry about fragmentation anyway since copy-on-write ensures that your sectors (or at least files) will be nowhere near each other anyway. Want to grep a directory? Thanks to snapshots, the old files are kept at the start of the disk, the newer ones a little farther down, newer than that a little farther down, and the newest ones near the end of the disk. Rattle-rattle-rattle. No thanks.
How about if I just want to compress my files, I compress my files? Most people would like to compress and serialize files for transmission, so why would I want them compressed separately in a file system I can't serialize automatically without using tar or cpio? Let me put it another way, I can compress subdirectories in NTFS, and I don't use it. Why would ZFS by any different? It's not magic.
The problem isn't snapshots cover 80% of cases. The problem is snapshots cover 250% of cases. You'll end up revision controlling files that shouldn't be controlled. Or revisions will be snapshotted that don't even make sense (like in the middle of a larger reorganizational operation). Filesystems don't have traditional calls for "the file is in a consistent state right now, now would be a good time to snapshot" or "now I have made significant changes that would be worth snapshotting" versus "I have made trivial changes that don't need to be saved, they'll be done again automatically if the file is rolled back to the last snapshot made". So you end up not capturing data in inconsistent states, or you end up snapshotting stuff that doesn't need snapshotted.
It's a mess. This can't be grafted on without modifying many many apps to understand it. And UNIX has a lot of apps to modify.
The average person shouldn't be using RAID at all, SW or HW. It has performance negatives (rotational latency adds up) that don't make sense for most people.
I'd love to hear how you think checksumming is mitigated. Bringing in 5M in a row takes almost the same time as bringing in 1M. Thus, bringing in 5M in 5 1M stripes with checksum data inbetween is 5x slower. That that is just wall clock time, the CPU time goes up exponentially since DMA takes almost no CPU and checksumming 5M takes quite a bit.
On a hard drive, the sectors are ECC coded. On ATA UDMA/33 and later, the data transfer over the interface is CRC checked. PCI Express has CRC checking on the transfer from the controller to the memory. Where are these disk errors you're going to detect coming from?
If you have an app that so sensitive to corruption that these 3 layers aren't enough protection, you probably shouldn't trust anyone else to handle your data safely, and should implement your own error detection at the file format level. - AtHomeBoy2000, on 10/12/2007, -2/+1And it's going to be adopted by Microsoft as a formating option.... oh, wait. Why would M$ do something silly like that?
- rspeed, on 10/12/2007, -3/+2Pretty much the same thing that happens now, I would assume, but without worrying about corruption and data loss.


What is Digg?
The Digg Toolbar for Firefox lets you Digg, submit content, and keep track of Digg even when you're not on the Digg site. Download the official