We’ve been doing a lot of storage research lately, and there’s been a lot of talk about ZFS. I’m going to spare you the magazine article (if you want to read more on what it is, and where it comes from, look elsewhere) and give you some guts.
ZFS is a 128-bit file system, and unfortunately isnt likely to be built into the linux kernel anytime soon. You can however, use it in userspace, using zfs-fuse, similarly to how you might use NTFS on linux (for those of us still dual booting). The machine i’m running on, runs solely Fedora Core 11, and has a handsome amount of beef behind it. It’s also got 500gb of local storage, so I can play around with huge files no sweat. You can do the same things i’m doing, with smaller files, if you’d like.
First of all, you’ll need to install zfs-fuze, this was simple on Fedora.
$ sudo yum install zfs-fuse
Next some blank disk images to toy with.
$ mkdir zfs $ cd zfs $ for i in $(seq 8); do dd if=/dev/zero of=$i bs=1024 count=2097152;done
This gives me 8, 2gb blobs. Make these smaller if you’d like. I wanted enough space to throw some large files at zfs. You’ll see in a bit.
Now let’s make our first zfs pool.
$ sudo zpool create jose ~/zfs/1 ~/zfs/2 ~/zfs/3 ~/zfs/4
I named my pool jose. I like it when my blog entries have personality. 😛
zfs list will give you a list of your zfs pools.
$ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT jose 72K 7.81G 18K /jose
Creating the pool also mounts it.
$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 454G 210G 221G 49% / /dev/sda1 190M 30M 151M 17% /boot tmpfs 2.0G 25M 2.0G 2% /dev/shm jose 7.9G 18K 7.9G 1% /jose
An interesting note. I never created a file system on this pool, i just told zfs to have at it. zfs must work at a block level with the drives.
Now, let’s poke jose with a stick, and see what he does.
$ sudo dd if=/dev/zero of=/jose/testfile bs=1024 count=2097512 2097512+0 records in 2097512+0 records out 2147852288 bytes (2.1 GB) copied, 118.966 s, 18.1 MB/s $ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT jose 2.00G 5.81G 2.00G /jose
Its worth note, that with a zpool add
That’s all fun, but this is essentially just a large file system. No really cool features yet. Let’s see what we can really so with this thing.
Let’s make a raid group, instead of just a standard pool.
Goodbye Jose
$ sudo zpool destroy jose
From jose’s ashes, lets make a new pool.
$ sudo zpool create susan raidz ~/zfs/1 ~/zfs/2 ~/zfs/3 ~/zfs/4 $ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT susan 92.0K 5.84G 26.9K /susan
Notice that susan is smaller than jose, using the same disks. This isn’t because susan has made more trips to the gym than jose, rather it’s because of the raid set. This is similar to raid 5, where one disk is taken for parity. So you lose a one disk worth of capacity.
Let’s remedy that, by throwing more (virtual) hardware at it.
You cant expand a raid group, by adding a disk, so we’ll do it by recreating the group.
$ sudo zpool destroy susan $ sudo zpool create susan raidz ~/zfs/1 ~/zfs/2 ~/zfs/3 ~/zfs/4 ~/zfs/5 $ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT susan 98.3K 7.81G 28.8K /susan
And there you go, about 8gb again.
Now let’s poke susan with a stick.
First, here’s her status:
$ sudo zpool status pool: susan state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Tue Oct 6 15:22:24 2009 config: NAME STATE READ WRITE CKSUM susan ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /home/lagern/zfs/1 ONLINE 0 0 0 /home/lagern/zfs/2 ONLINE 0 0 0 /home/lagern/zfs/3 ONLINE 0 0 0 /home/lagern/zfs/4 ONLINE 0 0 0 /home/lagern/zfs/5 ONLINE 0 0 0 errors: No known data errors
Now we’ll dd another file to susan, and we’ll see if we can damage the array.
$ sudo dd if=/dev/zero of=/susan/testfile bs=1024 count=2097512
Then, in another terminal…
$ sudo zpool offline susan ~/zfs/4 $ sudo zpool status pool: susan state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scrub: scrub completed after 0h0m with 0 errors on Tue Oct 6 15:22:24 2009 config: NAME STATE READ WRITE CKSUM susan DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 /home/lagern/zfs/1 ONLINE 0 0 0 /home/lagern/zfs/2 ONLINE 0 0 0 /home/lagern/zfs/3 ONLINE 0 0 0 /home/lagern/zfs/4 OFFLINE 0 0 0 /home/lagern/zfs/5 ONLINE 0 0 0 errors: No known data errors
The dd is still running.
$ sudo zpool online susan ~/zfs/4
DD’s still going…..
DD finally finished, and it took a little longer than the first copy, but it finished, and the file appears correct.
Now, let’s try something else. With raid, you generally wont just take a drive offline, and then bring it right back, so let’s see what happens if you replace the drive.
Another dd session, and then the drive swap commands.
$ sudo dd if=/dev/zero of=/susan/testfile2 bs=1024 count=2097512
In another terminal…
$ sudo zpool status pool: susan state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Tue Oct 6 15:26:06 2009 config: NAME STATE READ WRITE CKSUM susan ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /home/lagern/zfs/1 ONLINE 0 0 0 /home/lagern/zfs/2 ONLINE 0 0 0 /home/lagern/zfs/3 ONLINE 0 0 0 /home/lagern/zfs/4 ONLINE 0 0 0 /home/lagern/zfs/5 ONLINE 0 0 0 errors: No known data errors $ sudo zpool offline susan ~/zfs/4 $ sudo zpool replace susan ~/zfs/4 ~/zfs/6 $ sudo zpool status pool: susan state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h1m, 25.87% done, 0h3m to go config: NAME STATE READ WRITE CKSUM susan DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 /home/lagern/zfs/1 ONLINE 0 0 0 /home/lagern/zfs/2 ONLINE 0 0 0 /home/lagern/zfs/3 ONLINE 0 0 0 replacing DEGRADED 0 0 0 /home/lagern/zfs/4 OFFLINE 0 0 0 /home/lagern/zfs/6 ONLINE 0 0 0 /home/lagern/zfs/5 ONLINE 0 0 0 errors: No known data errors
This procedure seriously degraded the speed of the dd. It also made my music chop, once.
After the dd finished, the status was happy again:
$ sudo dd if=/dev/zero of=/susan/testfile2 bs=1024 count=2097512 2097512+0 records in 2097512+0 records out 2147852288 bytes (2.1 GB) copied, 356.92 s, 6.0 MB/s $ sudo zpool status pool: susan state: ONLINE scrub: resilver completed after 0h4m with 0 errors on Tue Oct 6 15:35:52 2009 config: NAME STATE READ WRITE CKSUM susan ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /home/lagern/zfs/1 ONLINE 0 0 0 /home/lagern/zfs/2 ONLINE 0 0 0 /home/lagern/zfs/3 ONLINE 0 0 0 /home/lagern/zfs/6 ONLINE 0 0 0 /home/lagern/zfs/5 ONLINE 0 0 0 errors: No known data errors
Note that 4 is now replaced with 6.
Time for some coffee………..
Now lets look at some really neat things.
I mentioned that you couldn’t expand a raid volume. What you can do is replace the disks, with larger ones. Its unclear how this affects your data though (at least, it is unclear to me!) so I’m going to try it.
First let’s make some larger “disks”.
for i in $(seq 9 13); do dd if=/dev/zero of=$i bs=1024 count=4195024; done
Here we are at the beginning
$ sudo zpool status pool: susan state: ONLINE scrub: resilver completed after 0h4m with 0 errors on Tue Oct 6 15:35:52 2009 config: NAME STATE READ WRITE CKSUM susan ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /home/lagern/zfs/1 ONLINE 0 0 0 /home/lagern/zfs/2 ONLINE 0 0 0 /home/lagern/zfs/3 ONLINE 0 0 0 /home/lagern/zfs/6 ONLINE 0 0 0 /home/lagern/zfs/5 ONLINE 0 0 0 errors: No known data errors $ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT susan 4.00G 3.82G 4.00G /susan
The new disks i created are 4GB, So we should be able to double the capacity in this pool using these disks.
$ sudo zpool replace susan ~/zfs/1 ~/zfs/9 $ sudo zpool replace susan ~/zfs/2 ~/zfs/10 $ sudo zpool status pool: susan state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 12.94% done, 0h6m to go config: NAME STATE READ WRITE CKSUM susan ONLINE 0 0 0 raidz1 ONLINE 0 0 0 replacing ONLINE 0 0 0 /home/lagern/zfs/1 ONLINE 0 0 0 /home/lagern/zfs/9 ONLINE 0 0 0 replacing ONLINE 0 0 0 /home/lagern/zfs/2 ONLINE 0 0 0 /home/lagern/zfs/10 ONLINE 0 0 0 /home/lagern/zfs/3 ONLINE 0 0 0 /home/lagern/zfs/6 ONLINE 0 0 0 /home/lagern/zfs/5 ONLINE 0 0 0 errors: No known data errors $ sudo zpool replace susan ~/zfs/3 ~/zfs/11 $ sudo zpool replace susan ~/zfs/6 ~/zfs/12 $ sudo zpool replace susan ~/zfs/5 ~/zfs/13 $ sudo zpool status pool: susan state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 8.21% done, 0h5m to go config: NAME STATE READ WRITE CKSUM susan ONLINE 0 0 0 raidz1 ONLINE 0 0 0 replacing ONLINE 0 0 0 /home/lagern/zfs/1 ONLINE 0 0 0 /home/lagern/zfs/9 ONLINE 0 0 0 replacing ONLINE 0 0 0 /home/lagern/zfs/2 ONLINE 0 0 0 /home/lagern/zfs/10 ONLINE 0 0 0 replacing ONLINE 0 0 0 /home/lagern/zfs/3 ONLINE 0 0 0 /home/lagern/zfs/11 ONLINE 0 0 0 replacing ONLINE 0 0 0 /home/lagern/zfs/6 ONLINE 0 0 0 /home/lagern/zfs/12 ONLINE 0 0 0 replacing ONLINE 0 0 0 /home/lagern/zfs/5 ONLINE 0 0 0 /home/lagern/zfs/13 ONLINE 0 0 0 errors: No known data errors
This took a while, and really hit my system hard. I’d recommend doing this one drive at a time.
$ top top - 16:12:10 up 25 days, 5:27, 25 users, load average: 11.36, 9.27, 6.20 Tasks: 280 total, 2 running, 278 sleeping, 0 stopped, 0 zombie Cpu0 : 10.2%us, 1.3%sy, 0.0%ni, 61.0%id, 27.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 1.6%us, 2.9%sy, 0.0%ni, 5.5%id, 89.6%wa, 0.0%hi, 0.3%si, 0.0%st Cpu2 : 0.7%us, 0.7%sy, 0.0%ni, 92.7%id, 5.9%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 3.9%us, 2.0%sy, 0.0%ni, 94.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 1.0%us, 0.3%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 1.3%us, 2.0%sy, 0.0%ni, 9.8%id, 86.9%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 5.4%us, 6.8%sy, 0.0%ni, 87.3%id, 0.0%wa, 0.0%hi, 0.6%si, 0.0%st Cpu7 : 1.6%us, 1.3%sy, 0.0%ni, 97.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 4121040k total, 4004956k used, 116084k free, 13756k buffers Swap: 5406712k total, 322328k used, 5084384k free, 1441452k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11021 lagern 20 0 1417m 1.1g 35m S 14.2 26.8 2393:07 VirtualBox 313 lagern 20 0 1077m 555m 13m R 12.6 13.8 1089:52 firefox 22170 root 20 0 565m 221m 1428 S 6.6 5.5 5:57.71 zfs-fuse
I think i’ll go read some things on my laptop while this finishes.
Done! Took about 15 minutes to complete. My test files are still present in the pool,
$ ls -lh /susan total 4.0G -rw-r--r-- 1 root root 2.1G 2009-10-06 15:27 testfile -rw-r--r-- 1 root root 2.1G 2009-10-06 15:35 testfile2
My pool does not yet show the new size….
$ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT susan 4.00G 3.82G 4.00G /susan
I remounted…
$ sudo zfs umount /susan $ sudo zfs mount susan
No change….
According to harryd a reboot is necesasry. I’m not in the rebooting mood at the moment. I’ll try this, and report back if it doesnt work.
So, there you have it, zfs! Oh, another note. raidz is not the only raid option. raidz2 supports two parity drives. Like raid6. You can specify this via the zpool create command, using raidz2 where raidz was.
Enjoy!
-War