We’ve been doing a lot of storage research lately, and there’s been a lot of talk about ZFS. I’m going to spare you the magazine article (if you want to read more on what it is, and where it comes from, look elsewhere) and give you some guts.

ZFS is a 128-bit file system, and unfortunately isnt likely to be built into the linux kernel anytime soon. You can however, use it in userspace, using zfs-fuse, similarly to how you might use NTFS on linux (for those of us still dual booting). The machine i’m running on, runs solely Fedora Core 11, and has a handsome amount of beef behind it. It’s also got 500gb of local storage, so I can play around with huge files no sweat. You can do the same things i’m doing, with smaller files, if you’d like.

First of all, you’ll need to install zfs-fuze, this was simple on Fedora.

$ sudo yum install zfs-fuse

Next some blank disk images to toy with.

$ mkdir zfs
$ cd zfs
$ for i in $(seq 8); do dd if=/dev/zero of=$i bs=1024 count=2097152;done

This gives me 8, 2gb blobs. Make these smaller if you’d like. I wanted enough space to throw some large files at zfs. You’ll see in a bit.

Now let’s make our first zfs pool.

$ sudo zpool create jose ~/zfs/1 ~/zfs/2 ~/zfs/3 ~/zfs/4 

I named my pool jose. I like it when my blog entries have personality. 😛

zfs list will give you a list of your zfs pools.

$ sudo zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
jose    72K  7.81G    18K  /jose

Creating the pool also mounts it.

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      454G  210G  221G  49% /
/dev/sda1             190M   30M  151M  17% /boot
tmpfs                 2.0G   25M  2.0G   2% /dev/shm
jose                  7.9G   18K  7.9G   1% /jose

An interesting note. I never created a file system on this pool, i just told zfs to have at it. zfs must work at a block level with the drives.

Now, let’s poke jose with a stick, and see what he does.

$ sudo dd if=/dev/zero of=/jose/testfile bs=1024 count=2097512
2097512+0 records in
2097512+0 records out
2147852288 bytes (2.1 GB) copied, 118.966 s, 18.1 MB/s

$ sudo zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
jose  2.00G  5.81G  2.00G  /jose

Its worth note, that with a zpool add /dev/whatever you can add space to a pool of this sort.

That’s all fun, but this is essentially just a large file system. No really cool features yet. Let’s see what we can really so with this thing.

Let’s make a raid group, instead of just a standard pool.

Goodbye Jose

$ sudo zpool destroy jose

From jose’s ashes, lets make a new pool.

$ sudo zpool create susan raidz ~/zfs/1 ~/zfs/2 ~/zfs/3 ~/zfs/4
$ sudo zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
susan  92.0K  5.84G  26.9K  /susan

Notice that susan is smaller than jose, using the same disks. This isn’t because susan has made more trips to the gym than jose, rather it’s because of the raid set. This is similar to raid 5, where one disk is taken for parity. So you lose a one disk worth of capacity.

Let’s remedy that, by throwing more (virtual) hardware at it.

You cant expand a raid group, by adding a disk, so we’ll do it by recreating the group.

$ sudo zpool destroy susan
$ sudo zpool create susan raidz ~/zfs/1 ~/zfs/2 ~/zfs/3 ~/zfs/4 ~/zfs/5
$ sudo zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
susan  98.3K  7.81G  28.8K  /susan

And there you go, about 8gb again.
Now let’s poke susan with a stick.

First, here’s her status:

$ sudo zpool status
  pool: susan
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Tue Oct  6 15:22:24 2009
config:

	NAME                    STATE     READ WRITE CKSUM
	susan                   ONLINE       0     0     0
	  raidz1                ONLINE       0     0     0
	    /home/lagern/zfs/1  ONLINE       0     0     0
	    /home/lagern/zfs/2  ONLINE       0     0     0
	    /home/lagern/zfs/3  ONLINE       0     0     0
	    /home/lagern/zfs/4  ONLINE       0     0     0
	    /home/lagern/zfs/5  ONLINE       0     0     0

errors: No known data errors

Now we’ll dd another file to susan, and we’ll see if we can damage the array.

$ sudo dd if=/dev/zero of=/susan/testfile bs=1024 count=2097512

Then, in another terminal…

$ sudo zpool offline susan ~/zfs/4
$ sudo zpool status
  pool: susan
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
 scrub: scrub completed after 0h0m with 0 errors on Tue Oct  6 15:22:24 2009
config:

	NAME                    STATE     READ WRITE CKSUM
	susan                   DEGRADED     0     0     0
	  raidz1                DEGRADED     0     0     0
	    /home/lagern/zfs/1  ONLINE       0     0     0
	    /home/lagern/zfs/2  ONLINE       0     0     0
	    /home/lagern/zfs/3  ONLINE       0     0     0
	    /home/lagern/zfs/4  OFFLINE      0     0     0
	    /home/lagern/zfs/5  ONLINE       0     0     0

errors: No known data errors

The dd is still running.

$ sudo zpool online susan ~/zfs/4

DD’s still going…..

DD finally finished, and it took a little longer than the first copy, but it finished, and the file appears correct.

Now, let’s try something else. With raid, you generally wont just take a drive offline, and then bring it right back, so let’s see what happens if you replace the drive.

Another dd session, and then the drive swap commands.

$ sudo dd if=/dev/zero of=/susan/testfile2 bs=1024 count=2097512

In another terminal…

$ sudo zpool status
  pool: susan
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Tue Oct  6 15:26:06 2009
config:

	NAME                    STATE     READ WRITE CKSUM
	susan                   ONLINE       0     0     0
	  raidz1                ONLINE       0     0     0
	    /home/lagern/zfs/1  ONLINE       0     0     0
	    /home/lagern/zfs/2  ONLINE       0     0     0
	    /home/lagern/zfs/3  ONLINE       0     0     0
	    /home/lagern/zfs/4  ONLINE       0     0     0
	    /home/lagern/zfs/5  ONLINE       0     0     0

errors: No known data errors
$ sudo zpool offline susan ~/zfs/4
$ sudo zpool replace susan ~/zfs/4 ~/zfs/6
$ sudo zpool status
  pool: susan
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h1m, 25.87% done, 0h3m to go
config:

	NAME                      STATE     READ WRITE CKSUM
	susan                     DEGRADED     0     0     0
	  raidz1                  DEGRADED     0     0     0
	    /home/lagern/zfs/1    ONLINE       0     0     0
	    /home/lagern/zfs/2    ONLINE       0     0     0
	    /home/lagern/zfs/3    ONLINE       0     0     0
	    replacing             DEGRADED     0     0     0
	      /home/lagern/zfs/4  OFFLINE      0     0     0
	      /home/lagern/zfs/6  ONLINE       0     0     0
	    /home/lagern/zfs/5    ONLINE       0     0     0

errors: No known data errors

This procedure seriously degraded the speed of the dd. It also made my music chop, once.
After the dd finished, the status was happy again:

$ sudo dd if=/dev/zero of=/susan/testfile2 bs=1024 count=2097512
2097512+0 records in
2097512+0 records out
2147852288 bytes (2.1 GB) copied, 356.92 s, 6.0 MB/s

$ sudo zpool status
  pool: susan
 state: ONLINE
 scrub: resilver completed after 0h4m with 0 errors on Tue Oct  6 15:35:52 2009
config:

	NAME                    STATE     READ WRITE CKSUM
	susan                   ONLINE       0     0     0
	  raidz1                ONLINE       0     0     0
	    /home/lagern/zfs/1  ONLINE       0     0     0
	    /home/lagern/zfs/2  ONLINE       0     0     0
	    /home/lagern/zfs/3  ONLINE       0     0     0
	    /home/lagern/zfs/6  ONLINE       0     0     0
	    /home/lagern/zfs/5  ONLINE       0     0     0

errors: No known data errors

Note that 4 is now replaced with 6.

Time for some coffee………..

Now lets look at some really neat things.

I mentioned that you couldn’t expand a raid volume. What you can do is replace the disks, with larger ones. Its unclear how this affects your data though (at least, it is unclear to me!) so I’m going to try it.

First let’s make some larger “disks”.

for i in $(seq 9 13); do dd if=/dev/zero of=$i bs=1024 count=4195024; done

Here we are at the beginning

$ sudo zpool status
  pool: susan
 state: ONLINE
 scrub: resilver completed after 0h4m with 0 errors on Tue Oct  6 15:35:52 2009
config:

	NAME                    STATE     READ WRITE CKSUM
	susan                   ONLINE       0     0     0
	  raidz1                ONLINE       0     0     0
	    /home/lagern/zfs/1  ONLINE       0     0     0
	    /home/lagern/zfs/2  ONLINE       0     0     0
	    /home/lagern/zfs/3  ONLINE       0     0     0
	    /home/lagern/zfs/6  ONLINE       0     0     0
	    /home/lagern/zfs/5  ONLINE       0     0     0

errors: No known data errors

$ sudo zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
susan  4.00G  3.82G  4.00G  /susan

The new disks i created are 4GB, So we should be able to double the capacity in this pool using these disks.

$ sudo zpool replace susan ~/zfs/1 ~/zfs/9
$ sudo zpool replace susan ~/zfs/2 ~/zfs/10
$ sudo zpool status
  pool: susan
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 12.94% done, 0h6m to go
config:

	NAME                       STATE     READ WRITE CKSUM
	susan                      ONLINE       0     0     0
	  raidz1                   ONLINE       0     0     0
	    replacing              ONLINE       0     0     0
	      /home/lagern/zfs/1   ONLINE       0     0     0
	      /home/lagern/zfs/9   ONLINE       0     0     0
	    replacing              ONLINE       0     0     0
	      /home/lagern/zfs/2   ONLINE       0     0     0
	      /home/lagern/zfs/10  ONLINE       0     0     0
	    /home/lagern/zfs/3     ONLINE       0     0     0
	    /home/lagern/zfs/6     ONLINE       0     0     0
	    /home/lagern/zfs/5     ONLINE       0     0     0

errors: No known data errors
$ sudo zpool replace susan ~/zfs/3 ~/zfs/11
$ sudo zpool replace susan ~/zfs/6 ~/zfs/12
$ sudo zpool replace susan ~/zfs/5 ~/zfs/13
$ sudo zpool status
  pool: susan
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 8.21% done, 0h5m to go
config:

	NAME                       STATE     READ WRITE CKSUM
	susan                      ONLINE       0     0     0
	  raidz1                   ONLINE       0     0     0
	    replacing              ONLINE       0     0     0
	      /home/lagern/zfs/1   ONLINE       0     0     0
	      /home/lagern/zfs/9   ONLINE       0     0     0
	    replacing              ONLINE       0     0     0
	      /home/lagern/zfs/2   ONLINE       0     0     0
	      /home/lagern/zfs/10  ONLINE       0     0     0
	    replacing              ONLINE       0     0     0
	      /home/lagern/zfs/3   ONLINE       0     0     0
	      /home/lagern/zfs/11  ONLINE       0     0     0
	    replacing              ONLINE       0     0     0
	      /home/lagern/zfs/6   ONLINE       0     0     0
	      /home/lagern/zfs/12  ONLINE       0     0     0
	    replacing              ONLINE       0     0     0
	      /home/lagern/zfs/5   ONLINE       0     0     0
	      /home/lagern/zfs/13  ONLINE       0     0     0

errors: No known data errors

This took a while, and really hit my system hard. I’d recommend doing this one drive at a time.

$ top

top - 16:12:10 up 25 days,  5:27, 25 users,  load average: 11.36, 9.27, 6.20
Tasks: 280 total,   2 running, 278 sleeping,   0 stopped,   0 zombie
Cpu0  : 10.2%us,  1.3%sy,  0.0%ni, 61.0%id, 27.5%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.6%us,  2.9%sy,  0.0%ni,  5.5%id, 89.6%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  :  0.7%us,  0.7%sy,  0.0%ni, 92.7%id,  5.9%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  3.9%us,  2.0%sy,  0.0%ni, 94.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  1.0%us,  0.3%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  1.3%us,  2.0%sy,  0.0%ni,  9.8%id, 86.9%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  5.4%us,  6.8%sy,  0.0%ni, 87.3%id,  0.0%wa,  0.0%hi,  0.6%si,  0.0%st
Cpu7  :  1.6%us,  1.3%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4121040k total,  4004956k used,   116084k free,    13756k buffers
Swap:  5406712k total,   322328k used,  5084384k free,  1441452k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                       
11021 lagern    20   0 1417m 1.1g  35m S 14.2 26.8   2393:07 VirtualBox                                                    
  313 lagern    20   0 1077m 555m  13m R 12.6 13.8   1089:52 firefox                                                       
22170 root      20   0  565m 221m 1428 S  6.6  5.5   5:57.71 zfs-fuse     

I think i’ll go read some things on my laptop while this finishes.

Done! Took about 15 minutes to complete. My test files are still present in the pool,

$ ls -lh /susan
total 4.0G
-rw-r--r-- 1 root root 2.1G 2009-10-06 15:27 testfile
-rw-r--r-- 1 root root 2.1G 2009-10-06 15:35 testfile2

My pool does not yet show the new size….

$ sudo zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
susan  4.00G  3.82G  4.00G  /susan

I remounted…

$ sudo zfs umount /susan
$ sudo zfs mount susan

No change….

According to harryd a reboot is necesasry. I’m not in the rebooting mood at the moment. I’ll try this, and report back if it doesnt work.

So, there you have it, zfs! Oh, another note. raidz is not the only raid option. raidz2 supports two parity drives. Like raid6. You can specify this via the zpool create command, using raidz2 where raidz was.

Enjoy!

-War