Nov 30 2008

gmirror and gvinum on the same drives

In 2006, when I was installing a FreeBSD server for our client, one of the requests was also a RAID-5 array of some kind. I checked out and discovered GEOM vinum (or gvinum), which provided what I needed at that time. It is a file server, but throughput is not a critical issue, so I tried it (at that time, graid5 was not yet available, AFAIK). I am writing this because this weekend I had to rebuild the array (and copy the data) with new, larger drives, which took me many hours to do it, because there is not so many documentation on how to make different GEOM RAID subsystems share the same three drives.

This is what I wanted to achieve: have three drives, which would contain two gmirror (RAID-1) arrays (one for root partition, the other for swap) and three gvinum (RAID-5) volumes – for /var, /tmp and /usr. For the latter, it is best to use volume management capabilities of gvinum, which allows you to join only three physical devices (or slices or partitions) with it, so that the logical volumes are created “inside” the vinum manager.

The main problem was, that I forgot how to do this “properly”. It was 2 years since I did this (of course, I didn’t write it down how I did it, although it took me a few hours) last time and since the machine is far away, I don’t have physical access. This would have helped, because I could just put the old drives back and see how they were configured, but the remote system administrator already exchanged the drives and I didn’t want to bother him.

In FreeBSD terms, a partition is a logical unit, which resides on a slice (which is actually a partition from other operating systems’ point of view).  Let’s have four drives on the system: /dev/ad0, /dev/ad1, /dev/ad2 and /dev/ad3. We’ll assume that on /dev/ad0 there is the system we are booting and running FreeBSD at the moment and we wan’t to create the arrays on the other three drives, which will eventually run by themselves (we’ll pull the /dev/ad0 out when we finish). When you create a slice on /dev/ad1, for example, you’ll be able to access it via /dev/ad1s1. When you create a partition on this slice, you’ll see it as /dev/ad1s1a, where the last letter “a” can also be “b”, “d”, “e”, and so on, you know the alphabet. This naming system is somewhat peculiar, and I don’t like it but I can live with it. The letter “a” usually hosts the root partition, and the letter “b” provides swap space. As you can see, there is no letter “c”. This is because it specifies the whole slice and therefore it should not be used for anything else.

Usually, when you’re setting up the gmirror RAID-1 on FreeBSD, you make put it on the physical drive directly, i.e. you make the /dev/ad0 visible as /dev/mirror/gm0 (after you put the metadata on the drive, by running ‘gmirror create‘), which also means that all existing slices and partitions will be visible at the new location. If you had /dev/ad0s1a, you’ll now have /dev/mirror/gm0s1a. Which is very nice and makes gmirror very easy to set up after the system was installed. In the end, you just add other mirrors (/dev/ad1, …) in the array and that’s it.

However, if you want to use gvinum on the same drives (to make RAID-5 arrays, for example), you can’t do that. You’ll need to gmirror something else: the slices or the partitions, but not the whole drives. Now FreeBSD allows you to have no slices at all – to create the FreeBSD partitions (the letters) directly on the drive (so you’ll have /dev/ad1a). So when I started to think about how I would partition the drives and which units will I merge with gvinum and gmirror, I became a bit confused. So I tried a few ideas I had and none of them really worked because I didn’t know what actually the command such as “bsdlabel -w“, “boot0cfg“, “gmirror label“, “gvinum create” and creating slices via “sysinstall” actually do. Where do they write their data? At what offsets and what are the sizes of these metadata? I found it quite annoying that there isn’t much documentation about this (at least not well organized), so I had to do some homework. Here is what I discovered:

gvinum — When you run “gvinum create“, it will rewrite the bytes from 0x1000 to 0x21200, that is from block 8 (first 8 blocks are left untouched) to block 265 with its own configuration data, so you have to be careful not to mess with these blocks.

gmirror — Running “gmirror label” puts gmirror’s metadata on the last block of the device. The size in blocks of the mirror is then number of block of the device – 1.

bsdlabel — When labelling a slice (or the drive directly), bsdlabel writes label information to the second block (from address 0x200 on, in my tests it never passed the 0x2c0 limit, which still fits into the second block).

boot0cfg — Since it rewrites the MBR with BootMgr, this means it rewrites the first block (block 0) of the drive.

fdisk — Fdisk writes the slice information into first 16 blocks of the slice. This means, that you shouldn’t label them with bsdlabel (don’t assign them to any of the partitions), or you can have problems.

To sum up, the only configuration, which worked for me on FreeBSD 6.1 (yes, quite old one) was the following. First I created slices on all of the drives (one on each drive) and wrote the BootMgr onto them (you can do this easily by running sysinstall and then going to Custom and then Partition. You select the first drive (of the three) and then, when in fdisk-editor, press a and then w to write the slice. When asked about MBR, just say BootMgr and that’s it. This will ensure that there is a boot manager on the drive (which means you can boot from it). You have to repeat this procedure for the other (two) drives as well.

Then, you have to edit the label of all three slices, running “bsdlabel -e /dev/ad1s1” (for the slice on the first drive). You have to provide the following partition set:

a:  1048576       16    4.2BSD        0     0     0
b:  4194304  1048592      swap
c: 976768002       0    unused        0     0         # "raw" part, don't edit
d: 971525106 5242896     vinum

In this configuration you can see that the size of “a” (root) partition is 1048576 512-byte blocks which means 512 MB. The offset of 16 blocks for the “a” partition is very important, since the slice needs the first 16 blocks for itself. The size of the “b” (swap) partition is 4 times the size of “a” (2 GB) and the “d” takes all the space left on the slice.

So the idea is to make two gmirror arrays, the first one will consist of the three “a” partitions (remember, we have three hard drives) and will be used as the root partition. The second one will consist of the three “b” partitions and will be used as swap space. All the “d” partitions will be used for the construction of the gvinum array.

First, you need to load the geom_mirror module, which enables kernel to handle the gmirror arrays. You do this by running “kldload geom_mirror“. But, it is needed to make this change permanent (so the module will load at boot) so you need to add these two lines to /boot/loader.conf:

geom_mirror_load="YES"
geom_vinum_load="YES"

This will also enable loading gvinum at boot, which we will need later (when the system will boot from the new arrays). No it’s time to create the arrays. You’ll run something like:

# gmirror label -v -b round-robin root /dev/ad1s1a
# gmirror label -v -b round-robin swap /dev/ad1s1b
# gmirror insert root /dev/ad2s1a
# gmirror insert root /dev/ad3s1a
# gmirror insert swap /dev/ad2s1b
# gmirror insert swap /dev/ad3s1b

This was for the gmirror arrays. Now make a file named gvinum.conf and put this in it:

drive disk1 device /dev/ad1s1d
drive disk2 device /dev/ad2s1d
drive disk3 device /dev/ad3s1d
 volume var
  plex org raid5 491k
   sd length 1024m drive disk1
   sd length 1024m drive disk2
   sd length 1024m drive disk3
 volume tmp
  plex org raid5 491k
   sd length 512m drive disk1
   sd length 512m drive disk2
   sd length 512m drive disk3
 volume usr
  plex org raid5 491k
   sd length 0 drive disk1
   sd length 0 drive disk2
   sd length 0 drive disk3

And then you run:

# gvinum create gvinum.conf

This will create three gvinum RAID-5 arrays – for /var, /usr and /tmp. They will be accessible via /dev/gvinum/var, /dev/gvinum/usr and /dev/gvinum/tmp respectively. You should know, that the size of the RAID-5 array is the sum of the size of all drives – the size of one drive. This makes our /var 2 GB, /tmp 1GB and /usr the rest. If you execute “gvinum list” now, you’ll see that all the arrays are marked as up. However, this will not be the case after you reboot. I don’t know exactly why, perhaps this is a bug. Also I am not sure if it is present in the newest FreeBSD releases. So it is very important now, that you reboot the system now. After it comes back online, you have to run:

# gvinum start var
# gvinum start usr
# gvinum start tmp

Then you have to wait for the arrays to become synchronized. It may take a while. You can always check status with “gvinum list“. When the arrays are synchronized, you need to create the filesystems on all of them:

# newfs /dev/mirror/root
# newfs -U /dev/gvinum/var
# newfs -U /dev/gvinum/usr
# newfs -U /dev/gvinum/tmp

After that, you should mount these new arrays in /mnt and copy the system, you are running now onto them:

# mount /dev/mirror/root /mnt
# cd /mnt
# mkdir var tmp usr
# chmod 1777 tmp
# mount /dev/gvinum/var /mnt/var
# mount /dev/gvinum/usr /mnt/usr
# mount /dev/gvinum/tmp /mnt/tmp
# cd / && find . -xdev | cpio -pm /mnt
# cd /var && find . -xdev | cpio -pm /mnt/var
# cd /usr && find . -xdev | cpio -pm /mnt/usr
# cd /tmp && find . -xdev | cpio -pm /mnt/tmp

Finally, you have to modify your fstab file on the root gmirror array. Edit /mnt/etc/fstab as follows:

# Device                Mountpoint      FStype  Options         Dump    Pass#
/dev/mirror/swap        none            swap    sw              0       0
/dev/mirror/root        /               ufs     rw              1       1
/dev/gvinum/tmp         /tmp            ufs     rw              2       2
/dev/gvinum/usr         /usr            ufs     rw              2       2
/dev/gvinum/var         /var            ufs     rw              2       2
/dev/acd0               /cdrom          cd9660  ro,noauto       0       0

Now you can try to boot the system from one of the three drives which hold the RAID arrays and you should be lucky. If you aren’t, you are welcome to post comments here and we’ll try to sort it out together.

Reblog this post [with Zemanta]