[top]
Date: October 17, 2007
Author: Chris Peiffer
This document describes the steps taken to put a full booting FreeBSD system on a USB flash disk, with the goal of setting up a group of identically-configured machines that can be easily upgraded and maintained. The machines have an internal magnetic disk (SATA RAID in my case, but it could be anything) that contains only application data and some web app code, with all system and third-party application software on the flash.
This document assumes a basic familiarity with the process of setting up a single FreeBSD system and keeping it up to date.
I made extensive use of these URLs:
Examples of shell commands have the terminal style and examples of files have the file style.
Installing and configuring FreeBSD is laudably simple, thanks to sysinstall, cvsup, the FreeBSD handbook and the many ways to obtain the source. Upgrading a running system's crucial software requires a few more steps: staying on top of portupgrade, cvsuping ports and src and following the steps in /usr/src/Makefile to build world and kernel. But it's all fairly manageable.
The problem is that I've never found a simple way to keep multiple machines upgraded at once.
The official source upgrade process in /usr/src/Makefile mandates running 'make installworld' after rebooting single user. Even if you run 'make world' once and mount a common /usr/obj/, this entails taking every machine down separately and keeping it offline while the installworld completes.
Installing an upgraded port multiple times from the same mounted /usr/ports/ doesn't always work, because so many ports depend on touching parts of the /var/db and /etc/ trees. Even if it did work, the labor of running portupgrade or "make deinstall; make install" on multiple machines is frustrating and ultimately unsustainable.
Keeping src and third-party ports up to date on a set of machines is hard. Common mounted sources are a pain and they usually break.
What we need is a solution where we can do all the portupgrade and make world interaction in one place, and then blast the resulting installed system onto multiple machines safely and cleanly, with minimal downtimes and no risk of losing application data.
The solution described here entails a separate booting storage volume with all system software enclosed within defined identical slices that can be duplicated entirely and then replicated machine-to-machine. Although this is possible using only internal magnetic disks in a conventional system architecture, the rise of cheap, 2-gb-plus, BIOS-bootable, USB-connected flash storage devices makes everything a lot easier, especially initial system setup.
Booting off a USB slice makes the booting system discrete and separate from the permanent data on the magnetic disk. You can have a "classic" partition setup with no intermixing of application data. With two slices set up, one inactive and one active, you can install an upgrade with only the reboot period as downtime, and roll back to the previous version of the entire system seamlessly if there is a problem. All the while, your application data on the magnetic drive is never touched.
Although all this can be done remotely, should something go wrong the in-person administration is extremely simple. The physical USB flash connected externally makes it possible to literally pull the current system and install a new one much easier than anything involving internal magnetic disks.
Additionally, it is much easier to standardize the USB disk than the magnetic disk. It is easier to have a uniform slice image that the flash can boot from, while potentially having slightly different RAID controllers and partition sizes from machine to machine.
Since we're going to be working with systems that have magnetic disks, this document assumes you've got a system with some sort of disk and a working FreeBSD install up and running. You should have a built world and kernel in the /usr/src/ tree on that system. We also assume you've got at least one USB port, a BIOS that recognizes and can boot from USB drives (most modern BIOSes as of 8/2007 can), and several USB flash disks of at least 2 gb. The Cruzer from SanDisk is a popular model, but anything will probably work.
We want a disk with two identical slices, each about 1 GB in size, that we can boot from. Later we'll set up these from a script, but the first one we need to walk through.
First plug in your USB drive. You should see some console message about it being recognized.
# dmesg | tail -5 umass1: LEXAR JD Secure II +, rev 2.00/11.00, addr 3 da2 at umass-sim1 bus 1 target 0 lun 0 da2:Removable Direct Access SCSI-0 device da2: 40.000MB/s transfers da2: 3840MB (7864320 512 byte sectors: 255H 63S/T 489C)
Depending on what your disk is recognized as, replace da2 below accordingly. The first thing we need to do is blow away the factory layout of the flash disk and install two identical FreeBSD slices. The easiest way to do this is by using a prototype file for fdisk. Just to review: In FreeBSD, slices are handled with fdisk. A slice usually includes multiple partitions, which are handled with bsdlabel (disklabel). Most of the time you only have one slice. Most of the time you have many partitions, which map to mountpoints (e.g. / or /usr) or are used for swap.
Here is our fdisk.proto file:
# slice type start length p 1 165 63 1991997 p 2 165 1992123 1991997 # set slice one active a 1
The 63 sector offset at the top of the disk is to reserve space for booting code, and the 63 missing sectors between the slices preserves some kind of track alignment so that the two slices can be identically sized. The legacy bootcode stuff and the false geometry fakery (a flash drive is a long way from heads and cylinders, in its real physical sense) is a big pain, but in general FreeBSD will complain a little but work with reasonable settings.
# fdisk -f fdisk.proto -i da2
This sets up our slices with one active, but to fully initialize the disk we need to use boot0cfg.
# boot0cfg -o noupdate da2
The noupdate option keeps the MBR from writing itself back after every boot, a minor precaution against flash abuse.
# boot0cfg -B da2
This installs the MBR code on the flash so we can have the standard F1/F2 FreeBSD slice boot menu.
# boot0cfg -s 1 da2
This makes sure we're booting off slice 1 the next time we boot this disk.
Now we need to set up our initial flash with some partitions. We could in theory just have one big partition, but since we're using 1-GB slices that's not a good idea. FreeBSD doesn't like to boot off such a giant (by historical standards) partition, so we're going to divide each slice up into three partitions along traditional lines: /, /var and /usr. /var/run, /var/log, and /tmp will be handled by memory disks on the running system. Swap will be on the magnetic disk.
The file flash_labels.proto:
# size offset fstype [fsize bsize bps/cpg] a: 262144 0 4.2BSD 2048 16384 16392 c: * 0 unused 0 0 d: 163840 * 4.2BSD 2048 16384 10248 e: * * 4.2BSD 2048 16384 28528
The '*' parameters are convenient wildcards that basically mean "do the right thing" in the 'offset' column and "as much as possible" in the 'size' column. Set up the partitions, newfs them and mount them somewhere. (Here we use /mnt.)
# bsdlabel -R da2s1 flash_labels.proto # bsdlabel -B da2s1 # newfs -U da2s1a # newfs -U da2s1d # newfs -U da2s1e # mount /dev/da2s1a /mnt # mkdir /mnt/var # mount /dev/da2s1d /mnt/var # mkdir /mnt/usr # mount /dev/da2s1e /mnt/usr
Don't worry about slice 2-- we'll deal with that later by duping s1's image. All this laborious stuff is just to get the first valid slice up and running.
Now we need to install system software on this newly partitioned and formatted slice.
# cd /usr/src # make DESTDIR=/mnt installkernel # make DESTDIR=/mnt installworld
This doesn't install a full /etc hierarchy, so you'll have to run this too:
# cd /usr/src/etc # make DESTDIR=/mnt distrib-dirs # make DESTDIR=/mnt distribution
We need to get a few basic config file info on there before we can boot off the USB. We need a skeletal /etc/fstab:
/dev/da0s1a / ufs rw,noatime 1 1 /dev/da0s1d /var ufs rw,noatime 0 0 /dev/da0s1e /usr ufs rw,noatime 0 0 md /tmp mfs rw,-s24M,noatime 0 0 md /var/run mfs rw,-s4M,noatime 0 0 md /var/log mfs rw,-s32M,noatime 0 0
Note that this sets up our memory-disk volumes on three volatile parts of the disk. Although modern flash drives can handle a lot of writes, it's not a good idea to have continual writes hitting them over the life of the system. If you need persistent /var/run or /var/log info, map these somewhere on the magnetic disk. We'll see later on how we devote a partition to app log info.
We also need a very skeletal /etc/rc.conf so the network comes up:
hostname="host.domain.com" ifconfig_nfe0="DHCP"
Substitute a hostname and the proper interface/network scheme that fits your world.
At this point we could continue setting up the disk with chroot, but I think it's worthwhile to boot off the flash and make sure things are working. Reboot, and enter your BIOS setup to make sure it is set to boot off the USB HD first.
If the system boots up all right, mount your magnetic disk's /usr partition somewhere, link up to your ports collection and install whatever ports you need.
# mount /dev/da1s1e /mnt # cd /usr # ln -s /mnt/ports # cd /usr/ports/shells/bash; make; make install # [...]
With 730 MB on the flash's /usr volume, you should be able to fit anything you need. At this point, the flash-based system is running and ready to go.
There's a little bit of a chicken-and-egg problem for the next part-- we either have to walk through setting up a magnetic disk before we finalize the flash image and discuss upgrades, or assume the existence of a few subservient partitions that we conjure up out of thin air.
This document is going to take the former path and deal with initializing and hooking in the magnetic disk, but if you want to do that all yourself you can skip to Duplicating the Flash below.
For running systems within our new flash-booting setup, the magnetic disk is going to be completely reserved for application code, data and logs. We're going to blow away mostly all FreeBSD (or other OS) vestiges. That's the point-- the system boots off flash, so the internal disk is data only.
For that reason, it's advisable to perform these next steps on a different machine than the one you've started working on. We'll call our original machine M1 and the new one M2. Presumably you have at least two identical machines that you want to keep upgraded.
Now that you've booted on M2, whose internal disk contents are worthless, we can begin.
For my application, I'm setting up a big swap partition, one fs partition that uses 10% of remaining space (for web-app code, a few homedirs, etc.) and a giant partition that uses the remainder.
The file internal_labels.proto looks like this:
# a is small and unused # b is swap # d is where we put our app code, homedirs, ports, # system src, anything we need. 10% is a little generous. # e is the rest of the disk. It mounts on /usr/local/var and holds # all the growable data-- app data and logs. # #8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 1048576 0 4.2BSD 0 0 0 b: 30G * swap c: * 0 unused 0 0 # "raw" part, don't edit d: 10% * 4.2BSD 2048 16384 28528 e: * * 4.2BSD 2048 16384 28528
The initializion script init_disk.sh looks like this:
#!/bin/sh -x # DISK=${DISK:-da1} # Initialize the disk with one all-encompassing FreeBSD slice # fdisk -I ${DISK} # Initialize partitions from prototype file # bsdlabel -R ${DISK}s1 internal_labels.proto newfs -U ${DISK}s1d newfs -U ${DISK}s1e # mount the new vols and do app-specific setup
It respects the environment variable DISK in case da1 is not it.
After the newfs commands are complete, you can mount the new filesystems where you want and do whatever app-specific setup is needed. Extend the script as neccessary.
In my case, the new volumes are mounted on /usr/local/pd2 and /usr/local/var, neccessitating the following additions to /etc/fstab:
/dev/da1s1b none swap sw 0 0 /dev/da1s1d /usr/local/pd2 ufs rw 0 0 /dev/da1s1e /usr/local/var ufs rw,noatime 0 0
Note that because I'm using SATA RAID for my internal disk, I've assumed da1 as the disk's device name. If you use a plain SATA disk/controller you might use ad1 or ad4.
Now we have a booting slice that lives on a USB flash with another slice available, as well as a script to initialize the internal disk on new machines and at least one machine so initialized (M2), as well as our first machine which still has a bootable internal disk (M1).
For this first time, we need to boot off that original internal disk on M1.
The reason is that we need to boot off something other than our flash if we're going to duplicate that flash, since we need the entire image to be completely quiescent. If the filesystems are open we'll be duplicating an open filesystem, which will look like a crashed filesystem whenever we reboot.
Once we have flash disks with two working slices on them, we can finalize the new image on s1, then boot off the other slice, then duplicate s1's image.
We switch booting slices with the script swap_boot.sh:
#!/bin/sh -x # # Shell script that swaps which slice will boot the flash. # Takes an argument: 1 or 2. Relies on the DISK variable, # which can be set but defaults to da0. DISK=${DISK:-da0} fdisk -f ./active_slice.$1 ${DISK} boot0cfg -s $1 ${DISK}
This runs fdisk and boot0cfg to switch the default booting slice. It relies on two dummy fdisk prototype files, active_slice.1 and active_slice.2:
# dummy prototype to set slice 1 to active a 1
# dummy prototype to set slice 2 to active a 2
Since all that is required in these files is a line with 'a' and a number, the script should probably be improved to create them on the fly automatically.
After we boot, copy off the image of the non-booting flash slice using dd on the /dev/ special file:
# dd if=/dev/da2s1 of=FLASH_SLICE_1 bs=32768
This flash image can now serve as our source image for all the other machines we need to create, or, in the future, update. The basic command to write this image to a flash is the reverse of the above:
dd if=FLASH_SLICE_1 of=/dev/da2s2 bs=32768
It is intended to work as part of these two scripts. First, init_flash.sh, for when you've got a completely new flash:
#!/bin/sh -x # DISK=${DISK:-da2} /usr/sbin/boot0cfg -B ${DISK} /usr/sbin/boot0cfg -o noupdate ${DISK} /sbin/fdisk -f fdisk.proto -i ${DISK} /usr/sbin/boot0cfg -B ${DISK} ./update_flash_slice.sh
That performs the basic steps that are explained above and then calls update_flash_slice.sh:
#!/bin/sh # if [ -z ${SLICE_SRC} ] ; then echo "must set SLICE_SRC for slice source file" exit 1 fi if [ -z ${SLICE_DST} ] ; then echo "must set SLICE_DST with number of slice to write" exit 1 fi DISK=${DISK:-da2} dd if=${SLICE_SRC} of=/dev/${DISK}s${SLICE_DST} bs=32768
That enforces the exact definition of the source file and the target slice and performs the dd. Example usage (in bash):
# SLICE_SRC=./FLASH_SLICE_1 SLICE_DEST=2 DISK=da2 \ ./update_flash_slice.sh
Once the duplication is complete, you'll want to mount the new slice briefly and at least change the hostname in /etc/rc.conf and fix /etc/fstab if the target slice was s2, since our initial /etc/fstab assumes s1.
# mount /dev/da2s2a /mnt
Here is an example top of an /etc/fstab file for a slice 2 install:
/dev/da0s2a / ufs rw,noatime 1 1 /dev/da0s2d /var ufs rw,noatime 0 0 /dev/da0s2e /usr ufs rw,noatime 0 0
If this was all that was needed, you could extend the update_flash_slice.sh script to take care of this.
Once we make a new flash image on M1, for instance after an upgrade, we can put the tools together to upgrade running systems with the only downtime being the reboot. Assume that M1 is our seed machine, and we've created a flash image using dd as above and it resides in /usr/local/var.
Assume we have another machine, M3, that we want to upgrade. First we scp the image over onto M3's internal disk:
# scp M1:/usr/local/var/FLASH_SRC.NEW /usr/local/var/
Then we copy it onto whatever slice is inactive, make whatever edits are needed, swap the booting information and reboot:
# SLICE_SRC=/usr/local/var/FLASH_SRC.NEW SLICE_DEST=2 DISK=da0 \ ./update_flash_slice.sh # mount /dev/da0s2a /mnt # [edit /mnt/etc/fstab and /mnt/etc/rc.conf if needed] # ./swap_boot.sh 2 # reboot
Although this process may seem complicated, it eliminates repetition of complex jobs over the life of the running systems.
Initializing all the machines happens one time, when they are first brought up. It consists only of booting off the flash and initializing the internal disk-- no software installs.
All "system building" software installs and subsequent upgrades are done once, on one machine. When you have a satisfactory build you dupe that flash slice and copy it all around. Recurring jobs are cut to the bone and keeping many machines up to date and identical is easy.
Contact chris@cabstand.com with questions, suggestions and corrections.
This piece is licensed under a Creative Commons Attribution 3.0 United States License.