equivariant

Arch Linux with mirrored, encrypted ZFS root

Goals:

ZFS on Linux 0.8 supports native encryption. This page was written before 0.8 was released, so native encryption was not evaluated as an option. The ArchWiki now has instructions for using native ZFS encryption on the root filesystem. Consider following those instructions instead.

On the other hand, it may still be worth avoiding ZFS native encryption due to issues with zfs send/recv on natively encrypted datasets.

Installation

Prepare Arch Linux ISO with ZFS

For installation and recovery, it's useful to have a USB flash drive with ZFS already installed. From an Arch Linux system:

sudo -i
pacman -S archiso
cp -r /usr/share/archiso/configs/releng archlive
cd archlive

Add the archzfs repository. The key is the one listed here and here.

echo $'[archzfs]\nServer = http://archzfs.com/$repo/x86_64' >> pacman.conf
cp pacman.conf airootfs/etc/
echo 'systemctl enable pacman-archzfs-init.service' >> airootfs/root/customize_airootfs.sh
mkdir airootfs/etc/systemd/scripts
curl -L -o airootfs/etc/systemd/scripts/archzfs.asc https://archzfs.com/archzfs.gpg

Create airootfs/etc/systemd/system/pacman-archzfs-init.service with the following contents:

[Unit]
Description=Adds archzfs to pacman keyring
Requires=pacman-init.service
After=pacman-init.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/pacman-key --add /etc/systemd/scripts/archzfs.asc
ExecStart=/usr/bin/pacman-key --lsign-key DDF7DB817396A49B2A2723F7403BD972F75D9D76

[Install]
WantedBy=multi-user.target

Add packages:

echo $'linux-headers\narchzfs-linux' >> packages.x86_64

Build and copy to a USB flash drive:

mkarchiso -v -C pacman.conf .
dd bs=4M if=out/archlinux-*.iso of=/dev/sdX status=progress oflag=sync

Replace /dev/sdX with the path to your USB flash drive.

Begin installation

Follow the Installation Guide up to (but not including) the partitioning step.

Partitioning & filesystems

In the following, I will assume you are installing on /dev/sda and /dev/sdb.

Partition drives

I use BIOS because my motherboard's UEFI firmware is finicky. If you don't, adapt the partition scheme for UEFI as necessary. GPT supports drives ≥2 TiB, while MBR does not. Even though it was not strictly necessary for these two drives, I chose GPT over MBR for uniformity, as I have another drive >2 TiB.

Partition /dev/sda. Repeat for /dev/sdb. The sizes of partitions 2 and 3 must match between the two drives.

gdisk /dev/sda
o
n <enter> <enter> +1M ef02
n <enter> <enter> +1G fd00
n <enter> <enter> +464G 8309
w
sfdisk -A /dev/sda 1

The sfdisk command sets the bootable flag on the protective MBR (PMBR). My motherboard will not boot to a drive that doesn't have this bit set.

This results in the following partition scheme:

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            4095   1024.0 KiB  EF02  BIOS boot partition
   2            4096         2101247   1024.0 MiB  FD00  Linux RAID
   3         2101248       975179775   464.0 GiB   8309  Linux LUKS

Run lsblk --discard. If the DISC-GRAN and DISC-MAX columns are nonzero, then TRIM (a.k.a. discard) is supported. Take note of whether or not TRIM is supported on /dev/sda and /dev/sdb.

Set up dm-crypt partition

cryptsetup luksFormat --type luks2 /dev/sda3
cryptsetup luksFormat --type luks2 /dev/sdb3
cryptsetup open /dev/sda3 cryptroot1 --allow-discards
cryptsetup open /dev/sdb3 cryptroot2 --allow-discards

If your drives do not support TRIM or you are not comfortable with the tradeoffs of enabling TRIM on an encrypted volume, omit the --allow-discards option.

Create the ZFS pool

Create a pool and a dataset for the root filesystem. The dataset can be named anything, but you cannot use the pool's root dataset (zroot) as the root filesystem.1 I chose zroot/ROOT/default to match the naming scheme commonly used for boot environments.

zpool create zroot mirror /dev/mapper/cryptroot[12]
zfs set mountpoint=none acltype=posixacl xattr=sa zroot
zfs create -o mountpoint=none zroot/ROOT
zfs create -o mountpoint=/ zroot/ROOT/default
zpool export zroot
zpool import -d /dev/disk/by-id -R /mnt zroot

When creating the datasets, it will complain that it cannot mount /. This is okay.

OpenZFS uses a "cachefile" to record the devices in the pool. We must copy this file into our new Arch Linux system so that it knows where to look for the pool's devices:2

zpool set cachefile=/etc/zfs/zpool.cache zroot
mkdir -p /mnt/etc/zfs
cp /etc/zfs/zpool.cache /mnt/etc/zfs/

Set up RAID1 boot partition

We will be using GRUB 2 as our bootloader. While GRUB 2 does support booting from LUKS1-encrypted ZFS partitions, the ZFS support is not well maintained.3 Instead, we will use an unencrypted ext4 boot partition mirrored with MD/RAID.

mdadm --create --level=1 --metadata=1.0 --raid-devices=2 /dev/md/boot /dev/sd[ab]2
mkfs.ext4 /dev/md/boot
mkdir /mnt/boot
mount /dev/md/boot /mnt/boot

The 1.0 metadata format places the metadata at the end of the partition to avoid interfering with the bootloader.

Install base packages and chroot

Select mirrors as described in the Installation Guide, then install the base packages, bootloader, and the zfs-linux-lts package:

pacstrap -K /mnt base linux-lts linux-firmware vim grub mdadm zfs-linux-lts

I recommend running a LTS kernel with ZFS. Because ZFS depends on internal kernel functions and structures, changes in the kernel can break ZFS. Using a LTS release gives the OpenZFS developers some time to update ZFS after a new kernel release.

Generate a fstab and chroot into the system:

genfstab -U /mnt >> /mnt/etc/fstab
cp /etc/systemd/scripts/archzfs.asc /mnt/root/
arch-chroot /mnt

Add the archzfs mirror in the new system:

echo $'[archzfs]\nServer = http://archzfs.com/$repo/x86_64' >> /etc/pacman.conf
pacman-key --add /root/archzfs.asc
pacman-key --lsign-key DDF7DB817396A49B2A2723F7403BD972F75D9D76
pacman -Sy

Continue with the Installation Guide until you reach the initramfs (mkinitcpio) step.

Initramfs

Set the following hooks in /etc/mkinitcpio.conf:

HOOKS=(base udev autodetect modconf kms keyboard keymap consolefont block mdadm_udev encrypt encrypt2 zfs filesystems fsck)

Compared to the default, we have added mdadm_udev encrypt encrypt2 zfs between block and filesystems. Importantly, encrypt must come after keyboard keymap consolefont so that we can input the password.

The udev-based encrypt hook only supports unlocking a single encrypted volume. The systemd-based sd-encrypt hook supports unlocking multiple volumes, but unfortunately it is not compatible with the zfs hook.4 We will instead use a workaround and make a copy of the encrypt hook to unlock the second drive:

cp /usr/lib/initcpio/install/encrypt /etc/initcpio/install/encrypt2
cp /usr/lib/initcpio/hooks/encrypt /etc/initcpio/hooks/encrypt2
sed -i s/cryptdevice/cryptdevice2/ /etc/initcpio/hooks/encrypt2
sed -i s/cryptkey/cryptkey2/ /etc/initcpio/hooks/encrypt2

Now generate a new initramfs:

mkinitcpio -P

Bootloader

Install GRUB to both drives:

grub-install --target=i386-pc /dev/sda
grub-install --target=i386-pc /dev/sdb

Find the UUID of the encrypted partitions:

blkid /dev/sd[ab]3

Configure GRUB to pass those in as kernel parameters in /etc/default/grub:

GRUB_CMDLINE_LINUX="cryptdevice=UUID=11111111-1111-1111-1111-111111111111:cryptroot1:allow-discards cryptdevice2=UUID=22222222-2222-2222-2222-222222222222:cryptroot2:allow-discards"

As before, omit the :allow-discards if your drives do not support TRIM. Now generate your GRUB config. The ZPOOL_VDEV_NAME_PATH tells ZFS to use full paths (/dev/disk/by-uuid/dm-uuid-...) rather than short names (dm-uuid-...).

ZPOOL_VDEV_NAME_PATH=1 grub-mkconfig -o /boot/grub/grub.cfg

Save that variable to /etc/environment so that it doesn't need to be provided when running grub-mkconfig in the future:

echo ZPOOL_VDEV_NAME_PATH=1 >> /etc/environment

Confirm root zpool detected

Open /boot/grub/grub.cfg in a text editor and locate the menuentry 'Arch Linux' section. A few lines down there will be a line like linux ... root=ZFS=zroot/ROOT/default. If yours is root=ZFS=/ROOT/default (no zroot), then GRUB failed to detect your root zpool. A workaround is given on the Arch Wiki (bug report):

Open /etc/grub.d/10_linux in a text editor and replace

rpool=`${grub_probe} --device ${GRUB_DEVICE} --target=fs_label 2>/dev/null || true`

with

rpool=`zdb -l ${GRUB_DEVICE} | grep " name:" | cut -d\' -f2`

Run ZPOOL_VDEV_NAME_PATH=1 grub-mkconfig -o /boot/grub/grub.cfg again and confirm that the root was properly detected as root=ZFS=zroot/ROOT/default.

If you run grub-mkconfig again in the future (e.g., after a grub package update), be sure to look over /boot/grub/grub.cfg afterward. You may need to repeat the steps above.

Set root password, unmount, and reboot

passwd
exit  # leave chroot
umount /mnt/boot
zpool export zroot
systemctl reboot

First boot

We're not done! When the new system is booted up, regenerate the ZFS cachefile and enable ZFS-related services:

zpool set cachefile=/etc/zfs/zpool.cache zroot
systemctl enable zfs.target zfs-import-cache.service zfs-mount.service zfs-import.target zfs-zed.service

ZFS uses the system's hostid to determine if a pool is already open by a different OS.5 The hostid isn't available in the initramfs, so we'll need to save a copy of it to a file:

zgenhostid "$(hostid)"

Now regenerate your initramfs and reboot:

mkinitcpio -P
systemctl reboot

If the initramfs is unable to open the pool, it may kernel panic. You can always rescan devices and force the pool open by temporarily adding these kernel parameters:

zfs_import_dir=/dev/disk/by-id zfs_force=1

If this happens, you may need to save the ZFS cachefile and regenerate the initramfs:

zpool set cachefile=/etc/zfs/zpool.cache zroot
mkinitcpio -P

Quota (optional)

ZFS performance degrades when the pool is too full. It's helpful to set the quota to about 80% of the pool size to prevent this from happening:

zpool get size zroot
# NAME   PROPERTY  VALUE  SOURCE
# zroot  size      460G   -

zfs set quota=368G zroot

Monitoring

Set up local mail server

pacman -S opensmtpd s-nail

Configure OpenSMTPD for local delivery. Replace the default configuration at /etc/smtpd/smtpd.conf with the following:

table aliases file:/etc/smtpd/aliases
listen on localhost
action "local" mbox alias <aliases>
match for local action "local"

Start mail server:

systemctl start smtpd.service
systemctl enable smtpd.service

Add alias to forward mail from root to your user, then regenerate aliases:

echo 'root: youruser' >> /etc/smtpd/aliases
smtpctl update table aliases

Test mail:

mailx -s 'hello to root' root <<<'example message'
mailx -s 'hello to youruser' youruser <<<'example message'
cat /var/mail/youruser

You should see both of the test messages in /var/mail/youruser. Now install a local mail client (or configure your existing client to receive local mail). Mutt is a reasonable option; it can be installed with pacman -S mutt.

Configure mdadm to mail on events

In /etc/mdadm.conf, add the following option:

MAILADDR youruser

Start monitoring service:

systemctl start mdmonitor.service
systemctl enable mdmonitor.service  # May already be enabled.

Test it:

mdadm --monitor --scan --oneshot --test

Configure smartd to mail on errors

Install smartmontools:

pacman -S smartmontools

Replace the existing DEVICESCAN line in /etc/smartd.conf (replace youruser with your username):

DEVICESCAN -a -n standby,6,q -m youruser -M test

This line tells smartd to email youruser if any errors are encountered. It scans once every 30 minutes (default for smartd, see smartd(8)), skipping drives in standby mode. If a device is skipped 6 times, it will by scanned regardless of whether or not the drive is in standby.

Start the service:

systemctl start smartd.service

You should receive a test email for each drive. You can now remove the -M test from /etc/smartd.conf, then reload and enable the service.

systemctl reload smartd.service
systemctl enable smartd.service

Configure ZFS to mail on events

In /etc/zfs/zed.d/zed.rc, add an email address:

ZED_EMAIL_ADDR="youruser"

Temporarily enable notification for all events, then restart ZED:

cp /usr/lib/zfs/zed.d/generic-notify.sh /etc/zfs/zed.d/all-notify.sh
echo 'ZED_NOTIFY_INTERVAL_SECS=1' >> /etc/zfs/zed.d/zed.rc
systemctl restart zfs-zed.service

Generate an event by taking a snapshot:

zfs snap zroot/ROOT/default@meow
zfs destroy zroot/ROOT/default@meow

Confirm you received an email about the events. Remove the notifications for all events, then restart ZED again. You will still receive notifications for important events (see /etc/zfs/zed.d/*-notify.sh).

rm /etc/zfs/zed.d/all-notify.sh
sed -i '/^ZED_NOTIFY_INTERVAL_SECS=1$/d' /etc/zfs/zed.d/zed.rc
systemctl restart zfs-zed.service

Set up periodic ZFS scrub

Create /etc/systemd/system/zfs-scrub@.timer:

[Unit]
Description=Monthly zpool scrub on %i

[Timer]
OnCalendar=*-*-01 03:00:00
AccuracySec=1h
Persistent=true

[Install]
WantedBy=multi-user.target

Create /etc/systemd/system/zfs-scrub@.service:

[Unit]
Description=zpool scrub on %i

[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/usr/bin/zpool scrub %i

Enable the timer for your root pool:

systemctl enable zfs-scrub@zroot.timer
systemctl start zfs-scrub@zroot.timer

Set up periodic ZFS trim

Skip this section if you chose not to use TRIM.

Create /etc/systemd/system/zfs-trim@.timer:

[Unit]
Description=Weekly zpool trim on %i

[Timer]
OnCalendar=weekly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=multi-user.target

Create /etc/systemd/system/zfs-trim@.service:

[Unit]
Description=zpool trim on %i

[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/usr/bin/zpool trim %i

Enable the timer for your root pool:

systemctl enable zfs-trim@zroot.timer
systemctl start zfs-trim@zroot.timer

Maintenance

Upgrading your system

The archzfs repo lags behind the Arch core repository by a few days when there are kernel updates. You will sometimes see an error like this:

# pacman -Syu
:: Starting full system upgrade...
resolving dependencies...
looking for conflicting packages...
error: failed to prepare transaction (could not satisfy dependencies)
:: installing linux (5.9.2.arch1-1) breaks dependency 'linux=5.9.1.arch1-1' required by zfs-linux

You have a couple options:

  1. Don't upgrade right now. Wait for archzfs to catch up with the current Linux version.
  2. Press on and do a partial upgrade. This is not supported by the Arch Linux team!

If you choose to press on with a partial upgrade, you'll need to skip packages containing kernel modules, as Arch Linux does not guarantee backward compatibility of these modules with older kernels.6

mods=$(pacman -Qo /usr/lib/modules | awk '{print $(NF-1)}')
hold=$({ echo "$mods"; expac -l'\n' '%D' $mods | grep = | cut -d= -f1; } | sort | uniq | paste -sd,)
pacman -Su --ignore $hold

Before confirming the upgrade, carefully review the list of packages to be upgraded.

Viewing the status

Viewing mdadm status:

mdadm --detail /dev/md/boot

# /dev/md/boot:
#         Version : 1.0
#   Creation Time : x
#      Raid Level : raid1
#      Array Size : x
#   Used Dev Size : x
#    Raid Devices : 2
#   Total Devices : 2
#     Persistence : Superblock is persistent
# 
#     Update Time : Sun Apr 14 17:06:03 2019
#           State : clean 
#  Active Devices : 2
# Working Devices : 2
#  Failed Devices : 0
#   Spare Devices : 0
# 
#            Name : archiso:boot
#            UUID : x
#          Events : 122
# 
#     Number   Major   Minor   RaidDevice State
#        3       8        x        0      active sync   /dev/sda2
#        2       8        x        1      active sync   /dev/sdb2

Viewing ZFS status:

zpool status

#  pool: zroot
# state: ONLINE
#  scan: scrub repaired 0B in 0h10m with 0 errors on Sun Apr 14 12:42:19 2019
#config:
#
#   NAME                        STATE     READ WRITE CKSUM
#   zroot                       ONLINE       0     0     0
#     mirror-0                  ONLINE       0     0     0
#       /dev/mapper/cryptroot1  ONLINE       0     0     0
#       /dev/mapper/cryptroot2  ONLINE       0     0     0
#
#errors: No known data errors

Silent errors

If a drive has silent data corruption, ZFS will detect it and increment its checksum error counter in zpool status. We should kick the drive out of the MD/RAID configuration before it causes problems. RAID1 does not perform checksums, so if a drive has corruption, MD/RAID will blindly copy it to the other drive.

mdadm --fail /dev/md/boot $DRIVE

Run fsck on the boot partition:

umount /boot
fsck.ext4 /dev/md/boot
mount -a

If you're concerned that there was silent corruption in the boot partition, reinstall GRUB and regenerate the initramfs in case they got corrupted by MD/RAID copying bad data:

grub-install --target=i386-pc $GOOD_DRIVE
grub-mkconfig -o /boot/grub/grub.cfg
mkinitcpio -P

Now get the list of packages that have files in /boot and reinstall them:

find /boot -type f | xargs pacman -Qo 2>/dev/null
pacman -S $PACKAGES_FROM_PREVIOUS_COMMAND

Hard failures

If a drive becomes unavailable, mdadm and zpool status will both show a "degraded" state.

Replacing a drive

If you haven't already done so, remove the drive from MD/RAID and ZFS:

zpool offline zroot $DRIVE
zpool detach zroot $DRIVE
mdadm --fail /dev/md/boot $DRIVE  # if drive still active
mdadm --remove /dev/md/boot $DRIVE

If the drive is not listed in mdadm --detail /dev/md/boot, you may need to replace $DRIVE with failed or detached in the mdadm commands.

Power down your computer, remove the old drive, and add a new one. Power the computer back on. When the system comes back up, mdadm and ZFS will show a degraded state.

Follow the partitioning and dm-crypt steps above on the new drive. I recommend reusing the same name for the dm-crypt partition as the failed drive (cryptroot1 or cryptroot2). Install GRUB:

grub-install --target=i386-pc /dev/sdx

Add the drive to mdadm and ZFS. In this example, I will assume /dev/sdx is the new drive and its dm-crypt partition is mapped to /dev/mapper/cryptrootY.

mdadm --add /dev/md/boot /dev/sdx2
zpool attach zroot $GOOD_DRIVE /dev/mapper/cryptrootY

Finally, update the UUID in /etc/default/grub to the one shown by blkid /dev/sdx2 and generate a new GRUB config:

grub-mkconfig -o /boot/grub/grub.cfg

Interrupted kernel upgrade

If a kernel upgrade is interrupted, you may end up with an unbootable system with the following error:

Loading Linux linux-lts ...
error: file `/vmlinuz-linux-lts' not found.
Loading initial ramdisk ...
error: you need to load the kernel first.

This isn't unique to this setup, but the recovery is a little more complicated. Boot into the USB flash drive created earlier. Chroot into your system and reinstall the relevant packages:

cryptsetup open /dev/sda3 cryptroot1 --allow-discards
cryptsetup open /dev/sdb3 cryptroot2 --allow-discards
zpool import -d /dev/disk/by-id -R /mnt zroot
mount /dev/md/boot /mnt/boot
arch-chroot /mnt
pacman -S mkinitcpio systemd linux-lts
exit
umount /mnt/boot
zpool export zroot
systemctl reboot

After rebooting, reinstall all packages for good measure, then reboot again:

pacman -Qqn | sudo pacman -S -
systemctl reboot

Pacman preserves the installation reason when reinstalling packages, so doing this won't mess up pacman -Qe/pacman -Qe.


  1. GRUB will try to set the root disk as root=ZFS=zroot/, which is not a valid dataset due to the trailing slash. 

  2. This step isn't required if you choose to let ZFS scan all devices in the initramfs, but this is potentially slow. If you choose to do this, you'll need to add the kernel parameter zfs_import_dir=/dev/disk/by-id

  3. One OpenZFS developer commented that "nobody at upstream GRUB cares much about ZFS support" (dead link). However, the issue was later deleted and OpenZFS 2.1.0 ships with a grub2 compatability feature set, so perhaps the situation has changed. 

  4. There is an alternative systemd-based sd-zfs hook that is compatible with the sd-encrypt hook, but I have not tested it. 

  5. For example, if two systems were accessing the pool over iSCSI. 

  6. Most packages that install to /usr/lib/modules/*/extramodules do not specify a version constraint in their dependencies. This is because Arch Linux does not support partial upgrades, which is exactly what we're doing here.