Arch Linux with mirrored, encrypted ZFS root
Goals:
- The root filesystem is ZFS.
- The root filesystem is encrypted. The boot partition does not need to be.
- The root filesystem supports data checksums.
- If a drive fails, the system should boot with no manual intervention.
ZFS on Linux 0.8 supports native encryption. This page was written before 0.8 was released, so native encryption was not evaluated as an option. The ArchWiki now has instructions for using native ZFS encryption on the root filesystem. Consider following those instructions instead.
Installation
Prepare Arch Linux ISO with ZFS
For installation and recovery, it's useful to have a USB flash drive with ZFS already installed. From an Arch Linux system:
sudo -i
pacman -S archiso
cp -r /usr/share/archiso/configs/releng archlive
cd archlive
Add the archzfs repository. The key is the one listed here and here.
echo $'[archzfs]\nServer = http://archzfs.com/$repo/x86_64' >> pacman.conf
cp pacman.conf airootfs/etc/
echo 'systemctl enable pacman-archzfs-init.service' >> airootfs/root/customize_airootfs.sh
mkdir airootfs/etc/systemd/scripts
curl -L -o airootfs/etc/systemd/scripts/archzfs.asc https://archzfs.com/archzfs.gpg
Create airootfs/etc/systemd/system/pacman-archzfs-init.service
with the following contents:
[Unit]
Description=Adds archzfs to pacman keyring
Requires=pacman-init.service
After=pacman-init.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/pacman-key --add /etc/systemd/scripts/archzfs.asc
ExecStart=/usr/bin/pacman-key --lsign-key DDF7DB817396A49B2A2723F7403BD972F75D9D76
[Install]
WantedBy=multi-user.target
Add packages:
echo $'linux-headers\narchzfs-linux' >> packages.x86_64
Build and copy to a USB flash drive:
mkarchiso -v -C pacman.conf .
dd bs=4M if=out/archlinux-*.iso of=/dev/sdX status=progress oflag=sync
Replace /dev/sdX
with the path to your USB flash drive.
Begin installation
Follow the Installation Guide up to (but not including) the partitioning step.
Partitioning & filesystems
In the following, I will assume you are installing on /dev/sda
and /dev/sdb
.
Partition drives
I use BIOS because my motherboard's UEFI firmware is finicky. If you don't, adapt the partition scheme for UEFI as necessary. GPT supports drives ≥2 TiB, while MBR does not. Even though it was not strictly necessary for these two drives, I chose GPT over MBR for uniformity, as I have another drive >2 TiB.
Partition /dev/sda
. Repeat for /dev/sdb
. The sizes of partitions 2 and 3 must match between the two drives.
gdisk /dev/sda
o
n <enter> <enter> +1M ef02
n <enter> <enter> +1G fd00
n <enter> <enter> +464G 8309
w
sfdisk -A /dev/sda 1
The sfdisk
command sets the bootable flag on the protective MBR (PMBR). My motherboard will not boot to a drive that doesn't have this bit set.
This results in the following partition scheme:
Number Start (sector) End (sector) Size Code Name
1 2048 4095 1024.0 KiB EF02 BIOS boot partition
2 4096 2101247 1024.0 MiB FD00 Linux RAID
3 2101248 975179775 464.0 GiB 8309 Linux LUKS
Run lsblk --discard
. If the DISC-GRAN
and DISC-MAX
columns are nonzero, then TRIM (a.k.a. discard) is supported. Take note of whether or not TRIM is supported on /dev/sda
and /dev/sdb
.
Set up dm-crypt partition
cryptsetup luksFormat --type luks2 /dev/sda3
cryptsetup luksFormat --type luks2 /dev/sdb3
cryptsetup open /dev/sda3 cryptroot1 --allow-discards
cryptsetup open /dev/sdb3 cryptroot2 --allow-discards
If your drives do not support TRIM or you are not comfortable with the tradeoffs of enabling TRIM on an encrypted volume, omit the --allow-discards
option.
Create the ZFS pool
Create a pool and a dataset for the root filesystem. The dataset can be named anything, but you cannot use the pool's root dataset (zroot
) as the root filesystem.1 I chose zroot/ROOT/default
to match the naming scheme commonly used for boot environments.
zpool create zroot mirror /dev/mapper/cryptroot[12]
zfs set mountpoint=none acltype=posixacl xattr=sa zroot
zfs create -o mountpoint=none zroot/ROOT
zfs create -o mountpoint=/ zroot/ROOT/default
zpool export zroot
zpool import -d /dev/disk/by-id -R /mnt zroot
When creating the datasets, it will complain that it cannot mount /
. This is okay.
OpenZFS uses a "cachefile" to record the devices in the pool. We must copy this file into our new Arch Linux system so that it knows where to look for the pool's devices:2
zpool set cachefile=/etc/zfs/zpool.cache zroot
mkdir -p /mnt/etc/zfs
cp /etc/zfs/zpool.cache /mnt/etc/zfs/
Set up RAID1 boot partition
We will be using GRUB 2 as our bootloader. While GRUB 2 does support booting from LUKS1-encrypted ZFS partitions, the ZFS support is not well maintained.3 Instead, we will use an unencrypted ext4 boot partition mirrored with MD/RAID.
mdadm --create --level=1 --metadata=1.0 --raid-devices=2 /dev/md/boot /dev/sd[ab]2
mkfs.ext4 /dev/md/boot
mkdir /mnt/boot
mount /dev/md/boot /mnt/boot
The 1.0
metadata format places the metadata at the end of the partition to avoid interfering with the bootloader.
Install base packages and chroot
Select mirrors as described in the Installation Guide, then install the base packages, bootloader, and the zfs-linux-lts
package:
pacstrap -K /mnt base linux-lts linux-firmware vim grub mdadm zfs-linux-lts
I recommend running a LTS kernel with ZFS. Because ZFS depends on internal kernel functions and structures, changes in the kernel can break ZFS. Using a LTS release gives the OpenZFS developers some time to update ZFS after a new kernel release.
Generate a fstab and chroot into the system:
genfstab -U /mnt >> /mnt/etc/fstab
cp /etc/systemd/scripts/archzfs.asc /mnt/root/
arch-chroot /mnt
Add the archzfs mirror in the new system:
echo $'[archzfs]\nServer = http://archzfs.com/$repo/x86_64' >> /etc/pacman.conf
pacman-key --add /root/archzfs.asc
pacman-key --lsign-key DDF7DB817396A49B2A2723F7403BD972F75D9D76
pacman -Sy
Continue with the Installation Guide until you reach the initramfs (mkinitcpio
) step.
Initramfs
Set the following hooks in /etc/mkinitcpio.conf
:
HOOKS=(base udev autodetect modconf kms keyboard keymap consolefont block mdadm_udev encrypt encrypt2 zfs filesystems fsck)
Compared to the default, we have added mdadm_udev encrypt encrypt2 zfs
between block
and filesystems
. Importantly, encrypt
must come after keyboard keymap consolefont
so that we can input the password.
The udev-based encrypt
hook only supports unlocking a single encrypted volume. The systemd-based sd-encrypt
hook supports unlocking multiple volumes, but unfortunately it is not compatible with the zfs
hook.4 We will instead use a workaround and make a copy of the encrypt
hook to unlock the second drive:
cp /usr/lib/initcpio/install/encrypt /etc/initcpio/install/encrypt2
cp /usr/lib/initcpio/hooks/encrypt /etc/initcpio/hooks/encrypt2
sed -i s/cryptdevice/cryptdevice2/ /etc/initcpio/hooks/encrypt2
sed -i s/cryptkey/cryptkey2/ /etc/initcpio/hooks/encrypt2
Now generate a new initramfs:
mkinitcpio -P
Bootloader
Install GRUB to both drives:
grub-install --target=i386-pc /dev/sda
grub-install --target=i386-pc /dev/sdb
Find the UUID of the encrypted partitions:
blkid /dev/sd[ab]3
Configure GRUB to pass those in as kernel parameters in /etc/default/grub
:
GRUB_CMDLINE_LINUX="cryptdevice=UUID=11111111-1111-1111-1111-111111111111:cryptroot1:allow-discards cryptdevice2=UUID=22222222-2222-2222-2222-222222222222:cryptroot2:allow-discards"
As before, omit the :allow-discards
if your drives do not support TRIM. Now generate your GRUB config. The ZPOOL_VDEV_NAME_PATH
tells ZFS to use full paths (/dev/disk/by-uuid/dm-uuid-...
) rather than short names (dm-uuid-...
).
ZPOOL_VDEV_NAME_PATH=1 grub-mkconfig -o /boot/grub/grub.cfg
Save that variable to /etc/environment
so that it doesn't need to be provided when running grub-mkconfig
in the future:
echo ZPOOL_VDEV_NAME_PATH=1 >> /etc/environment
Confirm root zpool detected
Open /boot/grub/grub.cfg
in a text editor and locate the menuentry 'Arch Linux'
section. A few lines down there will be a line like linux ... root=ZFS=zroot/ROOT/default
. If yours is root=ZFS=/ROOT/default
(no zroot
), then GRUB failed to detect your root zpool. A workaround is given on the Arch Wiki (bug report):
Open /etc/grub.d/10_linux
in a text editor and replace
rpool=`${grub_probe} --device ${GRUB_DEVICE} --target=fs_label 2>/dev/null || true`
with
rpool=`zdb -l ${GRUB_DEVICE} | grep " name:" | cut -d\' -f2`
Run ZPOOL_VDEV_NAME_PATH=1 grub-mkconfig -o /boot/grub/grub.cfg
again and confirm that the root was properly detected as root=ZFS=zroot/ROOT/default
.
If you run grub-mkconfig
again in the future (e.g., after a grub
package update), be sure to look over /boot/grub/grub.cfg
afterward. You may need to repeat the steps above.
Set root password, unmount, and reboot
passwd
exit # leave chroot
umount /mnt/boot
zpool export zroot
systemctl reboot
First boot
We're not done! When the new system is booted up, regenerate the ZFS cachefile and enable ZFS-related services:
zpool set cachefile=/etc/zfs/zpool.cache zroot
systemctl enable zfs.target zfs-import-cache.service zfs-mount.service zfs-import.target zfs-zed.service
ZFS uses the system's hostid to determine if a pool is already open by a different OS.5 The hostid isn't available in the initramfs, so we'll need to save a copy of it to a file:
zgenhostid "$(hostid)"
Now regenerate your initramfs and reboot:
mkinitcpio -P
systemctl reboot
If the initramfs is unable to open the pool, it may kernel panic. You can always rescan devices and force the pool open by temporarily adding these kernel parameters:
zfs_import_dir=/dev/disk/by-id zfs_force=1
If this happens, you may need to save the ZFS cachefile and regenerate the initramfs:
zpool set cachefile=/etc/zfs/zpool.cache zroot
mkinitcpio -P
Quota (optional)
ZFS performance degrades when the pool is too full. It's helpful to set the quota to about 80% of the pool size to prevent this from happening:
zpool get size zroot
# NAME PROPERTY VALUE SOURCE
# zroot size 460G -
zfs set quota=368G zroot
Monitoring
Set up local mail server
pacman -S opensmtpd s-nail
Configure OpenSMTPD for local delivery. Replace the default configuration at /etc/smtpd/smtpd.conf
with the following:
table aliases file:/etc/smtpd/aliases
listen on localhost
action "local" mbox alias <aliases>
match for local action "local"
Start mail server:
systemctl start smtpd.service
systemctl enable smtpd.service
Add alias to forward mail from root to your user, then regenerate aliases:
echo 'root: youruser' >> /etc/smtpd/aliases
smtpctl update table aliases
Test mail:
mailx -s 'hello to root' root <<<'example message'
mailx -s 'hello to youruser' youruser <<<'example message'
cat /var/mail/youruser
You should see both of the test messages in /var/mail/youruser
. Now install a local mail client (or configure your existing client to receive local mail). Mutt is a reasonable option; it can be installed with pacman -S mutt
.
Configure mdadm to mail on events
In /etc/mdadm.conf
, add the following option:
MAILADDR youruser
Start monitoring service:
systemctl start mdmonitor.service
systemctl enable mdmonitor.service # May already be enabled.
Test it:
mdadm --monitor --scan --oneshot --test
Configure smartd to mail on errors
Install smartmontools:
pacman -S smartmontools
Replace the existing DEVICESCAN
line in /etc/smartd.conf
(replace youruser
with your username):
DEVICESCAN -a -n standby,6,q -m youruser -M test
This line tells smartd to email youruser
if any errors are encountered. It scans once every 30 minutes (default for smartd, see smartd(8)
), skipping drives in standby mode. If a device is skipped 6 times, it will by scanned regardless of whether or not the drive is in standby.
Start the service:
systemctl start smartd.service
You should receive a test email for each drive. You can now remove the -M test
from /etc/smartd.conf
, then reload and enable the service.
systemctl reload smartd.service
systemctl enable smartd.service
Configure ZFS to mail on events
In /etc/zfs/zed.d/zed.rc
, add an email address:
ZED_EMAIL_ADDR="youruser"
Temporarily enable notification for all events, then restart ZED:
cp /usr/lib/zfs/zed.d/generic-notify.sh /etc/zfs/zed.d/all-notify.sh
echo 'ZED_NOTIFY_INTERVAL_SECS=1' >> /etc/zfs/zed.d/zed.rc
systemctl restart zfs-zed.service
Generate an event by taking a snapshot:
zfs snap zroot/ROOT/default@meow
zfs destroy zroot/ROOT/default@meow
Confirm you received an email about the events. Remove the notifications for all events, then restart ZED again. You will still receive notifications for important events (see /etc/zfs/zed.d/*-notify.sh
).
rm /etc/zfs/zed.d/all-notify.sh
sed -i '/^ZED_NOTIFY_INTERVAL_SECS=1$/d' /etc/zfs/zed.d/zed.rc
systemctl restart zfs-zed.service
Set up periodic ZFS scrub
Create /etc/systemd/system/zfs-scrub@.timer
:
[Unit]
Description=Monthly zpool scrub on %i
[Timer]
OnCalendar=*-*-01 03:00:00
AccuracySec=1h
Persistent=true
[Install]
WantedBy=multi-user.target
Create /etc/systemd/system/zfs-scrub@.service
:
[Unit]
Description=zpool scrub on %i
[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/usr/bin/zpool scrub %i
Enable the timer for your root pool:
systemctl enable zfs-scrub@zroot.timer
systemctl start zfs-scrub@zroot.timer
Set up periodic ZFS trim
Skip this section if you chose not to use TRIM.
Create /etc/systemd/system/zfs-trim@.timer
:
[Unit]
Description=Weekly zpool trim on %i
[Timer]
OnCalendar=weekly
AccuracySec=1h
Persistent=true
[Install]
WantedBy=multi-user.target
Create /etc/systemd/system/zfs-trim@.service
:
[Unit]
Description=zpool trim on %i
[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/usr/bin/zpool trim %i
Enable the timer for your root pool:
systemctl enable zfs-trim@zroot.timer
systemctl start zfs-trim@zroot.timer
Maintenance
Upgrading your system
The archzfs repo lags behind the Arch core repository by a few days when there are kernel updates. You will sometimes see an error like this:
# pacman -Syu
:: Starting full system upgrade...
resolving dependencies...
looking for conflicting packages...
error: failed to prepare transaction (could not satisfy dependencies)
:: installing linux (5.9.2.arch1-1) breaks dependency 'linux=5.9.1.arch1-1' required by zfs-linux
You have a couple options:
- Don't upgrade right now. Wait for archzfs to catch up with the current Linux version.
- Press on and do a partial upgrade. This is not supported by the Arch Linux team!
If you choose to press on with a partial upgrade, you'll need to skip packages containing kernel modules, as Arch Linux does not guarantee backward compatibility of these modules with older kernels.6
mods=$(pacman -Qo /usr/lib/modules | awk '{print $(NF-1)}')
hold=$({ echo "$mods"; expac -l'\n' '%D' $mods | grep = | cut -d= -f1; } | sort | uniq | paste -sd,)
pacman -Su --ignore $hold
Before confirming the upgrade, carefully review the list of packages to be upgraded.
Viewing the status
Viewing mdadm status:
mdadm --detail /dev/md/boot
# /dev/md/boot:
# Version : 1.0
# Creation Time : x
# Raid Level : raid1
# Array Size : x
# Used Dev Size : x
# Raid Devices : 2
# Total Devices : 2
# Persistence : Superblock is persistent
#
# Update Time : Sun Apr 14 17:06:03 2019
# State : clean
# Active Devices : 2
# Working Devices : 2
# Failed Devices : 0
# Spare Devices : 0
#
# Name : archiso:boot
# UUID : x
# Events : 122
#
# Number Major Minor RaidDevice State
# 3 8 x 0 active sync /dev/sda2
# 2 8 x 1 active sync /dev/sdb2
Viewing ZFS status:
zpool status
# pool: zroot
# state: ONLINE
# scan: scrub repaired 0B in 0h10m with 0 errors on Sun Apr 14 12:42:19 2019
#config:
#
# NAME STATE READ WRITE CKSUM
# zroot ONLINE 0 0 0
# mirror-0 ONLINE 0 0 0
# /dev/mapper/cryptroot1 ONLINE 0 0 0
# /dev/mapper/cryptroot2 ONLINE 0 0 0
#
#errors: No known data errors
Silent errors
If a drive has silent data corruption, ZFS will detect it and increment its checksum error counter in zpool status
. We should kick the drive out of the MD/RAID configuration before it causes problems. RAID1 does not perform checksums, so if a drive has corruption, MD/RAID will blindly copy it to the other drive.
mdadm --fail /dev/md/boot $DRIVE
Run fsck on the boot partition:
umount /boot
fsck.ext4 /dev/md/boot
mount -a
If you're concerned that there was silent corruption in the boot partition, reinstall GRUB and regenerate the initramfs in case they got corrupted by MD/RAID copying bad data:
grub-install --target=i386-pc $GOOD_DRIVE
grub-mkconfig -o /boot/grub/grub.cfg
mkinitcpio -P
Now get the list of packages that have files in /boot
and reinstall them:
find /boot -type f | xargs pacman -Qo 2>/dev/null
pacman -S $PACKAGES_FROM_PREVIOUS_COMMAND
Hard failures
If a drive becomes unavailable, mdadm and zpool status
will both show a "degraded" state.
Replacing a drive
If you haven't already done so, remove the drive from MD/RAID and ZFS:
zpool offline zroot $DRIVE
zpool detach zroot $DRIVE
mdadm --fail /dev/md/boot $DRIVE # if drive still active
mdadm --remove /dev/md/boot $DRIVE
If the drive is not listed in mdadm --detail /dev/md/boot
, you may need to replace $DRIVE
with failed
or detached
in the mdadm commands.
Power down your computer, remove the old drive, and add a new one. Power the computer back on. When the system comes back up, mdadm and ZFS will show a degraded state.
Follow the partitioning and dm-crypt steps above on the new drive. I recommend reusing the same name for the dm-crypt partition as the failed drive (cryptroot1
or cryptroot2
). Install GRUB:
grub-install --target=i386-pc /dev/sdx
Add the drive to mdadm and ZFS. In this example, I will assume /dev/sdx
is the new drive and its dm-crypt partition is mapped to /dev/mapper/cryptrootY
.
mdadm --add /dev/md/boot /dev/sdx2
zpool attach zroot $GOOD_DRIVE /dev/mapper/cryptrootY
Finally, update the UUID in /etc/default/grub
to the one shown by blkid /dev/sdx2
and generate a new GRUB config:
grub-mkconfig -o /boot/grub/grub.cfg
-
GRUB will try to set the root disk as
root=ZFS=zroot/
, which is not a valid dataset due to the trailing slash. ↩ -
This step isn't required if you choose to let ZFS scan all devices in the initramfs, but this is potentially slow. If you choose to do this, you'll need to add the kernel parameter
zfs_import_dir=/dev/disk/by-id
. ↩ -
One OpenZFS developer commented that "nobody at upstream GRUB cares much about ZFS support" (dead link). However, the issue was later deleted and OpenZFS 2.1.0 ships with a
grub2
compatability feature set, so perhaps the situation has changed. ↩ -
There is an alternative systemd-based
sd-zfs
hook that is compatible with thesd-encrypt
hook, but I have not tested it. ↩ -
For example, if two systems were accessing the pool over iSCSI. ↩
-
Most packages that install to
/usr/lib/modules/*/extramodules
do not specify a version constraint in their dependencies. This is because Arch Linux does not support partial upgrades, which is exactly what we're doing here. ↩