equivariant

Zpool size doesn't match ZFS root dataset size

I noticed that the total size of a zpool doesn't match the used+avail for the root dataset:

$ zpool status zroot
...
NAME                                    STATE     READ WRITE CKSUM
zroot                                   ONLINE       0     0     0
  mirror-0                              ONLINE       0     0     0
    /dev/disk/by-id/dm-name-cryptroot1  ONLINE       0     0     0
    /dev/disk/by-id/dm-name-cryptroot2  ONLINE       0     0     0
...

$ zpool list zroot
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot   460G   192G   268G        -         -    24%    41%  1.00x    ONLINE  -

$ zfs list zroot
NAME    USED  AVAIL     REFER  MOUNTPOINT
zroot   192G   254G       24K  none

Notice that the zpool reports a size of 460 GiB. This size is consistent with the alloc and free properties for the zpool: alloc+free=192 GiB+268 GiB=460 GiB=size. \text{alloc} + \text{free} = 192\text{ GiB} + 268\text{ GiB} = 460\text{ GiB} = \text{size}.

However, the sum of the used and avail values for the root dataset is lower: used+avail=192 GiB+254 GiB=446 GiB<460 GiB. \text{used} + \text{avail} = 192\text{ GiB} + 254\text{ GiB} = 446\text{ GiB} < 460\text{ GiB}.

It's about 3% lower: 460 GiB446 GiB460 GiB=3.04%. \frac{460\text{ GiB} - 446\text{ GiB}}{460\text{ GiB}} = 3.04\%.

This value comes from the spa_slop_shift ZFS module parameter (see zfs(4)). By default, it is set to 5, meaning 1/2^5=3.125% of the space is set aside.

Raidz

For raidz, the zpool size is reported differently. Rather than showing the size of each disk (which is equal to the size of the zpool for mirrored disks), zpool list shows the sum of the disks. For example:

$ zpool status slow
...
NAME                                              STATE     READ WRITE CKSUM
slow                                              ONLINE       0     0     0
  raidz1-0                                        ONLINE       0     0     0
    /dev/disk/by-id/wwn-0x5000000000000001-part1  ONLINE       0     0     0
    /dev/disk/by-id/wwn-0x5000000000000002-part1  ONLINE       0     0     0
    /dev/disk/by-id/wwn-0x5000000000000003-part1  ONLINE       0     0     0
...

$ zpool list  slow
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
slow   10.9T  4.49T  6.42T        -         -     1%    41%  1.00x    ONLINE  -

Each of the three disks is 4 TB (not TiB), so the zpool size is reported as 34 TB10004byteTB/10244byteTiB=10.9 TiB. 3 \cdot 4\text{ TB} \cdot 1000^4\,\frac{\text{byte}}{\text{TB}} {\Large /} 1024^4\,\frac{\text{byte}}{\text{TiB}} = 10.9\text{ TiB}.

For raidz, you must take into account the space reserved for parity data. For a raidzP (P=1,2,3) group, about (N-P)*X of the space is available, where X is the size of each disk: (NP)X=(31)4 TB10004byteTB/10244byteTiB=7.27 TiB. (N-P) \cdot X = (3 - 1) \cdot 4\text{ TB} \cdot 1000^4\,\frac{\text{byte}}{\text{TB}} {\Large /} 1024^4\,\frac{\text{byte}}{\text{TiB}} = 7.27\text{ TiB}.

So with 3.125% reserved for spa_slop_shift, we should expect the root dataset to report 7.27 TiB * (1-0.03125) = 7.04 TiB for used+avail. Indeed, 2.99 TiB + 4.05 TiB = 7.04 TiB:

$ zfs list slow
NAME    USED  AVAIL     REFER  MOUNTPOINT
slow   2.99T  4.05T      330K  /slow

References

  1. https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSReservedSpaceVaries