Lumpy ZFS Pool Blues

I have a niche problem: my storage server’s ZFS pool is lumpy!

Wait, that probably doesn’t make a lot of sense. Let me show you what I mean.

Here’s what zpool list -v says:

NAME                        SIZE  ALLOC   FREE  FRAG    CAP
zones                      32.6T  12.2T  20.4T    3%    37%
 mirror                    3.62T  2.21T  1.41T    5%  61.1%
  c0t5000CCA25DE8EBF4d0        -      -      -     -      -
  c0t5000CCA25DEEC08Ad0        -      -      -     -      -
 mirror                    3.62T  2.22T  1.40T    6%  61.3%
  c0t5000CCA25DE6FD92d0        -      -      -     -      -
  c0t5000CCA25DEEC738d0        -      -      -     -      -
 mirror                    3.62T  2.28T  1.34T    6%  63.0%
  c0t5000CCA25DEAA3EEd0        -      -      -     -      -
  c0t5000CCA25DE6F42Ed0        -      -      -     -      -
 mirror                    3.62T  2.29T  1.33T    5%  63.2%
  c0t5000CCA25DE9DB9Dd0        -      -      -     -      -
  c0t5000CCA25DEED5B7d0        -      -      -     -      -
 mirror                    3.62T  2.29T  1.34T    5%  63.1%
  c0t5000CCA25DEB0F42d0        -      -      -     -      -
  c0t5000CCA25DEECB9Dd0        -      -      -     -      -
 mirror                    3.62T   237G  3.39T    1%  6.38%
  c0t5000CCA24CF36876d0        -      -      -     -      -
  c0t5000CCA249D4AA59d0        -      -      -     -      -
 mirror                    3.62T   236G  3.39T    0%  6.36%
  c0t5000CCA24CE9D1CAd0        -      -      -     -      -
  c0t5000CCA24CE954D2d0        -      -      -     -      -
 mirror                    3.62T   228G  3.40T    0%  6.13%
  c0t5000CCA24CE8C60Ed0        -      -      -     -      -
  c0t5000CCA24CE9D249d0        -      -      -     -      -
 mirror                    3.62T   220G  3.41T    0%  5.93%
  c0t5000CCA24CF80849d0        -      -      -     -      -
  c0t5000CCA24CF80838d0        -      -      -     -      -

The first five vdevs have utilizations above 60%, while the final four haven’t even surpassed 7%.

You can probably guess what happened: I had a zpool with five mirrors, and then expanded it by adding four more. ZFS doesn’t automatically rebalance existing data, but does skew writes of new data so that more go to the mirrors with lower used capacity.

Now, is this a problem? For my use case, not really. The bandwidth and IOPS from five mirrors of hard drives is more than sufficient. But still, this arrangement bothers me.

Fortunately, the algorithm to re-balance the data is trivial:

for file in dataset,
- copy the file to a temporary directory in another dataset
- delete the original file
- copy the temporary file back to the original location
- delete the temporary directory

As the files get rewritten, not only do the newer mirrors get more full, but the older mirrors also free up space. Eventually, the utilization of all mirrors should converge.

I wrote a program called datashake to automate this process.

So, how did it turn out?

NAME                        SIZE  ALLOC   FREE  FRAG    CAP
zones                      32.6T  12.2T  20.5T    1%    37%
 mirror                    3.62T  1.52T  2.11T    4%  41.9%
  c0t5000CCA25DE8EBF4d0        -      -      -     -      -
  c0t5000CCA25DEEC08Ad0        -      -      -     -      -
 mirror                    3.62T  1.52T  2.11T    3%  41.9%
  c0t5000CCA25DE6FD92d0        -      -      -     -      -
  c0t5000CCA25DEEC738d0        -      -      -     -      -
 mirror                    3.62T  1.52T  2.11T    3%  41.8%
  c0t5000CCA25DEAA3EEd0        -      -      -     -      -
  c0t5000CCA25DE6F42Ed0        -      -      -     -      -
 mirror                    3.62T  1.50T  2.12T    3%  41.5%
  c0t5000CCA25DE9DB9Dd0        -      -      -     -      -
  c0t5000CCA25DEED5B7d0        -      -      -     -      -
 mirror                    3.62T  1.52T  2.10T    3%  42.0%
  c0t5000CCA25DEB0F42d0        -      -      -     -      -
  c0t5000CCA25DEECB9Dd0        -      -      -     -      -
 mirror                    3.62T  1.16T  2.47T    0%  31.9%
  c0t5000CCA24CF36876d0        -      -      -     -      -
  c0t5000CCA249D4AA59d0        -      -      -     -      -
 mirror                    3.62T  1.15T  2.48T    0%  31.7%
  c0t5000CCA24CE9D1CAd0        -      -      -     -      -
  c0t5000CCA24CE954D2d0        -      -      -     -      -
 mirror                    3.62T  1.07T  2.55T    0%  29.6%
  c0t5000CCA24CE8C60Ed0        -      -      -     -      -
  c0t5000CCA24CE9D249d0        -      -      -     -      -
 mirror                    3.62T  1.19T  2.43T    0%  32.8%
  c0t5000CCA24CF80849d0        -      -      -     -      -
  c0t5000CCA24CF80838d0        -      -      -     -      -

I was pretty satisfied with this result. It didn’t converge as well as I had hoped, but it’s still much improved from before. And fragmentation went down, as an extra bonus.

Part of the reason for incomplete convergence is that I didn’t run the program on every dataset, just the main “datahoarder” one. My SmartOS node has some VMs with large filesystems and other cruft that haven’t been “shaken” yet.

But, I did notice one weird behavior. I had zpool iostat -v 15 open while running the program, and I did indeed see many times where a lot of data was being read only from the first five mirrors, but written across all nine. However, not infrequently I would see that the amount of data written to the first vdevs was still greater than that written to the latter vdevs. This may also explain why the vdevs didn’t converge as much as expected.

There’s probably some nuance in the ZFS write skew logic that explains this behavior, but I’m not enough of an expert to even attempt at a guess. Perhaps I can try “shaking” some of the files again and see if the utilization converges even more.