Performance implications of software RAID
Performance implications of software RAID
The choice between hardware assisted and software RAID seems to draw strong opinions on all sides. I’ve recently been specifying a moderately-sized storage system for a research setting. In this application, I/O throughput is nice to have but most users will be CPU bound; space requirements are in the dozens of terabytes, with the potential to scale up to 100 TB or so.
While there are many things that can be said about the durability offered by a hardware RAID controller with battery-backed writeback buffer, I’ve always been a bit skeptical of black-boxes. In the best case they are merely a nuisance to manage, in the worst case they are a binary-blob-laden nightmare. Further, the boards are fairly expensive, there is evidence to suggest that they often don’t offer the performance that one might expect, and they represent a single-point of failure where recovery requires finding hardware identical to the original controller.
So, while the durability of hardware RAID may be necessary for banking, it doesn’t seem necessary for the current application, where we will already have redundant, independent power supplies and the data is largely static. With this in mind, it seems that the decision largely comes down to performance. When reading about RAID solutions it doesn’t take long to encounter some argument around the CPU time required to compute RAID parity. Given the quite weak hardware driving most commercial RAID solutions are compared to the typical multi-GHz Haswell with >256 bit SIMD paths, this argument seems a tad empty. After all, RAID 5 is essentially block-wise XORs (RAID 6 is a bit trickier); how bad could it possibly be?
Take, for instance, my laptop. It features a Core i5 4300U running at 1.9GHz; a pretty quick machine but certainly nothing special. Every time I boot the kernel informs me of these useful numbers while initializing the md
RAID subsystem,
raid6: sse2x1 9154 MB/s
raid6: sse2x2 11477 MB/s
raid6: sse2x4 13291 MB/s
raid6: avx2x1 17673 MB/s
raid6: avx2x2 20432 MB/s
raid6: avx2x4 23755 MB/s
raid6: using algorithm avx2x4 (23755 MB/s)
raid6: using avx2x2 recovery algorithm
xor: automatically using best checksumming function:
avx : 28480.000 MB/sec
Or on a slightly older Ivy Bridge machine (which lacks AVX support),
raid6: sse2x4 12044 MB/s
raid6: using algorithm sse2x4 (12044 MB/s)
raid6: using ssse3x2 recovery algorithm
xor: automatically using best checksumming function:
avx : 22611.000 MB/sec
Taken alone these numbers are pretty impressive. Of course this isn’t surprising; a lot of people have put a great deal of effort into making these routines extremely efficient. For comparison, at this throughput it would require somewhere around 2% of a single core’s cycles to rebuild an eight-drive array (assuming each drive pushes somewhere around 60MByte/s, which itself it a rather generous number).
Looking at the code responsible for these messages, these figures arise from timing a simple loop to generate syndromes from a dummy buffer. As this buffer fits in cache these performance numbers clearly aren’t what we will see under actual workloads.
Thankfully, it is fairly straightforward (see below) to construct a (very rough) approximation of an actual workload. Here we build a md level-6 volume on an SSD. Using fio
to produce I/O, we can measure the transfer rate sustained by the volume along with the number of cycles consumed by md’s kernel threads.
On my Haswell machine I find that md
can push roughly 400MByte/s of writes with md
’s kernel worker eating just shy of 10% of a core. As expected, reads from the non-degraded volume see essentially no CPU time used by md
. On an Ivy Bridge machine (lacking AVX instructions; also tested against a different SSD), on the other hand I find that I can push roughly 250 MByte/s with closer to 20% CPU time eaten by md. Clearly AVX pulls its weight very nicely here.
Given that the workload in question are largely read-oriented, it seems like this is a perfectly reasonable price to pay, assuming AVX is available.
Benchmark
Here is the benchmark, to be sourced and invoked with run_all
. As usual, this is provided as-is; assume it will kill your cat and burn your house down.
#!/bin/bash -e
# Where to put the images
root="/tmp"
processes="md md127_raid6"
# damn you bash
time="/usr/bin/time"
# Setup RAID "devices"
function setup() {
loopbacks=
for i in 1 2 3 4; do
touch $root/test$i
fallocate -l $(echo 8*1024*1024*1024 | bc) -z $root/test$i
loopbacks="$loopbacks $(sudo losetup --find --show $root/test$i)"
done
sudo mdadm --create --level=6 --raid-devices=4 /dev/md/bench $loopbacks
sudo mkfs.ext4 /dev/md/bench
sudo mkdir /mnt/bench
sudo mount /dev/md/bench /mnt/bench
}
# Measure it!
tick="$(getconf CLK_TCK)"
function show_processes() {
for p in $@; do
pid="$(pidof $p)"
# pid, comm, user, system
cat /proc/$pid/stat | awk "{print \$1,\$2,\$14 / $tick,\$15 / $tick}"
done
}
function write_test() {
echo -e "\\n================================" >> times.log
echo -e "Write test" >> times.log
(echo "Before"; show_processes $processes) >> times.log
$time -a -o times.log sudo fio --bs=64k --ioengine=libaio \
--iodepth=4 --size=7g --direct=1 --runtime=60 \
--directory=/mnt/bench --filename=test --name=seq-write \
--rw=write
(echo -e "\\nAfter"; show_processes $processes) >> times.log
}
function read_test() {
echo -e "\\n================================" >> times.log
echo -e "Read test" >> times.log
(echo "Before"; show_processes $processes) >> times.log
$time -a -o times.log sudo fio --bs=64k --ioengine=libaio \
--iodepth=4 --size=7g --direct=1 --runtime=60 \
--directory=/mnt/bench --filename=test --name=seq-read \
--rw=read
(echo -e "\\nAfter"; show_processes $processes) >> times.log
}
function rebuild_test() {
echo -e "\\n================================" >> times.log
echo "Rebuild test" >> times.log
(echo "Before"; show_processes $processes) >> times.log
echo check | sudo tee /sys/block/md127/md/sync_action
sudo mdadm --monitor | grep RebuildFinished | head -n1
(echo -e "\\nAfter"; show_processes $processes) >> times.log
}
# Clean up
function cleanup() {
sudo umount /mnt/bench
sudo mdadm -S /dev/md/bench
echo $loopbacks | sudo xargs -n1 -- losetup --detach
sudo rm -R /mnt/bench
rm $root/test[1234]
}
function run_all() {
echo > times.log
setup
write_test
read_test
umount /mnt/bench
rebuild_test
cleanup
cat times.log
}