[sisyphus] xfs на software raid5.

Пт Мар 14 13:57:14 MSK 2003

Alexey V. Lubimov пишет:

>Кажись, нашел объяснение постоянным сообщениям в логе 
>
>
>raid5: switching cache buffer size, 512 --> 4096
>raid5: switching cache buffer size, 4096 --> 512
>...
>raid5: switching cache buffer size, 4096 --> 512
>raid5: switching cache buffer size, 512 --> 4096
>raid5: switching cache buffer size, 4096 --> 512
>raid5: switching cache buffer size, 512 --> 4096
>raid5: switching cache buffer size, 4096 --> 512
>
>и предположение о том, что это неэффективно тоже подтвердилось.
>  
>

Создал xfs на soft-raid5 и тоже стал наблюдать подобное.
Захотел "победить хитростью". Версия 1.2  позволяет
при создании ф.с. указывать размер блока. Сделал

mkfs.xfs -f -b size=512 /dev/md2

Поставил "заливку".
Прошло некоторое время, наблюдаю, размер кэша
все равно прыгает (не так часто, и тем не менее).
Видимо вынос журнала на отдельный раздел единственный
способ повысить производительность XFS на soft-raid5

Опа, linux "повесился" (kernel panic). Вот и вся "хитрость".

>==========================================================
>
>This seems to be an area of constant confusion.  Let me describe why
>XFS sucks on Linux software RAID5.  It has nothing to do with
>controllers, physical disk layout, or anything like that.
>
>RAID5 works by saving a N-1 chunks of data followed by a chunk of
>parity information (the location of the checksum chunk is actually
>interleaved between devices with RAID5, but whatever).  These N-1 data
>chunks + the parity blob make out a stripe.
>
>Every time you update any chunk of data you need to read in the rest
>of the data chunks in that stripe, calculate the parity, and then
>write out the modified data chunk + parity.
>
>This sucks performance-wise because a write could worst case end up
>causing N-2 reads (at this point you have your updated chunk in
>memory) followed by 2 writes.  The Linux RAID5 personality isn't quite
>that stupid and actually uses a slightly different algorithm involving
>reading old data + parity off disk, masking, and then writing the new
>data + parity back.
>
>In any case Linux software RAID keeps a stripe cache around to cut
>down on the disk I/Os caused by parity updates.  And this cache really
>improves performance.
>
>Now.  Unlike the other Linux filesystems, XFS does not stick to one
>I/O size.  The filesystem data blocks are 4K (on PC anyway), but log
>entries will be written in 512 byte chunks.
>
>Unfortunately these 512 byte I/Os will cause the RAID5 code to flush
>its entire stripe cache and reconfigure it for 512 byte I/O sizes.
>Then, a few ms later, we come back and do a 4K data write, causing the
>damn thing to be flushed again.  And so on.
>
>IOW, Linux software RAID5 code was written for filesystems like ext2
>that only do fixed size I/Os.
>
>So the real problem is that because XFS keeps switching the I/O size,
>the RAID5 code effectively runs without a stripe cache.  And that's
>what's making the huge sucking sound.  This will be fixed -
>eventually...
>
>By moving the XFS journal to a different device (like a software RAID1
>as we suggest in the FAQ), you can work around this problem.
>
>And finally - All the hardware RAID controllers I have worked with
>stick to one I/O size internally and don't have this problem.  They do
>read-modify-write on their own preferred size I/Os anyway.
>
>======================================================================
>
>Правда, есть и решение - переместить лог на другое устройство:
>
>=====================================================================
>
>Of course the thing I missed is that if you run growfs to grow a log
>it comes back and says:
>
>xfs_growfs: log growth not supported yet
>
>xfs_db also has some endian issues with the write command. I did however
>manage to grow a log:
>
>1. select your partition to become the log and dd a bunch of zeros over the
>   complete range you want to be the log, so 32768 blocks would be:
>
>   dd if=/dev/zero of=/dev/XXX bs=32768 count=4096
>
>2. run xfs_db -x on the original unmounted filesystem use sb 0 to get to
>   the super block.
>
>3. reset the log offset using 
>
>	write logstart 0
>
>4. set the new log size using
>
>	write logblocks xxxx
>
>   where xxxx is the size in 4K blocks, xfs_db will come out with a new
>   value which is different than this, feed this new value back into the
>   same command and it will report the correct version. This is the endian
>   conversion bug in xfs_db:
>
>	xfs_db: write logblocks 32768
>	logblocks = 8388608
>	xfs_db: write logblocks 8388608
>	logblocks = 32768
>
>5. mount the filesystem using the logdev option to point at the new log:
>
>         mount -t xfs -o logbufs=4,osyncisdsync,logdev=/dev/sda6 /dev/sda5 /xfs
>
>You now have a filesystem with a new external log. Going back is harder since
>you need to zero the old log and reset the logstart and logblocks fields.
>
>It does occur to me that by using a different logstart than zero you could
>put two external logs on the same partition, not sure what happens with
>the device open close logic if you do this though.
>
>Steve
>
>============================================================================
>
>
>
>Вот так вот...
>
>
>
>
>  
>

-- 
Best regards
Vladimir