Skip to content

Linux: Aligning partitions to a (hardware) RAID stripe

August 23, 2011

Note: This article isn’t finished. It got really long so I was going to split it and do a separate part on optimized formatting; that hasn’t happened yet ;). For now it is just a brain-dump on the topic, so if you’re short on time just do some searching and see if what you need is here!–

Simply put, aligning your partitions to your hardware RAID stripe cannot be over-emphasized. Granted, the idea that this tweak will make your storage twice as fast is ridiculous. However the idea that a failure to implement this tweak will rob you of 10% to 40% of performance in many disk metrics, is more or less irrefutable. This is true under most environments, but this article will deal with how to achieve this feat using hardware RAID under Linux. Once that is done we’ll look at optimized formatting with the ext filesystem in a later article.

A quick note to RAID1 users: If you’re only using RAID1 and have no other ‘layers’ (for instance RAID10 is multi-layer), then most of this article does not apply to you. One FYI though: I have seen RAID controllers that have a chunk or stripe size setting under RAID1, but it really doesn’t mean anything in that mode. If you actually have a controller on which the chunk size of a RAID1 matters, then I’d say that controller is a bigger problem for you than whether the resulting chunk is aligned or not. The end.

So back on track. There is a lot of misinformation out there, and in my experience also a lot of instances where two guides will use different names for the same thing. I’ll try to be as broad (yet complete) as possible. If you believe you see examples of these problems in this article and think you have better information, I would urge you to let me know in the comments so that if there is a problem…it gets fixed: This article is intended to clarify, not further confuse!

PRE-REQ: Since this topic touches many other areas, you will see me stop at several points to mention pre-requisite knowledge. You can stop and search online for the pre-req if you need to: So, first off, for the next couple paragraphs it will be very useful to you if you are refreshed on the differences between RAID levels 0,1,5, and 6 (or at least those that are relevant to your deployment). If you have never learned these levels you should start there and learn to deploy some test storage on a few of them before attempting (or possibly even reading) the rest of this article.

While the concepts here will mostly apply to RAID0 you should know that I’m using a RAID5 as my example. If I start talking about parity…it is pretty safe to assume it doesn’t apply to RAID0. I’ll note that kind of thing where I remember to. RAID6 is basically interchangeable with 5 for the purposes of this article. Just be sure that you can keep the innards straight in your head if you’re using 6. If you’re told (for example) to calculate how much of the total disk space will be for data and how much for parity, you’ll need to modify the formula to count out the equivalent of two parity disks even though my examples show the equivalent of one. Since again, I’m using RAID5.

TERMS: Similar to pre-req, I’ll be doing my best to clarify terms in places like this. For this one, we’re talking chunks and stripes. I’ve seen two versions of each word -so four meanings- used interchangeably. Here is the key aspect you need to figure out when you see someone use one of these two terms: Is the person talking about the amount of data that is contiguously written to each disk before moving on to the next one, or is the person talking about the amount of data that is written across all of the disks before coming back around to the first disk? The first case is also called a minor stripe or (something like) a per-disk stripe, and the second case is also called a major-stripe or (something like) a whole-array stripe. We’ll switch to discussing RAID0, just for the rest of this paragraph, so that we don’t have to complicate the examples with parity. So in practice, if you format a four disk RAID0 with a 64k stripe (referring to the minor-stripe here) then in a perfect world you would need to write 256k of data to fully write one major-stripe (across all of the disks). If you wrote the same 256k, but the 64k referred to the major-stripe instead, then each disk would really only take 16k of the data before the array moved on to the next disk. So in this second example you would write to each disk four times before the same 256k had hit the platters. The figure below shows the former: 256k written to an array with a 64k minor-stripe.

Stripe Example

Data on its way to disk. 4disk RAID0 64k minor stripe

Now that we’ve got some of the beginning stuff out of the way we’re going to make a RAID5. This is controller specific so I’m afraid I won’t be of much help in figuring out how to select your group of disks and make the array. I can however say that when you are done the only things you need to remember from the procedure are how many disks you used, whether any of them were ‘standyby’ or ‘hotspare replacements’, the stripe or chunk size, as well as whether it was a ‘major’ or ‘minor’ size setting (you may need to look in the controller’s manual). On my 3ware 9650-SE, I’ve found that the stripe setting in the controller’s BIOS is talking about the minor-stripe. I’ve set it to 256k, so my major stripe will be 768k. (I’ve selected four disks. I have zero hotspare/standby disks. One disk will be parity. 3 x 256k=768k)

PRE-REQ (and some terms): The first sector of a disk or RAID volume is always used to hold the ‘boot sector’; basically, the list of partitions (the ‘disk label’ or ‘partition table’) and an instruction as to which partition to boot (if the disk is bootable). Since that first sector is taken, you need to skip some number of sectors to where your first partition will actually start, a sector offset if you will. Unfortunately, many current distributions of Linux still start the first partition at a very inopportune sector offset: the 63rd sector (counted from sector 0). That sector is chosen for backwards compatibility with CHS, which is a much older way to track data on a disk. CHS is ancient black magic and should not be used except in very rare cases where there is no other choice. The newer way (that we’re using here) just uses the sector number to locate data, and is called LBA. There is more information below as to why sector 63 is a problematic default start point, and necessitates the manual partition alignment that we are working towards.

PRE-REQ: On-disk sectors are still equal to 512 bytes in most cases (when I wrote this). However there is movement towards a disk layout that uses 4096 byte sectors. Among other things, the larger sectors will reduce metadata and allow drive makers to offer more disk space and better ECC in the future. The official name or description of these drives is that they are “Advanced Format” (or AF). There are already native and emulated AF formats and there can be complex idiosyncrasies in mixing them, especially if there are also drives with 512 byte sectors in the mix. This is along the same lines as not mixing drive rotational speeds, cache sizes, etc. You want drives that are basically even. I would recommend not using any AF drives unless all of the drives are identical. You can look up the spec sheets for all of your drives on the manufacturer’s website and see whether they are AF. Sometimes it will only tell you the ‘actual sector size’ or ‘physical sector size’ as opposed to a ‘logical’ or ‘presented’ sector size. If the actual or physical size is 512bytes, the drives should be non-AF. In my case, only three of my four disks are the same exact model, but they are all 1tb, 7200rpm, 32mb cache, sata-ii, non-AF disks, so that is good enough for me. You also have to consider whether your RAID controller supports AF disks and consider what size sector the card will present to Linux, and the performance considerations therein.

Once you have your RAID volume set up, boot into Linux so we can create a partition on it. I am obviously assuming here that you have a separate boot volume. (Usually it is possible to install your distro onto the actual RAID volume while still using this guide, but you’d have to switch to another terminal when you get to partitioning and do it all manually on the console. We’re not going there right now. I assume you have a boot volume and that the RAID volume isn’t touched until you’re booted into the OS.)

So, since we established we are required to have a sector offset from the start of the volume, we’d like the offset size to be a multiple of the size of the minor-stripe. Here is why: The only analogy I can think of is if you worked in a cake factory, and it was your job to put the cakes into their boxes on an assembly line. Everything would be great as long as every box started out empty and you could put one whole cake in it. Then, imagine some jerk starts putting half eaten cakes in the boxes before they get to your station. Now suddenly you cannot put a whole cake into each box, because they aren’t fully empty. So now you’re dividing cakes into guesstimated fractions and trying to stuff the pieces evenly into the boxes, and things are getting slowed down in a hurry. Well, your RAID card works the same way. It is only going to write one cake’s (minor-stripe’s) worth of data to each disk before it has to go to the next disk in the array. If it isn’t able to write an entire minor-stripe each time, then every disk operation will touch multiple disks (cake boxes) unnecessarily. So we need there to be the right number of whole, empty boxes between the start of the volume and the first sector of your partition. Since the minor-stripe is effectively the smallest unit of data that the RAID controller deals with, once we have aligned the start of the partition to the start of a minor-stripe it will not get out of sync as data is added later. If we don’t do it though, then every time you go to write something within that partition your raid card will know you only as “the jerk who keeps giving me half full cake boxes”, and things will be slower.

I’d recommend starting your partition at sector 2048 for arrays using standard sized sectors (512bytes), and sector 256 for most arrays using AF sized sectors (4096bytes). Note that at this point we’re concerned with the physical sector size reported by the RAID card, which may differ from the disks you used. We use a different starting sector number depending on the sector size because a 4096byte sector is (obviously!) 8 times larger than a 512byte sector, so you reach the same physical spot in the array 8 times faster. 2048/8=256. The location that we’re going for here is the absolute 1st sector of the 2nd megabyte of data on the RAID volume. (Note that the first sector (boot sector) is inclusive here, and so most of the first whole MiB may be empty besides the boot sector and any boot loader code.) Skipping the first whole MiB is good because it lines up extremely well with almost all minor-stripe sizes. If your minor-stripe was 64k, then you’ve just skipped 16 minor-stripes. If it was 256k, you’ve skipped 4. The key here is that you’re skipping an even number of minor-stripes (or cake boxes!). (figures coming soon)

At the risk of being overly pedantic or annoying, I’d actually recommend that the next step is to stop here and start this whole guide over. Go back and go into the controller’s BIOS and screw around in it for an hour and search online for every term in there. You don’t have to have every answer but you’ll end with a better result if you get an idea whether you want write-through or write-back caching, whether the read cache should be set to ‘on’ or ‘adaptive’, etc. If nothing else, just find out if those options are even called that in your card’s BIOS, or if you even have those options. There is so much stuff like auto-carving, battery policies, verification time, foreground/background initialization options, whether your controller has volume size limits, etc. In reality you could continue without doing this, but I mean if you’re down to tweaking partition/stripe alignment, I’d just assume you’re the type who’d want to know what the other stuff does too.

We will stop here for now; this article is a work in progress. Soon, instructions on how to do all this with fdisk and parted will end up in this space, the article will be split in two, and optimized formatting of a partition will be in a third article.

So for now, if you can get your first partition to start at the first sector of the second MiB on your array, as described above, your partition should be aligned. If you use values of whole MiB in parted or fdisk when specifying partition sizes, then in most cases subsequent partitions would be aligned as well. But again, that is all to come…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: