CFQ I/O Scheduler - Performance for all hosts - Please Read
My name is Matt Heaton with Bluehost. As many of you know I try and work at very low in the kernel to address what I see as fundamental I/O concerns in the linux kernel. Lately, we have started employing very VERY smart kernel developers both locally and around the world to investigate performance issues in the kernel and then creates patches to solve these issues and do the political work involved to get these put into the mainline linux kernel available at kernel.org.
While about half of these patches are for us only, when we see a problem that is so widespread that everyone can benefit we do like to share this information. This is one of those times.
In almost every modern linux distribution CFQ is set as the default I/O scheduler for all your block devices. Here is a brief description of different I/O schedulers, what they are primarily used for etc. http://www.redhat.com/magazine/008ju...es/schedulers/
Anyway, we have found HUGE problems with CFQ in many different scenarios and many different hardware setups. If it was only an issue with our configuration I would have foregone posting this message and simply informed those kernel developers responsible for the fix.
Two scenarios where CFQ has a severe problem - When you are running a single block device (1 drive, or a raid 1 scenario) under certain circumstances where heavy sustained writes are occurring the CFQ scheduler will behave very strangely. It will begin to give all access to reads and limit all writes to the point of allowing only 0-2 I/O write operations being allowed per second vs 100-180 read operations per second. This condition will persist indefinitely until the sustained write process completes. This is VERY bad for a shared environment where you need reads and writes to complete regardless of increased reads or writes. This behavior goes beyond what CFQ says it is supposed to do in this situation - meaning this is a bug, and a serious one at that. We can reproduce this EVERY TIME.
The second scenario occurs when you have two or more block devices, either single drives, or any type of raid array including raid 0,1,0+1,1+0,5 and 6. (We never tested 3,4 who uses raid 3 or 4 anymore anyway?!!). This case is almost exactly opposite of what happens with only one block device. In this case if one of more of the drives is blocked with heavy writes for a sustained period of time CFQ will block reads from the other devices or severely limit the reads until the writes have completed. We can also reproduce this behavior with test software we have written on a 100% consistent basis.
This is VERY bad.
I have written several times about dirty cache and how write outs of dirty page cache can and will starve other block device reads. This is still the case, so adding the dirty cache issue to CFQ is a nightmare performance wise.
We have tested this from 2.6.22 to 2.6.27.rc8 kernels.
If you don't this doesn't affect you think again! This is a HUGE problem. So, what can you do about it. Well, we have tried adjusting tunables in CFQ to get the proper behavior but its clearly busted deep in the code! My suggestion and what we have done is switched all our block devices to the deadline scheduler. My preference would be to use CFQ "IF" it worked as it is laid out to do, but it doesn't. In all our tests when everything is blocked, switching out to deadline in the middle of the slowdown almost immediately relieves the problem and solves it going forward. I HIGHLY suggest web hosts consider this until CFQ can be properly fixed. We will post our tests and how to duplicate the problem to the LKML (Linux kernel mailing list) on Monday to speed along this fix to CFQ in this area.
To check what i/o scheduler you are currently running you can type.
cat /sys/block/sdX/queue/scheduler - Replace sdX with your device such as sda,sdb,sdc, and so forth.
The output should look like this:
noop anticipatory deadline [cfq]
If you would like to change CFQ to deadline you should issue this command.
echo deadline > /sys/block/sdX/queue/scheduler
After you issue this commmand it should then look like this:
noop anticipatory [deadline] cfq
This method of changing it won't survive a reboot, but you can see instantly how this will or won't affect your block device performance. You can add these changes to a boot up script to set it for all your block devices or you can use sysctrl to make it permanent as well.
There is one caveat to this information. If you are using SSD's (Sold state drives) for any block devices you should instead use the noop (No operation) I/O scheduler (For reasons I don't want to get into here - you can google it if you want to understand the very important reasons behind this).
The command to do this is
echo noop > /sys/block/sdX/queue/scheduler
Please remember that noop is ONLY for SSD's and other flash devices. It will cause regular platter based hard drives to run incredibly slow!
Hope this helps until the CFQ guys can get this resolved. I am actually quite astounded that such a severe bug has been in CFQ all this time, and that it is the default scheduler for virtually all enterprise linux distributions. I will post on my blog (Mattheaton.com) when a new CFQ patch that resolves the issue is available. If you have ever wondered why your file systems sometimes "hiccup" for 1-2 seconds and then seem to get back on track then this is usually the problem (At least as far as we have been able to test). Interestingly we have seen much better throughput for both large and small MySQL DBs using deadline. I doubt I will switch back to CFQ for my databases even after a solid patch has been written.
Matt Heaton / President Bluehost.com - Hostmonster.com