IOPS? Really?

Posted by: mk408

Tagged in: Untagged 

mk408

Recently, I had a discussion with a colleague about storage performance, and he kept talking about IOPS, whereas I have always measured it with the, perhaps more traditional, bytes per second. Since IOPS is effectively the reciprocal of latency, I have tended to ignore it for disk storage, as I have yet to see any use case which is synchronous, let alone sensitive to sub-centiseond latencies.

The alleged use case is a random write-heavy Oracle instance. I confirmed with a DBA I know that Oracle's block sizes will range from 4 to 32kiB. That suggests that the worst case random I/O can't occur, as the payload for each operaton will be between 8 and 64 sectors. Still, I no longer have the data for benchmarks I ran, to be able to quantify how much of a difference this might make.

I can, however, quantify what I've observed in terms of throughput numbers. A commodity 500GB 7200RPM SATA drive can do around 100MiB/s for sustained, sequential I/O. It drops to around 10MiB/s for sustained, contentious (though not rigorously, statistically random) I/O. If it can do around 100 IOPS, the payloads must be much larger than even 64 sectors, closer to quadruple that number. Perhaps Linux scheduler queue combined with NCQ gets enough adjacency for that fourfould increase.

Back to Oracle, or, perhaps, any database, does it really perform I/O in a synchronous fashion, not even dispatching an operation until the previous one succeeded? This strikes me as unlikely, especially in a high-concurrency environment, which is what I would assume anything with many, random writes would be. Surely part of the whole point of something like intent logging is the ability to do (otherwise reckless) asynchoronous writes, and logging is patently sequential.

Regardless, both theory and empirical observation lead me to the conclusion that real-world loads and capacities are more meaningfully measured in bytes not operations per unit time. Am I missing something?

Comments (9)Add Comment
jpolk
Great point
written by Jan Polking, June 07, 2009
I've never really paid much attention to IOPS as a performance metric with storage specific to applications so I think your point is a good one. Nice post
wazoox
...
written by Emmanuel Florac, June 10, 2009
Oracle (and other databases) only do synchronous journal operation, usually (it's necessary for data integrity). IOPS is hardly a problem in a DAS environment (one machine connected to one storage array) but becomes easily critical in a shared environment, either SAN or even plain old stupid NFS.

Another tren : set up 20 or 30 VM on a RAID array, and you'll be surprised by how heavy and how random the disk activity is... The IOPS-hungry application right now isn't Oracle : it's VMWare!
mk408
...
written by Max Kalashnikov, June 10, 2009
I'm curious by what you mean about IOPS becoming critical in a SAN versus DAS. Is this really just a concurrency issue, such as with the multiple VMs scenario?

If so, I'm quite skeptical that mere concurrency and randomness of I/O makes IOPS an interesting measurement (such as for bottlenecks), absent 512 byte operations and 100% randomness. Simple arithmetic and empirical evidence backs my suspicion.

Could you provide some data as to IOPS having been the limiting factor in a shared situation?
storageanarchy
...
written by Barry A. Burke, June 24, 2009
The importance of IOPS is in relation to block-size. Even though Oracle's block sizes can be large, for many applications the record size of a DB transaction is a fraction of Oracle's block size (this isn't unique to Oracle, for that matter, Exchange is very similar). So imagine processing transactions that requires random user records of 250 byte/record. The smallest I/O size is (say) 4kilobytes, and the transaction completion time is directly related to how fast the requisite 4KB block can be brought into memory.

This is where IOPS (and response times) are important. And the two aren't necessarily the reciprocal of each other, because IOPS can be limited by the rate requests are being made by the application(s), while response time is indeed the time between the origination and completion of requests.

Said simply, MB/s is a measure of how MUCH data you can move in a period of time (usually with as few as possible very large I/O requests), while IOPS measures HOW MANY I/O requests can be serviced in a period of time (usually a very larger number of very small requests).

Think about it - you really don't care how quickly you can backup the entire Exchange server database - what you care about is how fast you can open a specific email or meeting request. The former is MB/s dependent, the latter is IOPS.
mk408
...
written by Max Kalashnikov, June 24, 2009
OK. We're still talking about a minimum 4KB I/O, 8 times the disk's native block size.

I'm also having trouble with the leap of logic that even those 4K operations are anywhere near statistically random. Especially for the case of databases such as calendar and email, I could make a very strong argument for relative adjacency and caching having strong applicability. Database software (including filesystems) works pretty hard to manage I/O efficiently, else why would have anything but native block sizes?

I'd still be very interested to see some real-world data, even summarized, since the model is not the reality. I'm not about to eat the menu smilies/smiley.gif
DaveN
...
written by David Noonon, July 18, 2009
It is important to realize that MB/s is a key metric, but usually only in terms of infrastructure (SAN connectivity - ISLs,Trunking, etc). In a shared environment (in particular) it’s all about IOPS and Latency. Often times with the size of disks today, you can exceed I/O capacity long before exceeding your shared disk capacity (in GB) or your throughput capacity. The most important metric is the one that you are likely to bottleneck on first which is why IOPS is so critical. Thin provisioning and deduplication often exacerbates this issue since you load up more and more data into the same capacity - increasing I/O contention. It is important to note that some performance is gained back if enough cache exists to keep frequently accessed deduped blocks resident.

Obviously there are one-off situations where the above is not the case. Purely sequential workloads such as media streaming, or disk backups for instance.

Another thing to note is that in some systems, even data that an operating system recognizes as sequential may be completely random. For instance, with the way WAFL and ZFS take in random writes and stripe them sequentially to disk creates this interesting scenario.

Latency is really the measurement of performance that must always be monitored in a shared environment. At the point that I/O contention in a given disk pool (with whatever databases or VMs that exist on it) becomes too saturated, response delay increases. Eventually you can’t keep up with the number of transactions per second that you must achieve and are unable to meet defined SLAs.
mk408
@DaveN
written by Max Kalashnikov, July 20, 2009
The first question that popped into my mind was "how are disk spindles not infranstructure?!"

I believe I've already outlined what I believe it would take to exceed interconnect throughput with disk throughput. Are you refuting that?

What does it mean to increase I/O contention? In what unit is that measured?

Are you suggesting that latency, in this context, is something other than simply the reciprocal of IOPS?
DaveN
@mk408
written by David Noonon, July 21, 2009
Your original question was whether IOPS was a relevant metric. What I am saying is that typically it is the quantity of operations, not the size of the operations that is most relevant. This is not true when it comes to interconnects, however interconnects (at least with FC) are rarely the bottleneck.
I have found that typically bottlenecks follow this order:
1.Spindle count
2.Cache
3.Storage Processers
4.Interconnects
When it comes to actually assessing whether there is contention (not enough I/O to go around), it is response time that really provides the most value. Let’s say you have a sizable database and you find your response time to be averaging 30-40ms during the middle of the day after user complaints. You take a look at the MB/s throughput and see that the data LUN is only measuring 20MB/s. You know that an individual spindle can exceed this performance by far. You then take a look at IOPS and see that you are pushing 2500 IOPS because your transfer sizes are only averaging 8KB. Let’s also say that the read/write ratio is 80/20 and your RAID set is configured as a RAID-10. If you are on a traditional mid-range array that produces (for instance) an average of 150-160 IOPS per spindle at ~5-8ms, and you only have 14 spindles allocated instead of 20, then that is likely an issue. If you did have 20 allocated but were no longer achieving the IOPS/spindle you once were (prior to loading up the storage array with more enclosures perhaps), then you would start to take a look at cache and storage processor utilization.
It is possible to go too far the other way and get so hung up on spindle count that you forget about everything else. I’ve seen several environments where they are using well over 50% of the storage processor capacity during peak times. People often forget about the goal of being able to sustain a storage processor failure without significant performance degradation.
mk408
@DaveN
written by Max Kalashnikov, July 21, 2009
My original question is whether IOPS is more relevant than MB/s.

What you describe is a configuration that storage vendors love to sell people on, but, as an end user, it's irrelevant to me. "Storage processor"? That's my RDBMS! smilies/wink.gif It knows nothing of LUNs, only block devices, so that's what I want to measure.

I'm certainly not on any "traditional" array, which provides some number of IOPS, the very measure I'm questioning in the first place. Where is this 150-160 IOPS per spindle coming from? Calculating it backwards from 20MB/s and 8K transfers would mean that 20MB/s is still the more relevant measure. For that matter, where are the 8K transfers, at the spindle itself?

Response time as a measure of contention seems flawed to me, since one can have increasing response times with even a single sequential transfer, so long as it overstuffs some bottleneck.

Write comment
You must be logged in to post a comment. Please register if you do not have an account yet.

busy