Pragmatic IT: Disk Array

Showing posts with label Disk Array. Show all posts

Friday, 11 May 2007

NEC Storage Devices

It's interesting how NEC's press release on their new storage devices spends so much time talking about how they do non-disruptive upgrades and scale across the whole product line. We had trouble with upgrades and scaling our storage at Vancouver Coastal Health. The NEC press release validates that we're not the only ones. I don't have experience with NEC storage devices, so I can't comment on them. I'm only commenting on the press release.

Wednesday, 25 April 2007

The Real World is a Surprising Place

Some recent real-world research (and commentary) shows that the quality and price of a disk drive has far less impact on its lifespan than conventional wisdom would have it to be. SCSI, Fibre Channel and SATA disks all fail at roughly the same rate. And the MTBF that many manufacturers claim for their drives is not supported in the study, either.

The findings of the paper lead to some very significant considerations for the IT manager:

You need to pay the overhead for a RAID level that can survive a disk failure while the disk array is rebuilding itself after an earlier disk failure. RAID 5 isn't good enough. The study found that disk failures are not random, and that if a disk fails in one of your disk arrays, there's a good chance that another disk will soon fail in the same disk array. NetApp, for example, addresses this problem, but it means that 7 TB of raw disk space turns into about 4.9 TB of usable disk space (at least for a FAS 3020). That's 70 percent usable.
Plan for disk failures during the entire lifetime of your storage devices. Disks fail far more often than the manufacturer's data would suggest, and they also fail much like any other mechanical device: the longer they've been on, the more likely they are to fail. You can't assume that a four-year refresh cycle will keep you free of disk failures. The idea that disks either fail in the first few months or after several years of use is not supported by real world observations.
Don't believe everything your vendor, or conventional wisdom, tells you. This isn't a recommendation of the paper, by the way, but to me it's such an obvious conclusion that it needs to be said. It's also so obvious that I'm sure you're thinking, "Well, yeah." However, not believing your vendor is a pretty significant thing to actually commit to. Most IT managers don't have the luxury of testing everything their vendors tell them. The topic is big enough to merit a post of its own. (Interestingly, a staggering number of the comments to Robin Harris' commentary on the paper were along the lines of "the results of the paper must be wrong because everyone knows blah, blah, blah." Never underestimate the power of religion, even if that religion is an adherence to a particular technology.)

The authors of the paper cite some possible reasons for these perhaps surprising findings. One of them is that disk life may depend far more on the conditions the disk operates under rather than the quality of the disk itself. Desktop disks may fail more often simply because they tend to be in nastier environments than server disks, which typically sit in a nice, clean environmentally-controlled data centre. You may have multiple disk failures in a disk array in the data centre because the room got a bit warm when you were testing fail-over to the backup air conditioning units, for example.

A reason cited for more failures in the field than the data sheet would suggest is that customers may have more stringent test criteria than the manufacturer. One manufacturer reported that almost half of drives returned to them had no problem. However, the paper reports failure rates at least four times the data sheet rates, so that doesn't explain away the difference between data sheet and observed MTBF.

As an aside, I find it rather interesting that manufacturers of disks would simply accept that half of the returns are of non-defective drives. They're implying that their customers are stupid at least half the time. Perhaps they need to consider how they qualify a disk as being failed. People don't usually take down critical systems and do hardware maintenance on a whim. They had a good reason to suspect a drive failure.

Finally, I think the paper gives some hope that the we might see more studies based on real world observations. The authors of the paper were able to collect statistically significant data from a relatively small number of sites, due in part to the rise of large data centres with lots of hardware in them. As things like Software as a Service, large ISPs, etc. make centralized IT infrastructure more common, it may actually become easier to collect, analyze and publish real world observations about the performance of IT infrastructure. This would help manufacturers and IT managers alike.

Sunday, 22 April 2007

The Case for SAN Part II

When I did a total cost of ownership calculation for SAN-attached disk arrays during a recent assignment, the biggest factor in favour of SAN was the usage factor. With direct-attached disk on servers, you typically over-allocate the disk by a wide margin, because adding disk at a later time is a pain. With SAN-attached storage you can pop some more disks into the disk array and, depending on the operating system, you can increase the storage available in a relatively easy way.

Therefore if you manage your SAN-attached storage on a just-in-time basis, you can achieve perhaps 80 percent utilization of the disk, whereas in a typical group of servers using direct attached storage you might have 20 percent utilization. This four-t0-one price difference is significant.

Earlier I calculated that there's roughly a ten to one difference between consumer disk and the top-of-the-line SAN disk in at least one manufacturer's offerings. So a four-to-one price difference goes some way to fixing that, but not all the way. And to chip away further at the disk usage argument, a lot of the disks that contribute to the 20 percent utilization number are the system disks on small servers. These days you can't get much less the 72 GB on a system disk, and most servers need far, far less than that. My technical experts recommend that you don't boot from SAN, so you'll still have that low utilization rate even after installing a SAN.

I'm not making much of a case for SAN, am I?

Wednesday, 11 April 2007

The Case for SAN Part I

One pays more than ten times as much for storage on a big SAN-attached disk array than one does for direct attached storage (DAS). A raw terabyte of disk to put in a desktop computer casts about C$400. Storage on a SAN-attached mid-range disk array from one manufacturer costs about C$17,000 per raw TB. Storage on another manufacturer's low-end disk array costs about C$7,000 per raw TB. (Those are prices for health care, so a business could expect to pay a fair bit more.) And for SAN-attached storage you also have to pay for the SAN network itself, which can cost about C$3,500 per port for quality equipment.

Of course, you don't buy a big, SAN-attached disk array for reduced cost per TB. You buy it for some combination of:

lower total cost of ownership
reliability
availability
manageability
performance
the brute capacity to store multiple terabytes.

However, $10 to $1 is an awfully steep premium to pay for those advantages. Before you make that significant investment, you need to evaluate whether you really need those advantages, and in some cases whether you're really going to realize the benefits.

A few examples will show what I mean.

How about availability? I'm aware of a disk array manufacturer that doesn't support a hot firmware upgrade for one series of disk arrays (they say you can do it, but you have to sign a waiver for any data loss during the upgrade). The upgrade itself takes very little time, but you still need to shut down all the servers that connect to that disk array, and then start them up and test them after the upgrade. If you have fifty servers connected to the disk array, you're looking at an hour or more to shut them all down, and typically more than that to start them all up again. Suddenly your uptime numbers don't look so good anymore. And heaven help you if the users of those fifty servers have different preferred downtime windows, as was the case in my experience.

Reliability? In one year, in a population of four disk arrays of the same model, there were three significant unplanned downtimes, including one with significant data loss. (The data was eventually recovered from backup.) From sources inside the manufacturer of that disk array, I've heard that they've had a data loss event in almost one percent of their installations.

In a population of two disk arrays from another manufacturer, one required periodic reboots from the day it was turned on until the firmware was upgraded. Why did one work fine and the other not, in similar although not identical environments? The reality is, disk arrays are big, complicated pieces of IT and there's ample opportunity for software defects to manifest themselves in production use.

So far I'm not making much of a case for SAN-attached storage. I believe that one challenge with SAN is that it's sold as the solution to all problems, when in reality it has the potential to create two new problems for every one it solves. I think SAN-attached storage has its place, and in many cases it's the only technology that can do what we need. In follow-up posts I hope to give some guidelines that I think would help you realize the benefits of SAN and to avoid the pitfalls.

As always, I'd like to hear from you about your experience. Please leave a comment on this blog.

Pragmatic IT