Friday, 27 April 2007

Virtualization: There's Gotta be a Catch

Virtualization solves lots of problems for many if not most organizations that have more than a rack of servers. On an earlier assignment I calculated a worst-case saving of C$380 per month for a virtual server over a physical server (using ESX 2.5 and 3.0 from VMWare). But there's a catch to virtualization, and that catch is backups.

Virtualization introduces wrinkles on your backup approach. Fortunately, to start off you're probably okay doing your backups the same way you always have been. The backup wrinkles are not enough to stop you from embarking on virtualization.

Here are some of the things you need to watch for as you add virtual machines (VMs) to your virtualization platform:
  • Do you have highly-tuned start and stop times for your backup jobs, for example when you have inter-dependencies between external events and your backup jobs?
  • Do the servers you plan to virtualize have file server-like data, in other words, does it consist of a lot of small files that mostly don't change?
  • If you had a fire in your data centre today, before virtualizing, how soon would you have to have all the servers rebuilt?
  • Is your backup infrastructure really only being used to half capacity or less?
If you have highly tuned backup schedules, you need to consider how virtualization may mess up those schedules. Your backup performance may actually improve early on. This happens when you virtualize a lot of old servers that have slow disks and/or network cards. Your virtualization platform probably has gigabit network and may be attached to fast disk (e.g. a fibre channel SAN). The solution is simple: watch your backup jobs as you virtualize physical servers and make adjustments as needed.

As you add VMs to your infrastructure, you may run into decreasing backup performance. The reason: many servers today are at their busiest during their backup. You may be able to run 20 VMs comfortably on one physical server, but if you try to back up all those VMs at once you'll run into bottlenecks because the physical server has a certain number of network interfaces, and all the data is coming from the same storage device, or at least through the same storage interface. Again, the solution is to watch backup performance as you virtualize and make adjustments.

Be aware that you might have to make changes to your backup infrastructure to deal with real challenges of backup performance introduced by virtualization. If your backups are already a problem, you might want to look into this in more detail. (The problems and solutions are beyond the scope of this post.)

How long do you have to rebuild servers after a data centre fire? You may not have even thought of the problem (don't be embarrassed. Many of us haven't). With virtualization you have to think about it because an equivalent event is more likely to happen: The storage device that holds all your VMs may lose its data, and you're faced with rebuilding all your VMs. I have second-hand experience (e.g. the guys down the street) with storage devices eating all the VMs, but I've never directly known anyone who had a serious fire in the data centre.

If the backups on your physical servers can be restored to bare metal, then you don't have to worry about your storage device eating the VMs. You may have to make some changes to your bare-metal backup -- I have no experience with that topic so I don't know for sure -- but once you do you should be able to restore your VMs relatively quickly.

If you can't or don't have backups that can be restored to bare metal, then you have a challenge. I doubt that most general purpose data centres are full of identically configured servers, with detailed rebuild procedures and air-tight configuration management so every server can be rebuilt exactly like the one that was running before. If you had to rebuild 200 VMs from installation disks, you'd probably be working a lot of long nights.

If most of your servers that you plan to virtualize have database-like data (large files that change every day), I'd recommend looking at changing your backup approach for those servers to a product like ESX Ranger, or look for some of the user-built solutions on the Internet. These products will back up the entire virtual machine every time they run, and may not allow individual file (within the VM) restores. However, for a database server you're probably backing up the whole server every night anyway, so that won't be a significant change to your backup workload.

If you want to virtualize file server-like servers, there isn't really a good solution that I or anyone I know has found at this time. If your backup infrastructure has enough room to take the additional load, simply back up with ESX Ranger or one of the other solutions once a week (or however frequently you do a full backup), along with your current full and incremental backup schedule. If you have to rebuild the VM, you restore the most recent ESX Ranger backup first. If you just have to restore files on the server, because a user deleted an important document, for example, just use the regular backups.

If you have the budget to change backup infrastructures, ESX Ranger can probably provide a pretty good overall solution. However, you have to provide backup and restore for physical servers as well, so the staff who do restores have to be able to deal with two backup systems.

One final gotcha that I've run across: There are some great devices out there from companies like Data Domain that provide excellent compression of exactly the type of data you're backing up when you back up an entire VM. Unfortunately, ESX Ranger compresses the data too, which messes up the storage device's compression. Whatever solution you put together, make sure your vendor commits to performance targets based on the entire solution, not on individual products.

As with so much of what we do in IT, it's really hard to summarize everything in a way that makes sense in a blog post. Comment on this post if you'd like more details or reasons why I make the recommendations I make.

Wednesday, 25 April 2007

The Real World is a Surprising Place

Some recent real-world research (and commentary) shows that the quality and price of a disk drive has far less impact on its lifespan than conventional wisdom would have it to be. SCSI, Fibre Channel and SATA disks all fail at roughly the same rate. And the MTBF that many manufacturers claim for their drives is not supported in the study, either.

The findings of the paper lead to some very significant considerations for the IT manager:
  • You need to pay the overhead for a RAID level that can survive a disk failure while the disk array is rebuilding itself after an earlier disk failure. RAID 5 isn't good enough. The study found that disk failures are not random, and that if a disk fails in one of your disk arrays, there's a good chance that another disk will soon fail in the same disk array. NetApp, for example, addresses this problem, but it means that 7 TB of raw disk space turns into about 4.9 TB of usable disk space (at least for a FAS 3020). That's 70 percent usable.
  • Plan for disk failures during the entire lifetime of your storage devices. Disks fail far more often than the manufacturer's data would suggest, and they also fail much like any other mechanical device: the longer they've been on, the more likely they are to fail. You can't assume that a four-year refresh cycle will keep you free of disk failures. The idea that disks either fail in the first few months or after several years of use is not supported by real world observations.
  • Don't believe everything your vendor, or conventional wisdom, tells you. This isn't a recommendation of the paper, by the way, but to me it's such an obvious conclusion that it needs to be said. It's also so obvious that I'm sure you're thinking, "Well, yeah." However, not believing your vendor is a pretty significant thing to actually commit to. Most IT managers don't have the luxury of testing everything their vendors tell them. The topic is big enough to merit a post of its own. (Interestingly, a staggering number of the comments to Robin Harris' commentary on the paper were along the lines of "the results of the paper must be wrong because everyone knows blah, blah, blah." Never underestimate the power of religion, even if that religion is an adherence to a particular technology.)
The authors of the paper cite some possible reasons for these perhaps surprising findings. One of them is that disk life may depend far more on the conditions the disk operates under rather than the quality of the disk itself. Desktop disks may fail more often simply because they tend to be in nastier environments than server disks, which typically sit in a nice, clean environmentally-controlled data centre. You may have multiple disk failures in a disk array in the data centre because the room got a bit warm when you were testing fail-over to the backup air conditioning units, for example.

A reason cited for more failures in the field than the data sheet would suggest is that customers may have more stringent test criteria than the manufacturer. One manufacturer reported that almost half of drives returned to them had no problem. However, the paper reports failure rates at least four times the data sheet rates, so that doesn't explain away the difference between data sheet and observed MTBF.

As an aside, I find it rather interesting that manufacturers of disks would simply accept that half of the returns are of non-defective drives. They're implying that their customers are stupid at least half the time. Perhaps they need to consider how they qualify a disk as being failed. People don't usually take down critical systems and do hardware maintenance on a whim. They had a good reason to suspect a drive failure.

Finally, I think the paper gives some hope that the we might see more studies based on real world observations. The authors of the paper were able to collect statistically significant data from a relatively small number of sites, due in part to the rise of large data centres with lots of hardware in them. As things like Software as a Service, large ISPs, etc. make centralized IT infrastructure more common, it may actually become easier to collect, analyze and publish real world observations about the performance of IT infrastructure. This would help manufacturers and IT managers alike.

Sunday, 22 April 2007

The Case for SAN Part II

When I did a total cost of ownership calculation for SAN-attached disk arrays during a recent assignment, the biggest factor in favour of SAN was the usage factor. With direct-attached disk on servers, you typically over-allocate the disk by a wide margin, because adding disk at a later time is a pain. With SAN-attached storage you can pop some more disks into the disk array and, depending on the operating system, you can increase the storage available in a relatively easy way.

Therefore if you manage your SAN-attached storage on a just-in-time basis, you can achieve perhaps 80 percent utilization of the disk, whereas in a typical group of servers using direct attached storage you might have 20 percent utilization. This four-t0-one price difference is significant.

Earlier I calculated that there's roughly a ten to one difference between consumer disk and the top-of-the-line SAN disk in at least one manufacturer's offerings. So a four-to-one price difference goes some way to fixing that, but not all the way. And to chip away further at the disk usage argument, a lot of the disks that contribute to the 20 percent utilization number are the system disks on small servers. These days you can't get much less the 72 GB on a system disk, and most servers need far, far less than that. My technical experts recommend that you don't boot from SAN, so you'll still have that low utilization rate even after installing a SAN.

I'm not making much of a case for SAN, am I?

Friday, 13 April 2007

Decision Making Without All the Facts

When I read my previous post, I got the feeling that I was saying you needed an objectively measurable financial benefit in order to justify using SAN-attached storage. All my experience in IT has shown that you make no progress if you wait for all the questions to be answered. So reading my post got me thinking, how do I decide when I have enough information to move forward?

That reminded me of another recent project I led, where we set up a VMWare farm, then over the course of half a year virtualized over 130 physical servers and created about 70 new virtual servers. With minimal research we had established that VMs were C$380 per month cheaper, based on hard, quantifiable data (floor space lease, power consumption, server lease, and server maintenance). That's a savings of C$76,000 per month.

On top of that, you have all the harder to quantify benefits that the VMWare sales reps will tell you about: faster deployment, higher availability, etc. The nice thing is, you don't need to count those up when you have such an obvious measurable benefit. If fact, even if we had discovered that server management effort had gone up, we could have got another server admin for almost a year (fully loaded cost) for what we saved in a month by virtualizing.

A lot of what we deal with in IT seems to lean in the other direction: The easy-to-count numbers actually argue against a new approach. The value has to be in the intangible benefits. What I'm exploring with the SAN storage case is how to deal with intangible value.

Wednesday, 11 April 2007

The Case for SAN Part I

One pays more than ten times as much for storage on a big SAN-attached disk array than one does for direct attached storage (DAS). A raw terabyte of disk to put in a desktop computer casts about C$400. Storage on a SAN-attached mid-range disk array from one manufacturer costs about C$17,000 per raw TB. Storage on another manufacturer's low-end disk array costs about C$7,000 per raw TB. (Those are prices for health care, so a business could expect to pay a fair bit more.) And for SAN-attached storage you also have to pay for the SAN network itself, which can cost about C$3,500 per port for quality equipment.

Of course, you don't buy a big, SAN-attached disk array for reduced cost per TB. You buy it for some combination of:
  • lower total cost of ownership
  • reliability
  • availability
  • manageability
  • performance
  • the brute capacity to store multiple terabytes.
However, $10 to $1 is an awfully steep premium to pay for those advantages. Before you make that significant investment, you need to evaluate whether you really need those advantages, and in some cases whether you're really going to realize the benefits.

A few examples will show what I mean.

How about availability? I'm aware of a disk array manufacturer that doesn't support a hot firmware upgrade for one series of disk arrays (they say you can do it, but you have to sign a waiver for any data loss during the upgrade). The upgrade itself takes very little time, but you still need to shut down all the servers that connect to that disk array, and then start them up and test them after the upgrade. If you have fifty servers connected to the disk array, you're looking at an hour or more to shut them all down, and typically more than that to start them all up again. Suddenly your uptime numbers don't look so good anymore. And heaven help you if the users of those fifty servers have different preferred downtime windows, as was the case in my experience.

Reliability? In one year, in a population of four disk arrays of the same model, there were three significant unplanned downtimes, including one with significant data loss. (The data was eventually recovered from backup.) From sources inside the manufacturer of that disk array, I've heard that they've had a data loss event in almost one percent of their installations.

In a population of two disk arrays from another manufacturer, one required periodic reboots from the day it was turned on until the firmware was upgraded. Why did one work fine and the other not, in similar although not identical environments? The reality is, disk arrays are big, complicated pieces of IT and there's ample opportunity for software defects to manifest themselves in production use.

So far I'm not making much of a case for SAN-attached storage. I believe that one challenge with SAN is that it's sold as the solution to all problems, when in reality it has the potential to create two new problems for every one it solves. I think SAN-attached storage has its place, and in many cases it's the only technology that can do what we need. In follow-up posts I hope to give some guidelines that I think would help you realize the benefits of SAN and to avoid the pitfalls.

As always, I'd like to hear from you about your experience. Please leave a comment on this blog.

Saturday, 7 April 2007

Introduction to Pragmatic IT

In this blog I'm going to write about issues that I run into when trying to implement information technology in the real world. Most of my life has been spent as a software developer, but over the last few years (and at certain times in the past) I've worked on the infrastructure side. I started in IT when an IBM S/370 (predecessor of the computer that my son Marc is standing beside) filled a room.

Throughout my career I've been cursed by a need to provide real value to my clients, whether it be in the software I was building or the infrastructure I was implementing. (I'll explain why I say "cursed" in a later posting, since this posting is mainly to test out the blog.) I've seen a lot of value left on the table or even destroyed by qualified, competent IT people who had the best intentions, and I hope to explore why in this blog.

Therefore, I hope the flavour of this blog is almost as a sort of consumer advocate for the person purchasing IT. It's aimed squarely at IT managers and anyone who makes IT buying decisions. I hope it's also interesting to the technologists who make and sell information technology. I hope you find it interesting enough to comment on.