Saturday, 10 November 2007

A Tangled Web We Weave

I'm currently dealing with people who want to buy a technology that's flexible and can do all sorts of things that they don't currently need it to do. This is slowing up the project, and causing them a lot of stress.

The technology we buy today will be obsolete in five years (if it lasts that long). So by the time you want to do something more with the existing technology, you'll be able to do it better and cheaper with whatever is on the market at that time, not what you bought three years ago. Why should you pay extra for features that you're not going to use until the equipment is half way or more through its useful life?

If your organization is of any reasonable size, the real cost of doing something new with technology is the training and organizational change, not the capital cost of the equipment. You may want to take advantage of those features you paid for two years ago, but you still can't afford to because of the training costs for staff.

The idea that we have to buy for future unknown needs, not just what we need now is pervasive in IT. Perhaps it's because we're always overpromising what we can deliver. When we deliver and the customer is unhappy we beat ourselves up for not buying (overspending) on all the extra features so we get the one the user really wants. It's time to stop over promising.

Saturday, 27 October 2007

An Observation on Blogs and Podcasts

I've come quite late to the Web 2.0 world. About a year and a half ago I started to listen to podcasts while walking my dog. One of the things I discovered is how much differently I react to people when I hear their voice in a podcast.

When I read what Steve McConnell and Joel Spolsky write, I have trouble getting to the end of their articles because they seem to be just so wrong about too much. (I'll explain why below.) However, when I heard podcasts by them, they made a lot more sense. I don't know why, but there's something about a verbal communication, even when the person isn't present, that seems to somehow help me hear the whole message in the right context.

For example, when I read Joel's stuff about how to manage software teams, I think he's out to lunch because what he recommends would be impossible to implement for 99.9 percent of working software development managers out there. I'm sure he's said so much in his writing, but it wasn't until I heard a podcast by him that I really heard how much he admits that his is a special case. As soon as I heard that, my opinion of him as a person changed and I was able to read and listen to him in a whole different way.

With McConnell, I've always felt that his experience, research and analysis of software development was staggeringly good, which made the fact that he draws absolutely the wrong conclusions from his knowledge all the more maddening. I forget what it was about the podcast that softened my opinion of him, but I do remember quite well finishing the podcast and thinking that, while his conclusions are still wrong, I have much more respect for him than I did from his writings.

Sunday, 21 October 2007

Mac Joke

I'm part-way through a podcast by Guy Kawasaki where he recounts a joke the Apple II team had in the early days of the Macintosh: How many Mac team members does it take to screw in a light bulb? One: He just holds the light bulb and waits for the universe to revolve around him.

The podcast is good, too. At least so far.

Saturday, 20 October 2007

How Long Will People Put Up With Us?

Vancouver Coastal Health is gearing up to make sure that no one has problems with meetings scheduled in Outlook when we switch back to standard time from daylight savings time. In the leadership group for Pharmacy, about eight senior managers, the executive assistants have spent at least a full person-day, if not more, changing the subject line of meetings to include the intended meeting time, as recommended by Microsoft.

This is an office productivity tool?

I know DST changes and calendaring applications aren't easy. You can find lots of discussion on the web about the challenges. In this case, we seem to have put the responsibility for dealing with the complexity on the users, rather than figuring it out and giving the users a solution. But do we really think we can expect our users to put up with this twice a year forever?

I believe if you handled the DST rule change in March 2007, you shouldn't have to do anything else. However, IT organizations seem to think otherwise. Are they just covering their butts?

In one sense I don't blame the IT staff at an organization for being a bit reluctant to try to optimize the process. Take a look at the Microsoft knowledge base topic on the DST change and Outlook. The table of contents fills my entire screen top to bottom, and I use a small font.

So what should IT departments do? One thing you can do is be brave and don't tell the users to do anything special. Then, when the complaints come in about meetings being wrong, go out and fix the computers that didn't get the timezone update, or that didn't run the Outlook timezone fix tool. Sure, your the affected users will think you're a jerk because their calendars were wrong once. But you know what? All your users already think you're a jerk twice a year because you expect them to do all sorts of manual work-arounds.

Friday, 31 August 2007

This is What I Was Afraid Of

Part of the reason I started blogging was because I was "between contracts" as we say. I never seemed to have time to write when I was working full time and trying to have a life. Sure enough, I've posted three times, including this, since I got my current contract.

Who cares? Well, one of the reasons we get things so wrong in IT is that technology doesn't do what we were told it does. One of the great things about the Internet is that it's given us access to people who are actually using technology, so we can solve problems faster. Blogging, however, demands a certain level of time to blog, which is taking away from your time doing.

The bottom line: There's useful stuff in blogs, but you have to filter out the rantings from the useful information yourself.

iSCSI vs. Fibre Channel - You're Both Right

Reading another article from an expert who provides less than useful information has finally prompted me to try to provide useful guidance for IT managers of 50 to 1,000 diverse servers running a variety of applications.

iSCSI vs. fibre channel (FC) is a classic technology debate with two camps bombarding each other mercilessly with claims that one or the other is right. The reason the debate is so heated and long lived is because there isn't a right answer: there are different situations in which each one is better than the other. Here's how to figure out what's best for you:

Start with the assumption that you'll use iSCSI. It's less expensive, so if it does what you need, it should be your choice. It's less expensive at all levels: The switches and cables enjoy the ecnomy of scale of the massive market for IP networking. You already have staff who know how to manage IP networks. You already have a stock of Cat 6 cables hanging in your server rooms or network closets.

If you have mostly commodity servers, they transfer data to and from direct-attached storage at less than gigabit speeds. Gigabit iSCSI is fine. If you have a lot of servers, you have to size the switches correctly, but you have to do that with FC as well, and the FC switch will be more expensive. Implement jumbo frames so backups go quickly.

Just because you're using iSCSI doesn't mean you're running your storage network over the same cables and switches as your data centre LAN. In fact, you probably aren't. The cost saving doesn't come from sharing the existing LAN, it comes from the lower cost per port and the reduced people cost (skill sets, training, availability of administrators in the labour market) of using the same technology. As long as your storage and general-purpose networks are not sharing the same physical network, a lot of the criticisms of iSCSI evaporate.

If you have large, specialized servers that can and do need to sustain high data transfer rates, then definitely look at FC. Be sure you're measuring (not just guessing) that you need the data transfer rates.

If you have a large farm of physical servers running a huge number of virtual machines (VMs), look at FC. My experience is that virtual machine infrastructures tend to be limited by RAM on the physical servers, but your environment may be different. You may especially want to think about how you back up your VMs. You may not need the FC performance during the day, but when backups start, watch out. It's often the only time of day when your IT infrastructure actually breaks a sweat.

You might look at a FC network between your backup media servers and backup devices, especially if you already have an FC network for one of the reasons above.

Yes, FC will give you higher data transfer rates, but only if your servers and storage devices can handle it, and few today go much beyond one gigabit. FC will guarantee low latency so your servers won't do their equivalent of "Device not ready, Abort, Retry, Ignore?"

The challenge for an IT manager, even (or especially) those like me who have a strong technical background, is that it's easy to get talked into spending too much money because you might need the performance or low latency. The problem with that thinking is that you spend too much money on your storage network, and you don't have the money left over to, for example, mirror your storage, which may be far more valuable to your business.

A final warning: neither technology is as easy to deal with as the vendor would have you believe (no really?). Both will give you headaches for some reason along the way. If it wasn't hard, we wouldn't get the big bucks, would we?

Wednesday, 18 July 2007

That's Protection

David de Leeuw reports in the Risks Digest that he got this (you'll have to click on the picture. I couldn't get blogger to show it readably):

Thursday, 14 June 2007

Advances in User Experience

Here's another interesting article along the lines of, "How much have computers advanced the user experience?" A 1986 Mac Plus beats an AMD dual-core PC 9 tests to 8 over 17 different tests of Word and Excel performance. The article is rather tongue in cheek, but the perspective -- what does the user experience -- is one that we forget far too often.

One thing that's ignored in the article is that the Mac Plus cost you way more money in absolute dollars. Never mind that a 1986 dollar bought way more than a dollar today. Those that have written low-level code can also appreciate how much more work it is to maintain 32 bits of colour for each pixel, instead of one bit for the Mac's black and white display. (Personally I'm still waiting for a Vista theme that looks like a 1981 green screen monitor.)

Does this post contradict what I said here? You be the judge.

Friday, 8 June 2007

Don't Master Your Tools

Ever notice how using the full power of Microsoft Office makes the team less productive. Perhaps I exaggerate, but think about these examples:
  • I use style sheets in Word to get a consistent appearance from my documents, and to allow me to easily adapt documents to my clients' preferred formats. However, when someone else uses my documents they don't use the style sheet. The formatting becomes inconsistent in mysterious ways, and you very rapidly get to a state where you can't rely on the format being right unless you review the whole document. But using a word processor is supposed to be about easy changes, isn't it?
  • I can make a nice, easy to maintain financial model in Excel, but someone who doesn't know how I do things will probably break it within the first dozen edits or so. Often a broken spreadsheet isn't obviously broken, which can lead to disastrous consequences. I worked on a project once where the program director had to sneak into the office of a VP at Canada Post and retrieve a document because the financial numbers were wrong by a factor of two thanks to a spreadsheet error.
I'm talking about working on teams with IT professionals. If we don't know and use our tools productively and consistently, who will? What I mean is that I don't think it's a training or education issue. If there was a clear benefit to everyone to learn how to use Word or Excel at the power level in the team context (not individually for things that we only use ourselves), we'd already be doing it.

For word processing, I think the Web has shown something very interesting: You can communicate a lot with a very small set of styles. Look at the tags that you can usually use within a blog post and you'll see that there aren't many. I wonder if there's a market for a super-small word processor that has 10 buttons to do all formatting? (Sticking with the HTML-based idea, I'd say 10 buttons for formatting, but you're also allowed to have a front-end that gives the user access to different classes in a cascading style sheet.)

I think the financial modelling problem is different. At one level, it might involve some simple changes like making it harder to accidentally change a formula back to a fixed value. At another level, it's about letting people express what they want to do, not how to do it. That's starting to sound like the old idea that someday we'd just write requirements and the CASE tool would generate the code automatically, so I'd better shut up. :-)

Tuesday, 29 May 2007

Google and Healthcare IT

Google is showing interest in becoming the provider of people's electronic health record. It's an interesting idea, but the ramifications of the U.S. Patriot Act are probably unacceptable to the vast majority of people who think about the privacy of their health record.

Basically, the Patriot Act gives the U.S. Government the right to look at any data in any computer in the United States, and they don't have to tell anyone they're looking at it. In fact, it's a crime to tell anyone their data was looked at. This basically violates any privacy legislation in any country that has such a thing. You need to give permission to anyone to look at your data, and you need to be informed if someone looks at your data for any reason.

In healthcare IT here in Canada we now live with the fact that healthcare data about Canadians can't be stored in or even pass through the United States. At least we can build an electronic health record. We just have to keep the data in Canada. Thanks to the Patriot Act, Americans may miss out on a great chance to improve their health.

(There might be an opportunity for enterprising Canadians to host the U.S. electronic health record, but if I recall correctly HIPAA says you can't store U.S. healthcare data outside the U.S.)

Saturday, 26 May 2007

Anti-Economies of Scale

You must have noticed that IT is full of situations where the economy of scale is actually negative. Unlike retail, the bigger you get the more per unit everything costs. For example: An off-site backup costs nothing when a trusted employee can just take a tape (or jump drive) home. No need for special backup networks or LTO-3 tape drives either. You store your data on the cheap hard drive in your PC, and disaster planning consists of having another commodity PC available to which to restore your backup. This can all pretty much be done by any MCSE-type person that you can hire on a per-job basis.

Contrast the preceding with a larger enterprise: You have to pay Iron Mountain to pick up and store your tapes, tapes that you create from some complicated backup implementation. You need experts for the network, experts for Exchange (you need your own e-mail server, don't you), experts to run the SAN. You need to put this all in a special room, something that has its own design challenges. You may believe you need experts on staff who know your environment to handle all this. You can't just hire commodity skills on an as-needed basis.

The advantage goes to the manager who identifies what can be handled as a commodity, and either purchases the service outright, or at least organizes the internal staff so that the "commodity" tasks and responsibilities are done by "commodity-skilled" people. This has some interesting ramifications that I'd like to address in another post.

Tuesday, 22 May 2007

Why Agile Software Development isn't More Prevalent

I believe that agile software development leads to better software faster and therefore cheaper, and I think there's a lot of evidence to support that. So why, some seven years after the Agile Manifesto and many years after people started to do agile development, aren't the practices more common?

One factor that I think can't be underestimated is that agile pushes software developers and their managers out of their comfort zones way too much. Most development managers are more comfortable providing estimates up front and then reporting for most of the project that, "everything seems to be on track." At the end of the release cycle you have a period of discomfort, but that discomfort is mitigated by all the effort being put into getting the team to "pull it off."

Compare that to having to say early in the development cycle that, "we can already see that we're not going to get all the features in time, so do you want to drop features, and which ones, or do we move out the date?" That's way more uncomfortable because it's likely to be met by, "No to both." Then what do you do?

For developers, it's much more comfortable to just code what the Business Analyst told you to code, without having to address whether the user will deal with it. Moreover, why would a programmer want to produce code faster and cheaper? They get paid no matter how much work they do. Why would they want to even measure speed, as that could lead to being compared with their colleagues? It's much more comfortable to not even look at how fast the programmers are going.

Of course, successful agile organizations address these points. If you're looking at trying to make a non-agile organization more agile, you need to consider how really big a task you're setting yourself. It's hard to get people to change if they're not motivated to change, and the people who have to change to become more agile are actually motivated to do quite the opposite.

Saturday, 19 May 2007

Talk About Missing the Mark

I just read an article in the Risks Digest that contains a quote that totally misses the mark:
The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry...-- Henry Petroski
I'd say that software is barely starting to approximate the needs of the user, and if hardware can't keep up, well then those hardware engineers better get their act together and start producing something useful. We treat Moore's Law as some kind of miracle, when in reality all we're seeing is what can happen if you let smart people solve the same problem over and over again for thirty years. They get better at it. We should be more surprised if they couldn't double processor speed every 18 months.

The real challenge of course is, as my brother quotes someone, "not what the program does, but what the user does." Or in this case, what the user can do. If the hardware doesn't support a productive user interface, then it's no more than a mildly entertaining magic trick.

Wednesday, 16 May 2007

Eggs and Baskets

One of the patterns I've seen over the years is how IT and technology in general magnifies the consequences of poor performance. We put everything on one storage device so it's easy to manage, but it's also easy to type "drop table..." in the production database. It's also very hard to get the backup just right, and harder still to be able to test restores of a production backup.

As an IT manager, you need to mitigate for this. Do reviews, walkthroughs, or whatever you want to call it to ensure that at least two pairs of eyes have evaluated your designs. Do as much testing as possible. Fight for more budget for testing. Build the simplest possible solution that meets your needs.

Another thing you can do is subscribe to the comp.risks RSS feed and read all the different ways that the best laid plans can be derailed by one little mistake.

Friday, 11 May 2007

NEC Storage Devices

It's interesting how NEC's press release on their new storage devices spends so much time talking about how they do non-disruptive upgrades and scale across the whole product line. We had trouble with upgrades and scaling our storage at Vancouver Coastal Health. The NEC press release validates that we're not the only ones. I don't have experience with NEC storage devices, so I can't comment on them. I'm only commenting on the press release.

Wednesday, 9 May 2007

Test Everything - Yeah, Right

I just read a short article that said the first thing to do to solve a lot of our IT infrastructure problems is to, "Create and maintain a test environment that mimics the production environment. Also commit to testing every change." (Emphasis from the original article.)

No problem. Since my 2,000 square foot data centre is already full of the 450+ physical and virtual servers I have in it, I'll just build another data centre and buy another 450 servers with all the associated software support and licensing costs. The lease costs alone on 2,000 square feet around here are about C$40,000 per month. I'm sure most IT managers have enough spare change in their budget to pull off that one -- not.

On top of that, you can never test everything. When I led the virtualization of 135 physical servers, the only problems we had after going live with any of the virtual machines were caused by changing IPs (which we had to do because of the enterprise network design). We wouldn't have caught those with a test data centre. There are lots of problems similar to that one lurking in a typical IT infrastructure.

You've got to build flexible infrastructures. Make sure nothing uses hard-coded IPs. Make sure your DNS doesn't have legacy pointers hiding in corners that you forget to update. Limit the partitioning of networks by physical location as much as possible in the first place. Build commodity server infrastructures.

(P.S. We did have other problems virtualizing the 135, but we found them before we turned the VMs over to production. The only problems that were seen by users were IP-change-related.)

Tuesday, 8 May 2007

Business Impact of IT

A friend of mine, Don Gardner, sent me a link about the importance of communicating the business impact of IT projects. The article says we IT people too often report the facts (e.g. we installed an e-mail server), rather than what that does for the business.

I liked the article, but I found it a challenge to apply to the projects I was working on at the time. Part of the challenge was that my projects hadn't been designed with business impact in mind. I was simply supposed to implement some bits of IT infrastructure.

Because of the complexity of IT we have to break our work into pieces. The pieces of work are very large in terms of technical complexity and cost, but vanishingly tiny in terms of their visible connection to business needs. We IT people get the connection, but it's not obvious to anyone else.

Monday, 7 May 2007

Change Management and Project Management

I heard a good talk on Friday that identified for me a real tension between change management and project management. That wasn't the point of the talk. The presenters were just presenting change management in the context of a major health care IT initiative that's underway in BC. However, their words of guidance to project managers triggered some thoughts.

To avoid confusion: We're talking here about "change management" in the sense of how to help people change the way they work, typically with the introduction of new technology. We're not talking about change management as in the management of changes to IT infrastructure itself.

The presentation showed the "Six Human Needs" that you had to look at when undertaking major workplace changes. First on the list was the need to be heard and have control. To give people control means they might change what the project has to do. The tension is that our standard project management approach in IT hates scope change. We have mechanisms to handle scope change, but we certainly don't encourage it.

I think there are a lot of situations where this sort of creative tension actually leads to a good result: The change management people will bring good ideas back to the project team. The project manager will push back due to budget constraints, and the customer will get the "best" result.

This puts a lot of pressure on the scope change decision-making process. You have to get those trade-offs between scope and budget mostly right. In my experience, it's been really hard to get the right people to take the time to truly work through an issue and come to a reasoned conclusion on that type of issue.

I think this is an area where the IT industry's "best practices" need to be improved. As a project manager, you're pretty much between a rock and a hard place when you get the change management input. If more projects would formally recognize the exploratory stage that most require before confirming budgets and scope, we'd be a lot better. Not being a PMP, I don't know if the Project Management Institute (PMI) has done a lot of work on the quality of a change management process, as opposed to the existence of one. I'd be interested in hearing what they've done.

By the way, the talk was part of BC Health Information Management Professionals Society (BCHIMPS) spring education session on Friday. The BCHIMPS spring and fall education sessions are excellent meetings for anyone involved in health care IT in BC.

Friday, 27 April 2007

Virtualization: There's Gotta be a Catch

Virtualization solves lots of problems for many if not most organizations that have more than a rack of servers. On an earlier assignment I calculated a worst-case saving of C$380 per month for a virtual server over a physical server (using ESX 2.5 and 3.0 from VMWare). But there's a catch to virtualization, and that catch is backups.

Virtualization introduces wrinkles on your backup approach. Fortunately, to start off you're probably okay doing your backups the same way you always have been. The backup wrinkles are not enough to stop you from embarking on virtualization.

Here are some of the things you need to watch for as you add virtual machines (VMs) to your virtualization platform:
  • Do you have highly-tuned start and stop times for your backup jobs, for example when you have inter-dependencies between external events and your backup jobs?
  • Do the servers you plan to virtualize have file server-like data, in other words, does it consist of a lot of small files that mostly don't change?
  • If you had a fire in your data centre today, before virtualizing, how soon would you have to have all the servers rebuilt?
  • Is your backup infrastructure really only being used to half capacity or less?
If you have highly tuned backup schedules, you need to consider how virtualization may mess up those schedules. Your backup performance may actually improve early on. This happens when you virtualize a lot of old servers that have slow disks and/or network cards. Your virtualization platform probably has gigabit network and may be attached to fast disk (e.g. a fibre channel SAN). The solution is simple: watch your backup jobs as you virtualize physical servers and make adjustments as needed.

As you add VMs to your infrastructure, you may run into decreasing backup performance. The reason: many servers today are at their busiest during their backup. You may be able to run 20 VMs comfortably on one physical server, but if you try to back up all those VMs at once you'll run into bottlenecks because the physical server has a certain number of network interfaces, and all the data is coming from the same storage device, or at least through the same storage interface. Again, the solution is to watch backup performance as you virtualize and make adjustments.

Be aware that you might have to make changes to your backup infrastructure to deal with real challenges of backup performance introduced by virtualization. If your backups are already a problem, you might want to look into this in more detail. (The problems and solutions are beyond the scope of this post.)

How long do you have to rebuild servers after a data centre fire? You may not have even thought of the problem (don't be embarrassed. Many of us haven't). With virtualization you have to think about it because an equivalent event is more likely to happen: The storage device that holds all your VMs may lose its data, and you're faced with rebuilding all your VMs. I have second-hand experience (e.g. the guys down the street) with storage devices eating all the VMs, but I've never directly known anyone who had a serious fire in the data centre.

If the backups on your physical servers can be restored to bare metal, then you don't have to worry about your storage device eating the VMs. You may have to make some changes to your bare-metal backup -- I have no experience with that topic so I don't know for sure -- but once you do you should be able to restore your VMs relatively quickly.

If you can't or don't have backups that can be restored to bare metal, then you have a challenge. I doubt that most general purpose data centres are full of identically configured servers, with detailed rebuild procedures and air-tight configuration management so every server can be rebuilt exactly like the one that was running before. If you had to rebuild 200 VMs from installation disks, you'd probably be working a lot of long nights.

If most of your servers that you plan to virtualize have database-like data (large files that change every day), I'd recommend looking at changing your backup approach for those servers to a product like ESX Ranger, or look for some of the user-built solutions on the Internet. These products will back up the entire virtual machine every time they run, and may not allow individual file (within the VM) restores. However, for a database server you're probably backing up the whole server every night anyway, so that won't be a significant change to your backup workload.

If you want to virtualize file server-like servers, there isn't really a good solution that I or anyone I know has found at this time. If your backup infrastructure has enough room to take the additional load, simply back up with ESX Ranger or one of the other solutions once a week (or however frequently you do a full backup), along with your current full and incremental backup schedule. If you have to rebuild the VM, you restore the most recent ESX Ranger backup first. If you just have to restore files on the server, because a user deleted an important document, for example, just use the regular backups.

If you have the budget to change backup infrastructures, ESX Ranger can probably provide a pretty good overall solution. However, you have to provide backup and restore for physical servers as well, so the staff who do restores have to be able to deal with two backup systems.

One final gotcha that I've run across: There are some great devices out there from companies like Data Domain that provide excellent compression of exactly the type of data you're backing up when you back up an entire VM. Unfortunately, ESX Ranger compresses the data too, which messes up the storage device's compression. Whatever solution you put together, make sure your vendor commits to performance targets based on the entire solution, not on individual products.

As with so much of what we do in IT, it's really hard to summarize everything in a way that makes sense in a blog post. Comment on this post if you'd like more details or reasons why I make the recommendations I make.

Wednesday, 25 April 2007

The Real World is a Surprising Place

Some recent real-world research (and commentary) shows that the quality and price of a disk drive has far less impact on its lifespan than conventional wisdom would have it to be. SCSI, Fibre Channel and SATA disks all fail at roughly the same rate. And the MTBF that many manufacturers claim for their drives is not supported in the study, either.

The findings of the paper lead to some very significant considerations for the IT manager:
  • You need to pay the overhead for a RAID level that can survive a disk failure while the disk array is rebuilding itself after an earlier disk failure. RAID 5 isn't good enough. The study found that disk failures are not random, and that if a disk fails in one of your disk arrays, there's a good chance that another disk will soon fail in the same disk array. NetApp, for example, addresses this problem, but it means that 7 TB of raw disk space turns into about 4.9 TB of usable disk space (at least for a FAS 3020). That's 70 percent usable.
  • Plan for disk failures during the entire lifetime of your storage devices. Disks fail far more often than the manufacturer's data would suggest, and they also fail much like any other mechanical device: the longer they've been on, the more likely they are to fail. You can't assume that a four-year refresh cycle will keep you free of disk failures. The idea that disks either fail in the first few months or after several years of use is not supported by real world observations.
  • Don't believe everything your vendor, or conventional wisdom, tells you. This isn't a recommendation of the paper, by the way, but to me it's such an obvious conclusion that it needs to be said. It's also so obvious that I'm sure you're thinking, "Well, yeah." However, not believing your vendor is a pretty significant thing to actually commit to. Most IT managers don't have the luxury of testing everything their vendors tell them. The topic is big enough to merit a post of its own. (Interestingly, a staggering number of the comments to Robin Harris' commentary on the paper were along the lines of "the results of the paper must be wrong because everyone knows blah, blah, blah." Never underestimate the power of religion, even if that religion is an adherence to a particular technology.)
The authors of the paper cite some possible reasons for these perhaps surprising findings. One of them is that disk life may depend far more on the conditions the disk operates under rather than the quality of the disk itself. Desktop disks may fail more often simply because they tend to be in nastier environments than server disks, which typically sit in a nice, clean environmentally-controlled data centre. You may have multiple disk failures in a disk array in the data centre because the room got a bit warm when you were testing fail-over to the backup air conditioning units, for example.

A reason cited for more failures in the field than the data sheet would suggest is that customers may have more stringent test criteria than the manufacturer. One manufacturer reported that almost half of drives returned to them had no problem. However, the paper reports failure rates at least four times the data sheet rates, so that doesn't explain away the difference between data sheet and observed MTBF.

As an aside, I find it rather interesting that manufacturers of disks would simply accept that half of the returns are of non-defective drives. They're implying that their customers are stupid at least half the time. Perhaps they need to consider how they qualify a disk as being failed. People don't usually take down critical systems and do hardware maintenance on a whim. They had a good reason to suspect a drive failure.

Finally, I think the paper gives some hope that the we might see more studies based on real world observations. The authors of the paper were able to collect statistically significant data from a relatively small number of sites, due in part to the rise of large data centres with lots of hardware in them. As things like Software as a Service, large ISPs, etc. make centralized IT infrastructure more common, it may actually become easier to collect, analyze and publish real world observations about the performance of IT infrastructure. This would help manufacturers and IT managers alike.

Sunday, 22 April 2007

The Case for SAN Part II

When I did a total cost of ownership calculation for SAN-attached disk arrays during a recent assignment, the biggest factor in favour of SAN was the usage factor. With direct-attached disk on servers, you typically over-allocate the disk by a wide margin, because adding disk at a later time is a pain. With SAN-attached storage you can pop some more disks into the disk array and, depending on the operating system, you can increase the storage available in a relatively easy way.

Therefore if you manage your SAN-attached storage on a just-in-time basis, you can achieve perhaps 80 percent utilization of the disk, whereas in a typical group of servers using direct attached storage you might have 20 percent utilization. This four-t0-one price difference is significant.

Earlier I calculated that there's roughly a ten to one difference between consumer disk and the top-of-the-line SAN disk in at least one manufacturer's offerings. So a four-to-one price difference goes some way to fixing that, but not all the way. And to chip away further at the disk usage argument, a lot of the disks that contribute to the 20 percent utilization number are the system disks on small servers. These days you can't get much less the 72 GB on a system disk, and most servers need far, far less than that. My technical experts recommend that you don't boot from SAN, so you'll still have that low utilization rate even after installing a SAN.

I'm not making much of a case for SAN, am I?

Friday, 13 April 2007

Decision Making Without All the Facts

When I read my previous post, I got the feeling that I was saying you needed an objectively measurable financial benefit in order to justify using SAN-attached storage. All my experience in IT has shown that you make no progress if you wait for all the questions to be answered. So reading my post got me thinking, how do I decide when I have enough information to move forward?

That reminded me of another recent project I led, where we set up a VMWare farm, then over the course of half a year virtualized over 130 physical servers and created about 70 new virtual servers. With minimal research we had established that VMs were C$380 per month cheaper, based on hard, quantifiable data (floor space lease, power consumption, server lease, and server maintenance). That's a savings of C$76,000 per month.

On top of that, you have all the harder to quantify benefits that the VMWare sales reps will tell you about: faster deployment, higher availability, etc. The nice thing is, you don't need to count those up when you have such an obvious measurable benefit. If fact, even if we had discovered that server management effort had gone up, we could have got another server admin for almost a year (fully loaded cost) for what we saved in a month by virtualizing.

A lot of what we deal with in IT seems to lean in the other direction: The easy-to-count numbers actually argue against a new approach. The value has to be in the intangible benefits. What I'm exploring with the SAN storage case is how to deal with intangible value.

Wednesday, 11 April 2007

The Case for SAN Part I

One pays more than ten times as much for storage on a big SAN-attached disk array than one does for direct attached storage (DAS). A raw terabyte of disk to put in a desktop computer casts about C$400. Storage on a SAN-attached mid-range disk array from one manufacturer costs about C$17,000 per raw TB. Storage on another manufacturer's low-end disk array costs about C$7,000 per raw TB. (Those are prices for health care, so a business could expect to pay a fair bit more.) And for SAN-attached storage you also have to pay for the SAN network itself, which can cost about C$3,500 per port for quality equipment.

Of course, you don't buy a big, SAN-attached disk array for reduced cost per TB. You buy it for some combination of:
  • lower total cost of ownership
  • reliability
  • availability
  • manageability
  • performance
  • the brute capacity to store multiple terabytes.
However, $10 to $1 is an awfully steep premium to pay for those advantages. Before you make that significant investment, you need to evaluate whether you really need those advantages, and in some cases whether you're really going to realize the benefits.

A few examples will show what I mean.

How about availability? I'm aware of a disk array manufacturer that doesn't support a hot firmware upgrade for one series of disk arrays (they say you can do it, but you have to sign a waiver for any data loss during the upgrade). The upgrade itself takes very little time, but you still need to shut down all the servers that connect to that disk array, and then start them up and test them after the upgrade. If you have fifty servers connected to the disk array, you're looking at an hour or more to shut them all down, and typically more than that to start them all up again. Suddenly your uptime numbers don't look so good anymore. And heaven help you if the users of those fifty servers have different preferred downtime windows, as was the case in my experience.

Reliability? In one year, in a population of four disk arrays of the same model, there were three significant unplanned downtimes, including one with significant data loss. (The data was eventually recovered from backup.) From sources inside the manufacturer of that disk array, I've heard that they've had a data loss event in almost one percent of their installations.

In a population of two disk arrays from another manufacturer, one required periodic reboots from the day it was turned on until the firmware was upgraded. Why did one work fine and the other not, in similar although not identical environments? The reality is, disk arrays are big, complicated pieces of IT and there's ample opportunity for software defects to manifest themselves in production use.

So far I'm not making much of a case for SAN-attached storage. I believe that one challenge with SAN is that it's sold as the solution to all problems, when in reality it has the potential to create two new problems for every one it solves. I think SAN-attached storage has its place, and in many cases it's the only technology that can do what we need. In follow-up posts I hope to give some guidelines that I think would help you realize the benefits of SAN and to avoid the pitfalls.

As always, I'd like to hear from you about your experience. Please leave a comment on this blog.

Saturday, 7 April 2007

Introduction to Pragmatic IT

In this blog I'm going to write about issues that I run into when trying to implement information technology in the real world. Most of my life has been spent as a software developer, but over the last few years (and at certain times in the past) I've worked on the infrastructure side. I started in IT when an IBM S/370 (predecessor of the computer that my son Marc is standing beside) filled a room.

Throughout my career I've been cursed by a need to provide real value to my clients, whether it be in the software I was building or the infrastructure I was implementing. (I'll explain why I say "cursed" in a later posting, since this posting is mainly to test out the blog.) I've seen a lot of value left on the table or even destroyed by qualified, competent IT people who had the best intentions, and I hope to explore why in this blog.

Therefore, I hope the flavour of this blog is almost as a sort of consumer advocate for the person purchasing IT. It's aimed squarely at IT managers and anyone who makes IT buying decisions. I hope it's also interesting to the technologists who make and sell information technology. I hope you find it interesting enough to comment on.