Pragmatic IT: Relocation

Showing posts with label Relocation. Show all posts

Monday, 3 November 2014

There's No Such Thing as a Dry Run When You're Moving a Data Centre

There's no such thing as a dry run when you're moving a data centre. That may not seem sensible. But here's why. I think it's easiest to explain in one sentence:

If you do a dry run, moving a computer to a different data centre, and it works, why would you move it back?

If that still doesn't make sense, think back to the days when moving a computer included a physical activity: unplugging the computer, putting it on a truck, and shipping it to your new data centre. Would you really propose that you do a dry run of that, then, if your dry run succeeds, putting it back on a truck, moving it back to the old data centre, getting it running again, only to then do it "for real" some time later?

Granted, in the world of virtual computers, you don't have to actually move the computer back. However, there is still a list of activities you have to do to move a virtual computer, that you have to undo. There's just as much a chance you'll screw up the undoing of those steps, as there is that you'll screw up the doing of them in the first place. A dry run actually increases the overall risk of the relocation.

Thursday, 27 February 2014

Relocating Another Data Centre

I recently took part in another data centre relocation project. I was one of a number of project managers moving some of the servers in a 1,300 server data centre. I moved about 200, and decommissioned another 50. I was directly planning and executing moves, so my role was different from on my previous project. It was good to experience a move from another position.

The project was successful in the end. I have to say that there were a number of lessons learned, which goes to prove that no many how many times you do something, there's always something more to learn.

Unlike my previous experiences, there were three major organizations working together on this relocation: the customer and two IT service providers to the customer. All organizations had good, dedicated, capable people, but we all had, at a minimum, a couple of reporting paths. That in itself was enough to add complication and effort to the project.

The senior project manager identified this right from the start and he made lots of good tries to compensate and mitigate for it. We did a number of sessions to get everyone on the same page with respect to methodology. Our core team acted as a cohesive team and we all adhered to the methodology. And in fact, across the project I think it's safe to say that the front line people did as much as they could to push toward to project goals.

Despite our best efforts, we all, across the three organizations, had to devote significant efforts to satisfy our own organization's needs. It's worth noting that much of this is simply necessary -- organizational governance is a big issue in the modern economy, and appearing to have management control is an business reality.

So if you're planning a relocation, take a look at the organizational structures that will be involved, and take them into account when planning your data centre relocation project.

Sunday, 3 February 2013

Work-flow Diagram for Data Centre Relocation

I wrote here about the work-flow for planning and executing the move of a group of one or more servers from one data centre to another. Here's the picture:

Work-flow for a Data Centre Relocation

I've relocated a couple of data centres, and I've just started working on another. The last one moved over 600 servers, about half physical and half virtual. We moved over five months, counting from when the first production workload went live in the new data centre. Our team consisted of five PMs working directly with the server, network and storage admins, and application support teams.

[Update: Check out the visual representation of this post here.]

We knew we had a lot of work to do in a short time, and we were working in a diverse and dynamic environment that was changing as we tried to move it. We needed a flexible and efficient way to move the data centre. One thing that really helped was a work-flow for the PMs to work with the various technical and user teams that allowed teams to focus on doing what they needed to do.

Early in the project we collected all the inventory information we could to build up a list of all the servers, whether they were physical or virtual, make and model, O/S, etc., and put it in the Master Device List (MDL). We then did a high-level breakdown into work packets or affinity groups in consultation with the application support folks. These works packets were what was doled out to the individual PMs.

Each PM then began the detailed planning process for the work packet. Starting from a template, the PM began building the relocation plan, which was simply a spreadsheet with a few tabs:

One tab was the plan itself, a minute-by-minute description of the tasks that had to be done, and who was responsible for doing them, over the course of the time immediately around the time of the relocation. Many also included the prerequisite tasks in the days preceding the relocation
Another tab was the list of servers, and the method by which they would be moved. We had a number of possible move methods, but basically they boiled down to virtual-to-virtual -- copying a virtual machine across the network, lift and shift -- physically moving a server, and leap frog -- copying the image from a physical server across the network to another, identical physical server
The third tab was a list of contact information for everyone mentioned in the plan, along with the approvers for the hand-over to production, escalation points, and any other key stakeholders

At this point many PMs also nailed down a tentative relocation date and time for the work packet and put it in the relocation calendar, a shared calendar in Exchange. The relocation calendar was the official source of truth for the timing of relocations. Some PMs preferred to wait until they had more information. My personal preference is to nail down the date early, as you have more choice about when to move.

The PM then got the various admins to gather or confirm the key information for the server build sheet and the server IP list.

The server build sheet contained all the information needed to build the new server in the new data centre. For a virtual machine, this was basically the number and size of mounted storage volumes including the server image itself. This information was key for planning the timing of the relocation, and in the case of VMs with extra attached storage volumes, made sure that everything got moved.

For physical servers the build sheet had everything needed for a VM, plus all the typical physical server information needed by the Facilities team to assign an available rack location and to rack and connect the server in the new data centre.

The server IP list simply listed all the current IPs used by the server, and their purpose. Most of our servers had one connection each to two separate redundant networks for normal data traffic, along with another connection to the backup network, and finally a fourth connection to the out-of-band management network ("lights-out operation" card on the server). Some servers had more, e.g. for connections to a DMZ or ganging two connections to provide more throughput.

The PM iterated through these documents with the admins and support staff until they were ready. One thing that often changed over the course of planning was the list of servers included in the work packet. Detailed analysis often discovered dependencies that brought more servers into the work packet. Or the volume of work proved to be too much to do in the available maintenance window and the work packet had to be split into two. Or the move method turned out to be inappropriate. We encouraged this, as our goal of minimizing or eliminating downtime and risk was paramount.

When the plan was done the Facilities team took the server build sheet and arranged for the physical move and connection of servers. The Network team took the server IP list and used it to assign the new IPs, and prepare the required network configuration and firewall rules.

The network admins put the new IPs into the same server IP list sheet, which was available to everyone, so for example the server admins could assign the new IPs at the time of the relocation.

At the time of the relocation, everyone did their tasks according to the relocation plan, and the PM coordinated everything. For simple single server, single application relocations, the team typically moved and tested the server without intervention from the PM.

Finally, the Backup and Monitoring teams used the server list in the relocation plan to turn backups and monitoring off for the relocated servers at the old data centre, and to turn backups and monitoring on for the relocated servers at the new data centre.

It wasn't all roses. We had a few challenges.

We set a deadline for the PMs to have the server build sheets and server IP lists completed two weeks before the relocation, to give time for the Facilities team to plan transport and workloads for the server room staff, and for the Network team to check all the firewall rules and ensure that the new configuration files were right. We often missed that deadline, and were saved by great people in the Facilities and Network teams, but not without a lot of stress to them.

There was some duplication of information across the documents, and it could be tedious to update. As an old programmer, I had to stop myself several times from running off and building a little application in Ruby on Rails to manage the process. But we were a relocation project, not a software development project, so we sucked it up and just worked with the tools we had.

In summary, we had a repeatable, efficient work-flow that still allowed us to accommodate the unique aspects of each system we were moving. We needed five key documents:

Master device list (MDL), a single spreadsheet for the whole project
Relocation calendar, a single shared calendar in Exchange
Relocation plan, per work packet
Server build sheet, per server, or per work packet with a tab per server
Server IP list, a single document for the whole project (which grew as we went)

The PMs were working with various teams that knew how to do, and were very efficient at, certain repeatable tasks:

Communicating outages to the user base (Communication Lead)
Moving a physical server and connecting it in the new data, or installing a new server as a target for an electronic relocation of a physical server (Facilities team)
Moving a virtual machine or a physical machine image, and its associated storage (Server and Storage team)
Reconfiguring the network and firewall for the relocated servers, including DNS changes (Network team, although for simple moves the server admin often did the DNS changes)
Acceptance testing (Test Lead who organized testing)
Changing backups and monitoring (Backup team and Monitoring team)

Sunday, 30 September 2012

Long Fat Networks

Long fat networks are high bandwidth, high latency networks. "High latency" is relative, meaning high latency compared to a LAN.

I ran into the LFN phenomena on my last data centre relocation. We moved the data centre from head office to 400 kms from head office, for a round trip latency of 6 ms. We had a 1 Gbps link. We struggled to get a few hundred Mbps out of large file transfers, and one application had to be kept back at head office because it transferred large files back and forth between the client machines at head office and its servers in the data centre.

I learned that one can calculate the maximum throughput you can expect to get over such a network. The calculation is called the "bandwidth delay product", and it's calculated as the bandwidth times the latency. One way to interpret the BDP is the maximum window size for sending data, beyond which you'll see no performance improvement.

For our 1 Gbps network with 6 ms latency, the BDP was 750 KB. Most TCP stacks in the Linux world implement TCP window scaling (RFC1323) and would quickly auto tune to send and receive 750 KB at a time (if there was enough memory available on both sides for such a send and receive buffer).

SMB 1.0 protocols used by most anything you would be doing on pre-Windows Vista are limited to 64 KB blocks. This is way sub-optimal for a LFN. Vista and later Windows use SMB 2.0, which can use larger block sizes when talking to each other. Samba 3.6 is the first version of Samba to support SMB 2.0.

We were a typical corporate network in late 2011 (read, one with lots of Windows machines), and they were likely to suffer the effects of a LFN.

Note that there's not much you can do about it if both your source and destination machines can't do large window sizes. The key factor is the latency, and the latency depends on the speed of light. You can't speed that up.

We had all sorts of fancy WAN acceleration technology, and we couldn't get it to help. In fact, it made things worse in some situations. We never could explain why it was actually worse. Compression might help in some cases, if it gets you more bytes passing through the window size you have, but it depends on how compressible your data is.

(Sidebar: If you're calculating latency because you can't yet measure it, remember that the speed of light in fibre is only about 60 percent of the speed of light in a vacuum, 3 X 10^8 m/s.)

There are a couple of good posts that give more detail here and here.

Sunday, 22 January 2012

Know What You're Building

"Know what you're building" seems like an obvious thing to say, but I don't think we do it that well in IT. For my recent data centre relocation project, we applied that principle successfully to a couple of areas. The network lead wrote up exactly what he was building, and the storage lead listed out every device he needed. But we never did a complete "final state" description of the new data centre.

It all worked pretty well, although we needed a number of meetings during the design phase of our new data centre -- laying out the racks, non-rack equipment, power, cabling for the networks. I think we needed to have a lot of meetings because there isn't a commonly accepted way to draw a plan of a data centre that covers the requirements of all the people in the room.

I'm running into the issue again in a smaller way now that we're designing the new central communication room for the equipment that used to be in the old data centre, but needs to remain behind for local operations (mostly the network gear to service a large office building).

Just as a refresher, here are all the people you need to involve:

The server team(s) know the physical dimensions of the servers, their weight, how many network ports they have and how they need to be configured, whether they need SAN-attached storage, backup requirements, how much power and cooling the server needs
The network team(s) know the network devices, which have most of the same requirements as servers, the approach for connecting, which defines the need for cables and patch panels, and the cabling, which may affect weight of cable trays or floor loading
The storage team(s) know the switching devices, which have most of the same requirements as the network devices
The electrical engineer or consultant needs to know all the power requirements and placement of all the equipment
The mechanical engineer or consultant needs to know the cooling requirements and placement of all the equipment
The structural engineer or consultant needs to know the weight and placement of all the equipment
The trades who actually build it all need to know exactly what they're building
There's likely some other poor person, maybe a building architect, who has to pull this all together

Add to all that the fact that the technology in a data centre is constantly changing, at least in terms of the number and type of servers in the room. Also, the requirements and constraints tend to be circular: For example, the number of network ports on a server affects the amount of network gear you need, which affects how many servers you can have (either through port capacity or rack space), which affects how much power and cooling you need but also how many network ports you need.

You also have to worry about other details than can seriously derail an otherwise great plan. For example, when running fibre, you need to make sure it's the right kind of fibre and that it has the right connectors. Power cables in a data centre can be varied, so again you need to make sure that the power distribution units (PDUs) in the racks can be connected to your servers.

With all this, it can be hard for people to come to an agreement on what to build. We don't have well-established ways of describing what's going to be built in a way that everyone understands. There's software to help do this, but it tends to be unreasonably expensive for a medium-sized enterprise.

Regardless of how hard or expensive it is, there's a lot of value in figuring out what you're going to build, before you built it. We were successful using Excel and Word to describe what to build, and drawings of floor plans. We had to be extremely careful about versions and keeping the different documents in sync. In the end, happily it all worked out.

Saturday, 10 December 2011

Relocating Data Centres in Waves

I've never had to relocate a data centre in one big bang. You hear stories about organizations that shut down all the computers at 5:00 PM, unplug them, move them, and have them up by 8:00 AM the next morning, but I've never done that.

The big bang approach may still be necessary sometimes, but you can mitigate a lot of risk by taking a staged approach, moving a few systems at a time.

Conventional wisdom on the staged data centre relocation is to move simpler systems, and test and development systems, first. This lets you tune your relocation processes and, particularly if you're moving into a brand new data centre, work the kinks out of the new data centre.

It sounds great in theory. In practice, we ran into a few wrinkles.

I'd say the root source of the wrinkles traces back to our environment: We had a lot of applications integrated through various tools, and a large J2EE platform running a lot of custom applications. Also, even though we had some months to do the relocation in waves, we didn't have an infinite amount of time. On top of that, business cycles meant that some systems had to be moved at certain times within the overall relocation period.

The net result is that we ended up moving some of the most complicated systems first. At least we were only moving the development and test environments. Even so, it turned out to be quite a challenge. We were slammed with a large workload when people were just learning the processes for shipping and installing equipment in the new data centre. The team pulled it off quite well, but it certainly increased the stress level.

I don't think there's much you can do about this. If your time lines force you to move complicated systems first, so be it. The lesson I take away is to identify early in planning if I have to move any complicated environments. On this project, I heard people right from the start talking about certain environments, and they turned out to be the challenging ones. We focused on them early, and everything worked out well.

Karma and Data Centre Relocations

We're pretty much done the current project: relocation of 600 servers to a new data centre 400 kms from the old one. By accident more than by design we left the move of most of the significant Windows file shares to the last month of the relocation period.

Windows file shares are known to be a potential performance issue when you move your data centre away from a group of users who are used to having the file shares close to them. We're no exception: A few applications have been pulled back to the old data centre temporarily while we try to find a solution to the performance issues, and we have complaints from people using some desktop tools that don't work nicely with latency.

The luck part is that we've developed lots of good karma by making the rest of the relocation go well. Now that we have issues, people are quite tolerant of the situation and are at least willing to let us try to fix the problems. I won't say they're all happy that we've slowed their work, but at least we don't have anyone screaming at us.

I'd go so far as to say this should be a rule: All other things equal, move file shares near the end of a relocation project.

Sunday, 27 November 2011

The Java Gotcha for Data Centre Relocations

Way back in time, someone thought it would be a good idea for the Java run-time to cache DNS look-ups itself. Once it has an IP address for a name, it doesn't look up the name again for the duration of the Java run-time process.

Fast forward a decade, and the Java run-time is the foundation of many web sites. It sits there running, and caches DNS lookups as long as the web site is up.

On my current project, we're changing the IP address of every device we move, which is typical for a data centre relocation. We have a number of Java-based platforms, and they're well integrated (read interconnected) with the rest of our environment, and we're finding we have to take an outage to restart the Jave-based platforms far too often.

In hindsight, it would have been far simpler to change the Java property to disable DNS caching. Run this way for a while in the old environment to be sure there are no issues (highly unlikely, but better safe than sorry). Then you can start moving and changing IPs of other devices knowing your Java-based applications will automatically pick up the changes you make in DNS.

In case the link above goes stale, the four properties you want to look at are:

networkaddress.cache.ttl
networkaddress.cache.negative.ttl
sun.net.inetaddr.ttl
sun.net.inetaddr.negative.ttl

Look them up in your Java documentation and decide which caching option works best for you. (Normally I'd say how to set the parameters, but I've never done Java and I fear I'd say something wrong.)

Sunday, 20 November 2011

Data Centre Relocation Gotchas

Here are a couple of gotchas we ran into while relocating close a medium-size data centre:

When restarting a server in its new location, it decided to do a chkdsk. Unfortunately, the volume was a 10 TB SAN LUN. Fortunately, we had a long weekend to move that particular server, so we could wait the almost two days it took for the chkdsk to run. (I don't know why the server decided to do chkdsk. Rumour has it we didn't shut down the server cleanly because a service wouldn't stop.)
A website tells me to run "fsutil dirty query c:" to see if chkdsk is going to run on the C: drive the next time the system boots
On Linux, here are a couple of ways to make sure you won't have an fsck when you restart the server
We were frequently burned by the Windows "feature" to automatically add a server to DNS when the server starts up. Either we'd get DNS changes when we weren't ready for them, or we'd get the wrong changes put into DNS. For example, servers that have multiple IPs on one NIC, where only one of the IPs should have been in DNS

Here's a short checklist for turning off and moving a server:

Check to see if the server is going to check file system consistency on the next startup (chkdsk or fsck)
Shut the server down cleanly
If it's a physical server, shut it down and then restart it. Rumour has it that the hard drive can freeze up if the server hasn't been stopped in a long while. Better to find that out before you move it than after. This has never happened to me
Do a host or nslookup after starting the server to make sure your DNS entries are correct. Make sure the entry is correct and that you have the right number of entries (usually one)

Friday, 11 November 2011

Running Over the WAN After Relocating a Data Centre

My current data centre relocation has us moving the data centre about 400 kms away from its current location. This has resulted in a total round-trip change in latency of 6 ms. We implemented WAN acceleration in certain locations to address the issue, and we've learned some lessons in the process. Lessons learned is what this post is about.

We have offices all over the province, so not everyone sees the 6 ms change in latency as a negative. Many users are now closer to the data centre than they were before, and we always had users who had worse than 6 ms latency to our data centre. That gave us a lot of confidence that everything would be fine after the relocation.

However, the old data centre location was the head office, so a large number of users are now experiencing latency where they never did before, including senior management. Most of the remote sites were much smaller than head office.

The one or two issues we've had up to recently were due to our phased approach to moving. In one case we had to move a shared database server without moving all the application servers that used it. After the move, we had to do a quick move of one application server, because we discovered it just couldn't live far from its database server.

That changed recently. Like many organizations, we have shared folders on Windows file shares. Windows file shares are generally considered a performance risk for data centre relocations when latency changes. In preparation, we implemented WAN acceleration technology.

We moved the main file share, and by about 10 AM we were experiencing lots of calls to the help desk about slow performance. After a hour or two of measuring and testing, we decided to turn off WAN acceleration to improve the performance. Indeed, the calls to help desk stopped after turning off the WAN acceleration.

Analysis showed that the Windows file share was using SMB signing. SMB signing not only prevents the WAN accelerator from doing its job, but the number of log messages being written by the WAN accelerator may have actually been degrading performance to worse than an un-accelerated state.

So we turned off SMB signing, and tried again a few days later. No luck. Around 9:30 AM we started to get lots of calls, and again we turned off the WAN acceleration. We're lucky that performance is acceptable even without WAN acceleration (for the time being -- we'll need it soon).

We're still working this issue, so I don't know what the final solution is. I'll update this post when I know.

A non-technical lesson learned: If I were to implement WAN acceleration again, I'd get all the silos in a room in the planning stages, before I even bought anything. I'd make the network people, Windows administrators, and storage administrators understand each others' issues. I would have the WAN accelerator vendor and the storage device vendor at the table as well. And I'd make everyone research the topic using Google so we could find out what issues other people ran into.

Oh, and one final lesson learned: Bandwidth hasn't been an issue at all. In this day and age, 1 Gbps WAN connections are within the reach of a medium-sized organization's budget. We're finding 1 Gbps is more than enough bandwidth, even with the large data replication demands of the our project. And those demands will go away once the data centre is fully relocated.

Friday, 30 September 2011

Network Team for Data Centre Relocations

I had a real "Doh" moment this week. We're about 80 percent of the way through relocating a 500 server data centre. Things have been going pretty well, but right from the start we've found we were under-staffed on the network side. We have a pretty good process in place, with what I think is just the right amount of documentation. The individuals working on the network team are excellent. We even brought in one more person than we had planned for, but we're still struggling with burnout.

The light bulb went on for me a few days ago: Here, like other IT organizations of similar size and purpose, we have about ten times as many server admins as we do network admins. We have a bigger pool of people on the server side to draw from when we organize the relocations, which typically happen over night or on the weekend. While we rotate the network guys for the nights and weekends, it's a smaller pool. They do shifts more often, and they more often have to come in the next day, because they have to plan for the next relocation on their list and it's coming soon.

Another factor is that, in our case, the network team started intense work earlier than everyone else. We're occupying a brand new cage in a data centre, so the network team had to build the whole data centre network first. They did three weeks of intense work before we moved any servers at all.

As we hit six months of intense work for the network team, the strain is showing. We're going to try to rearrange some work and delay what we can. Other than that, we'll probably have to suck up the remaining 20 percent of the project somehow.

In the future, I'm not sure what I'd do. One approach, if I had enough budget, would be to hire a couple of contract network admins well in advance. Tell them they're going to work Wednesday to Sunday, and often at night. Train them up ahead of time so they're as effective as your in-house people. Then give most of the nasty shifts to the contractors.

What would you do?

(If you're looking for numbers, we have three full-time network engineers, and we're drawing on the operational pool from time to time.)

Saturday, 24 September 2011

The Data Centre Relocation Calendar

I'm past the half-way point in relocating a 500-server data centre. The servers are a real variety -- a typical medium-scale business data centre. We're using mostly internal resources, supplemented by some contractors like myself with data centre relocation experience.

I chose not to come in and impose a relocation methodology on everyone. There are a lot of reasons for that, some of which were out of my control. Rather than using a methodology out of the can, I tried to foster an environment where the team members would build the methodology that worked for them.

This turned out to be quite successful. One of the items that emerged fairly late for us, but was very successful, was a shared calendar in Microsoft Outlook/Exchange. (The tool wasn't important. It could have been done in Google Calendar. Use whatever your organization is already using.)

The shared calendar contained an event for every relocation of some unit of work -- typically some service or application visible to some part of the business or the public. Within the calendar event we put a high-level description of what was moving, the name of the coordinator for that unit of work, and hyperlinks to the detailed planning documents in our project shared folders. The calendar was readable by anyone in the corporation, but only my team members could modify it.

What struck me about the calendar was how it organically became a focal point for all sorts of meetings, including our weekly status meeting. Without having to make any pronouncements, people just began to put the calendar up on the big screen at the start of most of our meetings. We could see at a glance when we might be stretching our resources too thin. The chatter within the corporation that we weren't communicating enough diminished noticeably.

Based on my experience, I'd push for a calendar like this much earlier in the project. We built it late in our project because I had a lot of people on my team who were reluctant to talk about dates until they had all the information possible. We got so much value in having the calendar that I think it's worth it to make a calendar early in the planning stage, even if the dates are going to change.

Pragmatic IT