Sunday, 27 November 2011

The Java Gotcha for Data Centre Relocations

Way back in time, someone thought it would be a good idea for the Java run-time to cache DNS look-ups itself. Once it has an IP address for a name, it doesn't look up the name again for the duration of the Java run-time process.

Fast forward a decade, and the Java run-time is the foundation of many web sites. It sits there running, and caches DNS lookups as long as the web site is up.

On my current project, we're changing the IP address of every device we move, which is typical for a data centre relocation. We have a number of Java-based platforms, and they're well integrated (read interconnected) with the rest of our environment, and we're finding we have to take an outage to restart the Jave-based platforms far too often.

In hindsight, it would have been far simpler to change the Java property to disable DNS caching. Run this way for a while in the old environment to be sure there are no issues (highly unlikely, but better safe than sorry). Then you can start moving and changing IPs of other devices knowing your Java-based applications will automatically pick up the changes you make in DNS.

In case the link above goes stale, the four properties you want to look at are:

networkaddress.cache.ttl
networkaddress.cache.negative.ttl
sun.net.inetaddr.ttl
sun.net.inetaddr.negative.ttl

Look them up in your Java documentation and decide which caching option works best for you. (Normally I'd say how to set the parameters, but I've never done Java and I fear I'd say something wrong.)

Sunday, 20 November 2011

Data Centre Relocation Gotchas

Here are a couple of gotchas we ran into while relocating close a medium-size data centre:

  • When restarting a server in its new location, it decided to do a chkdsk. Unfortunately, the volume was a 10 TB SAN LUN. Fortunately, we had a long weekend to move that particular server, so we could wait the almost two days it took for the chkdsk to run. (I don't know why the server decided to do chkdsk. Rumour has it we didn't shut down the server cleanly because a service wouldn't stop.) 
  • A website tells me to run "fsutil dirty query c:" to see if chkdsk is going to run on the C: drive the next time the system boots
  • On Linux, here are a couple of ways to make sure you won't have an fsck when you restart the server
  • We were frequently burned by the Windows "feature" to automatically add a server to DNS when the server starts up. Either we'd get DNS changes when we weren't ready for them, or we'd get the wrong changes put into DNS. For example, servers that have multiple IPs on one NIC, where only one of the IPs should have been in DNS
Here's a short checklist for turning off and moving a server:

  • Check to see if the server is going to check file system consistency on the next startup (chkdsk or fsck)
  • Shut the server down cleanly
  • If it's a physical server, shut it down and then restart it. Rumour has it that the hard drive can freeze up if the server hasn't been stopped in a long while. Better to find that out before you move it than after. This has never happened to me
  • Do a host or nslookup after starting the server to make sure your DNS entries are correct. Make sure the entry is correct and that you have the right number of entries (usually one)

Friday, 11 November 2011

Running Over the WAN After Relocating a Data Centre

My current data centre relocation has us moving the data centre about 400 kms away from its current location. This has resulted in a total round-trip change in latency of 6 ms. We implemented WAN acceleration in certain locations to address the issue, and we've learned some lessons in the process. Lessons learned is what this post is about.

We have offices all over the province, so not everyone sees the 6 ms change in latency as a negative. Many users are now closer to the data centre than they were before, and we always had users who had worse than 6 ms latency to our data centre. That gave us a lot of confidence that everything would be fine after the relocation.

However, the old data centre location was the head office, so a large number of users are now experiencing latency where they never did before, including senior management. Most of the remote sites were much smaller than head office.

The one or two issues we've had up to recently were due to our phased approach to moving. In one case we had to move a shared database server without moving all the application servers that used it. After the move, we had to do a quick move of one application server, because we discovered it just couldn't live far from its database server.

That changed recently. Like many organizations, we have shared folders on Windows file shares. Windows file shares are generally considered a performance risk for data centre relocations when latency changes. In preparation, we implemented WAN acceleration technology.

We moved the main file share, and by about 10 AM we were experiencing lots of calls to the help desk about slow performance. After a hour or two of measuring and testing, we decided to turn off WAN acceleration to improve the performance. Indeed, the calls to help desk stopped after turning off the WAN acceleration.

Analysis showed that the Windows file share was using SMB signing. SMB signing not only prevents the WAN accelerator from doing its job, but the number of log messages being written by the WAN accelerator may have actually been degrading performance to worse than an un-accelerated state.

So we turned off SMB signing, and tried again a few days later. No luck. Around 9:30 AM we started to get lots of calls, and again we turned off the WAN acceleration. We're lucky that performance is acceptable even without WAN acceleration (for the time being -- we'll need it soon).

We're still working this issue, so I don't know what the final solution is. I'll update this post when I know.

A non-technical lesson learned: If I were to implement WAN acceleration again, I'd get all the silos in a room in the planning stages, before I even bought anything. I'd make the network people, Windows administrators, and storage administrators understand each others' issues. I would have the WAN accelerator vendor and the storage device vendor at the table as well. And I'd make everyone research the topic using Google so we could find out what issues other people ran into.

Oh, and one final lesson learned: Bandwidth hasn't been an issue at all. In this day and age, 1 Gbps WAN connections are within the reach of a medium-sized organization's budget. We're finding 1 Gbps is more than enough bandwidth, even with the large data replication demands of the our project. And those demands will go away once the data centre is fully relocated.

Living with Virtualization

In 2006, I was project manager on a VMware implementation for a health care organization. We virtualized 200 servers in six weeks, after a planning phase of about 2 months. Out of that experience I wondered, "Did virtualization have anything to offer a smaller business?" So I set up a box at home and converted my home "data centre" into a virtualized data centre using VMware's Server product, which was the free product at the time.

After five years it's been an interesting experience and I've learned a lot. At the end of the day, I'm pretty convinced that the small business that has a few servers running in a closet in their office doesn't have a lot to gain from virtualizing within the "closet". (I'm still a big fan of virtualization in a medium or large organization.) I'm going to switch back to running a single server with all the basic services I run (backup, file share, DNS, DHCP, NTP) on a single server image.

I had one experience where the VM approach benefited me: As newer desktops and laptops came into the house, the version of the backup client installed on them by default was newer than the backup master on my backup server (I use Bacula). Rather than play around with installing and updating different versions of the backup client or master, I simply upgraded the backup master VM to a new version of Ubuntu and got the newer version of Bacula. I didn't have to worry about what other parts of my infrastructure I was going to affect by doing the upgrade.

The down side was that I spent a lot of time fooling around with VMware to make it work. Most kernel upgrades require a recompile of the VMware tools on each VM, which was a pain. I spent a fair bit of time working through an issue about timekeeping on the guests versus the VMware host that periodically caused my VMs to slow to a crawl.

Connecting to the web management interface and console plug-in always seemed to be a bit of a black art, and it got worse over time. At the moment, I still don't think modern versions of FireFox can connect to a running VM's console, so I have to keep an old version around when I need to do something with a VM's console (before ssh comes up).

My set-up wasn't very robust in the face of power failures. When the power went off, the VMs would leave their lock files behind. Then, when the power came back, the physical machine would restart but the VMs wouldn't. I would have to go in by hand and clean up the lock files. And often I wouldn't even know there'd been a power failure, so I'd waste a bit of time trying to figure out what was wrong. I should have had a UPS, but that wouldn't solve all the instances where something would crash leaving a lock file behind.

All in all, and even if I had automated some of that, the extra level of complexity didn't buy me anything. In fact, it cost me a lot of time.

Some of these problems would have been solved by using the ESX family of VMware products, but the license fees guarantee that the economics don't work for a small business.

I originally started out planning to give Xen a try, but it turned out not to work with the current (at the time) version of Ubuntu. Today I would try KVM. I played around with it a bit last year and it looked fine for a server VM platform. I needed better USB support, so I switched to VirtualBox. VirtualBox worked fine for me to run the Windows XP VM I used to need to run my accounting program, but it has the free version/enterprise version split that makes me uncomfortable for business use.

So my next home IT project will be to move everything back to a simpler, non-virtualized platform. I'll still keep virtualization around for my sandbox. It's been great to be able to spin up a VM to run, say, an instance of Drupal to test upgrades before rolling out to my web site, for example, or to try out Wordpress, or anything else I need to try.

My blog posts about interesting steps along the virtualization road are here.


Wednesday, 9 November 2011

A New Computer -- Wireshark

I'm not a network expert by any stretch of the imagination, but I've occasionally solved problems by poking around a bit with Wireshark.

Of course, if my network is down I'm not going to be able to download Wireshark. Fortunately, I remembered to re-install Wireshark on my new computer before I needed it. I installed it using the Ubuntu Software Centre.

A new feature of Wireshark that I didn't know about: If you add yourself to the "wireshark" group, you can do live captures without running Wireshark as root.

sudo adduser --disabled-login --disabled-password wireshark
sudo chgrp wireshark /usr/bin/dumpcap
sudo chmod 754 /usr/bin/dumpcap
sudo setcap 'CAP_NET_RAW+eip CAP_NET_ADMIN+eip' /usr/bin/dumpcap


Now add yourself to the wireshark group and log out. When you log back in you should be able to do live captures without root privileges. To add yourself to the wireshark group in a terminal, type:

adduser your-user-name wireshark

The Wireshark documentation for this is here (scroll down a bit).

Tuesday, 1 November 2011

A New Computer -- Video

My son is fascinated with videos. I dream that one day he'll get fascinated by making them, not just watching them. So I've been trying to learn about making videos. Here's what I had to reinstall on my new computer.

First, playing video (and audio, for that matter) has worked out of the box much better with 11.04 than with previous versions of Ubuntu. I play my Guatemalan radio station and CBC audio and video without having to fool around with any setup.

To make videos, I loaded up OpenShot. 

sudo apt-get install openshot

That didn't install ffmpeg, which has been my main fallback tool. It seems to be the tool that does everything, although as a command line tool, I find I usually just cut and paste an example command line from the Internet. It's not that I'm afraid of the Linux command line. It's that I don't know anything about video. So:

sudo apt-get install ffmpeg

That seems to be all that was needed.