Sunday, 4 April 2010

Looking for IP Addresses in Files

I've moved a couple of data centres. And I've virtualized a lot of servers. In all cases, the subnets in which the servers were installed changed. If anything depends on hard-coded IP addresses, it's going to break when the server moves.

The next data centre I move, I'm going to search all the servers for files that contain hard-coded IP addresses. The simplest thing to do for Linux and Unix is this:
egrep -R "\b([[:digit:]]{1,3}\.){3}[[:digit:]]{1,3}\b" root_of_code
The regular expression matches one to three digits followed by a "." exactly three times, then matches one to three digits, with word boundaries at either end.

That's not the most exact match of an IP address, because valid IP addresses won't have anything higher than 255 in each component. This is more correct:
egrep -R "\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b" /!(tmp|proc|dev|lib|sys) >/tmp/ips.out
It yields about two percent fewer lines when scanning a Linux server (no GUI installed). (Thanks to this awesome site for the regular expression.)

When I run the above egrep command from "/", I have problems. There are a few directories I had to exclude: /tmp, /proc, /dev, /lib and /sys. I used this file pattern match to get all the files in root except those directories:
The reason I wanted to exclude /tmp is that I wanted to put the output somewhere. /tmp is a good place, and by excluding it I didn't end up writing to the file while reading it. /sys on a Linux server has recursive directories in it. /proc and /dev have special files in them that cause egrep to simply wait forever. /lib also caused egrep to stop, but I'm not sure why (apparently certain combinations of regular expressions and files cause egrep to take a very long time -- perhaps that's what happened in /lib.)

I'll write about how to do this for Windows in another post. I'll also write about how to do it across a large number of server.

Friday, 2 April 2010

The Cost of Storage

Over the years I've seen SAN storage cost between C$10 and C$20 per GB (C$ is approximately equal to US$ right now). This is the cost of the frame, a mix of disks, redundant director-class fibre channel switches with a number of 48 port cards in each switch, management computer, and a variety of management and replication software. The cost doesn't include the HBAs in the servers, or cabling.

The price above is for a very raw GB, before you apply the loss for whatever classes of RAID you apply.

The management and replication software in all the above cases was the basic stuff you need to manage a SAN and replicate it. There was no fancy de-duplication or information lifecycle management going on.

The costs above also didn't include the cost of training dedicated storage staff to set up and manage a SAN, or the higher salary you'll have to pay to keep them after you train them.

Compare that to direct attached storage: Right now I can get a 1TB drive for less than C$300, or about 30 cents per GB. If I put that in a RAID 1 configuration with a RAID controller (less than $400 easily) you would still be paying less than $1 per GB.

I get RAID 1 storage for an order of magnitude cheaper than raw storage on a SAN. No need for special management software. You've got rsync for replication. You can use standard tools that everyone knows how to use.

No wonder Google uses direct-attached storage in commodity servers for their index. It's just way more cost-effective. What's your business case for SAN-attached storage?