Sunday, 30 September 2012

Long Fat Networks

Long fat networks are high bandwidth, high latency networks. "High latency" is relative, meaning high latency compared to a LAN.

I ran into the LFN phenomena on my last data centre relocation. We moved the data centre from head office to 400 kms from head office, for a round trip latency of 6 ms. We had a 1 Gbps link. We struggled to get a few hundred Mbps out of large file transfers, and one application had to be kept back at head office because it transferred large files back and forth between the client machines at head office and its servers in the data centre.

I learned that one can calculate the maximum throughput you can expect to get over such a network. The calculation is called the "bandwidth delay product", and it's calculated as the bandwidth times the latency. One way to interpret the BDP is the maximum window size for sending data, beyond which you'll see no performance improvement.

For our 1 Gbps network with 6 ms latency, the BDP was 750 KB. Most TCP stacks in the Linux world implement TCP window scaling (RFC1323) and would quickly auto tune to send and receive 750 KB at a time (if there was enough memory available on both sides for such a send and receive buffer).

SMB 1.0 protocols used by most anything you would be doing on pre-Windows Vista are limited to 64 KB blocks. This is way sub-optimal for a LFN. Vista and later Windows use SMB 2.0, which can use larger block sizes when talking to each other. Samba 3.6 is the first version of Samba to support SMB 2.0.

We were a typical corporate network in late 2011 (read, one with lots of Windows machines), and they were likely to suffer the effects of a LFN.

Note that there's not much you can do about it if both your source and destination machines can't do large window sizes. The key factor is the latency, and the latency depends on the speed of light. You can't speed that up.

We had all sorts of fancy WAN acceleration technology, and we couldn't get it to help. In fact, it made things worse in some situations. We never could explain why it was actually worse. Compression might help in some cases, if it gets you more bytes passing through the window size you have, but it depends on how compressible your data is.

(Sidebar: If you're calculating latency because you can't yet measure it, remember that the speed of light in fibre is only about 60 percent of the speed of light in a vacuum, 3 X 10^8 m/s.)

There are a couple of good posts that give more detail here and here.

No comments: