Loadbalanced and redundant computing

Comment on this article

When writing communication systems, one very important requirement usually is "high availability" or "fault tolerance". The client is usually not a problem, but the server needs to be very robust. How can we achieve that? Well, there's a couple of ways of doing that:

  1. Spend as much time and effort as needed to make your server error free and robust.
  2. Make the server reasonably error free and provide for redundant servers.
  3. Make the server kernel very solid and provide for redundant server processes.

1. Make single server error free

Using this model, you develop a server that works alone and that has to have an uptime as close to 100% as possible. For every man-month you put into the development, you'll get a gain in product quality (I'm simplifying a bit here...).

Usually, you'll see diminishing returns as you put in more development time. The first year you'll get 95% reliability, the second year 98%, the third year 98.5% and so on. You'll never get to 100%, of course, but you'll never get real close either. And that's because those last pesky bugs can be extremely hard to find and are often due to causes beyond your control anyway.

2. Redundant servers

Creating a system using redundant servers is in itself much more complex than a system with a single server. Not only does it cost more time, but it also introduces more bugs than if you hadn't added this hairy code. On the other hand, bugs in this code is often within your reach to fix in a later stage.

The story for this kind of development may go like this:

As in the single server case, you'll get to maybe 95% reliability in the first year. The second year, you'll be adding the code for load balancing and redundancy and that will lower the reliability to maybe 90%. Not only did you introduce more code, you also did not spend time to make the basic server code better, so it's a double loss in this sense.

On the other hand, if you now install two servers, the 90% reliability turns into a 99% reliability, if no bugs are in the basic algorithms (which some are, of course).

Bugs in the balancing algorithms affect all servers alike, so the 99% number would go down. On the other hand, such bugs are generally easy to detect and correct during pilot phases, so after a short while we're back up close to the 99% figure.

Now, if you're not happy with the 99% number, simply add more servers. One more and you'll get 99.9%. This number is simply unachievable by any other method. I may exaggerate here, but not by much, actually.

3. Solid kernel with redundant processes

This approach was taken by the Apache and IIS teams. A very reliable server core was developed, but all other processes, such as ASP filters, script processors and user processes are started in protected mode and allowed to crash if something goes wrong. Crashed processes are simply restarted by the server core. Naturally, this model of development is very suitable for servers that need to be modular and accepts user programs of dubious quality. The development of the robust core is also very expensive and probably out of reach of most organizations.

The dumbest thing you can do

Now we're getting to the reason I'm writing this story. The absolutely dumbest thing you can do is to choose the "Redundant Server" model and then install it on a single server. You've then developed a 95% reliable server, gone on to reduce its reliability to 90% using great development costs and then you've rolled out a complex and unreliable system. Congratulations, you've not only shot yourself in the foot; you've machine-gunned your entire lower body.

When you roll out a redundant server system, you have to have multiple servers right from the start, even though the traffic may be slow. Why? Because right at the start you'll have most problems keeping each individual server up and running and that's when you need the switch-over capability. Once those bugs are ironed out, the traffic goes up naturally and you'll need the extra servers to take the load. In other words, you cannot roll out a redundant server scheme using Scrooge tactics.

A second side-effect of the redundant server design is that each individual server has a lower performance than a stand-alone server. This means that you not only achieve lower reliability but also lower throughput if you roll out with a single server.

If you (or your customer/boss) insists on running this system on a single server, you'll have to tell them it'll take development time to remove the loadbalancing code and even more time to bring up the code to the standard it needs in a single server scenario.

For example: you spent 2 years building a redundant server system only to discover it will only be used in single-server scenarios. Now you may need three to six months to rip out code and another year to upgrade the reliability. The total cost of the system will be three and a half years of development for a result that will be worse than had you developed a single server solution in two years.

Admittedly, it will save one or two machines, though.

Comment on this article

TOP