ACM Queue - A Conversation with Phil Smoot - An engineer at Hotmail discusses the challenges of keeping one of the Web’s largest and oldest Internet services running 24/7 (via Matt). Informative, and has some good quotes. On simplicity:
Most of this experience comes with time. We try to build our bench with folks who understand what mistakes not to make. New hires tend to want to do complex things, but we know complex things break in complex ways. The veterans want simple designs, with simple interfaces and simple constructs that are easy to understand and debug and easy to put back together after they break.
The best advice is just basically to keep everything as simple as possible—simple processes, simple SKUs, simple engineering. These systems get to be very big very fast. I don’t think there’s really any one particularly hard, gnarly problem, but when you add them all up, there are lots and lots of little problems. As long as you can keep each of those pieces simple, that seems to be the key. It’s more of a philosophy, I think, than anything else.
On manageability:
BF Are there scaling reasons to think about the benefits of a command line for managing over a GUI, or are there other things to think about?
PS Our operations group never wants to rely on any sort of user interface. Everything has to be scriptable and run from some sort of command line. That’s the only way you’re going to be able to execute scripts and gather the results over thousands of machines.
On scalability:
BF Is storage going to change enough so that maybe just the next round of disks will be fast enough that you don’t need to worry about that?
PS If you rely on scale up, you’ll probably get killed. You should always be relying on scale out.
In other words, every scaling problem is ultimately a distributed computing problem.
Feel free to post a comment below. Please see my comment policy.
Formatting Rules (No HTML):