Reflections on complexity on the occasion of diagnosing computer problems

At least twice in the last two days, I’ve had friends or neighbors pose computer problems to me. At least one was my own fault, having given a friend a (fully licensed, purchased from the MS Store) copy of Windows XP to “fix” their laptop via reinstallation. The other is a gentleman here on the island, from whom I’m in the process of buying a small boat to putt around between the islands. In the latter case I answered the usual questions about occupation and history, and since I nearly always answer that question with something about software or computers, I guess being involved in the second incident was my fault, too.

In the first case, the problem was that my friend tried to reinstall XP Pro SP2 over a Dell OEM installation of Windows Media Center, and got an error saying that the product key was invalid. It only took one email for my friend, who’s probably accustomed to folks like me asking seemingly simplistic questions like “did you mistype the code?” to convince me that, indeed, this was a real error. In the second case, a neighbor here on the island knew I was in the software business and asked me why his HP inkjet printer didn’t seem to install and work correctly on his Mac running OS X.

In both cases I was initially stymied. In the second case, I’m still stymied, but I’m buying the guy’s boat so I might help him figure it out tomorrow.

In the first case, a quick Google on the problem revealed that other people have exactly the same problem. Reinstalling a personal (i.e., non-Enterprise) license key for XP Pro over Media Center seems to reject perfectly valid license keys. Of course, even though I worked at Microsoft and have worked with Windows since the 3.1 days, I have absolutely no clue why it does this. I just know enough about the complexity of the Windows code base and have enough anecdotal experience not to be shocked in the slightest. Similarly, I’m not shocked that I could have a serious amount of experience with computers and code and still not have a clue.

I suspect the reason for this is that software engineers actually have two core skills, not one. Sure, software engineers are extremely good at abstraction: the skill of looking at a set of particulars, and creating a model of generalizations to represent any other set of particulars that share all or some of the relationships we imagine to exist within the original case. That task of abstraction is the same one shared by mathematicians, physicists, population geneticists, and other creators of mathematical models. But software engineers, and systems administrators, as opposed to pure computer scientists, have a second skill which is equally crucial. The ability to catalog a large number of actual cases, their causes, and their solutions. In other words, the skill to capture and contextualize and apply the lore of computing.

The first ability, I think, is what people expect when they ask me what might be causing their technology to have a problem. The ability to see a rational abstraction behind the seemingly random behavior that’s occurring, and thus to diagnose what’s wrong. But in reality, the extent of one’s command of lore — of detail, contextualized by situation and software version and architecture — governs one’s ability to solve such problems, particularly remotely — without the computer in question in your hands. The reason is the fundamental complexity of the situation. On top of the hardware runs an operating system, with a specific set of rules. That operating system can be tiny, like MS-DOS 3.3, or utterly massive, like the 60+ million lines of C code that purportedly make up Windows XP. On top of this midgit (or giant), rests a layer of drivers — bit of the operating system contributed typically by hardware vendors that allow the whole thing to work on their hardware. And on top of this three-layer cake runs your applications, today often themselves multi-million line pieces of software code. Code that might also depend critically on being able to communicate to other computers, across a network, to gather data via HTML or other “protocols,” which are essentially small languages that all computers must speak fluently in order to not misunderstand one another.

Complexity is the enemy of things “just working.” And it’s the enemy of even computer professionals being able to understand the systems they build. We can visualize a few interactions; we can even visualize a few histories of interactions. But nobody can visualize all of the interactions and possible states that even a moderately large piece of software (forget Microsoft Office, Windows, the Linux kernel, or Mac OS X) can display. Heck, human beings can’t visualize the geometry of a vector with more than three dimensions! How are we possibly going to understand the state space (i.e., possible behavior) of a piece of software with 66 million lines of code and megabytes of internal state variables?

We can’t, in detail. We do so statistically. We test things over partial ranges of their possible behaviors. Hopefully the important range of their behaviors, in terms of how often users can get their system into the same state. Even understanding the scope of the range of possible behaviors is a massive challenge, witnessed by the continued research into code coverage, automated testing, and the like. The current popularity of unit testing probably represents a programmer-driven effort to simply reduce the dimensionality of the state space. Unit testing reduces, by pursuing automated means of verifying the lowest level of “contracts” within the software itself, the size of the state space by large factors.

But what’s left after good, serious modern testing and QA is still a lot of possible behavior, and only some key pathways, the deepest, most intentional valleys through the overall “landscape” of behaviors, are documented or recorded. Much of the state space of a modern commercial software program is still deeply terra incognita, as a simple consequence of the overall complexity and coupling present in our systems.

Thus, I was encouraged by this post about Erlang on Lambda the Ultimate, a prominent blog about programming languages and the associated computer science. The designer of Erlang, Joe Armstrong, has this to say:

The Erlang flagship project (built by Ericsson, the Swedish telecom company) is the AXD301. This has over 2 million lines of Erlang.

The AXD301 has achieved a NINE nines reliability (yes, you read that right, 99.9999999%). Let’s put this in context: 5 nines is reckoned to be good (5.2 minutes of downtime/year). 7 nines almost unachievable … but we did 9.

Why is this? No shared state, plus a sophisticated error recovery model. You can read all the details in my PhD thesis.

Interesting. And impressive. It’s possible that there’s an approach here for reducing complexity to manageable, understandable, plannable levels. Objects, aspects, and other recent software innovations aim to reduce dimensionality, allowing more of the total state of a program to be explicitly designed, rather than showing up as emergent run-time behavior.

It seems clear, though, that getting a handle on complexity in software is critical — if we’re going to be able to diagnose what goes on inside our software, and thus if we’re going to be able to trust it. For commerce. For security. For privacy. And for exercising our rights in a democracy, since more and more, software is involved when we vote and make decisions.