AT&T and the iPhone 4 Pre-Order Debacle

Yesterday, we’re told, the crush of fanatical fanboys pre-ordering iPhones brought AT&T’s servers to their knees.  Apple and AT&T pre-sold 600K iPhones, and we’re told they processed 13 million eligibility requests during the day, as people tried over and over to get through.  Random reports surfaced about how the crushing load “crippled” AT&T’s internal network, and caused security glitches and the exposure of private customer data (again).

We’re supposed to believe that this overwhelming traffic load was unprecedented and brought their systems to a screeching halt.  Well, at least AT&T’s systems — Apple’s systems seemed fine if you weren’t going through the eligibility portion of the check.

Here’s the problem, though — if you run the numbers, and know something about web/database applications, it just doesn’t add up.

13 million database queries sounds like a lot.  But let’s say that all of these queries largely happened in the first 12 hours of the day yesterday, instead of spreading them out over the full 24 hour cycle.  That’s 1.08MM queries per hour, or 300 queries per second, on average.

I don’t know if it sounds like a lot to you, but it’s really not.  Here’s a Google query on “mysql queries per second” just to get a general idea of what people are doing out there.  Many of the results range from 2003 through the present, and folks are doing a LOT more than this.  With clustering and various attempts to scale out, folks are doing 10-20K per second.  Oracle, properly tuned, can do thousands to tens of thousands of transactions (operations that change data, not just read it) per second.

I’m not a database expert, but I’ve worked around and with them for years, and I’ll say that 300 queries per second on average is not something that should cause one of the largest (and oldest, if one considers them the heir of the Bell System) telecom companies in the world to crumple under the load.

But traffic is bursty, not uniformly distributed.  So even if they saw periods with 10-50x greater load than average, we’re still in the ballpark for reasonable performance on a pure database query.  Note that I’m assuming that eligibility is a somewhat simple database query; we gave three items of data which obviously form a compound primary key, and AT&T is supposed to return some information about eligibility for upgrade:  perhaps date, perhaps a few other bits of info.

Let’s be generous and assume that 1K of data per eligibility request is returned (i.e., there’s little concern for efficiency).  That’s still only about 300K bytes per second of query results flowing back to Apple from AT&T, or about 2.4Mbps.  Again, perhaps bursting to 20-100Mbps for very brief periods of time.  In other words, a couple of DS3s or a fast ethernet cross-connect are sufficient to carry the data back and forth.  One imagines this shouldn’t strain AT&T’s internal network too much, despite random claims yesterday.

Of course, maybe the problem here isn’t database performance or bandwidth, but that AT&T did the eligibility checks as API calls through a large enterprise system where a single check builds and then tears down many EJBs or other enterprise objects. This might be closer to the truth for a performance bottleneck here.  Maybe the system was built to handle tens, but not hundreds or thousands, of requests per second.  That’s plausible, but kind of stupid for a large engineering company used to having millions of subscribers and doing business globally.  But I could buy it.

But you’d imagine that they’d have learned something from three previous “major” iPhone releases, and the iPad 3G release, and figured out an easier way to quickly respond to eligibility requests.  After all, my eligibility isn’t a rapidly changing variable — I’m eligible on a certain day, and they know what that day is.  Which means that the eligibility of every iPhone owner on the planet could have been precalculated easily just before the iPhone4 launch, and cached.  It’s not that much data, frankly.  You could have cached a table with the user’s phone number, last 4 SSN, and zip (the keys they ask you to enter) hashed, and a eligibility “price code”, in a few gigs of memory on all the app servers, and just statically responded to queries for the first 24 hours, if you were worried that your enterprise systems wouldn’t handle “first day” load.

Anyhow, these are just ballpark figures, and they could be wildly wrong about the instantaneous loads experienced, etc.  But the general point is, 13MM eligibility checks and 600K preorders isn’t really a lot of load and traffic.  Ask Amazon or eBay what “a lot” of transactions looks like.

Or better yet, AT&T, before the next launch, hire some of their ex-employees to take a look at your databases and systems.  Please.