Google’s Gmail application was knocked out for the majority of users for 100 minutes on Sept. 1 due to human error, the company said last night after the Gmail engineering team fixed the issue.
Google took a small fraction of Gmail’s servers offline to perform routine upgrades. Google does this regularly, sending traffic to other locations when one is offline. That’s when things got hairy, as Ben Treynor, vice president of engineering and site reliability czar for Google, explained:
““We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers-servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!” This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the Web interface because their requests couldn’t be routed to a Gmail server.”“
Through internal monitors, Treynor said the Gmail engineering team was alerted to the failures within seconds and added several request routers online to make up for the dearth in capacity and distributed the traffic across the request routers. Gmail came back online around 2:30 p.m. PDT.
To ensure this lack of server capacity-which is ironic considering that Google allegedly powers the world’s most popular search engine with more than 1 million servers-doesn’t happen again, Google boosted request router capacity well beyond peak demand for extra juice when the application needs it.
Treynor also said Google is improving the failure isolation in the routers, so a problem in one data center won’t affect servers in another facility. Moreover, he said that Google is taking steps to make sure that when the request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load to another data center.
It’s also worth noting that when Gmail did go down, Google urged users to access it via the IMAP and POP mail protocols; mail processing continued to work normally because these requests don’t use the same routers at Google.
“We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there’s a problem with the service,” Treynor added. “Thus, right up front, I’d like to apologize to all of you-today’s outage was a Big Deal, and we’re treating it as such.”
So are the Gmail users who use Gmail for their businesses. Donald told Google Watch: “I use G-Mail to run my CPA practice. This is a serious (huge) problem.”
Sergei added: “This is a huge problem and an outrage. I demand immediate Gmail access. What is with those people?”
Indeed, more than 1.75 million businesses use Google Apps and some of them pay Google $50 per user, per year for the Google Apps collaboration suite, which boasts Gmail as its backbone. Users have little patience for a service that conks out on them, particularly when they are paying for the extra reliability and security. Read more about this on TechMeme here.
The latest issue follows a big outage in February, when Gmail went down for two and a half hours due to “unexpected side effects of some new code.” But these last two issues were nothing compared with the August 2008 outage that took Gmail down for nearly 15 hours.