Tuesday, May 31, 2011

Blogger Incident Report
By Eddie Kessler, Tech Lead/Manager, Blogger

(This is a follow-up to the original post we made on Blogger Buzz about this incident.)

On Wednesday, May 11 at 10:00pm PDT, Blogger entered a one-hour scheduled maintenance period to make improvements to increase service reliability. Unfortunately, errors made during this period had the opposite effect. This report describes what we were doing, what went wrong, how we fixed it, and what we’re doing to help prevent these types of problems from happening again.

We sincerely apologize for the impact of this incident on Blogger’s authors and readers.

What we were doing

Blogger maintains replicas (copies) of blogs in multiple locations so if one copy becomes unavailable — maybe due to a cut network cable or loss of power — blogs continue to remain accessible. During this maintenance period we attempted to add several more replicas to both increase redundancy (which helps make the service more robust) and capacity (so we can serve more blogs to more people).

The procedure required a read-only period (no new posts, comments, or blogs) while we added the new replicas. The procedure appeared to go smoothly, and just after 10:30pm PDT we allowed new blogs, posts, and comments.  We then noticed a higher than usual rate of errors being reported and quickly realized that the new replicas were missing data.

What went wrong

Next, we reverted the service upgrades and removed the newly introduced replicas. During this process we discovered that running Blogger with the bad replicas had caused some user data to become inaccessible. This manifested itself in a number of ways including some blogs appeared to have been replaced, others wouldn’t display, and some users couldn’t access their dashboard.

Fixing the problem

After several hours of examining different strategies to address the data problems we decided to restore Blogger data from our backup systems. First we had to restore the data from backups to our serving infrastructure and then we had to recover all the posts, pages, and comments that had been made since the backup was taken.

While we restored the service from backup, Blogger remained in read-only mode for just over 10 hours, after which the majority of blogs returned to normal. The backup, however, had some inconsistencies that affected a very small percentage of blogs.  Also, the process used to migrate posts immediately after restoration had some unforeseen side-effects that we had to address.

Once we had all the blogs restored, we put our energy toward removing inconsistencies and gradually restoring posts, pages and comments. This required detailed work, and we focused on being meticulous to make sure that we didn't create any additional issues.

Lessons learned and preventative actions

We’ve learned a lot in a short period of time about how to manage failures in our maintenance and recovery processes. We’ve identified multiple areas to fix and improve — including better tools for repairing inconsistencies in our data store, defensive elements in our software to guard against corruption, better backup and restore procedures, and some procedural changes for maintenance that would have prevented the initial issue. We also outlined how we can improve our communication with Blogger users should something like this occur again, which would include more consistent updates on the user forum.

During this time, we received messages of support from some users for which we are very grateful. Blogger users have spent countless hours creating blogs that are amazing, creative, important, and personal. Those hours have weighed heavily on us, and for all the anxiety and frustration that this incident caused some Blogger users, we sincerely apologize.

We are committed to swiftly fixing the problems and using our lessons learned to try to make sure that this type of issue doesn't happen again.