October 22nd – Degraded Server Performance
Adzerk experienced approximately 2 hours of degraded performance yesterday between 4:00 and 6:00pm EST. Ad serving was intermittent, and while there were times when it was working consistently, there were times when ad serving was delayed or completely down. This is unacceptable to us and we are taking steps to ensure that what happened to cause this doesn’t happen again – we know that you expect 100% up-time and we are taking steps to ensure that going forward.
Continue reading for a full explanation of what happened.
Earlier in the day an outage in EBS caused widespread problems for customers of Amazon Web Services which caused many sites across the internet to go down. You can read articles here and here for more details. EBS has caused many problems in the past and we consciously moved off of EBS for all of our ad delivery servers – when the EBS issue began we didn’t encounter any problems and ads continued to serve.
At around 4:00 EST Amazon decided to cut off all traffic to the affected zone (we are distributed across multiple zones to ensure issues in a single zone don’t affect us) – when Amazon did this all of our traffic had to fall over to our other servers and they didn’t have enough capacity to serve the traffic. We saw the issues and quickly tried to bring up new instances to meet the demand but due to so many other customers having issues we were unable to launch additional instances for over an hour (dealing with request exceeded errors, zones not allowing more instances, and other EC2 related issues). Once we figured out an availability zone we could launch instances in and found a workaround to another of our service providers being down (Loggly) we quickly launched the needed instances and service was back to normal.
We are still unclear on why Amazon would shut off all traffic to the affected zone when the non-EBS instances we had in the zone continued to operate throughout the entire incident – this issue was almost completely caused by Amazon trying to “help” their customers and inadvertently causing problems.
Going forward we are going to be expanding the number of instances we run at a given time to ensure that if we lose an availability zone we won’t need to launch additional instances – we will also be expanding into US-WEST and EU-1 in the next 6 months to provide true data center redundancy.