Yesterday from approximately 2:15 am until 7:15 pm PDT (2020-06-16 09:15 until 2020-06-17 02:15 UTC) Asana’s webhooks and event streams were partially or fully down and failed to deliver the majority of events during that time. We have identified and disabled the root cause of the downtime and our systems have been operating normally since then. Below is the detailed summary of the event and what we’ll be doing to prevent similar incidents in the future.
Anytime an object in Asana is changed, a detailed record of the change including who caused the change and how the object changed, is put in a queue specific to that object. This record is what eventually becomes the event we deliver to your apps. This event is always enqueued, even if there is no subscriber to eventually receive the event. The act of taking an event, figuring out what its “parents” are (e.g., a story’s events propagate to the task, which propagates to its projects, etc.) looking up all the different webhooks, and distributing the event to those webhooks is enough work that we want it to happen asynchronously. This asynchronous job that processes these queues is called, unsurprisingly, the “distributor.”
The distributor is handed a list of objects that were recently modified on a particular database, and for each of the objects it will check the queue for that object for the more detailed events. To ensure that the distribution is robust, and to ensure that we don’t get duplicate events being sent to webhooks, the distributor holds a lock for a short while as it’s reading from a particular queue. The distributor will drain the queue one event at a time until either the queue is empty or until the lock is about to expire.
If a different instance of the distributor encounters a locked queue, it will work on other objects in its list and retry later. Queues typically have no more than 10 events in them at a time and are drained quickly, so retries are rarely needed. If the maximum number of retries is ever exhausted, the distributor will quit and get rescheduled with the same list of objects. (There is no risk of duplicate events—when the rescheduled distributor checks the queue for an object it’s already processed it will find no events waiting in it.)
The incident began when a badly-behaved app began making requests to update the same task over and over again at a high, sustained rate. The queue for this task started growing faster than the distributors could drain it, and distributors held onto the lock for longer and longer. While one distributor held the lock, another distributor would eventually be assigned another list of modified objects, and because this task was being updated so frequently, this other list inevitably contained the same task. The first distributor, trying to process the growing queue, held onto the lock for long enough that the second distributor would exhaust its retries and quit.
As distributors rotated through all our databases, more and more of them ended up trying to acquire a lock on this one queue, ultimately causing them all to fail and quit. With exponential backoff, the rescheduling of these failed jobs were spaced farther and farther apart, leading to long stretches of time when no events were processed at all. Events continued to accumulate, this one particular database kept falling farther and farther behind, and the distributors became more entrenched in this looping and backoff. Attempts to restart the system worked for a short period of time, until all the distributors accumulated on the problematic database again.
The incident was resolved by banning the badly-behaving app, and manually deleting the excessively large queue, which at its peak contained over 130,000 events.
For most of the incident, our systems were failing to deliver any webhook events, and requests for event streams would return no events as well. Intermittently, events would be delivered but the system would break down again within about 15 minutes.
Our systems account for some amount of lag between the creation of events and their delivery, so events in the near-past (no more than a few hours old) are delivered when the system recovers. However, we do not buffer events indefinitely and so events created towards the start of the incident were lost and could not be delivered when the system finally fully recovered.
This incident was a failure in a large number of ways, and we’re taking away a variety of action items. We haven’t run a full retrospective yet, so this isn’t a complete list of what we’re going to do, but it covers most of the issues we’ve identified already.
- We are reviewing all our Datadog monitors and metrics for correctness. With the recent webhooks improvements we made, we also improved some of our metrics and reporting. However, we failed to update one of our alarms to use the new metrics and as a result we weren’t paged for this incident automatically. Our on-call engineer was manually paged by our customer support team due to the high volume of webhooks-related tickets (thank you all for writing in!) but by this time the incident was already four hours underway.
- We are educating more engineers on how our events system operates. Despite the recent work on webhooks, many parts of the system were left untouched and our engineers were unfamiliar with these inner workings.
- We are adding new metrics to help us identify issues more quickly. We spent a lot of time digging through logs and metrics to discover something that our systems could have told us automatically. We’re going to make it easier to narrow down which parts of our system are not functioning properly, such as what objects the distributors are choking on and what resource queues are backed up.
- We are adding a maximum size to the queues. Most queues contain fewer than 10 events by the time a distributor processes it, and even when our asynchronous job system as a whole falls behind for an hour or so, no queue should grow to contain thousands of events. We will also alert when queues are full and events are being dropped.
- We are going to add finer-grained controls for disabling the event system. After we identified which database and workspace contained the frequently-modified task, we would have liked to block the processing of events for that narrow scope and allow distributors to continue working normally on other databases. Without those controls, the entire set of distributors stayed stuck until we deleted the problematic queue, and it took much longer for us to return to normal operation for most other customers.
We apologize for the duration and magnitude of this outage, and we’re going to continue working hard to make Asana and its API a robust and reliable platform for you to develop on. Let us know if you have any questions about the incident or further concerns about webhooks and event streams. And thank you again for your patience as we worked to investigate and resolve this issue.