Webhooks Incident 11/14

Steven here from the API team, and I wanted to share some transparency on issues related to webhooks we’ve been encountering recently. Due to an infrastructure issue at 5am 11/14 UTC, you might be missing 8 hours of events from your webhooks / event endpoints, as well as any unfetched events before the incident. We recommend directly fetching any resources you subscribe to on a regular cadence to ensure your app has complete information moving forward.

On 11/14 5am UTC, Asana had a failure in the system we use to manage our webhooks and events data. This meant that for some domains, all unfetched events and webhooks events that had yet to be delivered were lost. This is also the cause of the large number of sync_errors some of you experienced. All data outside of webhooks and events are safe, as this was a localized failure to this one system, which has since recovered.

As of 1pm 11/14 UTC, we should be back to normal. We have identified the root cause of the problem, and are working to ensure it does not happen again. We’re actively working on better webhooks and events infrastructure as a priority to reduce the amount of undelivered events. We are also aiming to surface status for webhooks in our status page to give more visibility for these types of problems.

See the first comment for more information about why this happened, and about our future plans.

1 Like

What happened?

Currently, our webhooks and events APIs both run using the same infrastructure, Redis, since they have similar needs in terms of keeping track of changes (which we call events), and distributing those events to apps.

As part of ensuring our infrastructure stays healthy, every service has a health check. The most basic of these is pinging the service. If the service is up and healthy, then it should be able to respond to the ping. If it’s not healthy, then we want to restart it to hopefully make it healthy. Since most of our infrastructure doesn’t have an active state, this automatically helps ensure that our services remain responsive.

On 11/14, one of our Redis nodes failed its health check. We think this was caused by a script that we were running to remove stale entries, increasing the CPU load. But, we are not yet 100% sure, and are currently investigating why we failed the health check. The failed health check caused our health check system to kill and restart the Redis node. While this is safe for most systems, since we store events in our nodes, killing the node loses our event data. We’re still actively investigating this issue, and will update this thread if we get more clarity.

This requires 8 hours to recover, rather than how long it takes to bring up the Redis node, because of how we store webhooks. We store webhooks both in our database, which is our source of truth, and a copy in Redis, which we use to figure out where new events should be sent to quickly. When we lose the Redis node, we lose its copy of the webhook, leading us to not send events to that webhook. To account for this, we copy the webhook to Redis every 8 hours when we do our health check. So if you had your health check just before the outage, then it could take 8 hours to have the subscription rebuilt. This means that there’s no action needed on your behalf (as we automatically resubscribed your webhook) but that process took some time.

What are we doing to fix this?

We’re stopping usage of the potentially problematic script, and keeping a closer eye on our Redis nodes to ensure they don’t show up as unhealthy again. We’re also investigating how to ensure that our nodes don’t fail health checks, and if they do to take actions that don’t lose event data.

Longer term, we’ve been working on webhooks and events stability for a couple months (the script mentioned above was part of this work), and will continue. For example, we’re planning to move our webhooks infrastructure from relying on Redis to systems that have better data storage guarantees. This will make our system more robust to these failures, and make it easier to recover.

Additionally, we want to be more transparent about our stability for webhooks and events. We are planning as part of our investments in webhooks stability to expose the status of these services in our status page.