Upcoming improvements to our webhooks system

Hey developers,

My name is Andrew. I am a Developer Advocate here at Asana. Over the past few months, we’ve seen frequent incidents in which webhook events were delayed or even dropped. Unfortunately, these performance issues have impacted your applications, resulting in a suboptimal customer and developer experience.

As we shared last month, we have committed our resources to making webhooks more stable and reliable. Today we are happy to post an update on this commitment. We have begun migrating webhooks onto a more reliable infrastructure, and simultaneously incrementally improving the presently-deployed system. This complete redesign aims to optimize our webhooks’ durability, performance, and scalability.

Background

The issues with our current system stem from two architectural decisions: how we store events, as well as how we distribute events.

Several years ago, we chose Redis as the data store for our events infrastructure. At the time, it made a lot of sense. Our Luna framework used Redis extensively which meant that our engineers already had experience with it. In addition, a high-throughput concurrent queue is a classic use-case for Redis. As a result, we were able to rapidly implement a performant system with acceptable reliability limits.

Out of the box, Redis does not persist data to disk; data is stored to RAM by default. This works fine as an ephemeral cache, but if Redis is restarted in our system, it will lose data stored within. Redis can be configured for persistence, but our current infrastructure complicates any potential deployment. As such, while Redis is capable as a data store and fulfills our original requirements, we need a more scalable solution moving forward.

For event distribution, we use remote jobs, our internal system for running long-lasting tasks, to send webhook events to customer endpoints. The issue here is that our remote jobs system’s SLA does not meet our needs for events as our usage grows. Since we rely on remote jobs succeeding, and succeeding within a given time period to transfer events within our system, this distribution framework has also become a point of failure in our system.

Redesigning webhooks from the ground-up

We believe that a webhooks system without certain durability and latency guarantees would not meet developers’ needs, especially since our community continues to build more applications relying on webhooks as workflow triggers. To scale with our developers’ needs, we’ve launched a phased rollout of long-term fixes to improve the durability and latency of webhooks and eliminate future incidents.

Phase 1

In Phase 1, we will address durability concerns in our event delivery architecture by incrementally migrating from Redis to AWS DynamoDB.

The aim of Phase 1 is to mitigate catastrophic incidents (e.g., losing an entire day of webhooks) and to reduce overall downtime. This phase is already underway, and we estimate it will begin rollout soon. Note that this fix is not intended to solve intermittent or ephemeral latency issues.

Phase 2

Phase 2 focuses on improving the durability of the events queue, where Redis stores all events generated per resource (e.g., a task) before they are moved to consumer-specific queues. In this phase, we will replace any lingering usages of Redis as well as critical remote jobs with AWS Kinesis. By removing Redis entirely, no single point of failure remains. Migrating off our internal remote jobs framework should also reduce latency.

We estimate that this phase will be complete in Q4 2021. Completing this phase will greatly increase the durability of our webhooks and events infrastructure, bolstering event delivery and drastically reducing the risk of outages.

Phase 3

Phase 3 of the redesign involves replacing the webhook remote job with an AWS Simple Queue Service and retrial of webhook events. Moving webhooks distribution outside the remote jobs framework is designed to significantly improve scalability, performance, and availability. Completion of this phase also unblocks further development of webhooks, allowing our team to keep iterating and innovating on functionality. We expect Phase 3 to be carried out in 2022.

Moving forward

In summary, our webhooks system redesign is already underway. Most changes in this redesign are under the hood, and won’t require you to change the way your integration is built. Along with the long-term fixes we describe above, we’re also actively working with other teams at Asana to reduce latency, decrease lock contention, and recover more quickly from incidents.

To be fully transparent: while we are actively working to improve webhooks performance, you may still see delays and intermittent outages. We recommend these best practices as a fallback for when outages occur. Thank you for your patience as we work to eliminate these roadblocks and improve your integration experience. We’ll continue updates in this thread, so please feel free to share your comments, feedback, and questions here as well!

7 Likes

Thanks very much, @AndrewWong (and @Kem_Ozbek) for this detailed information and for your transparency. It’s really appreciated.

As you know, I’m anxiously awaiting these improvements. Two questions at the moment…

  1. Can you say what the status is (pun intended) of plans to improve the reporting of webhook issues on status.asana.com? As you know, the lack of any significant feedback on webhook issues on the status page is a pain point for us.

  2. Do you know if these infrastructure changes will include a reduction of the “noisiness” of webhooks; that is, the current behavior of sometimes receiving multiple webhooks about the same event? It would be great if that could be among the design goals for this re-work.

4 Likes

Hey @Phil_Seeman , thanks for checking in! Regarding the status page, marking downtime is currently a manual process, and only reflects downtime for lost events. We’d ideally like to get to a place where we can report lost events automatically. At one point, we were over-reporting the downtime, and of course we’ve also swung in the other direction. With better observability in the new system, I imagine some of the related issues should be resolved. I’ll keep you posted as I get more updates on this.

On your second point, I’ve heard of issues where some events get generated and sent that shouldn’t have been. Our engineers have been looking into it earlier this week. Can you share some examples of what you’ve been seeing? Feel free to send to me directly too if that’s easier!

2 Likes