Webhooks Incident 11/14

Steven_Rybicki · 18 November 2020 02:16

Steven here from the API team, and I wanted to share some transparency on issues related to webhooks we’ve been encountering recently. Due to an infrastructure issue at 5am 11/14 UTC, you might be missing 8 hours of events from your webhooks / event endpoints, as well as any unfetched events before the incident. We recommend directly fetching any resources you subscribe to on a regular cadence to ensure your app has complete information moving forward.

On 11/14 5am UTC, Asana had a failure in the system we use to manage our webhooks and events data. This meant that for some domains, all unfetched events and webhooks events that had yet to be delivered were lost. This is also the cause of the large number of sync_errors some of you experienced. All data outside of webhooks and events are safe, as this was a localized failure to this one system, which has since recovered.

As of 1pm 11/14 UTC, we should be back to normal. We have identified the root cause of the problem, and are working to ensure it does not happen again. We’re actively working on better webhooks and events infrastructure as a priority to reduce the amount of undelivered events. We are also aiming to surface status for webhooks in our status page to give more visibility for these types of problems.

See the first comment for more information about why this happened, and about our future plans.

Steven_Rybicki · 18 November 2020 02:16

What happened?

Currently, our webhooks and events APIs both run using the same infrastructure, Redis, since they have similar needs in terms of keeping track of changes (which we call events), and distributing those events to apps.

As part of ensuring our infrastructure stays healthy, every service has a health check. The most basic of these is pinging the service. If the service is up and healthy, then it should be able to respond to the ping. If it’s not healthy, then we want to restart it to hopefully make it healthy. Since most of our infrastructure doesn’t have an active state, this automatically helps ensure that our services remain responsive.

On 11/14, one of our Redis nodes failed its health check. We think this was caused by a script that we were running to remove stale entries, increasing the CPU load. But, we are not yet 100% sure, and are currently investigating why we failed the health check. The failed health check caused our health check system to kill and restart the Redis node. While this is safe for most systems, since we store events in our nodes, killing the node loses our event data. We’re still actively investigating this issue, and will update this thread if we get more clarity.

This requires 8 hours to recover, rather than how long it takes to bring up the Redis node, because of how we store webhooks. We store webhooks both in our database, which is our source of truth, and a copy in Redis, which we use to figure out where new events should be sent to quickly. When we lose the Redis node, we lose its copy of the webhook, leading us to not send events to that webhook. To account for this, we copy the webhook to Redis every 8 hours when we do our health check. So if you had your health check just before the outage, then it could take 8 hours to have the subscription rebuilt. This means that there’s no action needed on your behalf (as we automatically resubscribed your webhook) but that process took some time.

What are we doing to fix this?

We’re stopping usage of the potentially problematic script, and keeping a closer eye on our Redis nodes to ensure they don’t show up as unhealthy again. We’re also investigating how to ensure that our nodes don’t fail health checks, and if they do to take actions that don’t lose event data.

Longer term, we’ve been working on webhooks and events stability for a couple months (the script mentioned above was part of this work), and will continue. For example, we’re planning to move our webhooks infrastructure from relying on Redis to systems that have better data storage guarantees. This will make our system more robust to these failures, and make it easier to recover.

Additionally, we want to be more transparent about our stability for webhooks and events. We are planning as part of our investments in webhooks stability to expose the status of these services in our status page.

Simon_Tretter · 7 December 2020 11:02

Are there any new known issues to the webhook service? We retrieve a lot of events twice or even more times. E.g. we are listening for one event when a task has been closed, I see events in the webhook which have happened few days ago already and are coming in again and again… like it thinks the webhook has not been succesfully delivered?

Simon_Tretter · 7 December 2020 11:14

Followup: It seems also the “auomatic rules” in asana are affected by this issue, lot of automated rules just do not do anything anymore or are at least very very delayed

Phil_Seeman · 7 December 2020 14:42

@Simon_Tretter,

I’m not sure if it’s related but Asana’s webhook service is currently down - for the latest status, watch: https://status.asana.com/

cc: @Ross_Grambo

Phil_Seeman · 7 December 2020 15:43

Clarification: it appears it’s not totally down but events are being delayed in processing.

Ross_Grambo · 7 December 2020 19:30

The normal cause of webhook event repeating is that your server is not returning a 200/success status code to the POST from Asana. If you’re doing that and still getting duplicates, I’d credit it to this outage as Phil mentioned.

Simon_Tretter · 7 December 2020 21:07

Thanks for your response, I’ve checked our http logs, we are always returning 200… it seems the issue is still not resolved though, it works for some of the events, but it seems not for all… and can it be that some events are just got lost? We have a super weird state right now, with tickets that are closed, but the trigger has never been fired, and with tickets that have been fired multiple times.

Phil_Seeman · 7 December 2020 21:53

Asana’s webhooks are “noisy” in that you WILL definitely get some events multiple times. That’s just the way it works and you have to handle that situation idempotently on your end such that it doesn’t cause problems.

These repeats will happen within a few seconds or so, though, not days later…

Simon_Tretter · 7 December 2020 23:30

Mhm okay, good to know! We have handled multiple events for most of the cases, but it still didn’t happen before a few days. Also we are still missing some events, but maybe this is related to the issues today. But also not a major problem, the main issue we have right now is that we are still don’t see all new events… I can close one ticket and nothing comes to the webhook, I close the ticket right below it, and this comes quite “fast” to the webhook (took still some time at least).
last checked few hours ago…

Simon_Tretter · 7 December 2020 23:40

E.g. for ticket 1199337922893389 I get the webhook “task completed” since at least 2020-10-06 17:46:56 till now (2020-12-08 00:29:04) every 1-3 minutes. This ticket has been closed 2 days ago already! (and has been closed only once ;-)).

Should I remove the task to stop the webhook from flooding always the same message? Or do you think resusbricbing to the webhook could solve this issue?

Simon_Tretter · 9 December 2020 12:35

The issue has disappeaered from yesterday to today, but just reappeared. I resubscribed now all webhooks and added more filters (to have fewer irrelevant events), so far this looks like it solves it… but this is still very unexpected.

Simon_Tretter · 9 December 2020 12:38

One finale remark: can it be that it the webhooks kinda “collide” with asana’s built in “rules” feature, we have removed the automated rules in one of the projects at the same time when we resusbcribed the webhooks … since then it looks way more stable.

Phil_Seeman · 9 December 2020 14:44

That’s an interesting hypothesis… @Ross_Grambo we’ll have to ask you if you think there’s any possibility that Asana rules interfere in any way with webhooks (like, cause webhooks to be sent more often)?

Simon_Tretter · 9 December 2020 14:58

Well it’s down again, right now we do not receive any webhooks at all anymore… or at least not any relevant ones. We receive webhooks which are not actualyl “subscribed”… e.gh. “due_date_chaned” or “completed_at” …

we actualyl filter with:
{
resource_type: ‘story’,
resource_subtype: ‘comment_added’,
action: ‘added’
}
and
{
resource_type: ‘story’,
resource_subtype: ‘comment_liked’,
action: ‘added’
}
and
{
resource_type: ‘task’,
action: ‘changed’,
field: ‘completed’
}

Ross_Grambo · 9 December 2020 19:16

My understanding of the Rules and Webhooks infrastructure is that they are entirely separate. The only way they interact with each other is if the rule and the changes by the webhook-integration created some circular loop of events. But I think you would notice that quickly if that was the case.

I’m concerned to hear this. Could I have the gid of your webhook that’s missing events? I haven’t yet heard of others being affected by this.

Currently resource_subtype filtering does not work and this is a known bug. This is on the roadmap for the API team but was put to the side during these webhook issues.

Simon_Tretter · 9 December 2020 19:22

Thanks Ross for your response,
right now it’s up again and works for the last few hours.
we have several subscriptions, but the most important one is: 1198865370615028
( others are 1198865370568039, 1198865370856397, 1198865370905826, 1198865371928588, 1199157951324007, and more … they are only for watching for status updates, which we do over watching for new comments added to a project).

Let’s see, currently we have no issues… but this was also the case yesterday evening (here in Vienan it’s now off business hours), maybe our problems occur only if there is more activity on our asana. I’ll let you know if we experience issues again!

One final question: I’ve changed the webhook for asana now to return immediately 200 OK, before that we processed the event and only returned 200 OK in case it was completed. This was actually almost all time the case, but could it be that the webhook has some kind of low timeout?

Ross_Grambo · 11 December 2020 19:59

Thank you for those gids. I’ll take a look when I get a chance. The timeout for webhooks is 10 seconds. I doubt you were running into that, but I’m glad that you have things working now

Simon_Tretter · 14 December 2020 10:02

Hi Ross,
here we are again, problems reappeared again

But I will test something more on our side, if you say a webhook timeout ist 10 second, I think this can be related. Because we move tasks/create new subtasks etc during a webhook call, it can take more than 10 seconds for all the api calls to finish. Maybe this is the culprint here, we do this for all other webhooks the same way, and fail the webhook if something goes wrong. If the timeou tis 10 second I’m afraid this is probably not possible with the asana webhook then. Are there plans/ideas to increase this 10 second timeout to something higher (e.g. 5 minutes?, I think github’s webhook have something arround 20 minutes, which seems super high)

Topic		Replies	Views
Webhooks/Events Outage 2021-02-01 Platform News outage	0	2028	2 February 2021
Missing events and no "sync_error" recieved on that webhook endpoint. Developers & API	39	4385	24 August 2021
June 16th webhooks and event streams outage Platform News api , webhooks	4	2502	21 June 2020
What's the best way to troubleshoot the webhooks? Developers & API api , events , webhooks	2	2131	20 May 2021
Webhooks: Scheduled Downtime in December Platform News	31	4098	23 December 2021

Webhooks Incident 11/14

What happened?

What are we doing to fix this?

Related topics