Yesterday i noticed at least two missing events on a webhook, and when checking the logs i had no sync_errors on that endpoint. Are there a risk that we don’t always receive “sync_error” for a project when something happens or what could be the issue of this?
Hey @Matt_Bramlage @Jeff_Schneider @Amit_Kumar & @Marie.
This is causing major implications for our organizations. We are “losing” business-critical information, daily now.
The number of issues and outages affecting Webhooks and Event Streams and Automations and Background Actions must be a critical issue to Asana.
We are expecting some response from Asana. Something more than silence. Where are your communicating?
During the outages over the last couple of weeks, events happening in Asana never reach our webhook endpoint. The information is lost. The activities in Asana are correctly recorded in Asana but they are never sent by webhook.
For us to build a backup solution fetch these actions during the period retroactive would be a large undertaking and would need to be replicated by all customers who depend on these webhooks.
I would prefer if Asana would make a single solution on your side, and make sure that all actions made in Asana during Outage are sent as soon as the systems are up again. If the queues don’t work, it’s better that you build a single solution on your side to identify what has happened in Asana during that time and which of these events has not been sent by webhook yet.
It’s better for all if you build the fallback solution on your side than for all the customers to build one for themselves.
As of now, these webhooks are just missing, and we don’t get any public statement from you. What’s the plan, what’s the timeline?
I’m sure this is a priority for you, and you are aiming for a long-term solution rather than short-term fixes but it’s not sustainable to keep us waiting for that any longer. We need the quick fix now, to go on with our operations.
I totally agree!
I am surprised not more people are commenting and creating posts about this.
Asana is for business and when this is not working, our business is not working!
Please prioritise this asap
Hey, @Bastien_Siebman & @Phil_Seeman , how are you handling this loss of events, returning 13:00 UTC daily for the past two weeks?
I don’t have any “real time” code that polls Asana or reacts to events, so my tools probably don’t work at that time but apart from frustration for the user that does not impact me thankfully.
It’s a HUGE problem for me. I agree 1000% with everything you said above.
I’ve raised it internally at Asana with some of my contacts, and have made it quite clear how serious an issue it is. I believe the message has been received and it’s now getting a top priority, but at the same time I will continue pushing on it until it gets better!
cc: @Ross_Grambo @Jeff_Schneider @Matt_Bramlage
Thanks @Phil_Seeman for confirming this also. I felt like i was going crazy with my backend missing things. But i havent been able to find any issues other than things are actually not sent.
Hi, we recently experienced this kind of problems, that have a major impact on our operations and the workline of our services.
Please check it asap.
Thanks for bringing this to our attention and raising its visibility! We’re aware that there are some issues with our webhooks right now and our API infrastructure team is looking into it.
You might have been able to intuit what has happened by looking at our status chart, but the short version is that there are some cases related to lots of changes happening on a single object (like a task) within a short period of time that causes issues for us. You can see in the chart when load has gotten high enough for this scenario to have been a problem over the last few days.
This is a classic scaling problem where we’re fine as long as things don’t get too busy, and in production when things do get too busy we lose some events which we consider to be a pretty serious condition. For some reason this has become more pronounced over the past few days so we’re looking into it. Our current hypothesis is that a performance regression is causing our systems to process webhooks more slowly, meaning they’re getting overwhelmed, but we’re still investigating whether this seems true so we can identify and ship a fix.
We won’t have a timeline for a fix until we fully understand the root cause, but we do see it happening on our side. We’ll keep you updated as we make progress! Sorry again for the disruption! I’ll share more information when we have it. Thanks again for your patience, and be assured that we’re aware and working on it!
Thank you @Matt_Bramlage!
This is how we lowered the load on our systems for the past 14 days:
- Left table in the image show volumes Asana still sends to us, presented by unfiltered volumes 7 days last month.
- Right table, you see volumes last 7 days, with our own grouping/filtering these character input events into manageable event volumes implemented 14 days ago on our side.
- This lowered our Cloudmate processing volumes to 20% of the load sent by Asana.
We identified that Asana basically sends an event for every character input to a text custom field, so we added a grouping filter for those events before we make any other processing of the data. Our clients have some custom fields that are used to input longer text, written by the user directly into the custom field in the web app.
I believe you could make a substantial difference to you system loads, just not to send an event for any character input to text custom_fields as you currently do. You currently send us webhooks full with 100 events of exactly the same body, all events are identical.
Assuming Asana has clients with large text inputs to web app during ~13 - 16 UTC daily, this is likely to overwhelm your systems.
My insight into the concrete issue here is obviously limited but I hope It might be a shortcut to a quick fix.
We got 18h delay on events from yesterday. Unsure about the number of losses.
All events generated in Asana today seem to have been sent to us, looks very promising.
Any progress update from @Matt_Bramlage or @Jeff_Schneider ?
Another good day here.
Would love confirmation from Asana, that it’s likely to work as intended from here on. So that we can inform our clients that they don’t need to time restrict their office hours anymore.
Yes, we think we’re in the clear, but we want to be sure on our side that everything seems great before communicating that we’re confident that we’ve gotten the root cause!
We think that our system for processing and distributing webhooks has been falling behind. This means that we might be slow in sending webhooks, but in particular we think that we may have been losing events if a lot of changes happen on a single object at once. Our event queues (which are allocated per object) have a fixed length and if too many changes happen before we can ship them out we might overrun these queues. From our side this looks like it only happened a very few times!
We think we’ve identified the cause and have shipped a fix! I wouldn’t advise telling your clients that everything is fixed, because I wouldn’t want you to over-promise, but we think we’re at least trending back toward a more reliable uptime for webhooks. In light of this, we’re also examining how we can make webhooks have stronger reliability guarantees, so keep an eye out for future developments!
Thank you for the update @Matt_Bramlage, it’s really valuable to us.
We just experienced a 33 min absence of incoming webhooks.
First time in 5 days.
Events happening in Asana during this time have not been sent to Cloudmate webhook endpoint yet, but newer events have. This feels like a lost event to us. From experience, these can take up to 48h to be delivered.
I’m sharing our perspective and insight in hope of resolving this issue faster. Let us know if there is anything else we can do to be of service here.
During this post, the event stream stopped again and is currently down.
!Update: Now we seem to be back on track, longest recorded delay so far today is 82 min.
So that I might see if/how that corresponds to my incoming webhooks today, can you tell me what time zone is on the X axis?
Also, don’t know if it’s related but I’ve gotten 9 sync_errors in the past 24 hours or so, after not getting any in the few days prior to that.
Hi @Phil_Seeman , we are GMT +2, Sweden.
We don’t see any correlation between sync_errors and the total absence/delay of webhooks. Sync_error count today is one, and unusually close to the time when delay happened today.
We only receive sync_error on one subscription type(many projects). We are evaluating if it’s related to us subscribing to projects while they are being generated(filled with tasks and so on). Do you see any correlation to that assumption?
I’m not seeing any gaps today - this is last 4 hours, in UTC time:
I don’t think so since that would be happening daily for me so I would then expect to see sync errors daily.
Oh well, so much for that burst of optimism on our part…
Hey @Phil_Seeman and team!
We’ve recently experienced performance issues with webhooks and want to share a quick update.
Recent issues with the infrastructure that supports our webhooks and events systems has resulted in delays or losses of events over the last 2 days. We believe we’ve identified the root cause of the most recent incidents, we have recovered, and are working on ensuring they don’t happen again. We plan to share more information about the recent outage, and our plans to improve webhooks and events, in the near future.
In the meantime, we recommend that you subscribe to our status page to be alerted to incidents as they happen, which now includes a separate line for webhooks and events. We also recommend that you implement a fallback of scanning resources periodically to catch up with any changes missed from lost events.
Thanks for your patience as we work through this. We’ll get back to you shortly with more information!
Sorry to just write when it is bad. Most of the time Asana is awesome!
We have not received any webhooks for the past 3 hours.
Asana support has been responsive but the escalation of priority is slow. No info at status.asana.com but support have confirmed resolving some issue. Problem remains 3h+ now