About once a day (though sometimes I can go a few days), my Flowsana integration receives a webhook whose type is “sync_error”, with the message, “There was an error with the event queue, which may have resulted in missed events. If you are keeping resources in sync, you may need to manually re-fetch them.”
Granted Flowsana receives many thousands of webhooks a day so this is a pretty small error rate, but still, if it means someone’s workflow or rule does not work properly as a result, it’s a potential issue.
I can tell from the webhook which Flowsana user account had the error, but nothing further that I can see, and it’s not practical at that point for me to query every project and task for that account to try and figure out what might have been missed.
I’m wondering if you have any insights into what causes these sync errors and/or how we might mitigate the situation when it occurs. Thanks!
After adding logging and digging through deeper parts of the code, I can report that the source of sync errors is, unfortunately, an unavoidable occurrence: a Redis node dying. Webhooks (and event streams) are built on a multi-step, asynchronous pipeline that pushes data from where a user makes a change to where we send the webhook to your server. This pipeline involves piping temporary data about the events into and out of a Redis store. It’s unavoidable that these Redis nodes sometimes die, and when that happens they take the enqueued events with them.
It’s possible for us to change the behavior of webhook event delivery from “at most once” to “at least once” but it would involve a massive rearchitecting and reimplementation of our pipeline, which we’re unable to do at this time. However, I’m following up with other teams at Asana to see if there’s anything we can do to make these issues occur less frequently.
A recommendation I can make to help make it easier to recover: give each subscription its own URL/query parameters so that you can distinguish between different projects in the same account, e.g., receive webhooks at /webhooks/<account-id>/<resource-id> or /webhooks/<account-id>?resource=<resource-id>.
Thanks for the info, @Joe_Trollo - not the best of news but good to know the scoop.
Follow-up question:
Does this also apply to the Events API - i.e. if one were to use that instead of (or in addition to) webhooks, would that also potentially be missing the same events; or is this a webhooks-only phenomenon?
This applies to event streams too—you’ll also encounter sync_error events there. The difference is that events wait in Redis until they’re fetched rather than being pulled from Redis and delivered promptly. When Redis dies, the event stream is entirely lost and we stop accumulating events for it. This means that you’re likely to lose more information from an event stream when this happens. (There are likely more events to be waiting in the queue, and we won’t collect events between when Redis goes down and when you make the next request.)
I’m going to do this. Not sure at the moment what I’ll do with that additional info, but we’ll see!
Much appreciated, that would be awesome!
I understand this is not something you can do right away, but I would think it’d be something for the API team’s pipeline? Those of us who rely on webhooks in our apps would really like to be able to count on them to be complete and accurate…
That sort of redesign isn’t currently on our roadmap. We believe that we can get dramatically better reliability with significantly smaller investments in our infrastructure. For example, writing the events to two Redis nodes instead of one would square the probability of failure/event loss. If the failure rate of a single node is 1 in 10,000 now, the failure rate of two nodes would be 1 in 100,000,000.
Wondering if there’s been any progress on implementing these changes re. sync errors, or what the status is? I continue to get sync errors in my Flowsana webhooks daily (sometimes once a day, sometimes more than once, but pretty much every day).
I’ve gotten a relatively higher number of sync errors of late: 14 over the past 24 hours as of this writing. Just wanted to mention it in case it helps in your sync_error investigation.
The user that caused it cycled their PAT and caused it again. We did a user ban this time. The API team is planning a “fix” for this, where they cap out an event stream and either trash the stream if it gets too big or stop appending events to the queue. Both don’t sound like great options, but they’re better than more outages!
@Ross_Grambo
Wow! Guess you weren’t expecting THAT to happen.
But unless I’m missing something, I don’t think my sync errors are related to yesterday’s outage. I say that because (1) during the outage I was getting NO webhooks, valid or sync_error; and (2) I’ve gotten 17 sync_errors today alone, well after the restoration of the outage.
FYI my Flowsana app has gotten 21 sync-errors between 7:59 PM EST last night (8/9/20) and 6:34 AM today so far (no particular reason to think there won’t be more coming shortly).
Wondering if there has been any progress on the sync_error issue? It’s pretty critical for those of us who have apps built on top of the Asana webhook platform. Thanks!
FYI Flowsana is getting another raft of sync_errors today. It started at 6:54 am Pacific time. I’ve gotten 16 so far and still coming in, it seems. Let me know if I can provide any other info that might be helpful.