"Sync_error" occurrences

Hi guys,

About once a day (though sometimes I can go a few days), my Flowsana integration receives a webhook whose type is “sync_error”, with the message, “There was an error with the event queue, which may have resulted in missed events. If you are keeping resources in sync, you may need to manually re-fetch them.”

Granted Flowsana receives many thousands of webhooks a day so this is a pretty small error rate, but still, if it means someone’s workflow or rule does not work properly as a result, it’s a potential issue.

I can tell from the webhook which Flowsana user account had the error, but nothing further that I can see, and it’s not practical at that point for me to query every project and task for that account to try and figure out what might have been missed.

I’m wondering if you have any insights into what causes these sync errors and/or how we might mitigate the situation when it occurs. Thanks!

@Joe_Trollo @Matt_Bramlage @Jeff_Schneider @Ross_Grambo

3 Likes

Hi Phil,

After adding logging and digging through deeper parts of the code, I can report that the source of sync errors is, unfortunately, an unavoidable occurrence: a Redis node dying. Webhooks (and event streams) are built on a multi-step, asynchronous pipeline that pushes data from where a user makes a change to where we send the webhook to your server. This pipeline involves piping temporary data about the events into and out of a Redis store. It’s unavoidable that these Redis nodes sometimes die, and when that happens they take the enqueued events with them.

It’s possible for us to change the behavior of webhook event delivery from “at most once” to “at least once” but it would involve a massive rearchitecting and reimplementation of our pipeline, which we’re unable to do at this time. However, I’m following up with other teams at Asana to see if there’s anything we can do to make these issues occur less frequently.

A recommendation I can make to help make it easier to recover: give each subscription its own URL/query parameters so that you can distinguish between different projects in the same account, e.g., receive webhooks at /webhooks/<account-id>/<resource-id> or /webhooks/<account-id>?resource=<resource-id>.

Thanks for the info, @Joe_Trollo - not the best of news but good to know the scoop.

Follow-up question:
Does this also apply to the Events API - i.e. if one were to use that instead of (or in addition to) webhooks, would that also potentially be missing the same events; or is this a webhooks-only phenomenon?

This applies to event streams too—you’ll also encounter sync_error events there. The difference is that events wait in Redis until they’re fetched rather than being pulled from Redis and delivered promptly. When Redis dies, the event stream is entirely lost and we stop accumulating events for it. This means that you’re likely to lose more information from an event stream when this happens. (There are likely more events to be waiting in the queue, and we won’t collect events between when Redis goes down and when you make the next request.)

1 Like

I’m going to do this. Not sure at the moment what I’ll do with that additional info, but we’ll see!

Much appreciated, that would be awesome!

I understand this is not something you can do right away, but I would think it’d be something for the API team’s pipeline? Those of us who rely on webhooks in our apps would really like to be able to count on them to be complete and accurate…

That sort of redesign isn’t currently on our roadmap. We believe that we can get dramatically better reliability with significantly smaller investments in our infrastructure. For example, writing the events to two Redis nodes instead of one would square the probability of failure/event loss. If the failure rate of a single node is 1 in 10,000 now, the failure rate of two nodes would be 1 in 100,000,000.

That sounds great. I’m not partial to any particular fixes, just good to know you’ll be taking some action(s) to improve it!

Hi @Joe_Trollo @Ross_Grambo,

Not sure if it means anything in particular but I’ve been getting a lot (compared to usual) of sync errors over the past 48 hours or so.

Have you had an opportunity to make any of those infrastructure changes yet? :pray: