"Sync_error" occurrences

Hi guys,

About once a day (though sometimes I can go a few days), my Flowsana integration receives a webhook whose type is “sync_error”, with the message, “There was an error with the event queue, which may have resulted in missed events. If you are keeping resources in sync, you may need to manually re-fetch them.”

Granted Flowsana receives many thousands of webhooks a day so this is a pretty small error rate, but still, if it means someone’s workflow or rule does not work properly as a result, it’s a potential issue.

I can tell from the webhook which Flowsana user account had the error, but nothing further that I can see, and it’s not practical at that point for me to query every project and task for that account to try and figure out what might have been missed.

I’m wondering if you have any insights into what causes these sync errors and/or how we might mitigate the situation when it occurs. Thanks!

@Joe_Trollo @Matt_Bramlage @Jeff_Schneider @Ross_Grambo

3 Likes

Hi Phil,

After adding logging and digging through deeper parts of the code, I can report that the source of sync errors is, unfortunately, an unavoidable occurrence: a Redis node dying. Webhooks (and event streams) are built on a multi-step, asynchronous pipeline that pushes data from where a user makes a change to where we send the webhook to your server. This pipeline involves piping temporary data about the events into and out of a Redis store. It’s unavoidable that these Redis nodes sometimes die, and when that happens they take the enqueued events with them.

It’s possible for us to change the behavior of webhook event delivery from “at most once” to “at least once” but it would involve a massive rearchitecting and reimplementation of our pipeline, which we’re unable to do at this time. However, I’m following up with other teams at Asana to see if there’s anything we can do to make these issues occur less frequently.

A recommendation I can make to help make it easier to recover: give each subscription its own URL/query parameters so that you can distinguish between different projects in the same account, e.g., receive webhooks at /webhooks/<account-id>/<resource-id> or /webhooks/<account-id>?resource=<resource-id>.

Thanks for the info, @Joe_Trollo - not the best of news but good to know the scoop.

Follow-up question:
Does this also apply to the Events API - i.e. if one were to use that instead of (or in addition to) webhooks, would that also potentially be missing the same events; or is this a webhooks-only phenomenon?

This applies to event streams too—you’ll also encounter sync_error events there. The difference is that events wait in Redis until they’re fetched rather than being pulled from Redis and delivered promptly. When Redis dies, the event stream is entirely lost and we stop accumulating events for it. This means that you’re likely to lose more information from an event stream when this happens. (There are likely more events to be waiting in the queue, and we won’t collect events between when Redis goes down and when you make the next request.)

1 Like

I’m going to do this. Not sure at the moment what I’ll do with that additional info, but we’ll see!

Much appreciated, that would be awesome!

I understand this is not something you can do right away, but I would think it’d be something for the API team’s pipeline? Those of us who rely on webhooks in our apps would really like to be able to count on them to be complete and accurate…

That sort of redesign isn’t currently on our roadmap. We believe that we can get dramatically better reliability with significantly smaller investments in our infrastructure. For example, writing the events to two Redis nodes instead of one would square the probability of failure/event loss. If the failure rate of a single node is 1 in 10,000 now, the failure rate of two nodes would be 1 in 100,000,000.

That sounds great. I’m not partial to any particular fixes, just good to know you’ll be taking some action(s) to improve it!

Hi @Joe_Trollo @Ross_Grambo,

Not sure if it means anything in particular but I’ve been getting a lot (compared to usual) of sync errors over the past 48 hours or so.

Have you had an opportunity to make any of those infrastructure changes yet? :pray:

Hi @Joe_Trollo, @Ross_Grambo,

Wondering if there’s been any progress on implementing these changes re. sync errors, or what the status is? I continue to get sync errors in my Flowsana webhooks daily (sometimes once a day, sometimes more than once, but pretty much every day).

Thanks for any updates you might have to provide!

I don’t have anything to report yet. We’re looking into the cause of daily failures.

1 Like

Hi @Ross_Grambo and @Joe_Trollo,

I’ve gotten a relatively higher number of sync errors of late: 14 over the past 24 hours as of this writing. Just wanted to mention it in case it helps in your sync_error investigation.

Hey Phil,

The user that caused it cycled their PAT and caused it again. We did a user ban this time. The API team is planning a “fix” for this, where they cap out an event stream and either trash the stream if it gets too big or stop appending events to the queue. Both don’t sound like great options, but they’re better than more outages!

@Ross_Grambo
Wow! Guess you weren’t expecting THAT to happen.

But unless I’m missing something, I don’t think my sync errors are related to yesterday’s outage. I say that because (1) during the outage I was getting NO webhooks, valid or sync_error; and (2) I’ve gotten 17 sync_errors today alone, well after the restoration of the outage.

Oof! Sorry, webhooks were top of mind so I assumed this was related.

I’m checking with the API team now.

Looks like this was not on their roadmap. They just added an investigation into it with the double encoding events as the frontrunning solution.

1 Like

Hi @Ross_Grambo,

FYI my Flowsana app has gotten 21 sync-errors between 7:59 PM EST last night (8/9/20) and 6:34 AM today so far (no particular reason to think there won’t be more coming shortly).

Wondering if there has been any progress on the sync_error issue? It’s pretty critical for those of us who have apps built on top of the Asana webhook platform. Thanks!

1 Like

Hi @Ross_Grambo,

FYI still continuing to see lots more sync_errors than usual - 12 more today since my posting above, and counting.

I pinged API oncall to take a look. Sorry for the delay!

Could I get flowsana’s app id/any other app ids having this issue?

Thanks, @Ross_Grambo - I’ll DM you the App Id.

Hi @Ross_Grambo,

FYI Flowsana is getting another raft of sync_errors today. It started at 6:54 am Pacific time. I’ve gotten 16 so far and still coming in, it seems. Let me know if I can provide any other info that might be helpful.

1 Like