Hi everyone, sorry for the issues!
Yes, this was caused by problems on our side here’s the underlying issue:
When objects change state in Asana, the new data is written to our main databases, which hold the canonical object state. As a part of this, there is a separate table that holds a log of all objects which have had any changes since [whenever], and a request is made to a Redis cluster to record for the API’s use what changed on that object. After this, an asynchronous job runs that plays forward all the events since the last time the job has run by looking up all the changed objects in the object change table, cross-referencing with the information stored in Redis, and pushing these events to all event/webhook subscribers of that object that haven’t sent out their events yet. This somewhat complex flow is pretty typical for long-running jobs that we don’t want to handle in real time in our servers to keep them responsive. Other such jobs include sending out emails, sending the “consider updating your project progress” tasks, updating Inboxes, and so on.
What happened was that one of our background job processors got stuck due to local state (it had a local database have an integer overflow which put us into a strange place) and so jobs on it got “stuck”. When this happens, the job gets rescheduled, but it gets rescheduled on the same machine, which still was in a bad state, so stayed stuck. Unfortunately, one of these jobs was the distributor job that fans out some (but not all) of our events/webhooks. The “some” part of these jobs would affect only some users or apps, but this “some” is consistent, so if you were seeing problems you would consistently see problems even with new webhooks since your subscription would keep getting scheduled for the same stuck job.
We’ve since forcibly kicked these jobs to a new machine, which means we should be sending out events and webhooks again. Sorry for the issues here, this is one of those “We didn’t plan for this failure mode” things that we’ll run our postmortem process on and try to fix for in the future.