Hey folks,
This is Amit from the API team. We wanted to share that our webhooks and event stream infrastructure suffered a failure around 2021-02-01 6:00 am UTC. One of our Redis nodes used for storying events for webhooks was marked as dead/unreachable by our monitoring infrastructure and replaced, as a result, you might be missing about 9 hours of events from the events and webhooks endpoints, including any events not fetched previously. We recommend fetching the latest version of resources you subscribe to on a regular interval to minimize impact.
This error manifests itself as “sync_errors” to the clients. Though there is a gap in events, the webhook subscriptions themselves were not impacted. By 3:00 pm, the webhooks and events infrastructure had recovered. This was caused by the same failure that triggered our previous webhooks incident (a Redis node failing its health check) but we’re still investigating the root cause.
In order to address these issues we are taking a few steps:
A new Webhooks and Events infrastructure
In November 2020, following our previous incident, we started work on a new Webhooks and Events Infrastructure that would address the fundamental concerns of the durability of the events. We have a dedicated team that is working on this project and the stability of this infrastructure is of utmost importance to us. We know how important it is for developers and applications to have stable events and webhooks infrastructure.
Increase the Visibility and Observability
We have added Webhooks and Event Streams to our service status page. Internally we have added additional alerts to catch performance and stability related issues. While these alerts and monitors do not avoid incidents like this, they allow us to pinpoint key areas to focus in the new infrastructure.
While it gives us no joy to report on outage news, we’ll strive to maintain transparency and we’ll continue to invest in bringing to you a new and better infrastructure for webhooks.