What happened?
Currently, our webhooks and events APIs both run using the same infrastructure, Redis, since they have similar needs in terms of keeping track of changes (which we call events), and distributing those events to apps.
As part of ensuring our infrastructure stays healthy, every service has a health check. The most basic of these is pinging the service. If the service is up and healthy, then it should be able to respond to the ping. If it’s not healthy, then we want to restart it to hopefully make it healthy. Since most of our infrastructure doesn’t have an active state, this automatically helps ensure that our services remain responsive.
On 11/14, one of our Redis nodes failed its health check. We think this was caused by a script that we were running to remove stale entries, increasing the CPU load. But, we are not yet 100% sure, and are currently investigating why we failed the health check. The failed health check caused our health check system to kill and restart the Redis node. While this is safe for most systems, since we store events in our nodes, killing the node loses our event data. We’re still actively investigating this issue, and will update this thread if we get more clarity.
This requires 8 hours to recover, rather than how long it takes to bring up the Redis node, because of how we store webhooks. We store webhooks both in our database, which is our source of truth, and a copy in Redis, which we use to figure out where new events should be sent to quickly. When we lose the Redis node, we lose its copy of the webhook, leading us to not send events to that webhook. To account for this, we copy the webhook to Redis every 8 hours when we do our health check. So if you had your health check just before the outage, then it could take 8 hours to have the subscription rebuilt. This means that there’s no action needed on your behalf (as we automatically resubscribed your webhook) but that process took some time.
What are we doing to fix this?
We’re stopping usage of the potentially problematic script, and keeping a closer eye on our Redis nodes to ensure they don’t show up as unhealthy again. We’re also investigating how to ensure that our nodes don’t fail health checks, and if they do to take actions that don’t lose event data.
Longer term, we’ve been working on webhooks and events stability for a couple months (the script mentioned above was part of this work), and will continue. For example, we’re planning to move our webhooks infrastructure from relying on Redis to systems that have better data storage guarantees. This will make our system more robust to these failures, and make it easier to recover.
Additionally, we want to be more transparent about our stability for webhooks and events. We are planning as part of our investments in webhooks stability to expose the status of these services in our status page.