Recent increase in network errors

Hi everyone, we have an unfortunate update to share with you all about a recent increase in API errors. Lately, you may have noticed a very small fraction of your API requests (around 0.001%) failing because the response body is not valid JSON. This is because of an issue with our load balancers causing the final packet of the response to be dropped. This dropped packet can go unnoticed during the request cycle and is only detected when parsing the response afterwards. At this point, the retry logic surrounding failed requests has likely already concluded because the request can appear successful. (Our official client libraries, in particular, retry based solely on response codes, and not the validity of the body.)

Previously, this issue would be constrained to a single load balancer, which we would replace when necessary. However, this is now occurring at very low rates across all our load balancers, even new ones, so we’re currently unable to mitigate the issue. Our metrics show that this impacts roughly 0.001% of all API requests, though the proportion of operations affected is potentially much higher because, for example, an app may make several requests to fetch all tasks in a project. Because this manifests in an unorthodox way that can be confusing to encounter, we wanted to post about it publicly. Our current recommendation is to make your apps robust to errors beyond non-200 status codes and retry where appropriate.

Our load balancers are hosted by AWS and we’ve been working with them to resolve this dropped packet issue for a while now (starting back when we noticed the problem on individual load balancers). We currently believe that the problem originates from AWS-managed network components. Because we can no longer actively remove malfunctioning load balancers to protect developers from this issue, the problem may be more prevalent than it was before.

We’ll update this post with any news we receive, and will continue working to resolve the issue and minimize any impact.

5 Likes

A brief update here: our infrastructure team has been able to find an alternate configuration for our load balancers that avoids the root cause of the packet loss. We’re continuing to work with AWS to resolve the original issue, but API users should no longer be affected. Thank you for your patience!

1 Like