Limit concurrent HTTP connections to avoid crippeling overload
Even the fastest web servers become bottlenecks when handling CPU-intensive API work. Slow response times can cripple your service, especially when paired with an API gateway that times out after 29 seconds.
I solved this using a simple middleware that eliminated 504 - Gateway Timeout responses and significantly reduced unnecessary load on my service API.
I assumed that if a single request takes 5 seconds on average, at least five requests could complete before hitting Amazon API Gateway’s 29-second timeout:
In practice, the behavior was completely different (though in retrospect, it makes perfect sense). CPU resources are divided equally among all concurrent requests, causing all responses to slow down proportionally:
I found myself in a situation where even a mediocre load would make the system incapable of responding within the time limit. The API gateway would return 504 - Gateway Timeout on behalf of my service, while my unaware service would occupy CPU resources for responses that would never be used for anything, slowing everything even further.
A sure way to contribute to climate change and get a high cloud bill, while delivering zero value.
Oh wait…
a caller is very likely to want to retry a request, indicating a temporary problem, which the HTTP response code 504 - Gateway Timeout does. Now multiply your already high cloud bill by the retry count.
In other words: A disaster. ☠️
An entirely different architecture, maybe involving a queue or some async response mechanism, would probably have been a better solution. But sometimes, we need to work with what we’ve got.
Since my CPU load was fairly consistent across requests, I could predict how many concurrent connections could complete within the timeout limit.
With the following middleware, I limit concurrent active connections to ensure high CPU utilization while still responding within the timeout:
(defn wrap-limit-concurrent-connections
"Middleware that limits the number of concurrent connections to ´max-connections´,
via the atom `current-connections-atom`.
This means that the middleware can be applied in several different places
while still sharing an atom if necessary."
[handler current-connections-atom max-connections]
(fn [request]
(let [connection-no (swap! current-connections-atom inc)]
(try
(if (>= max-connections connection-no)
(handler request)
{:status 503 :body "Service Unavailable"})
(finally
(swap! current-connections-atom dec))))))
The middleware implementation is very naive and assumes that, the service only exposes work with a similar load profile so that the same middleware (and coordination atom) can be reused across the service.
Though the middleware does make the 504 - Gateway Timeout responses go away, they are replaced with slightly fewer 503 - Service Unavailable errors. The important part is that the maximum possible number of 200 - OK are allowed to pass through, making the system partially responsive while scaling up (deploying more instances).
I ran tests to find the right value for max-connections that matches the given work and hardware the service was running on.
Endpoints with low CPU intensity, such as health checks, should not be wrapped in the middleware. You don’t want a service instance terminated and restarted just because the health check can’t communicate: I’m still doing important stuff.
A more sophisticated rate-limiting middleware is possible using the same scaffolding as above. Maybe something that times requests and reduces concurrency as response time goes up, or something with different weights instead of just incrementing and decrementing by one. But if this starts getting hairy, you might be better off with an entirely different architecture.
Use with caution. 💚