Limit concurrent HTTP connections to avoid crippeling overload

8th Feb 2026

Midjourney prompt: Three people too big to squeeze through a door simultaneously are blocking the entrance for a whole line of people behind them.

Even the fastest web servers become bottlenecks when handling CPU-intensive API work. Slow response times can cripple your service, especially when paired with an API gateway that times out after 29 seconds.

I solved this using a simple middleware that eliminated 504 - Gateway Timeout responses and significantly reduced unnecessary load on my service API.

I assumed that if a single request takes 5 seconds on average, at least five requests could complete before hitting Amazon API Gateway’s 29-second timeout:

Visualization of how I assumed concurrent HTTP connections would put load on the CPU.

In practice, the behavior was completely different (though in retrospect, it makes perfect sense). CPU resources are divided equally among all concurrent requests, causing all responses to slow down proportionally:

Visualization of how the concurrent HTTP connections actually put load on the CPU.

I found myself in a situation where even a mediocre load would make the system incapable of responding within the time limit. The API gateway would return 504 - Gateway Timeout on behalf of my service, while my unaware service would occupy CPU resources for responses that would never be used for anything, slowing everything even further.

A sure way to contribute to climate change and get a high cloud bill, while delivering zero value.

Oh wait…
a caller is very likely to want to retry a request, indicating a temporary problem, which the HTTP response code 504 - Gateway Timeout does. Now multiply your already high cloud bill by the retry count.

In other words: A disaster. ☠️

An entirely different architecture, maybe involving a queue or some async response mechanism, would probably have been a better solution. But sometimes, we need to work with what we’ve got.

Since my CPU load was fairly consistent across requests, I could predict how many concurrent connections could complete within the timeout limit.

With the following middleware, I limit concurrent active connections to ensure high CPU utilization while still responding within the timeout:

(defn wrap-limit-concurrent-connections
  "Middleware that limits the number of concurrent connections to ´max-connections´,
   via the atom `current-connections-atom`.
   This means that the middleware can be applied in several different places
   while still sharing an atom if necessary."
  [handler current-connections-atom max-connections]
  (fn [request]
    (let [connection-no (swap! current-connections-atom inc)]
      (try
        (if (>= max-connections connection-no)
          (handler request)
          {:status 503 :body "Service Unavailable"})
        (finally
          (swap! current-connections-atom dec))))))

The middleware implementation is very naive and assumes that, the service only exposes work with a similar load profile so that the same middleware (and coordination atom) can be reused across the service.

Though the middleware does make the 504 - Gateway Timeout responses go away, they are replaced with slightly fewer 503 - Service Unavailable errors. The important part is that the maximum possible number of 200 - OK are allowed to pass through, making the system partially responsive while scaling up (deploying more instances).

Visualization of how the concurrent HTTP connections actually put load on the CPU with the middleware applied.

I ran tests to find the right value for max-connections that matches the given work and hardware the service was running on.

Endpoints with low CPU intensity, such as health checks, should not be wrapped in the middleware. You don’t want a service instance terminated and restarted just because the health check can’t communicate: I’m still doing important stuff.

A more sophisticated rate-limiting middleware is possible using the same scaffolding as above. Maybe something that times requests and reduces concurrency as response time goes up, or something with different weights instead of just incrementing and decrementing by one. But if this starts getting hairy, you might be better off with an entirely different architecture.

Use with caution. 💚

Limit concurrent HTTP connections to avoid crippeling overload

Enjoyed reading this post?

Using idempotency for application interfaces like REST APIs 4th Dec 2025

Exception handling differences between Clojure map & pmap 15th Nov 2025

Consistent code style for Clojure function definitions 11th May 2025

Local S3 storage with MinIO for your Clojure dev environment 21st Apr 2025

A Clojure Jekyll adventure: Jakel materializes from the mist (a Jekyll clone) 12th Apr 2025

A Clojure Jekyll adventure: A Wet side quest (Wet 0.3.0 released) 16th Mar 2025