-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Description
Background
Google’s Site Reliability Engineering, Handling Overload (§ Adaptive Throttling, Eq 21‑01) states that a client should:
- Increment
requests
as soon as the call is attempted (before it goes on the wire). - Increment
accepts
only after the backend returns success. - Use a sliding‑window long enough (default ≈ 2 minutes) to ensure every successful response lands in the same window that already holds its originating request.
This tight feedback loop lets the drop‑probability rise the moment successes start to lag behind attempts.
Problem 1 – requests
counted only after the call finishes
In googleBreaker.doReq
, both total
and success/failure
are updated inside the defer
callback, i.e. after the RPC returns.
As a result, in‑flight calls are invisible to the window: during an outage the client keeps sending traffic until the first wave of responses arrives. This postpones throttling exactly by the round‑trip time of the service.
err := req() // no accounting yet
defer { markSuccess() // total++ only here
... // success / failure++
}
SRE model: requests++ before req()
is invoked.
Current behavior: requests++ after the response.
Problem 2 – Window shorter than possible response time‑out
googleBreaker
uses a 10 s window (40 × 250 ms buckets) while a single call can legally run longer than 10 s.
When that happens:
- The request falls out of the sliding window before its response returns.
- The subsequent acceptance is recorded in a newer window that no longer contains the matching request.
- The ratio
requests / accepts
inflates, causing excessive local drops long after the backend has recovered—or never rises, delaying protection, depending on timing.
SRE model: window ≥ service P99 latency (≈ 2 min in the book) so each success always pairs with its request.
*Current behavior: window may be shorter than the call, breaking the pairing._
Impact
These two mismatches can lead to either:
- Slow throttling—backend already failing while the breaker still passes traffic, or
- Over‑throttling—client drops requests even though the backend is healthy again.
Both scenarios diverge from the intent of the adaptive throttling algorithm outlined in the SRE book.