Skip to content

Request/Accept accounting in googleBreaker deviates from Google SRE adaptive throttling model #5011

@ktalg

Description

@ktalg

Background
Google’s Site Reliability Engineering, Handling Overload (§ Adaptive Throttling, Eq 21‑01) states that a client should:

  • Increment requests as soon as the call is attempted (before it goes on the wire).
  • Increment accepts only after the backend returns success.
  • Use a sliding‑window long enough (default ≈ 2 minutes) to ensure every successful response lands in the same window that already holds its originating request.

This tight feedback loop lets the drop‑probability rise the moment successes start to lag behind attempts.


Problem 1 – requests counted only after the call finishes

In googleBreaker.doReq, both total and success/failure are updated inside the defer callback, i.e. after the RPC returns.
As a result, in‑flight calls are invisible to the window: during an outage the client keeps sending traffic until the first wave of responses arrives. This postpones throttling exactly by the round‑trip time of the service.

err := req()          // no accounting yet
defer { markSuccess() // total++ only here
        ...           // success / failure++
}

SRE model: requests++ before req() is invoked.
Current behavior: requests++ after the response.


Problem 2 – Window shorter than possible response time‑out

googleBreaker uses a 10 s window (40 × 250 ms buckets) while a single call can legally run longer than 10 s.
When that happens:

  1. The request falls out of the sliding window before its response returns.
  2. The subsequent acceptance is recorded in a newer window that no longer contains the matching request.
  3. The ratio requests / accepts inflates, causing excessive local drops long after the backend has recovered—or never rises, delaying protection, depending on timing.

SRE model: window ≥ service P99 latency (≈ 2 min in the book) so each success always pairs with its request.
*Current behavior: window may be shorter than the call, breaking the pairing._


Impact
These two mismatches can lead to either:

  • Slow throttling—backend already failing while the breaker still passes traffic, or
  • Over‑throttling—client drops requests even though the backend is healthy again.

Both scenarios diverge from the intent of the adaptive throttling algorithm outlined in the SRE book.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions