Request/Accept accounting in googleBreaker deviates from Google SRE adaptive throttling model

**Background**
Google’s *Site Reliability Engineering, Handling Overload* (§ Adaptive Throttling, Eq 21‑01) states that a client should:

* **Increment `requests` as soon as the call is *attempted*** (before it goes on the wire).
* **Increment `accepts` only after the backend returns success**.
* Use a sliding‑window long enough (default ≈ 2 minutes) to ensure every successful response lands in the same window that already holds its originating request.

This tight feedback loop lets the drop‑probability rise the moment successes start to lag behind attempts.

---

#### Problem 1 – `requests` counted only after the call finishes

In `googleBreaker.doReq`, both `total` and `success/failure` are updated inside the `defer` callback, i.e. *after* the RPC returns.
As a result, in‑flight calls are invisible to the window: during an outage the client keeps sending traffic until the first wave of responses arrives. This postpones throttling exactly by the round‑trip time of the service.

```
err := req()          // no accounting yet
defer { markSuccess() // total++ only here
        ...           // success / failure++
}
```

***SRE model**: requests++ **before** `req()` is invoked.*
***Current behavior**: requests++ **after** the response.*

---

#### Problem 2 – Window shorter than possible response time‑out

`googleBreaker` uses a 10 s window (40 × 250 ms buckets) while a single call can legally run longer than 10 s.
When that happens:

1. The request falls out of the sliding window before its response returns.
2. The subsequent acceptance is recorded in a newer window that no longer contains the matching request.
3. The ratio `requests / accepts` inflates, causing excessive local drops long after the backend has recovered—or never rises, delaying protection, depending on timing.

***SRE model**: window ≥ service P99 latency (≈ 2 min in the book) so each success always pairs with its request.*
\***Current behavior**: window *may be* shorter than the call, breaking the pairing.\_

---

**Impact**
These two mismatches can lead to either:

* **Slow throttling**—backend already failing while the breaker still passes traffic, or
* **Over‑throttling**—client drops requests even though the backend is healthy again.

Both scenarios diverge from the intent of the adaptive throttling algorithm outlined in the SRE book.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Request/Accept accounting in googleBreaker deviates from Google SRE adaptive throttling model #5011

Problem 1 – `requests` counted only after the call finishes

Problem 2 – Window shorter than possible response time‑out

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Request/Accept accounting in googleBreaker deviates from Google SRE adaptive throttling model #5011

Description

Problem 1 – requests counted only after the call finishes

Problem 2 – Window shorter than possible response time‑out

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Problem 1 – `requests` counted only after the call finishes