Multi-agent systems fail in correlated ways. One bad tool call cascades into twenty retries, and the bill arrives Monday. A pattern from electrical engineering, ported badly into software.
The first time I shipped a multi-agent harness to anything resembling production, it took out our OpenAI quota in ninety seconds. Not because the prompts were expensive — they weren't — but because one agent's tool call returned malformed JSON, the planner agent retried, and the supervisor cheerfully scheduled the same broken plan eight more times.
The fix is older than I am. It's borrowed wholesale from electrical engineering: a circuit breaker. When the same failure pattern repeats, stop. Open the circuit. Let the system cool down before you try again.1
A naive first attempt
My first cut was an int _failures field on the harness root. After three,
throw. It worked for about a day.
// don't do this
public class NaiveBreaker
{
private int _failures = 0;
public async Task<T> InvokeAsync<T>(Func<Task<T>> op)
{
if (_failures >= 3)
throw new CircuitOpenException();
try { return await op(); }
catch { _failures++; throw; }
}
}
note A counter without a half-open state isn't a circuit breaker — it's a kill switch. The system never recovers on its own.
The real pattern has three states: closed (normal), open (failing fast), and half-open (cautiously probing). The transition between them is the entire trick.
The three states
| State | What's happening | Transition out |
|---|---|---|
| Closed | Calls flow through. Failures are counted. | Failure threshold → Open |
| Open | Calls fail fast without invoking the dependency. | Cooldown elapses → Half-Open |
| Half-Open | One probe call is allowed. | Probe succeeds → Closed, fails → Open |
The math for the cooldown matters more than I expected. Linear backoff is wrong for AI harnesses — model providers tend to recover in bursts. I've been running an exponential with jitter:
\[t_{cooldown} = \min\left(t_{max},\; t_{base} \cdot 2^{n}\right) \cdot \big(1 + U(-0.2, 0.2)\big)\]where $n$ is the number of consecutive open transitions. The jitter prevents a thundering herd when many breakers reopen simultaneously — a real failure mode if you're running parallel agents that share a downstream tool.
Implementation in C# 13
A working version, with the half-open state and an injectable clock for testing:
public sealed class CircuitBreaker(
int failureThreshold = 5,
TimeSpan? cooldown = null,
TimeProvider? time = null)
{
private readonly TimeSpan _cooldown = cooldown ?? TimeSpan.FromSeconds(30);
private readonly TimeProvider _time = time ?? TimeProvider.System;
private readonly Lock _gate = new();
private int _failures;
private CircuitState _state = CircuitState.Closed;
private DateTimeOffset _openedAt;
public async Task<T> InvokeAsync<T>(
Func<CancellationToken, Task<T>> op,
CancellationToken ct = default)
{
EnsureCanProceed();
try
{
var result = await op(ct).ConfigureAwait(false);
OnSuccess();
return result;
}
catch when (!ct.IsCancellationRequested)
{
OnFailure();
throw;
}
}
private void EnsureCanProceed()
{
lock (_gate)
{
if (_state is CircuitState.Open)
{
if (_time.GetUtcNow() - _openedAt < _cooldown)
throw new CircuitOpenException();
_state = CircuitState.HalfOpen;
}
}
}
private void OnSuccess() { lock (_gate) { _failures = 0; _state = CircuitState.Closed; } }
private void OnFailure()
{
lock (_gate)
{
_failures++;
if (_failures >= failureThreshold || _state is CircuitState.HalfOpen)
{
_state = CircuitState.Open;
_openedAt = _time.GetUtcNow();
}
}
}
}
public enum CircuitState { Closed, Open, HalfOpen }
public sealed class CircuitOpenException : Exception;
A few notes on this version:
- The
Locktype is the C# 13 named lock. Reads and writes to the state fields are short, so a single lock is fine; if you're seeing contention, you've got bigger problems than the breaker. TimeProvideris injectable so the test suite can advance time deterministically. Don't useDateTime.UtcNowdirectly — you'll regret it.ConfigureAwait(false)because this is library-ish code.
tip In production, prefer Polly's
ResiliencePipelineBuilderwithAddCircuitBreaker. The above is for teaching — Polly handles the edges (timeouts inside the breaker, isolation between breakers, telemetry) that a hand-rolled version misses.
Tuning the thresholds
Three knobs, in order of how often I touch them:
- Failure threshold. Start at 5 for chatty providers, 3 for ones you pay per call. Lower for cold paths.
- Cooldown base. 10s is fine for most providers; 30s if you're seeing rate-limit-and-recover patterns.
- Sliding window vs. consecutive count. Consecutive is simpler and surprisingly good. Switch to a sliding window only if you're seeing intermittent failures that should trip the breaker but don't.
warn Don't share a single breaker across logically distinct dependencies. One bad tool shouldn't blackhole the entire agent. Scope breakers to the narrowest unit that makes sense — usually
(tool_id, provider).
What it doesn't fix
Circuit breakers stop cascades; they don't stop bad plans. If your agent is asking the wrong question, the breaker will dutifully stop you from asking it twenty times — and then the agent will pick the next-most-confident question and keep going. That's a separate problem, and a more interesting one. I'll write it up next.
-
Michael Nygard, Release It! (2007). The book that put this pattern in front of a generation of services engineers. The original Hystrix docs at Netflix are also worth reading; the project itself is retired but the concepts hold. ↩