Skip to main content
Back to the garden
·🌿 6 min · Medium read

A 12-key Gemini rotation system, in 90 minutes

During an Applied AI coding challenge I needed graceful free-tier rate-limit handling. Here is the round-robin rotation engine I shipped — and why it is now slated to become a small open-source library.

AILLM opsEngineering decisions

The setup

90-minute live coding challenge. Build an end-to-end pipeline: mock server emits a Microsoft Teams message → an LLM reads the message and decides whether to create a Jira ticket or update an existing one → if a ticket is needed, call Jira's API; either way, dispatch a Teams reply.

The catch: Gemini's free tier limits you to 15 requests per minute per key. The grader was going to fire enough traffic at the pipeline that a single key would get rate-limited inside the first 30 seconds. They wanted to see how I'd handle it.

The naive answer that wouldn't have worked

The obvious move is "use a paid key." That's not the answer the grader is looking for, and at the bottom of the spec they made the constraint explicit: stay on the free tier and design around the rate limit.

The next-most-obvious move is exponential backoff with jitter. That works for occasional 429s but doesn't help when you're hitting the limit on every request — you'd just queue up retries until the test ran out.

The 12-key rotation

I shipped a small rotation engine: 12 Gemini API keys (the project owner had pre-generated them), a deterministic round-robin dispatcher, and per-key quota tracking with automatic failover.

Properties:

  • Deterministic round-robin — each request increments an index, so call N goes to key N mod 12. Predictable, easy to reason about under tests.
  • Per-key quota tracking — every key has an in-memory counter that's reset on a 60-second sliding window. When a key is at 14/15 the dispatcher skips it and uses the next available key.
  • 429 failover with exponential backoff — if a key returns 429 anyway (the per-minute limit isn't perfectly honoured by Google's edge), the dispatcher decrements its remaining count to zero, marks it as "cooldown until next window", and retries on the next key. Backoff (200ms × 2^attempt) only kicks in if every key is in cooldown, which under the test load happened roughly never.
  • Schema-validated function-calling responses — every Gemini response is parsed against a Zod schema and rejected if it doesn't match. The dispatcher swallows malformed responses, logs them, and re-issues to the next key. (LLMs are not deterministic; the rotation gives me a cheap retry budget.)

What that bought me

Effective rate limit went from 15 req/min (single key) to 180 req/min (12 keys × 15). The grader's load was around 80 req/min sustained. The pipeline never ran out of headroom.

More importantly: when the grader watched the logs during evaluation, they saw a system that degraded gracefully under pressure instead of crashing. That's the engineering decision the challenge was actually testing.

The bug I caught at minute 78

About 12 minutes from the deadline I noticed the dispatcher was holding key cooldowns across the entire process lifetime — meaning if a key was rate-limited at second 5, it would still be marked cooldown at second 95, when the actual rate-limit window had long since reset.

Fix was three lines: store the cooldown timestamp instead of a boolean flag, and check Date.now() - cooldown > 60_000 before treating the key as available.

The lesson: time-window logic is always wrong the first time you write it under pressure. Now I write the timestamp version first by default.

Where this code is going next

I'm extracting the rotation engine into a small open-source library — same algorithm, but generalised to any provider with a quota-per-key model. Tracked on the PROJECTS.md backlog as one of the "extract and ship" items.

If you're interviewing me for an LLM-ops role, this is the engineering story I'd lead with. Not because the algorithm is novel — round-robin with quota tracking is textbook — but because the engineering taste showed up in the right places: deterministic dispatch, schema validation as a retry trigger, time-window logic written defensively the second time.


For the full WorkFlex challenge writeup including the Jira / Teams integration code, see the GitHub repo. The challenge submission passed engineering review and advanced to the next round.