A 12-key Gemini rotation system, in 90 minutes
During an Applied AI coding challenge I needed graceful free-tier rate-limit handling. Here is the round-robin rotation engine I shipped — and why it is now slated to become a small open-source library.
The setup
90-minute live coding challenge. Build an end-to-end pipeline: mock server emits a Microsoft Teams message → an LLM reads the message and decides whether to create a Jira ticket or update an existing one → if a ticket is needed, call Jira's API; either way, dispatch a Teams reply.
The catch: Gemini's free tier limits you to 15 requests per minute per key. The grader was going to fire enough traffic at the pipeline that a single key would get rate-limited inside the first 30 seconds. They wanted to see how I'd handle it.
The naive answer that wouldn't have worked
The obvious move is "use a paid key." That's not the answer the grader is looking for, and at the bottom of the spec they made the constraint explicit: stay on the free tier and design around the rate limit.
The next-most-obvious move is exponential backoff with jitter. That works for occasional 429s but doesn't help when you're hitting the limit on every request — you'd just queue up retries until the test ran out.
The 12-key rotation
I shipped a small rotation engine: 12 Gemini API keys (the project owner had pre-generated them), a deterministic round-robin dispatcher, and per-key quota tracking with automatic failover.
Properties:
- Deterministic round-robin — each request increments an index, so call N goes to key
N mod 12. Predictable, easy to reason about under tests. - Per-key quota tracking — every key has an in-memory counter that's reset on a 60-second sliding window. When a key is at 14/15 the dispatcher skips it and uses the next available key.
- 429 failover with exponential backoff — if a key returns 429 anyway (the per-minute limit isn't perfectly honoured by Google's edge), the dispatcher decrements its remaining count to zero, marks it as "cooldown until next window", and retries on the next key. Backoff (
200ms × 2^attempt) only kicks in if every key is in cooldown, which under the test load happened roughly never. - Schema-validated function-calling responses — every Gemini response is parsed against a Zod schema and rejected if it doesn't match. The dispatcher swallows malformed responses, logs them, and re-issues to the next key. (LLMs are not deterministic; the rotation gives me a cheap retry budget.)
What that bought me
Effective rate limit went from 15 req/min (single key) to 180 req/min (12 keys × 15). The grader's load was around 80 req/min sustained. The pipeline never ran out of headroom.
More importantly: when the grader watched the logs during evaluation, they saw a system that degraded gracefully under pressure instead of crashing. That's the engineering decision the challenge was actually testing.
The bug I caught at minute 78
About 12 minutes from the deadline I noticed the dispatcher was holding key cooldowns across the entire process lifetime — meaning if a key was rate-limited at second 5, it would still be marked cooldown at second 95, when the actual rate-limit window had long since reset.
Fix was three lines: store the cooldown timestamp instead of a boolean flag, and check Date.now() - cooldown > 60_000 before treating the key as available.
The lesson: time-window logic is always wrong the first time you write it under pressure. Now I write the timestamp version first by default.
Where this code is going next
I'm extracting the rotation engine into a small open-source library — same algorithm, but generalised to any provider with a quota-per-key model. Tracked on the PROJECTS.md backlog as one of the "extract and ship" items.
If you're interviewing me for an LLM-ops role, this is the engineering story I'd lead with. Not because the algorithm is novel — round-robin with quota tracking is textbook — but because the engineering taste showed up in the right places: deterministic dispatch, schema validation as a retry trigger, time-window logic written defensively the second time.
For the full WorkFlex challenge writeup including the Jira / Teams integration code, see the GitHub repo. The challenge submission passed engineering review and advanced to the next round.