AI/MLLive2026

OutboundLab

Multi-agent B2B research demo. Paste a company URL → three AI agents (Reconnaissance / People & ICP / Personalisation) research the company, identify decision-makers from public sources, and draft personalised outreach. Built on a free-tier LLM provider chain (Groq → Gemini → OpenRouter) with structured validation gates between agents to keep hallucinations out of the final email.

Visit live site View source

Next.js 16TypeScripttRPC v11Vercel AI SDKTailwind v4Supabase + pgvectorGroq + GeminiVitestPlaywright

What it is

A working B2B outbound research demo. Paste a URL like `linear.app`, pick a channel (Email / LinkedIn DM / X DM) and tone (cold / warm), and watch three agents work in sequence — Reconnaissance, People & ICP, Personalisation & Outreach — followed by a copy-pasteable message addressed to a real, verified decision-maker. Built on a free-tier LLM chain (Groq → Gemini → OpenRouter) with $0/month inference cost. Live at outbound-lab-acquisity.vercel.app.

The problem

B2B outreach research is the part of sales that swallows hours and produces template emails anyway — and the AI tools that automate it sit behind paid LLM APIs and paid data providers (Clearbit, Apollo, Hunter). The interesting question isn't whether you can wire GPT-4 + a CRM. It's whether you can ship something that works for $0/month, names real decision-makers without inventing them, and refuses to hallucinate when the public data isn't there.

What I built

The 3-agent pipeline

URL paste → cache lookup (30-day window, schema-versioned) → Agent 1 Reconnaissance (Llama on Groq, 6-tool-call cap, `web_search` + `web_fetch`) → Agent 2 People & ICP (Gemini-first, 5-tool-call cap, `web_search` only) → Agent 3 Personalisation (no tools, temp 0.7). Validation gates run server-side between every stage.

Streaming layer

The orchestrator is a single `async generator<StreamEvent>` that yields events live as agents work. tRPC v11 subscriptions wrap it with SSE on the wire. Visitors who navigate away mid-run land back on a usable timeline by replaying `agent_done` events from the per-agent message log plus polling for new ones.

Free-tier provider chain as a first-class constraint

Groq for primary inference (LPU speed feels live in the streaming UI), Gemini as fallback (different infrastructure, different rate-limit pool), OpenRouter as last-resort safety net. The chain detects 429s and the AI SDK's `AI_RetryError` wrapper, falls through gracefully, and surfaces a per-provider breakdown when all three exhaust simultaneously. Per-agent provider preference lets Agent 2 (the heaviest tool-loop) start on Gemini to spread load away from Groq's TPM ceiling.

SPA-aware web fetching

Many B2B targets a recruiter might paste — Linear, Vercel, Supabase — ship as JS-rendered SPAs. Direct HTTP GET returns 2-4 KB of script tags and an empty root div. The fetcher detects this (heavy raw HTML + thin stripped text) and falls back to a Tavily search of the same URL where the rendered body is actually available. Combined with seed-path probing of `/about`, `/team`, `/leadership`, `/people`, `/story`, this widens the demoable target pool sharply.

Engineering decisions

Why validation gates between agents, not creative trust

Agent 2's output is post-validated server-side: a decision-maker is accepted only if the source URL is on the target's own domain, OR the name appears on any of the target's own pages, OR a cross-domain source body contains both the name AND the target company name. This is the single most important correctness fix shipped — it kills the 'right name, wrong company' failure mode (e.g. surfacing Leila Hormozi of Acquisition.com when researching Acquisity, because the names are similar). Confidence scoring is additive metadata on top, not a hard filter.

Why static curated source tiers, not learned scoring

Each verified DM carries a `confidence` field — HIGH / MEDIUM / LOW — surfaced as a small pill next to their name. HIGH = first-party (target domain) + LinkedIn / Crunchbase / mainstream business press. MEDIUM = curated developer / industry platforms (Medium, dev.to, GitHub, Substack). LOW = anything else, including AI-generated wikis and SEO content farms. Tiers are versioned with the code, transparent, a code change to update. The alternative — learned tiers, logging-based ranking — drifts silently as the corpus changes; explicit lists do not.

Why an anti-fabrication regex gate on Agent 3 specifically

Even with explicit prompt instructions, Llama on Groq sometimes leaks a confident-sounding number into the email body. A regex gate runs after Agent 3 and triggers a corrective retry if the body contains any verifiable specific (multi-digit percentage, dollar figure, Series A-K, funding verb, multiplier). The retry instruction tells the model to fall back to the public value proposition — which IS in the brief — instead of citing any specific stat at all. Better a clean factual opener than a confident fabrication.

Why empty-DM resilience over a hallucinated recipient

When the validation gate drops every candidate (small startup, only the founder is publicly named anywhere), the orchestrator skips Agent 3 entirely instead of letting it invent a recipient out of the buyer-persona placeholder. The Outreach tab surfaces a 'no verifiable decision makers found' panel with a deep-linked Open LinkedIn People Search button pre-filled with the company name. The tool refuses to invent and points the user at the right manual path.

What I'd do differently

Wire `AbortSignal` through the agent runners — right now if a visitor navigates away mid-run the orchestrator keeps churning to completion, burning quota on a result no one will see. Tighten triangulation requirements per tier (≥2 sources for HIGH, ≥1 for MEDIUM) — a single curated source currently sets the tier, which is fine for the demo but won't hold under real volume. Logging-based source tier learning would auto-demote consistently-failing sources, but it needs the curated baseline to work first. None of these are blocking the demo's intended value; they're production hardening.