Architecture at a glance
Spider queries
→ Harness (agent run → TelemetryRecord)
→ Detector (windowed drift detection → DriftEvent) ← my stage
→ Correction (teacher generates + verifies SQL → few-shot examples)
→ Viewer (events.jsonl → recovery curve)
Contracts: frozen Pydantic types between stages · append-only event log
· feedback = edits to AgentConfig.few_shot_examples
The problem
Production AI agents drift — the same agent that worked last week quietly degrades — and most “self-improving agent” demos can’t tell a genuinely bad run from noise. Built at an AI Engineer World’s Fair hackathon (24 hours, four people), this is a self-correcting text-to-SQL agent that detects its own degradation and learns from its failures, with the improvement held to a real statistical bar. I owned the Detector stage — the drift detection.
The architecture
Four typed stages with frozen Pydantic contracts between them, over an append-only event log. Harness runs the agent on Spider questions and emits TelemetryRecords. Detector (mine) consumes those and emits DriftEvents. Correction has a stronger teacher model generate and verify SQL to mint few-shot examples. Viewer turns the event log into a recovery curve. “Feedback” is concretely a modification to AgentConfig.few_shot_examples — the loop is data, not weights.
What was hard, and what I decided
Distinguishing drift from noise (the Detector). A single wrong query is noise; sustained degradation is drift. I built the detector as windowed rather than point-wise, running as a small state machine — WARMUP → NORMAL → DRIFTING — so it only fires on a persistent shift in error rate, not one-off failures. That’s the difference between an alert you trust and one you mute.
Proving the agent actually improved. It’s easy to show a number going up and call it learning. We validated recovery with a paired McNemar exact test on 30 held-out hard questions — 0.567 vs. 0.300 accuracy, a gain of 0.267, p = 0.016 — so the improvement is statistically significant, not cherry-picked. To make the test meaningful, we used an intentionally weak agent against a stronger teacher, so any improvement reflects genuine learning from failures rather than the base model already knowing the answer.
Corrections you can trust. The teacher’s SQL is accepted only if it executes to the correct result set; otherwise it’s discarded and the gold query is used instead. Examples are stratified out-of-sample by database, with anti-forgetting anchors, so the agent doesn’t overfit to the drift window it just recovered from.