From Vibes to Production: Evaluating and Shipping AI Agents That Work 101
Schedule: 9:00am-11:00am
Laurie Voss · Room 2010
Still worth it despite sponsor context: evaluation, tracing, failure analysis, online monitoring, and coding-agent repair loops are directly reusable.
Official description
Building an AI demo is easy. Knowing whether it actually works — and keeping it working in production — is the hard part. Most teams ship agents on vibes: they try a few prompts, the output looks good, and they push to production with no real way to measure quality or catch regressions. This hands-on workshop walks through the full lifecycle of shipping a real AI agent, using a working financial-analyst agent built on the Claude Agent SDK as the running example. You'll instrument it with tracing, do structured error analysis on its actual outputs, and build a layered evaluation suite — from cheap deterministic code checks to LLM-as-a-judge evaluators with custom rubrics. We'll cover the parts most tutorials skip: why agents fail in ways single LLM calls don't, the eval anti-patterns that quietly mislead you, and how to know whether you can even trust your judge (meta-evaluation). Finally, we'll close the loop: turning eval results into datasets and experiments, running evals online against production traffic, wiring them to monitors and alerts, and feeding failure explanations back to a coding agent to actually fix the underlying problems. You'll leave with a runnable notebook and a repeatable, evaluation-driven workflow you can apply to your own agents the next day.
Open official scheduleCooking with Codex
Schedule: 9:00am-11:00am
Charlie Guo, Gabriel Chua · Room 2002
Good alternative if you want Codex workflow, skills, plugins, MCPs, and multi-agent operating patterns more than eval depth.
Official description
Codex is changing how technical teams ship across the software development lifecycle, from feature implementation to code review and automation. But the real unlock comes when these practices move beyond a single workflow and become shared systems a team can trust. In this hands-on session, you'll use Codex across real development and knowledge-work scenarios: structuring tasks, supervising agentic work, coordinating subagents, using plugins and MCPs, and combining Codex with OpenAI's frontier reasoning, coding, and multimodal models. Bring your laptops and leave with reusable demos and a set of Codex recipes your team can adapt.
Open official schedule