AI Engineer World's Fair 2026

Day 1 - Workshop Day

Hands-on workshops. Some are vendor-hosted, but these are selected for transferable agent-building practice.

9:00-11:00

From Vibes to Production: Evaluating and Shipping AI Agents That Work 101

Schedule: 9:00am-11:00am

Laurie Voss · Room 2010

Still worth it despite sponsor context: evaluation, tracing, failure analysis, online monitoring, and coding-agent repair loops are directly reusable.

Official description

Building an AI demo is easy. Knowing whether it actually works — and keeping it working in production — is the hard part. Most teams ship agents on vibes: they try a few prompts, the output looks good, and they push to production with no real way to measure quality or catch regressions. This hands-on workshop walks through the full lifecycle of shipping a real AI agent, using a working financial-analyst agent built on the Claude Agent SDK as the running example. You'll instrument it with tracing, do structured error analysis on its actual outputs, and build a layered evaluation suite — from cheap deterministic code checks to LLM-as-a-judge evaluators with custom rubrics. We'll cover the parts most tutorials skip: why agents fail in ways single LLM calls don't, the eval anti-patterns that quietly mislead you, and how to know whether you can even trust your judge (meta-evaluation). Finally, we'll close the loop: turning eval results into datasets and experiments, running evals online against production traffic, wiring them to monitors and alerts, and feeding failure explanations back to a coding agent to actually fix the underlying problems. You'll leave with a runnable notebook and a repeatable, evaluation-driven workflow you can apply to your own agents the next day.

Open official schedule

mustevalsopsgtaaf

Cooking with Codex

Schedule: 9:00am-11:00am

Charlie Guo, Gabriel Chua · Room 2002

Good alternative if you want Codex workflow, skills, plugins, MCPs, and multi-agent operating patterns more than eval depth.

Official description

Codex is changing how technical teams ship across the software development lifecycle, from feature implementation to code review and automation. But the real unlock comes when these practices move beyond a single workflow and become shared systems a team can trust. In this hands-on session, you'll use Codex across real development and knowledge-work scenarios: structuring tasks, supervising agentic work, coordinating subagents, using plugins and MCPs, and combining Codex with OpenAI's frontier reasoning, coding, and multimodal models. Bring your laptops and leave with reusable demos and a set of Codex recipes your team can adapt.

Open official schedule

automationgtops

11:05-12:05

How to Build Quality Gates into Agentic Coding Workflows

Schedule: 11:05am-12:05pm

Nnenna Ndukwe · Room 2024

Best agent-builder pick in this slot: turn coding agents from demos into gated workflows that can safely touch real repos.

Official description

AI coding agents can now generate code at unprecedented speed. But faster code generation creates a new engineering problem: how do we know when agent-written code is actually safe, maintainable, and ready to merge? In this hands-on workshop, attendees will build an agentic coding workflow with enforceable code quality gates across planning, implementation, testing, and code review. By the end of the session, participants will have a working reference pattern for agentic software delivery: an AI-assisted workflow that can inspect a repo, implement a change, run tests, evaluate risk, respond to feedback, and surface what still requires human judgment. This is a technical enablement session for engineers building with AI coding agents, platform teams designing agentic SDLC workflows, and AI engineering leaders thinking about how to scale software quality with AI.

Open official schedule

mustevalsopsgt

Building self-learning loops for your agent

Schedule: 11:05am-12:05pm

Fuad Ali · Room 2010

Good if you are focused on feedback loops and continuous improvement for AAF review quality.

Official description

Open official schedule

evalsopsaaf

12:10-1:10

From Zero to Leaderboard: Building an End-to-End AI Agent Evaluation Pipeline

Schedule: 12:10pm-1:10pm

Wolfram Ravenwolf · Room 2005

A practical eval-pipeline session. Use it to think about AAF candidate quality and Birdie/GolfTroop regression tests.

Official description

Running one agent eval is easy. Running hundreds — with controlled timeouts, replicated configs, and automated collection across distributed VMs — requires infrastructure that most teams end up building from scratch. In this workshop, we shortcut that process and build a rigorous evaluation pipeline end-to-end. Participants will set up and connect the full evaluation stack: **Layer 1 — The Benchmark Runner.** Configure Harbor to orchestrate parallel agent evaluations on Terminal-Bench 2.0, with W&B Sandboxes providing isolated environments for each task. **Layer 2 — The Collection Pipeline.** Use WolfBench to scan distributed VMs for results, deduplicate across runs, download trajectories, and build a local results archive that survives VM teardown. **Layer 3 — The Analysis Framework.** Compute the five-metric framework (Ceiling / Best / Average / Worst / Solid) across replicated runs. Learn to read the spread: when is a model "better"? When is a score difference just noise? **Layer 4 — The Observability Layer.** Upload full agent conversation traces to W&B Weave for per-turn inspection. See exactly where an agent goes wrong — the command it ran, the output it misread, the moment it started looping. **Layer 5 — The Leaderboard.** Generate interactive HTML charts that show the full performance distribution, not a single bar. We'll work with real data from hundreds of production runs, and participants will leave with a working pipeline they can adapt to their own agents and benchmarks. Laptops required; all tools are open-source.

Open official schedule

mustevalsopsgtaaf

2:20-4:20

From Vibes to Production: Evaluating and Shipping AI Agents That Work 201

Schedule: 2:20pm-4:20pm

Laurie Voss · Room 2010

Best continuation if you did 101. Prioritize if you want a real workflow for trace-to-dataset-to-monitor-to-fix loops.

Official description

Open official schedule

mustevalsopsgtaaf

Vector Isn't Enough: Hybrid Search & Retrieval for AI Engineers

Schedule: 2:20pm-4:20pm

Jeff Vestal · Room 2024

Only switch here if retrieval quality is the bottleneck this week. It is less agent-build focused than the eval workshop but strong for AAF search.

Official description

If you build RAG, you reached for vector search first. This lab is about everything that happens after you realize embeddings alone don't cut it in production. You'll write real queries — semantic, lexical, and hybrid — feel exactly where each one fails, and walk out with a production-grade retrieval pipeline and the judgment to know which technique to reach for when. What you'll actually do: 1. Dense vector search, and the mechanism behind it. Run semantic queries over a semantic_text field backed by Jina v5 embeddings — generated server-side, at query time, by the Elastic Inference Service (EIS). No embedding service to stand up, no client-side inference code. We open the hood on how query-time embedding actually works. 2. Break it. Throw adversarial queries at pure vector — exact error codes, version numbers (8.18 vs 9.0), precise config keys — and watch semantic similarity blur the exact match you needed. Then bring in BM25 lexical search to rescue it… and find the queries where keyword search whiffs. Each method is strongest exactly where the other is weakest. 3. Hybrid, properly. Fuse lexical + semantic with Elasticsearch retrievers. Learn the two fusion strategies that matter — Reciprocal Rank Fusion (RRF) and linear combination with score normalization — when to use each, and how to tune them. Optional: cross-encoder reranking with Jina Reranker v2. 4. Why this is the whole game for agents. Wire the hybrid retriever into a RAG flow and prove that retrieval quality, not the model, determines answer quality. Only synthesis truly needs the LLM - retrieve, rank, filter, and document-level security are database work done in milliseconds for a fraction of the cost. The contrarian takeaway: most of your RAG pipeline shouldn't be LLM calls at all.

Open official schedule

retrievalaafgt

4:30-5:30

The Art and Science of Loopcraft with Pi (and friends)

Schedule: 4:30pm-5:30pm

Joel Hooks · Room 2020

Useful for recurring loops, scheduled tasks, and long-lived assistant workflows.

Official description

This workshop helps agentic coding practitioners stop treating agents like pretend coworkers and start designing reliable, compounding loops. Using Pi as the concrete demo surface, Joel Hooks will show how loop state, handoffs, review, memory, and operator control become visible, while keeping the ideas portable to Claude, Codex, Cursor, and similar coding agents. Practitioners should leave able to identify loops inside their agent workflows, diagnose when failures need gates/evidence versus orchestration/memory/leverage, and understand how model-shaped lifecycles differ from traditional human SDLC rituals.

Open official schedule

automationgtops

The Autonomous Computer: Full-stack Infrastructure for Computer Use Agents

Schedule: 4:30pm-5:30pm

Ang Li · Room 2010

Better AAF pick if you are focused on browser/computer-use infrastructure for hard portal workflows.

Official description

Even the world's best computer-use agents cannot repeat their successes at the moment. Agents that write code — emitting structured selector-based actions instead of clicking pixels — break through that ceiling. We'll share two years of experience from Simular's production agent platform, the architectural decisions that mattered (refs over pixels, code as substrate, Simulang DSL), and a live demo: a 30-step unattended Windows workflow, side-by-side with a vision-only baseline. If you're shipping agents to real users, this is the playbook.

Open official schedule

automationaafops

Day 2 - Session Day 1

Today: bias toward agent architecture, evals, security, memory, and production workflows. Vendor/product talks are mostly demoted.

9:00-10:30

Main Stage block: opening keynotes

Schedule: 9:00am-10:30am

swyx, Pablo Castro, Alexander Embiricos, Romain Huet, Zixuan Li, Thom Wolf, Olive Song · Main Stage · Level 3

Start here if you are onsite. Not all are directly actionable, but they frame the week and avoid missing late schedule changes.

Official description

Opening main-stage block for Day 2.

Open official schedule

ops

10:45-11:05

The New Primitives: Building AI-Native Software

Schedule: 10:45am-11:05am

Kwindla Kramer · Room 2014

Better fit than Pinecone 2.0 if you care about building agents rather than buying a search product. Listen for primitives and architecture, not product positioning.

Official description

In the future, every piece of software with a human-facing surface will be built from new, LLM-centric primitives. (Just like every piece of software today has networking, threads/async routines, UI on top of some flavor of Model/View/Controller abstractions, etc.) We're just starting to invent these new primitives. The list, though, will definitely include: 1. Subagents - multiple inference loops, multiple models, async tool calls 2. Very long context - memory + episodic human interactions over a long period of time, structured data input (not just output), progressive skills/context loading, graceful compaction & summarization 3. dynamic user interface generation / user interfaces driven by LLM inference 4. conversational voice input

Open official schedule

mustautomationopsgt

Governance Is the Real Bottleneck to AI ROI

Schedule: 10:45am-11:05am

David Hsu · Leadership 2 · Room 3020

Alternative if you are thinking about agent access, approval, and accountability layers.

Official description

As AI systems move from generating content to taking Claw-based agents action inside production systems, governance (not model quality) becomes the limiting factor. David will break down why visibility, guardrails, approvals, and rollback matter more than raw intelligence, and how companies can enable AI adoption without creating security and compliance disasters.

Open official schedule

opsgtaaf

11:10-11:30

The unreasonable effectiveness of BM25 for agentic search

Schedule: 11:10am-11:30am

Jo Kristian Bergum · Room 2002

Strong, durable retrieval topic for AAF and agent search. Not just a vendor pitch; BM25 remains a practical baseline you can actually use.

Official description

GPT-5 is shockingly good at search, and that changes the "BM25 as a baseline" story. Using GPT-5 search trajectories from BrowseComp-Plus, I'll show how default BM25 parameters and evaluation harnesses can make lexical retrieval look weak, while real agent queries often play directly to BM25's strengths. Much like grep became a core retrieval primitive for coding agents, BM25 is re-emerging as a powerful primitive for agentic search.

Open official schedule

mustretrievalaafgt

Your Agent Evolved. Your Evals Didn't.

Schedule: 11:10am-11:30am

Ameya Bhatawdekar · Leadership 2 · Room 3020

Pick this if eval design is more urgent than retrieval today.

Official description

Knowing which generation your agent is in, which failure modes your current evals are blind to, and what to build next is the difference between shipping with confidence and flying blind. Agent architectures have evolved through six generations; prompt, chain, ReAct loop, workflow graph, modern agent loop, AI harness. And each one quietly breaks the eval strategy of the generation before it. A prompt-quality rubric won't catch a bad tool call; a trace scorer won't catch memory poisoning. Using a single SRE incident response agent threaded through every generation, this talk shows exactly where each architecture outgrows its evals and what you need to close the gap.

Open official schedule

evalsopsgtaaf

11:40-12:00

Agentic SDLC at Uber: Building Blocks for Uber's Software Factory

Schedule: 11:40am-12:00pm

Uday Kiran Medisetty, Adam Huda · Leadership 1 · Room 3016

Architecture patterns for production coding agents at large scale. More useful than another generic agent-performance framework.

Official description

99% of Uber engineers are using AI every month, 70% of PRs are attributed to AI, and 15% of PRs are now done entirely by autonomous agents. In this session, we go behind the scenes to show you exactly what it takes to get there — starting with the foundational building blocks: the model gateway, MCP infrastructure, agent skills, knowledge systems, and cloud developer environments that make agentic engineering possible at scale. Then, once those foundations are in place, we show you how to assemble them into a fully agentic SDLC. We'll walk through every stage — from research and spec writing, to autonomous code generation, to verifying and validating that code before it ships, to monitoring what happens after it lands, and continuously improving it over time. With tooling example demos throughout. Whether you're just starting your agentic journey or already running agents in production, you'll leave with a concrete blueprint for what this looks like end to end.

Open official schedule

mustautomationopsgt

Agentic vs. Vector Search: An Eval-Driven Approach to Coding Agent Performance

Schedule: 11:40am-12:00pm

Jess Wang · Expo Stage 2 NW · Level 1

Good alternate if you want search/eval lessons for coding agents.

Official description

Evals let you replace gut feelings with quantifiable decisions. This talk breaks the basic concepts of evals, including the four core components: datasets, tasks, scoring, and experiments. Then, to solidify the concept, we’ll walk through a real eval comparing agentic search versus vector search for coding agents. We'll also cover practical challenges like tracing Claude Code subprocess calls and why a single eval run is never enough. You'll leave with a concrete framework for building evals that actually inform your ship decisions.

Open official schedule

retrievalevalsgt

12:05-12:25

Scaling Code Quality: Building uReview, Uber’s Multi-Agent Code Review Engine

Schedule: 12:05pm-12:25pm

Will Bond, Ameya Ketkar · Leadership 1 · Room 3016

High-confidence useful: specialized agents, deterministic tools, generator-verifier checks, confidence scoring, dedupe, and developer feedback loops.

Official description

At Uber scale, human-only code reviews create massive bottlenecks, while generic AI tools overwhelm developers with noisy, hallucinated spam. This session explores the architecture behind uReview, Uber’s multi-agent AI code review engine designed strictly for high-precision feedback. Attendees will learn how we moved beyond monolithic prompts to build a modular pipeline featuring deep contextual ingestion, specialized domain agents, and a Generator-Verifier grader system. By enforcing strict confidence scoring and semantic deduplication, uReview filters out AI noise, shifting the focus from comment quantity to high-signal actionability and significantly reducing Pull Request cycle times. Talk Outline I. The Code Review Crisis at Uber Scale (0–3 mins) Establish the critical tension between engineering velocity and code quality, highlighting why standard AI implementations fail in massive monorepo environments. 1. The Monorepo Bottleneck: At Uber, thousands of engineers commit code daily. Relying solely on human reviewers creates a massive operational bottleneck, leading to reviewer fatigue, extended Pull Request cycle times, and inevitable missed vulnerabilities. 2. The Developer Spam Problem: Generic LLM integrations fail because they prioritize comment quantity over actionable quality. If an AI posts ten hallucinated suggestions on a diff, developers will simply mute the tool. AI must reduce cognitive load, not add to it. 3. The Signal-to-Noise Mandate: Defining the North Star for uReview. The goal is not to replace human reviewers, but to build an AI system that respects developer time by delivering high-precision, strictly verified code feedback. II. The uReview Architecture: A Modular Agentic Pipeline (3–10 mins) Detail the transition from a monolithic prompt approach to uReview’s sophisticated, multi-stage agentic workflow designed for enterprise codebases. 1. Deep Contextual Ingestion: A standard git diff is not enough. We discuss how uReview fetches extended context, integrating with our build systems to analyze surrounding functions, upstream dependencies, and class hierarchies before generating a single token. 2. Specialized Domain Assistants: Instead of a generalist model, uReview deploys independent AI agents. We route code to narrow, specialized agents—such as a Go Concurrency Analyzer, a Java Memory Leak Detector, or a Security Vulnerability Scanner—to ensure precise, domain-specific insights. 3. Hybrid Intelligence: Probabilistic LLMs cannot operate in a vacuum. We detail how uReview integrates deterministic tools, like Bazel dependency graphs and static linters, to ground AI suggestions in objective codebase realities. III. Engineering the Trust Layer (10–17 mins) Dive into the verification phase. This is the core engineering that filters out AI noise and ensures uReview maintains developer trust. 1. The Generator-Verifier Pattern: Implementing a Grader Model architecture. A primary agent generates code suggestions, but a secondary, high-reasoning model audits those suggestions against strict coding guidelines to catch hallucinations before they reach the PR. 2. Confidence Scoring and Suppression: We assign a numerical confidence score to every generated comment. If a comment falls below our calibrated threshold, uReview silently drops it. We explore the engineering behind suppressing low-confidence outputs to prevent tooling spam. 3. Semantic Deduplication: Technical strategies for merging overlapping warnings. If a deterministic static analysis tool and an LLM agent flag the same null pointer exception, uReview merges them into a single, concise developer instruction. IV. Operationalizing uReview at Scale (17–20 mins) Conclude by discussing the long-term governance, feedback loops, and measurable impact of running an AI review engine in production. 1. The Telemetry Feedback Loop: We embedded Useful and Not Useful rating buttons directly into the developer UI on every uReview comment. We discuss how this telemetry flows back into a curated data lake, driving continuous Reinforcement Learning from Human Feedback and prompt refinement. 2. Shifting Success Metrics: Why organizations must abandon vanity metrics like total comments posted. We measure uReview’s success through Actionability Rate (the percentage of AI comments accepted as commits) and the reduction in Mean Time To Merge.

Open official schedule

mustevalsautomationopsgt

Rebuilding the web for agents

Schedule: 12:05pm-12:25pm

Liad Yosef · Room 2002

AAF alternative: agent-readable web surfaces, extraction, and indexing implications.

Official description

AI apps are the new browsers. And the web is not ready. For thirty years we built the web for human eyes, benchmarked by tools like Lighthouse: humans measuring human behavior. That era is ending. Bot traffic has overtaken human traffic, and we can't hand-write a benchmark for what comes next - every best practice goes stale the moment models improve. Your next customer isn't a human with a credit card - it's an agent with a protocol, and it would rather not see your interface at all. That shift moves the UX question from how a human experiences your product to how an agent does, and how a human experiences that agent. Already, some services report their MCP traffic outpacing their web UI. The agent is rapidly becoming the main surface, and it always takes the path of least friction. Claude Code might consistently prefer PostHog over Mixpanel simply because PostHog *has the better agentic surface* - and Mixpanel loses customers without a human ever weighing in. Meanwhile the agentic web protocol stack keeps multiplying, a new one seemingly every week. The harder problem isn't discovery - it's operability: whether the web can actually be run once an agent arrives, and what is the ideal stack for that. Should we lean into headless protocols, or ones like WebMCP that treat the UI as the source of truth? Does a site need to implement every new spec just to support every kind of agent? So we stopped guessing and watched real agents work the whole journey: finding, understanding, authenticating, acting, handing back to a human. The findings go against the last year of agent-readiness advice. Agents ignore the files we built for them, reaching for docs and homepages instead - and whatever they reach, they trust and act on. But when those files are linked properly, their usage jumps 4x. The format isn't the key for the agentic web. Reachability is. The web will never be completely headless. Some moments still demand a human: choosing a seat, comparing options, casually exploring. And agents aren't uniform - some want full headless access, others spin up a browser to fill the gaps, but that's a friction point, not a free fallback. So the web is going nearly headless, always with a human eye at the end. This talk maps the entire agent web landscape based on findings from real agent journeys research: * Which protocols earn their place and which are noise. * Why "agent-ready" and "accessible" are the same engineering problem. * How MCP Apps close the last mile - and when headful protocols like WebMCP step in. * How to build for agent-readiness that survives the next model - not a checklist that's stale in a month. The gap between ready and not is about to separate the relevant from the invisible.

Open official schedule

retrievalautomationaaf

1:30-1:50

If we want them to do Knowledge Work, we need to design Knowledge Agents

Schedule: 1:30pm-1:50pm

Benjamin Clavié · Room 2002

Best fit for building useful agents over project knowledge rather than only chat wrappers.

Official description

It's tempting to assume that just like agents revolutionised coding, they will revolutionize other areas: legal, finance, advertising, and even medicine. All of those have in common that they are fundamentally knowledge work. And thankfully, humans have spent thousands of years searching for the best possible workflows for knowledge work. And yet, we seem to be disregarding all of these learnings, forcing every knowledge task into the shape that worked for coding. Today, we're going to talk about the history of knowledge work and how tools were co-designed to support it to understand how we should be building Knowledge Agents, themselves co-designed with their Knowledge Tools. This is key to avoiding falling into a "good enough" local optimum: think about legal clerking, a core part of the legal industry where information gathering and reasoning is performed to support the work of senior lawyers. The practice of clerking follows its own code, rules and best practices, which could not have feasibly emerged from studying software engineering: and similarly, there is no reason to believe knowledge agents could emerge from coding agents.

Open official schedule

mustretrievalgtaaf

Designing Evals That Earn User Trust

Schedule: 1:30pm-1:50pm

Felipe Blanes · Expo Stage 3 SW · Level 1

Alternative if user-facing agent trust is the current product concern.

Official description

Most teams measure their agent against a benchmark, ship it, and hope. But when your agent serves real users, a benchmark won't tell you if it's actually working. This session is about building an eval suite that captures what success looks like in production, runs against real user workflows, and feeds back into product decisions. Here's the flywheel we use in practice: start with what success looks like from the user's perspective, instrument production workflows to capture those signals, diagnose where the agent falls short, and feed those insights into the next thing you build. You'll see how it shaped concrete product bets, turning eval results from a report card into a discovery tool.

Open official schedule

evalsgtops

1:55-2:15

IT Admin for the AI Workforce: Why Your AI Agents Will Need Their Own IT Department

Schedule: 1:55pm-2:15pm

Sarthak Aggarwal · Leadership 2 · Room 3020

Strong agent-builder operations topic: identities, permissions, lifecycle, and support for agents as a managed workforce.

Official description

Every enterprise will soon run two workforces - human and AI. Humans already have IT departments managing their identities, access, incidents, and compliance. Who manages all that for your fleet of 10,000 AI agents? Nobody. Yet. At Decawork AI, we started by building autonomous IT resolution for human employees - a dual-agent system where the agent that thinks can't act and the agent that acts can't improvise. We're live in production across multiple enterprises - autonomously resolving incidents across identity systems, security platforms, endpoint infrastructure, and collaboration stacks. But here's what we discovered: the patterns for managing human IT - identity lifecycle, access governance, incident resolution, audit logging - are the exact same patterns you'll need to manage AI agent fleets at scale. The next massive infrastructure layer isn't AI agents doing work. It's AI agents managing other AI agents. This talk covers the architecture, the production war stories, and the thesis: IT Admin for the AI workforce is an inevitability, and we're building it now.

Open official schedule

mustopsautomationgtaaf

Deploying browser agents at scale

Schedule: 1:55pm-2:15pm

Derek Meegan · Expo Stage 4 SE · Level 1

AAF alternative if browser automation is the immediate need.

Official description

Not every browser agent trajectory is the same, and treating them like they are is how teams quietly burn budget on agents that never ship. This talk walks through the two trajectory types behind every browser agent, the cost/performance/maintainability tradeoffs that decide whether they hold up, and the concrete patterns for evaluating, hardening, and iterating on them.

Open official schedule

automationaaf

2:25-2:45

Productionizing LLM Gateways: Architecture, Tradeoffs, and Hard Lessons from the Trenches

Schedule: 2:25pm-2:45pm

Kanish Manuja · Leadership 1 · Room 3016

Practical for agent infrastructure: routing, cost, reliability, provider failures, and observability.

Official description

As organizations scale their use of large language models, the biggest challenge is no longer prompting, it’s productionizing. This session dives deep into building and operating an LLM gateway that sits between applications and model providers, handling routing, observability, cost control, reliability, and safety at scale. Drawing from real world experience, this talk breaks down the architecture of a production LLM gateway, including model abstraction layers, request orchestration, fallback strategies, caching, rate limiting, and evaluation pipelines. We’ll explore hard tradeoffs such as latency vs. cost, quality vs. determinism, and vendor lock-in vs. flexibility. Attendees will leave with concrete design patterns, failure modes to avoid, and a mental model for turning LLM experiments into resilient, scalable systems.

Open official schedule

mustopsgtaaf

How to Connect AI to Billions of Legal Documents

Schedule: 2:25pm-2:45pm

Simon Eskildsen, Jacob Lauritzen · Room 2002

AAF alternative for legal/public-record retrieval patterns.

Official description

Legora’s foundational engineering challenge is connecting frontier LLMs to billions of legal documents so the models can efficiently solve end-to-end legal workflows without burning extra tokens. We’ll share the retrieval architecture we built with turbopuffer that achieves: 1. Strict data isolation across millions of legal cases in a very security-conscious domain 2. Predictable search performance (<100ms p90 latency) on large contexts 3. High retrieval quality (95%+ recall@10) with fewer agent loops We’ll retrospect on two architectures that failed to achieve all 3 (and why), and the key design factors that make the current solution work at our scale. Practical takeaways include: - How to evaluate per-tenant vs shared-index retrieval under strict data isolation - How to efficiently index and retrieve context to maximize relevance per input token - How to build a highly intelligent AI application when your inference budget is constrained

Open official schedule

retrievalaaf

2:50-3:10

Your company brain will leak secrets. Here's how we stopped it for big banks and ourselves.

Schedule: 2:50pm-3:10pm

Tanmai Gopal · Room 2010

Important for agents over company data: permissions, leakage, and retrieval boundaries.

Official description

Everyone wants a shared "company brain", one single AI that knows everything the org knows. But it's nearly impossible to build one, because the moment AI scrapes everyone's data into one place, a single wrong answer to the wrong person is a breach. The downside of modifying a above-my-pay-grade shared skill, or leaking confidential information to the wrong colleague is catastrophic. Ergo, company brain projects can only ever ship to the few people who already had access to everything, or stay hobbled with strictly public information (eg: River at Shopify). We've been building one for the last year and have successfully deployed for Fortune 100 banks, for distributed-operations orgs with global scale, and for ourselves as a 70-person AI-native startup. I'll leave you with a blueprint covering how we solved the following problems: 1. Permissions for shared data and tools 2. A shared context layer (skills, knowledge, semantic layer) with its own access control 3. Scoping the blast radius of wrong context 4. Auto-learning without auto-leaking If your company brain effort has been blocked by security, compliance, or just a healthy fear of the intern asking the AI a question and getting back the exec comp table, this is the talk.

Open official schedule

mustopsgtaaf

6 Pillars of an Agentic Harness That Fixes Production Incidents

Schedule: 2:50pm-3:10pm

Varun Krovvidi · Expo Stage 1 NE · Level 1

Alternative if you want incident-response framing for agents.

Official description

A model delights us when any plausible answer works, but a production incident has one right answer, and the model alone can't reliably reach it. Getting there depends less on the model and more on the orchestration, context, and judgment built around it. That work is harness engineering, and it is the new frontier. This session breaks down the six pillars of an agentic harness required to fix production incidents: model orchestration, context, reasoning, actions, learning, and evals. Join Resolve AI to walk through what each one does, why a better model doesn't make any of them go away, and how they compose to find the root cause of a live incident across massive context, under a clock, with real revenue on the line.

Open official schedule

opsevalsgtaaf

3:20-3:40

From Context to Memory: Your Agents Need a Real Memory Layer

Schedule: 3:20pm-3:40pm

Anders Swanson · Expo Stage 2 NW · Level 1

Directly useful for durable project agents and AAF learning loops.

Official description

Most agents don't really have memory. They have a context window, a pile of temporary files, maybe an AGENTS.md, and a retrieval step that attempts to build state from whatever the model can still see. You've seen the flashy demos, but these systems fall apart when an agent needs to recover from failure, revisit prior work, and observe if failures are less frequent over time. This talk explores agent memory as a systems problem. Effective memory isn't just storing data: it's an evolving knowledge layer with write filtering, consolidation, reflection, and forgetting. Agents need persistence, and they also need structure. Raw logs and Markdown scratchpads aren't enough. A real memory layer weights recency, combines retrieval techniques, and correlates episodic memories. Serious agent memory is inherently multi-model. The best systems use full-text search, semantic retrieval, graph relationships, and structured state to reconstruct context with far more precision than filesystem grep alone. This is where databases become essential as the foundation for real memory. Memory shapes how agents behave, adapt, and improve over time.

Open official schedule

mustretrievalautomationgtaaf

How to Get Your Org to Adopt Coding Agents (Without Shipping Garbage)

Schedule: 3:20pm-3:40pm

Eyal Blum · Leadership 1 · Room 3016

Alternative if adoption/process is more important than memory architecture.

Official description

AI coding agents promise 10x. On complex, production work inside a real org, the honest number is 2-5x — and getting there requires a journey most teams aren't prepared for. At Figma, we ship AI products to millions of users, but internally our engineering org is spread across three stages of adoption. The honeymoon, where AI is magic. The crash, where AI writes bad code and your best engineers are stuck protecting the quality bar. And the real skill — 2-5x with disciplined development practices and proper investment. This talk covers why adoption is uneven, what the trust curve looks like from the inside, and what leaders can do about it: guide teams to align on plans before generating code, set honest expectations, invest in the fundamentals that make codebases agent-friendly, and create space for skeptics without judgment. You'll leave with a framework for driving adoption more organically without mandating it — and without shipping garbage.

Open official schedule

automationopsgt

3:45-4:05

Unlock Agent Autonomy: The Runtime for AI-Native Systems

Schedule: 3:45pm-4:05pm

Tushar Jain · Leadership 2 · Room 3020

Good runtime architecture close: autonomy, orchestration, data, and security boundaries.

Official description

The way software gets built in 2026 doesn't look like it did in 2024. The actors changed. Agents read and write entire codebases. Subagents spawn to chase down a flaky test, refactor a module, or triage an incident. But this shift doesn't stop at the SDLC. Agents increasingly invoke tools, interact with enterprise systems, install dependencies, call APIs, and orchestrate workflows across local machines, CI systems, cloud infrastructure, and organizational boundaries. The teams leaning into this shift are moving faster, and the gap is widening by the quarter. But few have the confidence to let agents operate autonomously across those environments. Not because the model capability isn't there. Trust isn't. Agents can pull a poisoned dependency, invoke an untrusted tool, wipe a database, leak sensitive data, or access systems they shouldn’t. Prompt-level instructions won't close that gap, the unlock has to happen one layer down, at the runtime layer itself. Docker spent the last decade making it safe to ship software by getting the runtime right: isolation, network policy, trusted base images, and credentials. Agents are the next workload, and the same principles apply. Tushar Jain, EVP of Engineering at Docker, walks through what the runtime layer for AI-native systems looks like in practice: hardened runtime foundations, sandboxes that constrain what agents can touch, and governance controls that limit what agents can introduce, access, and execute across local, CI, cloud, and enterprise environments. The pattern is the same on every vector: reduce the surface area of what the agent gets to decide, so the parts that matter aren't left to a prompt. Attendees leave with a clearer framework for giving agents more autonomy safely. Engineers see how agentic applications can operate across tools and infrastructure. Security leaders get a runtime model that maps to controls they already understand. Platform teams get a way to scale agent execution without standing up a new runtime for every team.

Open official schedule

mustautomationopsgtaaf

Loop Engineering from first principles

Schedule: 3:45pm-4:05pm

Kyle Mistele · Main Stage · Level 3

Alternative if you want a less vendor-shaped main-stage systems framing.

Official description

Code is free, software is infinite, and agents can do it all - that's the promise of the lights-off software factory, where humans interact only with tickets & specifications, and nobody reads the code, let alone writes it. We ran our own for six months, and we have the scars to prove it - bad code compounded, and agents created problems that agents couldn't solve - until we had to throw it all away. But this is a survivor's guide, not an obituary. In this talk, we'll share the challenges we encountered, what we liked, what we hated, what we're still doing, what we stopped doing, and what we started doing afterwards.

Open official schedule

automationops

4:30-5:30

Main Stage closing block: Harness Engineering, proof, and Cursor

Schedule: 4:30pm-5:30pm

Dex Horthy, Erik Meijer, Lee Robinson · Main Stage · Level 3

Stay if your energy is good. Erik Meijer’s proof/harness framing is the likely high-signal portion.

Official description

Closing main-stage block for Day 2.

Open official schedule

opsevalsgt

Day 3 - Session Day 2

Tomorrow: strongest day for computer use, skills, context, agent reliability, and production failure modes.

9:00-10:30

Main Stage morning block: survey, verifiers, perception, evals intro

Schedule: 9:00am-10:30am

Barr Yaron, Thariq Shihipar, Tariq Shaukat, Antje Barth, Benoit Schillings, Laurie Voss, Aparna Dhinakaran · Main Stage · Level 3

The verifiers and evals intro are the likely useful pieces; use this as context before the technical tracks.

Official description

Morning main-stage block for Day 3. Prioritize the verifier and eval framing.

Open official schedule

evalsops

10:45-11:05

Build-Time vs. Run-Time: Why Your Dev Tools Will Fail in Production

Schedule: 10:45am-11:05am

Averi Kitsch, Prerna Kakkar · Room 2020

Must-attend for agent builders. It targets the gap between local demos and production reality.

Official description

A dangerous pattern is evolving in the ecosystem: developers are deploying "Build-Time" tools into "Run-Time" environments. In this session, we will introduce a critical distinction for the MCP ecosystem: the difference between Build-Time Agents (Developer Assistants like Gemini Code Assist) and Run-Time Agents (End-user applications like a Customer Support bot). Drawing from our experience building the MCP Toolbox, we will demonstrate why the "Atomic" tools that make Build-Time agents powerful become catastrophic liabilities for Run-Time agents. We will provide a framework for transitioning your architecture across three key axes: Design: Moving from flexible, atomic primitives to "Composite Workflows" that encapsulate business logic. Security: Shifting from "Developer Identity" (trusted) to "Workload Identity" (zero-trust), where the agent is treated as an untrusted user. Reliability: Why production agents need "Agent-Readable" errors (natural language guidance) rather than the stack traces that developers rely on. Attendees will leave with a clear rubric for evaluating whether their tools are truly "Production Ready" or just "Prototype Ready."

Open official schedule

mustopsautomationgtaaf

Computer-use models will agentify the web, not APIs

Schedule: 10:45am-11:05am

Dhruv Batra · Room 2024

AAF alternative if browser/computer-use portal work is the priority.

Official description

We are rushing towards a world where every single digital surface (email, calendar, messaging, …, every desktop app, every phone app, every web app) that was previously meant for humans is now managed by AI agents. Of course, there are technical challenges to be solved: - Model context windows haven’t increased in 2 years. And the digital world is OOMs bigger (the ultimate “big world hypothesis”) anyway, so how does one architect this? - A large part of the digital world (most of the web) does not have APIs available and requires agents to act like humans (consume pixels, output keyboard/mouse actions). - Human preferences and the digital world change, and require agents to maintain a dynamic memory and continually learn. But even if we could solve these problems, what does this world look like? - The digital world, particularly the web, was built for human consumption (and is often hostile to bots). - For a while to come, we will be sharing the digital roadways with these digital robots. - What does end-to-end encryption and privacy mean when the other “end” of the communication is an AI agent? The Yutori team has spent the last year building the world’s best computer use model (slightly better than Opus 4.6 and GPT 5.4 while being 2x faster and 4-5x cheaper on browser use tasks), converted the web into a webhook with Scouts (agents that monitor the web 24/7 for anything you care about), and are now releasing Yutori agent that expands from the open web to your most common digital surfaces. This talk will be grounded in Yutori’s learning from what it takes to build agents that are always on, taking us one step closer to the world where every digital surface is their playground.

Open official schedule

automationaaf

11:10-11:30

Computer Use at the Edge of the Statistical Precipice

Schedule: 11:10am-11:30am

Pierluca D'Oro · Room 2024

Good agent-building session for understanding brittleness and limits in computer use.

Official description

Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically address. We show that a 1MB replay script that blindly executes a recorded action sequence without ever observing the screen outperforms frontier models on prominent static benchmarks, and prove that its expected success rate is exactly equal to the source agent's pass@k in deterministic environments. We trace this and other failures to two root causes: non-principled environment design (static, unsandboxed, or unreliably verified environments) and non-principled evaluation methodology (naive aggregation and misuse of pass@k for stateful UI interactions). To address the first, we propose PRISM, five design principles for CUA environments and instantiate them in DigiWorld, a benchmark of 15 realistic sandboxed mobile applications able to evaluate agents in over 3.2 million verified unique configurations. To address the second, we develop an aggregation framework that correctly accounts for the nested structure of CUA benchmarks. All together, we show that principled environment design and rigorous evaluation methodology are not optional refinements but prerequisites for meaningful CUA research.

Open official schedule

mustautomationaaf

The Death of Keyword Search and the Rise of Agent-Readable Catalogs

Schedule: 11:10am-11:30am

Nixon Dinh · Expo Stage 3 SW · Level 1

Alternative for making records/docs navigable by agents.

Official description

As search shifts from classic keyword matching to more conversational experiences, product data quality becomes critical to LLM-powered retrieval. At PayPal, we tested how enriching traditional catalog data could help AI systems better find, understand, and rank products across large-scale commerce catalogs. We built a RAG-based AI judge to compare enrichment approaches and identify five patterns that consistently improved AI discovery results.In this talk, we'll share the evaluation framework, key lessons, and a practical approach for preparing enterprise data for conversational and agentic search.

Open official schedule

retrievalgtaaf

11:40-12:00

500 Skills, Zero Fine-Tuning: LinkedIn's Playbook for AI Agents That Actually Know Your Codebase

Schedule: 11:40am-12:00pm

Ajay Prakash · Room 2020

Very relevant to your Codex skill library: routing, maintainability, codebase-specific agent knowledge.

Official description

Everyone's building custom AI agents. We didn't. Instead, we built CAPTAIN — an MCP server that makes any off-the-shelf coding agent understand LinkedIn's entire engineering stack. The secret: a meta-tool architecture (discover → inspect → execute) and composable skills that encode tribal knowledge as executable workflows. 500+ skills later, it's used across all of LinkedIn engineering. I'll show you the architecture in 10 minutes and why context engineering beats model engineering every time.

Open official schedule

mustautomationgtops

Memory Harnesses for Long-Running Research Agents

Schedule: 11:40am-12:00pm

Stefania Druga · Main Stage · Level 3

Alternative if AAF research-agent memory is top of mind.

Official description

At Sakana AI we build agents that run for hundreds of turns to read literature, run experiments, and draft papers. The model rarely breaks. The harness around it is the weak point: the agent contradicts a decision it made 80 turns ago, redoes finished work, or drifts from the question it started on. This is the binding-constraint thesis. For long-horizon tasks, reliability is set as much by the harness as by the model as clearly instantiated in autoresearch recent efforts. This is a field guide to the harness's memory layer. I'll trace a real research agent through its lifecycle, show exactly where context rot and drift set in, and cover the patterns that hold over 100+ turns: three-tier memory, progressive disclosure, recall-first compaction, sub-agent isolation, and architectural memory beyond the vector database. I will show how to measure whether your memory harness actually helps, at the trajectory level, so you stop tuning prompts to fix what's really a state-management bug.

Open official schedule

retrievalaafops

12:05-12:25

The Dark Arts of Web Automation: Teaching Agents to Use Websites Like Humans

Schedule: 12:05pm-12:25pm

Corey Gallon · Room 2024

Best AAF-specific talk. Attend for brittle sites, portal workflows, and browser-agent behavior.

Official description

Anything you can do in a browser, your agent can do too. Not by tiptoeing through an MCP server one polite, token-burning call at a time -- properly, programmatically, the way you'd drive any other tool. I'll show you how with chrome-agent, an open source wrapper over the Chrome DevTools Protocol that has become irreplaceable in my everyday work. If you'll ever do a browser task more than once, step-by-step MCP browsing is slow, brittle, and bills you tokens for every single click. A CLI straight onto CDP makes the whole browser programmable: loop it, pipe it, script it, walk away. Write it Tuesday, run it a thousand times Wednesday, all without a second of AI agent babysitting. We'll dispel the MCP hype and myths, with successful demonstrations of cheeky things like: the power of CLI-based browsing and how its so much more capable than mere MCP; reaching through those oh-so-clever cross-origin iframes to clear the verify you're human checkboxes; showing that a JavaScript .click() is not a click, rather, just a function call in a costume that is banhammerable; ultimately, proving that a CDP browser operates just like a meatbag with a mouse and keyboard. You'll learn how to point your AI agents at real, messy, uncooperative websites and web applications and have them get things done exactly the way that you would.

Open official schedule

mustautomationaaf

Prompt, Memory, Weights: The Architecture Decisions Most AI Teams Make by Accident

Schedule: 12:05pm-12:25pm

Anant Srivastava · Expo Stage 4 SE · Level 1

Alternative for agent architecture tradeoffs.

Official description

The interesting engineering in production AI isn't in the model. Your knowledge lives in files, databases, and APIs: docs, runbooks, conversations, code. The model just reads tokens. So the real architectural question is which path that knowledge takes to inference: into the prompt directly, into memory for retrieval on demand, or into the weights through fine-tuning. Most teams treat these as a ladder. Start with prompts, escalate to RAG, eventually fine-tune, as if each step is a more advanced version of the last. The field is converging on a different answer: they solve different problems. The prompt shapes behavior and constraints. Memory grounds the model in current, citable knowledge. Weights harden specialized reasoning and format. They're not substitutes you graduate between; they're complementary, and the failures come from using one to do another's job. Fine-tuning to teach the model facts it should have retrieved is the classic trap: you bake in knowledge that's stale the day it ships, and you still can't cite it. This is an opinionated take on all three: when each is the right call, when each is a trap, and the part most teams never build, the circulation between them. Memory that captures what the agent does becomes the dataset you fine-tune on; fine-tuning changes what's worth retrieving; the loop compounds. Get the three paths right and they stop being a pipeline you climb and start being an architecture that learns.

Open official schedule

retrievalopsgtaaf

1:30-1:50

Closing the Loop: An Autonomous AI Research Agent

Schedule: 1:30pm-1:50pm

Tim Sweeney · Main Stage · Level 3

Good for AAF-style research loops: hypothesis, retrieval, evaluation, and iteration.

Official description

The holy grail of agentic AI tooling is the autoresearch loop: an agent that can sift through your experiments, create visualizations, propose a hypothesis, launch a training job, read the results, and try again autonomously. In this session, we'll show new autoresearch capabilities built directly into the W&B Models web and iOS apps. We will demo these live using a real-world fine-tuning project, covering everything from launching jobs and reading loss curves to surfacing outlier runs that consume researcher hours and recommending the next steps. Then you'll learn how the eval-driven development loop in W&B Weave makes agents like this trustworthy. You'll see how production traces become benchmarks, and how only the agents that beat the bar make it to production. Join us to learn the same loop we use to improve our own agentic features.

Open official schedule

mustautomationaaf

The Agentic Power User's Playbook: Tips and Tricks for Swarm-Style Agentic Development

Schedule: 1:30pm-1:50pm

John Lindquist · Room 2009

Alternative if you want hands-on operator practices.

Official description

You opened a fifth agent tab this morning and immediately lost track of which one was doing what. This workshop is the playbook I use daily to run swarms of agents in parallel: the keyboard shortcuts, layout patterns, supervision habits, and fast-model tricks that turn chaos into a control surface. We'll go hands-on: spawning a wall of agents across tiled panes, routing prompts to the right swarm with fast models, switching contexts in milliseconds, recovering when an agent goes off the rails, and building the muscle memory that separates a one-agent-at-a-time user from a true power user. By the end you'll leave with a stocked toolbelt of concrete shortcuts, repeatable patterns, and workspace habits you can drop into your own setup the same day. No cloud, no platform lock-in: every trick runs on the machine in front of you.

Open official schedule

automationgtops

1:55-2:15

Improving Agents is a Data Mining Problem

Schedule: 1:55pm-2:15pm

Vivek Trivedy · Room 2003

High-confidence useful: focus on mining failures and behavior data, not buying another framework.

Official description

Harness Engineering, Post-Training, Continual Learning...these all boil down to the same underlying substrate - Mining Agent Traces 1. I need to run my agents to collect Traces 2. Understand behaviors from Traces at scale 3. Filter data for "improvement" 4. Do an improvement step There's a reason why every continual learning platform ends up looking like an observability platform. It's because Traces are the lifeblood of agent improvement. The mechanism that we use to attempt improvement can vary - Harness Eng, SFT, etc. But without understanding the data agents produce, no algorithm will truly build better agents. The holy grail of Agent Improvement is Continual Learning. Consistently mining data and integrating it into the agent definition over infinitely long time horizons. Today, the easiest way to do that is to build an observability platform and constantly point agentic compute to understand the data that agents produce. We'll walk through the current methods of understanding traces at massive scale and choosing how to integrate them to improve agents across your personal agents, team agents, and entire company.

Open official schedule

mustevalsretrievalgtaaf

WTF Is the Context Layer? The Missing Infrastructure for Production Agents

Schedule: 1:55pm-2:15pm

Prukalpa Sankar · Room 2020

Alternative if context infrastructure is the priority.

Official description

In the last two years, models have gotten exponentially smarter. Two years ago they couldn't pass the bar. Today, top 1% of test scorers. And yet most agents still can't answer a simple business question correctly. You ship a demo that works. You deploy it. The business abandons it in a month. The missing variable is context: the business definitions, procedural knowledge, and operational norms that make a human expert valuable. Drawing on hundreds of production deployments, Prukalpa Sankar will break down what it actually takes to give agents contextual intelligence — and get them past the demo stage. She'll walk through the architecture of a context layer: how context repos work (versioned, testable, portable), how simulation environments catch failures before deployment, how agent traces compound back into shared context, and why context engineering scales where fine-tuning and prompting don't. She'll also cover why your context needs to be open (MCP, Iceberg, deploy to any framework) — and what happens when it isn't.

Open official schedule

retrievalopsgt

2:25-2:45

I Let Agents Refactor My Codebase for 3 Weeks. Then I Read the Code.

Schedule: 2:25pm-2:45pm

Keiji Kanazawa · Leadership 2 · Room 3020

Practical reality check for coding agents and code quality.

Official description

Lopopolo says code is a liability. Zechner got a standing ovation for "read every fucking line." I was firmly at L — letting coding agents drive a refactoring for weeks, rubber-stamping PRs, trusting the vibes. Then I actually read what they'd built and couldn't explain my own system's contracts. The interfaces weren't wrong. They were plausible. Which is worse. So I took the wheel back. But this isn't a Zechner victory lap — I'm now building better specs and evals specifically so I can move back toward L with confidence. This talk is the honest, in-progress round trip, and a framework for finding where you should sit on the spectrum today.

Open official schedule

mustautomationgtops

MCP Apps - Extending the frontier

Schedule: 2:25pm-2:45pm

Liad Yosef, Ido Salomon · Room 2020

Alternative if MCP app surfaces are your active build area.

Official description

AI agents are quickly becoming the new browsers, changing how users consume content and get work done. That shift is increasingly powered by a new generation of agentic apps that don’t just present text but deliver interactive experiences within any MCP host. By standardizing interactive UI on MCP, the MCP Apps official extension (SEP-1865) is poised to become the new agentic app runtime, serving as the backbone of the future and removing adoption obstacles that previously hindered the protocol. Join us to learn more about: The new web - How MCP Apps reshapes the traditional app landscape and transforms the way users interact with the web Deep dive into MCP Apps - - Architecture - Real-world use cases - What's ahead? - Getting started (+community and #mcp-apps-wg) - Future Vision

Open official schedule

automationgtaaf

2:50-3:10

Agents Are Where Microservices Were in 2015. We're Making All the Same Mistakes.

Schedule: 2:50pm-3:10pm

Roberto Milev, Uday Kanagala · Leadership 1 · Room 3016

Must-attend. Treat agents as distributed systems: ownership, observability, failure isolation, contracts.

Official description

Remember when everyone was shipping microservices without service discovery, circuit breakers, or distributed tracing? Agents are in that exact phase right now. Everyone's building them. Almost nobody is thinking about the infrastructure underneath. We've been deploying production agents across 120+ microservices. Here's the stack that's emerging: Runtime — containerized execution, session persistence, workspace snapshots. Solved-ish, mostly duct tape. Memory — RAG had a good run. It's not enough. Tiered memory — short-term, long-term with semantic/episodic strategies, agents deciding what to remember and forget. Observability — you can't tail -f an agent. Execution traces, reasoning chains, confidence signals — agents need their own observability stack. Testing — the biggest gap. Unit testing non-deterministic behavior, regression testing prompt changes, knowing your agent got worse before users do. Skills and tools — MCP and skill definitions as the standard interface layer — the REST APIs of the agent era. Context engineering — what the agent knows at decision time. The new performance tuning. Guardrails and auth — scoped credentials, budget limits, knowing when to stop. Least-privilege for agents. Orchestration — single vs. multi-agent, choreography vs. orchestration. Same tradeoffs as microservices, new failure modes. This talk maps the stack, draws the parallels to how we eventually got microservices right, and calls out what's still painfully missing.

Open official schedule

mustopsgtaaf

Designing Agents (The Floor Is the Frontier)

Schedule: 2:50pm-3:10pm

Ben Hylak · Room 2003

Alternative if you want more product/UX framing for agent interaction.

Official description

You know how smart your agent can be. You have no idea how dumb it gets until it does the dumbest possible thing in front of your most important user, with full access to act on their behalf. Capability isn't the bottleneck anymore, the floor is. The hard part is there's usually no objective right answer. You raise the floor by observing what your agent actually does in production, catching the dumb thing the moment it happens, and closing the loop so it never happens twice.

Open official schedule

automationgt

3:20-3:40

Inference is the New Training Loop: Architecting High-Reliability Agents and Continuous AI Systems

Schedule: 3:20pm-3:40pm

David Corbitt · Leadership 2 · Room 3020

Strong reliability framing for agents that improve while they run.

Official description

For agentic AI and complex, multi-step workloads, the inference environment is the engine for continuous improvement, not a final deployment step. This talk focuses on engineering the full AI loop: tightly integrating inference with reinforcement learning (RL) and evaluation. Learn how to leverage native observability, serverless RL, and optimized inference stacks to continuously refine model behavior based on production traces, delivering agents that are reliable, auditable, and constantly evolving.

Open official schedule

mustevalsopsgtaaf

Sandboxes Aren't Optional: Runtime Isolation Patterns for Coding Agents at Scale

Schedule: 3:20pm-3:40pm

Robert Brennan · Room 2010

Alternative if safe execution and isolation are active problems.

Official description

Last year, an AI coding agent wiped a production database during a code freeze, ignored explicit instructions to stop, then told the developer recovery was impossible. (It wasn't.) That's what happens when your security model is "we told the agent to be careful." When agents can write code, run tests, make API calls, and push commits, security is no longer a prompt engineering problem. It's a runtime isolation problem. This talk covers the patterns we follow at OpenHands and that you can steal wholesale: Docker and Kubernetes isolation, per-agent file system scoping, network egress controls, RBAC for multi-tenant deployments, and the full audit trail every enterprise security team demands. We'll walk through the three most common failure modes we see when teams skip proper isolation, including one case where an agent helpfully committed secrets to a public repo. You'll see a live demo of 50 parallel sandboxed agents running against a real codebase, with resource limits, timeout enforcement, and graceful degradation when agents hit unexpected states. You'll leave with a sandbox checklist and reference Kubernetes config. Bounded autonomy isn't a limitation on agent capability. It's what makes production trust possible.

Open official schedule

opsautomationgt

3:45-4:05

LLM Knowledge Bases: a practical guide

Schedule: 3:45pm-4:05pm

Ben Holmes · Room 2003

Useful for making project knowledge durable and queryable across docs, logs, records, and memory.

Official description

Putting thoughts to paper (or keyboard, or transcription model) refines your thinking, connects ideas, and pulls context out of your brain for others to learn from. But while taking notes can be fun, organizing those notes is not. Flat lists turn to folders turn to tags and taxonomies that grow unwieldy beyond the first hundred entries. If you can’t find what you wrote down yesterday, or you miss connections to related ideas, you’re missing the value of notetaking: learning from what you notate. Agents dramatically expanded what’s possible here. Combined with Markdown-backed apps like Obsidian to make notes agent-accessible, you can build a second brain that works for you, not the other way around. Andre Karpathy has popularized LLM knowledge bases, and I want to take it further with concrete workflows you can use to organize your thoughts with agents. We’ll explore a number of Obsidian workflows to make this possible: - Automations to organize notes with tags, folders, backlinks, and deduplication to level-up search and discovery - More automations to have agents expand your thinking by auto-recording ideas while you sleep - Building an agentic writing partner to surface related ideas in real time and answer questions as you type (or as you speak) - Voice monologuing and summarization tools to lower the friction of transcibing thoughts into well-formatted notes You’ll walk away with a new appreciation for notetaking, and a second brain that leaves you 10x smarter than your brain alone. Talk format: Code and live tech demos. I will set up all of these automations and tools from scratch, and show agents executing each of them live. I will share the source for all automations as well.

Open official schedule

mustretrievalgtaaf

Day 4 - Session Day 3

Final day: agentic engineering, harnesses, skills, verifiers, memory, and distributed systems.

9:00-10:30

Main Stage block: skills, inference, harness engineering, Anthropic, and graphs

Schedule: 9:00am-10:30am

Matt Pocock, John Ousterhout, Maxime Rivest, Isaac Miller, Mike Krieger, Emil Eifrem · Main Stage · Level 3

Start here. The skills and task/model separation talks are most relevant to building agents.

Official description

Day 4 main-stage block: skills, inference systems, task/model separation, Anthropic build lessons, and graphs.

Open official schedule

mustopsautomationgtaaf

10:45-11:05

Your Agent Can't Tell If It's Right

Schedule: 10:45am-11:05am

Willem Pienaar · Expo Stage 2 NW · Level 1

Best fit for building reliable agents: correctness, self-checking, and verification.

Official description

Coding agents feel reliable because of one signal you never think about: the tests. They catch confident mistakes in seconds, so you never see most of them. The real world has no test suite. Put an agent in production and that signal is gone, and a wrong answer looks the same as a right one. So how do you know it's right? We watched our agent look at an 80% drop in throughput and report zero user impact, because a similar alert the month before had been noise. The data to catch it was already in front of it. There is no single verifier, but there are several weaker signals. While the agent reasons: grounding each claim against live data, and looking for evidence that distinguishes competing hypotheses. Before it acts: calibrated confidence, and a separate critic. After it acts: whether the fix held, whether the alert returned, whether an engineer redid the work. None is conclusive on its own. Combined, they estimate whether the agent was right. The talk covers where these signals come from, how we combine them, and how often they still disagree.

Open official schedule

mustevalsopsgtaaf

Operating Distributed Inference Systems at Scale

Schedule: 10:45am-11:05am

Nishant Gupta, Naman Ahuja · Room 2016

Alternative if model-serving operations are the bottleneck.

Official description

Inference has rapidly become one of the most important infrastructure problems in modern computing. As AI systems evolve into autonomous agents with persistent memory, tool usage, and multi-step reasoning, traditional inference architectures struggle under growing demands for latency, throughput, cost efficiency, and reliability. In this talk, I’ll share lessons from building large-scale elastic compute and AI infrastructure systems powering production workloads. We’ll explore the modern inference stack and the architectural patterns emerging to support next-generation agentic AI systems. Topics include distributed inference architectures for large-scale AI systems, GPU scheduling and elastic compute for inference workloads, multi-tenant inference infrastructure, caching, batching, latency optimization strategies, reliability and fault isolation for inference systems, observability and control loops for AI serving platforms, balancing cost, throughput, and user experience, and why inference is becoming an infrastructure orchestration problem. Attendees will gain practical insights into designing scalable, resilient, and cost-efficient inference platforms for modern AI workloads.

Open official schedule

opsgt

11:10-11:30

MCPs, CLIs, and Skills: Choosing the Right Tooling Layer for Agentic Development

Schedule: 11:10am-11:30am

Nikita Kothari · Main Stage · Level 3

Almost custom-fit to your workflow. Attend for when to use skills, CLI tools, MCP, or APIs.

Official description

Agentic development needs more than one interface: MCPs provide clean, portable connectors to services, with built-in patterns for security and auth. CLIs offer composability, debuggability, and workflows developers already trust. Skills teach agents how to use a wide variety of tools and MCPs effectively without overloading context.

Open official schedule

mustautomationgtaafops

Your AI Agent Has No Nervous System

Schedule: 11:10am-11:30am

Matt Gibiec · Expo Stage 4 SE · Level 1

Alternative if observability/control loops for agents feel more urgent.

Official description

Most agents ship with solid evals and zero runtime observability. When something breaks in production — wrong answer, runaway retry loop, or silent tool failure — you're debugging blind. You can see the output, but you can't see what the agent believed when it made the decision. This talk walks through how to instrument agentic pipelines with OpenTelemetry: capturing system context at every step, making prompt state and tool call outcomes visible as structured data, and governing token consumption as SLOs instead of discovering overruns on an invoice. Attendees will leave with three takeaways: an understanding of telemetry for multi-step agentic workflows, a pattern for capturing system context at the span level so teams know exactly what the agent saw before it acted, and a framework for visibility into token budget and behavioral drift before something goes sideways in production. Telemetry is the nervous system. System context is the memory. Token budgets are the vital signs. None of it is optional.

Open official schedule

opsevalsgtaaf

11:40-12:00

Guide, Verify, Solve: The Engineering Discipline Agentic Development Demands

Schedule: 11:40am-12:00pm

Anirban Chatterjee · Room 2020

Best non-vendor pick for building agents with verification discipline.

Official description

Agentic development is not a productivity story: it's a reliability engineering problem at a scale most teams have never faced. Long-running agent tasks fail at alarming rates, pull requests have grown from 50 lines to 5,000, and cognitive surrender is real—the more capable AI output appears, the less humans interrogate it, right at the moment the stakes are highest. Independent, peer-reviewed research from Carnegie Mellon studying 807 open source projects found that AI agent adoption caused a persistent 30% increase in code analysis warnings and a 41% increase in complexity — with long-term development velocity declining as a result. Agents don't just write code faster, they accumulate debt faster, too. The answer is not to slow agents down, it's to govern and refine the loop they operate inside. Sonar's Agent Centric Development Cycle (AC/DC), defines that loop across three continuous stages: guide agents with project-specific context and constraints before a single line is written; verify rigorously and continuously inside the loop, not downstream in CI; and solve issues automatically before they ever reach a manual review. The deeper insight is that this is not primarily a security story. It's an efficiency story. Codebases riddled with complexity make agents slower, less reliable, and significantly more expensive to run. Every token spent navigating legacy debt is a tax on every future agent run. Well-maintained, low-complexity codebases mean fewer failures, fewer tokens, and faster iteration. The teams that instrument this loop now will do more than ship safely: they'll compound their advantage every time an agent touches their codebase. Verification isn't a cost center. In an agentic world, it's a competitive moat.

Open official schedule

mustevalsautomationgt

The Art of Building Verifiers for Computer Use Agents

Schedule: 11:40am-12:00pm

Miguel González Fernández, Corby Rosset · Expo Stage 1 NE · Level 1

AAF alternative for verifier design around browser/computer-use agents.

Official description

Every team building browser agents has the same problem: you can't trust your own evals. Browser tasks are too open-ended for deterministic checks, so teams use LLM verifiers as judges, and the judges are wrong constantly. WebVoyager misses 45% of failures. WebJudge misses 22%. Used as RL reward, you're not training a better agent, you're training a more confident liar. This talk walks through the Universal Verifier, open-sourced with Microsoft Research: false positive rate near zero, Cohen's κ matching human-human agreement. Four design principles, one open benchmark, and an honest account of where auto-research worked and where it plateaued.

Open official schedule

evalsautomationaaf

12:05-12:25

Benchmarking Coding Agents on New vs Legacy Code bases

Schedule: 12:05pm-12:25pm

Denys Linkov · Room 2020

Useful if you care about Codex/BirdieBuilder performance on real legacy Laravel code.

Official description

You have an old code base with 100,000s of lines of code, should you let an AI Agent refactor or do you wait until you have a cleaner setup? Last year we refactored a number of code bases and ran evaluations on how well different models, harnesses and rule sets affected multiple versions of the code base. This talk will feature specific code examples as well as a broader set of evals.

Open official schedule

mustevalsgt

Harness Engineering: Building the Production Cage for Powerful Domain Agents

Schedule: 12:05pm-12:25pm

Mike Chambers · Main Stage · Level 3

Alternative if you want domain-agent runtime boundaries.

Official description

Every agent is a while loop. The model takes strings in and produces strings out. We've all written it, debugged it, shipped it. And yet every team building agents is still re-inventing the same session management, truncation logic, tool wiring, and memory plumbing from scratch. The hard part is the harness: session isolation, context management, memory persistence, sandboxed execution, observability. The machinery that makes a model dependable in production. Most of the failures we see in deployed agents (context rot, premature completion, tool bloat) trace back to harness problems, not model problems. This talk covers what a harness actually does, why "harness engineering" suddenly showed up in engineering posts from everyone, and what changes when you stop building harnesses by hand. In live demos, we'll build the same agent three ways: hand-rolled Python, framework-generated, and fully managed through a single API call. Each level shifts the failure modes from infrastructure plumbing to engineering judgment, where the real questions are what context to preserve, when to verify, and how to keep an agent from finishing half the job and calling it done. The harness handles the machinery. You still have to engineer the behavior.

Open official schedule

opsevalsgtaaf

1:30-1:50

Coding Agents Don't Scale Themselves. Neither Do Your Teams.The Rise of Agent Enablement.

Schedule: 1:30pm-1:50pm

Patrick Debois · Leadership 2 · Room 3020

Strong agent enablement/org-design pick; more relevant than another product demo.

Official description

Every company wants to know how others are actually scaling AI coding. But it's hard to get past the generic transformation stories. What are the new practices showing up in real engineering orgs? What does maturity actually look like, and what separates teams that are moving from teams that are stuck? What are the patterns for enabling humans and agents, together? Patrick Debois has been collecting the practices and patterns, talking to the early Agent Enablement teams already on the job, team leads, and VPs of Engineering. What's showing up is a new function: a team that enables other teams to get real leverage out of their agents. This talk takes the [Context Development Lifecycle](https://tessl.io/blog/context-development-lifecycle-better-context-for-ai-coding-agents/) off the individual laptop and onto the org chart, grouped across three pillars: - **Enablement.** From individual experimentation to team and org-level fluency with agents. - **Platform.** Agent tooling that runs like a real delivery pipeline: fast, observable, cost-aware. - **Governance.** Ad-hoc guardrails growing into real evaluation, telemetry, and accountable agent work. For Agent Enablement leaders scaling it out across the org. For team leads looking to help their teams get better at this. For VPs ready to unblock the friction and unlock what agents can actually do. *Coding agents don't scale themselves. This is the talk about who does*

Open official schedule

mustautomationopsgt

Codex, Behind the Harness

Schedule: 1:30pm-1:50pm

Dominik Kundel · Room 2020

Alternative if you want Codex-specific internals and harness design.

Official description

Agents have evolved a lot in the last year both in capabilities and in the overall structure. Increasingly sandbox-powered coding agents are breaking out to do general purpose work. In this talk we’ll be taking apart the open-source Codex agent harness. Understand how it works, what makes it so suitable to do work beyond coding tasks, how it handles key aspects like context management, tools and file system access. We’ll also tie these back to concrete actions you can take to bring these patterns into your own agents, whether you are building on top of the Codex agent or building your own.

Open official schedule

automationgtops

1:55-2:15

Multiplayer agentic engineering: enabling your whole team and your best agents to work together

Schedule: 1:55pm-2:15pm

Arjun Singh · Room 2020

Best for your team+agent workflow interests.

Official description

For a solo developer, coding agents are a superpower. For a team, they surface new kinds of bottlenecks: coordination, visibility, review, and shared context. We wanted our whole team and our best agents to work together, with no work or context trapped on any one developer's machine. So we pressed pause on the product we were building to create a multiplayer cloud workspace for agentic engineering. This talk shares five key practices we've learned from building and using our platform: Turn every surface the team uses into an agent interface. Kick off sessions from Slack, review via iOS app, iterate in GitHub comments, ship from web. Agents run in the cloud, so work keeps moving even when your laptop is closed. Make agent work visible and collaborative across the whole team. Every agent session is shared, has a live app preview, and an agent-guided code review. This allows engineers, PMs, and designers to steer and evaluate agent work collaboratively. Turn every external signal into shipped code your team can quickly evaluate. Automatically turn customer emails, meeting action items, and bug reports into agent implementations that the whole team can review. Set up shared cloud dev environments so agents aren't siloed to individual machines. Secrets, role-based access, and network controls shared across the whole team. Fast environment startup, so you're not giving up speed by moving off local. Benchmark agents on your own codebase. Claude Code, Codex, Gemini, Amp, OpenCode — how do you know which is actually better on your stack? We'll cover using your merged PRs as ground truth to build a "Personal SWE-Bench" for your codebase. Agentic engineering is going multiplayer. This is how your team gets there.

Open official schedule

mustautomationgtops

MCP doesn’t suck — your agent does

Schedule: 1:55pm-2:15pm

Jan Curn · Expo Stage 2 NW · Level 1

Alternative if you want tooling-layer lessons and failure modes.

Official description

Most AI agents misuse MCP and treat tools as prompt-time function calls: tool definitions and results are repeatedly injected into the context, tokens are wasted, and context rots. The result? Slower, less reliable agents, and the misleading conclusion that “MCP sucks, CLIs are better.” To challenge this narrative and show how agents can get the best of both MCP and CLI, at https://apify.com/ we’ve built mcpc (https://github.com/apify/mcpc), an open-source universal CLI client for MCP. It maps MCP operations to intuitive CLI commands, which agents quickly pick up through --help without external skills. It turns out, CLI is the perfect local interface for agents to interact with MCP, giving them access to full protocol capabilities including modern features like code mode or progressive tool discovery through a single Bash() tool call, while leveraging MCP’s standard remote interface for server discovery, authentication, payments, and access control. To once and for all kill the MCP vs. CLI debate and show those two technologies are not exclusive but complementary, we’ll present evals comparing performance of agents using naive MCP, modern MCP, native CLIs, other MCP CLIs, and mcpc, in various real-world scenarios.

Open official schedule

automationgtaaf

2:25-2:45

Always-on agents run production without the on-call tax

Schedule: 2:25pm-2:45pm

Justin Smith · Room 2020

Useful for supervised production agents, alerts, and operational boundaries.

Official description

Most production teams have the same problem. The work that keeps systems healthy- deployment checks, on-call handoffs, anomaly reviews- never makes it into a sprint. It falls to whoever has bandwidth, gets done inconsistently, and disappears when people are stretched thin. Background agents fix this by running that work on a schedule, using the same production context a senior engineer would, without waiting for someone to initiate it. Justin Smith, Founding Engineer at Resolve AI, walks through the architecture behind always-on agents, the use cases teams are starting with today, and what we have learned from running them in our production environment.

Open official schedule

mustopsautomationgtaaf

Lessons From Building The World's Largest Knowledge Graph

Schedule: 2:25pm-2:45pm

Jeffrey Wang · Room 2014

Alternative if graph/data modeling for AAF is more important.

Official description

_Exa set out to index and embed the entire web as a queryable knowledge graph — the substrate behind neural search and the enrichment layer powering modern GTM data. Co-founder Jeffrey Wang shares the hard engineering lessons: crawling and embedding at web scale, keeping a graph fresh and trustworthy, and the retrieval architecture that lets agents pull grounded facts instead of hallucinations. Why the knowledge graph — not the model — is becoming the moat for AI-native GTM._

Open official schedule

retrievalaafgt

2:50-3:10

AI Agents Are Just Distributed Systems Now

Schedule: 2:50pm-3:10pm

Salman Munaf · Leadership 1 · Room 3016

Must-attend systems framing: queues, retries, contracts, failure containment, observability.

Official description

AI agents are often described as a new kind of software, but once they move beyond chat and start calling tools, reading data, making decisions, retrying tasks, and coordinating workflows, they begin to look a lot like distributed systems. They have state. They call external services. They depend on APIs. They fail partially. They retry. They time out. They can loop. They can act on stale context. They can produce inconsistent results. And when something goes wrong, teams need logs, traces, permissions, ownership, and rollback paths just like they do with any other production system. This session will give engineers a practical way to reason about AI agents using familiar distributed systems concepts. We will break down the agent loop: planning, tool use, observation, memory, and retries. Then we will map common agent failure modes to engineering patterns teams already know, including timeouts, circuit breakers, idempotency, rate limits, least privilege, observability, and human approval. The goal is to move past the hype and treat agents like real production systems. Attendees will leave with a clear mental model for designing, debugging, and operating agents safely, especially as they become part of customer-facing products, internal developer tools, and business workflows.

Open official schedule

mustopsgtaaf

No Memory, No Harness: Why the Database Is the Last Line of Defense

Schedule: 2:50pm-3:10pm

Kay Malcolm · Main Stage · Level 3

Alternative if database-backed harness design is more urgent.

Official description

The model is the easy part. Everything that makes an agent survive contact with production lives in the harness around it: orchestration, tooling, governance, and the memory core that keeps the system grounded when the model itself is probabilistic, forgetful, and non-deterministic. This talk walks the surface areas of an agent harness and consolidates the lessons we're learning as we ship them, from agentic applications in their current form (autonomous systems that now build their own automations) to the continual-learning loops that let agents improve from their own experience. We'll look at how the discipline is segmenting. AI application development is no longer one role but several: agent engineers, memory engineers, and platform engineers. We'll map Oracle's primitives onto each as the current state of harness engineering takes shape. We'll also examine the two populations betting on this stack at once, enterprise customers who need governance, reliability, and scale, alongside the cracked developers who need fast, composable primitives, and why a well-engineered harness serves both. And we'll make the case that has held through every shift in the stack: memory isn't a feature you bolt on, it's the foundation the rest of the harness stands on. The database remains the memory core, and when everything above it is probabilistic, it's the last line of defense.

Open official schedule

retrievalopsgtaaf

3:20-3:40

Give the Agent a Budget, Not a Token

Schedule: 3:20pm-3:40pm

Sachin Malhotra · Leadership 2 · Room 3020

Useful for cost and resource governance around long-running agents.

Official description

Every agent demo runs with a god-token. Then it ships, and someone has to explain why the helpful AI just rm -rf'd the staging database "to clean up." I run platform infrastructure at a frontier lab, and for the last year my job has partly been: let coding agents do real work against real systems, without ever having to write the postmortem. This talk is the permission model that fell out of that - not RBAC-with-extra-steps, but primitives designed for an actor that's smart, fast, tireless, and occasionally *confidently wrong*. **The four primitives:** - **Asymmetric verbs** - the agent can `quarantine` but not `delete`, `retry` but not `approve`, `propose` but not `merge`. The verb list *is* the security boundary. Stop thinking in resources, start thinking in reversible vs. irreversible actions. - **Regenerating budgets** - every agent identity gets N disruptive actions per window. Burn the budget, you're benched until it refills. No human-in-the-loop until the budget's gone — which means 95% autonomy with a hard ceiling on blast radius. - **The undo test** - if the agent can't undo it, the agent can't do it without a second key. One line, surprisingly load-bearing. - **Tripwires over allow-lists** - let the agent roam, but instrument the three actions that would actually hurt. Cheaper than enumerating everything safe. I'll show the ~200-line policy layer that implements all four, the failure modes each one exists to catch, and the one design I shipped that turned out to be security theater. Tool-agnostic - works whether your agent is touching CI, a database, a cloud account, or your users' files. If you're shipping an agent that does anything more than read, you'll leave with a threat model and a starting policy you can paste into your repo on the flight home.

Open official schedule

opsgtaaf

Agent Memory Is a Solved Problem. Agent Learning Is Not.

Schedule: 3:20pm-3:40pm

Karthik Ranganathan, Heather Downing · Expo Stage 1 NE · Level 1

Alternative if learning loops are the priority.

Official description

The failures that break multi-agent systems are not reasoning failures, they are handoff failures. One agent works something out and the knowledge dies in its private context, because the only thing that crosses the boundary is output. Memory made each agent better in isolation and changed nothing about what the group knows. The missing primitive is supervised promotion: a deliberate decision about which private learning is worth sharing, moved into common knowledge with the reasoning attached, so trust survives the handoff. Today a human makes that call, and promoted knowledge resolves on read, in any tool, with no retrain or reindex. Those calls are also the training signal for what comes next: orchestrator agents, trained on what matters to the people they serve, that promote on their own. This talk covers how our collective knowledge grew as we approached memory promotion, including what the first build got wrong, and a live look at it working between humans and agents.

Open official schedule

evalsretrievalgtaaf

3:45-4:05

Agents Without Code: How Skills, YAML, and Filesystems Replaced Python

Schedule: 3:45pm-4:05pm

Philipp Schmid · Main Stage · Level 3

Good closing agent-build talk for file/skill-based systems and low-code orchestration.

Official description

Six months ago, building an agent meant writing a Python class with a `while` loop, tool definitions in dicts, manual state management or writing custom python functions. Today, you define an agent in a YAML file, drop a `SKILL.md` into a folder, and deploy. This talk traces the arc from "Agent in Python" to "Agent as filesystem". You'll learn the same agent built three ways: the hard way (Jan 2025), the simple way (Oct 2025), and the zero-code way (today).

Open official schedule

mustautomationgtops

Why We Killed Our Multi-Agent Pipeline: Lessons From Pharma Commercial Intelligence

Schedule: 3:45pm-4:05pm

Subbiah Sethuraman, Abhilash Asokan · Room 2005

Alternative if you want anti-patterns and simplification lessons.

Official description

Key takeaways: A practical design principle for agentic systems in regulated, high-stakes domains: derive the architecture from agent behavior, don't impose it. Concrete patterns the audience can apply this week — domain knowledge graphs as agent context, deterministic preprocessing as a complement to agentic reasoning, reference-based context management. An honest case study from production: what worked, what didn't, and the open architectural questions we're still working on. Abstract : We lead the architecture and AI engineering org behind ZS Associates' commercial intelligence platform for pharmaceutical brand teams. The product has two surfaces: a proactive alert system that delivers signal-driven intelligence packets when a brand's KPIs move, and a conversational analytics chat where business users ask ad-hoc questions. A year ago we built both surfaces as separate V1 stacks. They broke in different ways. The diagnosis was the same: we had decided on the structure before we knew what the agent actually needed. This talk is about the design principle that came out of rebuilding both — and what it produced. The architecture is derived, not designed. We stopped trying to predict what scaffolding the agent would need and started designing the system around what the agent's behavior, on real production tasks, actually demanded. Tools, context, structure, and guardrails get introduced at the points where the agent's reasoning needs them — and nowhere else. What that produced is an architecture that's smaller than V1, not bigger. A single agent owns each investigation end-to-end across both surfaces, launching parallel sub-agents when the work needs them — not according to a pre-defined topology. A pharmaceutical commercial knowledge graph — HCPs, accounts, payers, territories, brands, KPIs and the relationships between them — gives the agent the domain context it needs without prompt-engineering heroics. Statistical signal detection runs deterministically before the agent wakes up, so the agent's job is to explain signals, not find them. Raw query results stay out of the context window through a reference-pattern that lets the agent reason over data without drowning in it. Each of those decisions came from watching an agent struggle on a real task and asking what does it need here? — not from sketching the architecture in a doc and forcing the agent into it. The patterns generalize. If you're shipping agents over messy enterprise data — finance, supply chain, claims, operations — the failure modes and the fixes will look familiar. We'll close with the open questions and the pieces we haven't solved yet.

Open official schedule

opsgtaaf

Concept Watchlist

Second-pass research notes: concepts to actively listen for, even when the exact session is not on the main path.

Verifiers beat vibes

Must internalize

Across evals, code review, computer use, and public-record review, the recurring pattern is generator plus verifier plus calibrated suppression. This is the main design idea to bring home.

Related sessions: In the Land of AI Agents, the Verifiers Are King; The Art of Building Verifiers for Computer Use Agents; Guide, Verify, Solve.

Context is infrastructure, not prompt stuffing

High

Several talks converge on the same point: agents need explicit context layers, compaction, memory boundaries, and retrieval contracts. This matters for BirdieMemory and AAF record intelligence.

Related sessions: The Infinite Context Window Is a Myth; WTF Is the Context Layer?; Context Engineering in 2026; From Systems of Record to Systems of Context.

Agent memory is operational data

High

Memory should be mined from traces, failures, user corrections, and durable records. Treat it like a data product with provenance, not just a vector store.

Related sessions: Improving Agents is a Data Mining Problem; From Context to Memory; Memory Harnesses for Long-Running Research Agents; Agent Memory Is a Solved Problem. Agent Learning Is Not.

Harnesses are the production boundary

High

The useful abstraction is not an agent framework. It is a harness: task definition, tools, permissions, evals, state, budget, logs, and rollback.

Related sessions: No Memory, No Harness; Harness Engineering: Building the Production Cage; The Unreasonable Effectiveness of Separating the Task from the Model; Everyone's Building a Harness. Nobody Calls It That.

Agents are distributed systems

Must internalize

The best talks keep circling back to ownership, queues, retries, idempotency, observability, contracts, and failure containment. This maps directly to AAF queues and Birdie jobs.

Related sessions: Agents Are Where Microservices Were in 2015; AI Agents Are Just Distributed Systems Now; Productionizing LLM Gateways; Operating Distributed Inference Systems at Scale.

Skills need budgets and decay tests

Medium-high

Your Codex skill library is an asset, but long instructions can rot, conflict, or overflow useful context. Look for patterns that keep skills small, testable, and routed.

Related sessions: 500 Skills, Zero Fine-Tuning; How long can your skills be before your agent forgets what you told it?; We Vetted 2,000 AI Skills Before They Reached Developers.

Sandboxing is not optional for coding agents

High

Any agent that writes code, calls tools, or browses sites needs execution isolation, filesystem/network rules, approval gates, and replayable traces.

Related sessions: Sandboxes Aren't Optional; Your agent needs a sandbox, not a desert; How I learned to stop worrying and love the sandbox; Docker approval-loop sessions.

Cost is a control plane signal

Medium-high

Token spend, latency, retries, model routing, and tool budgets should be visible and enforceable. Budget is part of autonomy, not just accounting.

Related sessions: FinOps for AI Agents; Give the Agent a Budget, Not a Token; Latency Is a Budget; Stop Model Shopping.

Ontology and graph thinking may be worth the vendor tax

High for AAF

This is one vendor-shaped area that passes the confidence check for AAF: cases, people, charges, sources, evidence, aliases, and provenance are naturally graph-shaped.

Related sessions: Why Agentic Systems Need Ontologies; Your Moat Is Your Data Model; From Systems of Record to Systems of Context; CrabRAG.

Computer-use agents need verifiable replays

High for AAF

For portal work, the core need is not just browser automation. It is observable sessions, respectful rate limits, replay/debug artifacts, and deterministic extraction checks.

Related sessions: The Dark Arts of Web Automation; Computer Use at the Edge of the Statistical Precipice; Bringing agents onto the world wide web; Browserbase expo conversation.

Adoption fails when agents ship garbage

Medium

For GolfTroop/Birdie, the human workflow around agent output matters as much as model quality: review load, trust, suppression, and rollback.

Related sessions: How to Get Your Org to Adopt Coding Agents; Coding Agents Don't Scale Themselves; I Let Agents Refactor My Codebase for 3 Weeks.

Benchmarks are weak unless they match your work

Medium

Use conference benchmarks as ideas for eval design, not as buying criteria. Your real benchmark is Laravel code, AAF crawl/review quality, and operator time saved.

Related sessions: Are LLM Performance Benchmarks Reliable?; Benchmarking Coding Agents on New vs Legacy Code bases; Terminal-Bench/eval workshops.

Expo Vendors To Talk To

Use the expo for vendor discovery. Keep the questions concrete and force answers back to your agent-building needs.

Braintrust

High

Evals, datasets, experiments, trace review, and failure analysis for production agents.

Ask: Ask how to ingest OpenTelemetry traces from Laravel/AAF, build judge rubrics, promote failures into datasets, and alert on eval drift.

Arize

High

Agent observability and evaluation. Useful if you want traces, spans, and quality monitoring that operators can understand.

Ask: Ask about OpenTelemetry span schemas, online evals, hallucination/cost monitoring, and comparing agent versions.

Browserbase

High for AAF

Browser automation infrastructure is directly relevant to portal indexing and computer-use agents.

Ask: Ask about session persistence, Playwright traces, respectful rate control, login isolation, captcha boundaries, and replay/debug artifacts.

Cloudflare

High

Workers, Durable Objects, Queues, Access, Browser Rendering, and edge deployment fit small agent services and public tools.

Ask: Ask about durable agent state, queue retries, Access-protected operator tools, Pages/Workers split, and browser rendering limits.

Neo4j

High for AAF/data modeling

Graph and ontology work maps to people/cases/charges/sources and Birdie operational knowledge.

Ask: Ask for Postgres-to-graph patterns, graph RAG without over-selling, and how to model evidence provenance.

Docker

High

Sandboxing, approval loops, and repeatable execution environments matter for coding agents.

Ask: Ask about secure agent execution, filesystem/network isolation, local-to-prod parity, and human approval gates.

Temporal / Inngest / Orkes

High

Durable workflow engines are relevant for long-running crawls, retries, review queues, and human-in-the-loop agents.

Ask: Compare idempotency, visibility, retry semantics, schedules, event history, and operator intervention.

Sentry / Datadog

Medium-high

Operational visibility for agent failures, Laravel errors, queues, and user-facing regressions.

Ask: Ask how agent traces connect to normal app errors, deploys, queues, and root-cause workflows.

LangChain / LlamaIndex

Medium

Useful ecosystem conversations, but avoid framework lock-in. Mine them for patterns, not defaults.

Ask: Ask about eval-first agent design, tool calling boundaries, retrieval debugging, and migration exits.

Qodo / Sourcegraph / Greptile

Medium for GolfTroop

Code review and codebase-agent vendors may have useful verifier and context-ingestion patterns.

Ask: Ask how they suppress low-confidence comments, dedupe static-analysis findings, and measure accepted suggestions.

Firecrawl / Apify / Bright Data

Medium for AAF, caution

Web extraction and data access may help, but AAF needs safe, respectful indexing over growth-hack scraping.

Ask: Ask specifically about robots/rate-limit controls, provenance, replayable extraction, and avoiding brittle/captcha-heavy flows.