Agents Leave the Lab — What 2026 Shows

2026-05-04

agentic-aienterprise-aiai-strategygovernanceregulated-industriesai-architecture

Agents Leave the Lab — What 2026 Shows - Blog Post Title Slide

At a Glance

At PyCon DE & PyData 2026, teams are showing agent systems that have been running stably in production for three to six months — no longer just demos.
Three patterns connect those systems: strict context management, deterministic fallback paths, evaluation-coupled releases.
Three patterns reliably break: open-ended tasks, free tool choice, test generation without a domain anchor.
The strategic consequence: in 2026/27, agent architectures should be measured by their harness, not by their model. Putting the next euro into the base model is putting it where the lever is not.

From Demo to Production: a Quiet but Decisive Shift

At PyCon DE & PyData 2026, something shifted that's hard to capture in a headline: the tone of the agent talks turned. Where 2024 and 2025 were dominated by "look at what's possible" demos, this year teams talked about "running since February", "had to be re-tuned three times", "fails in exactly these categories". That sounds unspectacular. It is the most important shift of the year.

What surfaced there was not a new agent euphoria, but its opposite: sobriety as a sign of maturity. The teams that ship don't talk about model magic any more. They talk about context budgets, fallbacks, tool boundaries, and evaluation. In short: about the harness.

By harness I mean the whole of context control, tool boundaries, eval pipeline, fallback logic, and approval model — the operational layer that makes a model production-ready. It is, at the same time, technical architecture, governance frame, and investment decision. Three patterns from the conference recurred in the systems that hold — and three in the ones that dazzle in the demo and reliably break in production.

Sebastian Raschka und Alexander C. S. Hendorf im Fireside Chat „Stop Waiting, Start Shipping" auf der PyCon DE & PyData 2026

Three Patterns That Hold in 2026

Strict Context Management, Not Open Memory

The systems that work treat context not as unlimited memory but as a scarce resource with explicit economics. They know what gets in, what stays out, and when context is pruned. Sebastian Raschka put it dryly in his fireside chat: the secret behind working coding agents isn't the model — it's prompt and cache management. Repo history, conversation, plan — all of it has to flow in, but not all at once. Without active curation, the system's behaviour drifts from session to session. Context management, then, is not a prompting technique. It is state management.

Deterministic Fallback Paths

Every robust system has a path that works without an LLM. This is not nostalgia for pre-AI engineering. It is honest acknowledgement that a language model does not get more available, cheaper, or more explainable the deeper it sits in a stack. In regulated contexts the point is non-negotiable: without a deterministic fallback there is no audit trail, no traceability, no sign-off. The fallback is not the system's emergency exit. It is the proof that the system has been understood.

Evaluation-Coupled Releases

Production-stable teams do not ship "when it looks good" — they ship when the evaluation pipeline is green. That presupposes the pipeline exists, which in many programmes happens late, often too late. Where eval discipline was built in from week one, the conference showed systems with clear version trajectories and traceable improvement curves. Where it was missing, teams could not say with any confidence when their system had actually got better. Eval is not quality assurance at the end. It is the only layer in which "better" has any meaning.

Three Patterns That Break in Production

Open-Ended Tasks

"Write tests for this module" produces tests. They are rarely good. Raschka was blunt in the chat: agents have "no agency of their own". They respond precisely to precise instructions. Open-ended prompts produce shallow output — which looks impressive in demos, because it does something, and fails in production, because "something" is not enough.

Alina Dallmann dissected this precisely in her talk Beyond Vibe-Coding — A Practitioner's Guide to Spec-Driven Development: three recurring failure modes appear when you hand the AI open-ended tasks — fragmented design decisions scattered across multiple chat sessions; prompt drift, where the conversation takes on a dynamic of its own; and hidden assumptions the model makes because no one stated them. Her conclusion is the same as the architectural claim made here: the specification belongs before the code, not inside it. Task specification is itself architecture. If you don't see that, you are facing a problem that is not a model problem.

Free Tool Choice

When an agent picks freely from an open toolbox, behaviour becomes unpredictable in practice — mostly elegant, occasionally catastrophic. Harald Nezbeda's talk Building Secure Environments for CLI Code Agents documented concrete incidents (more in the "Trust Boundary" section below). For non-critical applications this spread is fine. For any regulated context, any critical pipeline, any automated operation against real data, it is an architecture that gets expensive sooner or later. The systems that hold restrict tool choice drastically — and check every tool against a clear use-case contract.

Test Generation Without a Domain Anchor

The problem is not that agents write too little code. The problem is that they produce code faster than organisations can absorb it — in domain expertise, in security, in architecture. Code volume is not the same as delivery capability.

Adrin Jalali, scikit-learn maintainer, made this point in the fireside chat (paraphrased, not a verbatim quote): well-meaning vibe-coded pull requests in open-source projects don't help. They are noise, and they burn maintainer time that is already scarce. What appears as a burden in open-source projects is a structural issue in enterprise software.

The New York Times documented the case in April 2026: a financial services firm started using Cursor and jumped from 25,000 to 250,000 lines of code per month. Within a short period, a review backlog of one million lines built up. Joni Klippert, co-founder and CEO of StackHawk (a security start-up that worked with the firm): "The sheer amount of code being delivered, and the increase in vulnerabilities, is something they can't keep up with." The consequence: senior software engineers in urgent demand to review it all — and the pressure cascading into sales, marketing, and support, who have to keep pace. Tests written by agents are often shallow. Reviews done by humans don't scale by a factor of ten. That gap will not close on its own in 2026.

What This Means for Architecture Decisions

If the harness is the decisive layer, three concrete consequences follow for programmes starting in 2026:

Model choice becomes secondary

Cursor's Composer-3 — running productively in their coding tool — is one example of why: the production gain came from post-training on an open base model, not from picking a vendor. Accept that logic, and your investment shifts from vendor comparison to harness engineering. The full case-study treatment of Composer-3 sits in the open-stack piece in this series.

Trust boundaries become explicit

Where can an agent act autonomously, where only suggest, where only inform? This is not a detail — it is architecture. In regulated industries it defines the compliance frame. Everywhere else it determines whether the system can scale.

Review capacity becomes the bottleneck

The factor-of-ten jump in code volume happens automatically once agents start writing productively. The reviewers to check it do not appear automatically. Programmes that don't address this in the setup phase build a form of technical debt that costs more in twelve months than any savings ever return.

Trust Boundary as Contract, Not Slogan

Saying "trust boundary" is easy. Writing it as a contract is the actual work. This is exactly where most programmes stumble in 2025/26 — not because the idea is wrong, but because it never becomes operational.

Concretely: for every agent step, three modes are worth distinguishing. Read and suggest (human decides), execute with downstream sign-off (four-eyes principle), or complete autonomously (no human in the loop). Which mode applies to which action is not a technical decision but an architecture and governance one. It has to be encoded in the pipeline, not in the prompt template.

The programmes delivering productively in 2026 have explicitly assigned these three modes for every agent step. In most cases the autonomous mode is excluded for more than half of the possible actions — and precisely that is what makes the rest defensible. Where the assignment is missing, every action runs implicitly as "autonomous" until an incident forces the discussion. From running architecture reviews I can add: where this mode assignment sits in the programme setup as a contractual element, it becomes operational. Where it travels along as an annex or as an implicit architectural decision, it collapses at the first stress moment.

Gabriela Bogk, CISO at Mobile.de, in her keynote "Honey, I vibe coded some crypto" at PyCon DE & PyData 2026, Darmstadt

Gabriela Bogk, CISO at Mobile.de and a long-time member of the Chaos Computer Club, captured this in her keynote "Honey, I vibe coded some crypto" with a concept that belongs in every architecture contract: blast radius. The question before any autonomous agent action is not "can the agent do this?" but "what is the worst that can happen if it gets it wrong — and can we absorb that?". Her own Claude Code setup runs in a VM with hand-curated API keys, no access to production data, and the code itself backed up in a Git repo. That is not paranoia. It is the operational translation of trust boundary into architecture.

Bogk's second point matters even more in regulated contexts: prompt-based guardrails are soft. "Everything that's prompt-based in terms of your guardrails is soft and can be worked around and is prone to injection attacks." Anyone implementing security through system prompts is building on sand. Hard-coded limits on tool permissions, filesystem access, and API keys are the only load-bearing layer — the LLM sits on top, not underneath.

Harald Nezbeda in his talk "Building Secure Environments for CLI Code Agents" at PyCon DE & PyData 2026

Harald Nezbeda made the consequences very concrete in his talk Building Secure Environments for CLI Code Agents. The risk profile of running a coding agent unsandboxed on a developer's machine falls into what Simon Willison calls the lethal trifecta: private data access plus external connectivity plus acting on untrusted context. Documented incidents from real Claude Code use: wiped home directories, a crypto miner installed via a compromised NPM package. His pattern for it: container isolation plus a man-in-the-middle proxy with its own SQLite-based observability. That is not paranoia. It is the minimum 2026 configuration whenever a coding agent enters production or a regulated context. Not waiting for that conversation is the most expensive piece of discipline in agent engineering today.

So What

Reading the conference as "confirmation of the agent wave" misses the picture. In 2026, agent programmes split into two camps: those with harness discipline and those without. The positioning statement "we're now embracing agentic AI as well" — whether on board slides, in strategy papers, or in an investor deck — is too cheap in 2026. It says something about external presentation, nothing about architecture.

The question that decides programmes over the next twelve months is more concrete: what is our harness, who builds it, how do we know it holds. Not the model choice, not the vendor, not the pilot budget. Take the harness seriously, and you get agents that hold. Skip it, and you get demos at production cost.

Your agent programme needs harness discipline, not the next model. Let's talk about the architecture that holds.

Let's talk