The New AI Architecture in 2026: Harness, Evaluation, Open Source

2026-05-26

ai-strategyenterprise-aiai-architecturemlopsllmopsregulated-industriestakeaways

Three shifts that change AI architecture in 2026: from model to harness, from test to evaluation architecture, from budget option to sovereignty lever, synthesis from PyCon DE & PyData 2026

Build with the three shifts in 2026 and you'll have several use cases in production in 2027. Build without them and you'll still be stuck on your first pilot.

The most important movement at PyCon DE & PyData 2026 wasn't in any single talk. It was a shift in tone: away from "look at what's possible", toward "running since February". The industry hasn't arrived. But it has digested its first AI wave and pulled three architectural shifts out of it: from the model to the harness, from test to architecture (evaluation), from budget option to sovereignty lever (open source). 2026 is the dividing line between programmes that build with these shifts and programmes that keep building on 2024 logic; the delivery gap will be plain to see in 2027.

At a Glance

Three shifts that surfaced across this series come together into one picture: AI architecture in 2026 calls for different thinking than 2024.
From model to harness; from test to architecture (evaluation); from budget option to sovereignty lever (open source).
What the conference did not solve: LLMOps stack standardisation, review capacity, clear skill paths. Saucedo's survey data point: roughly half of organisations still have no productive ML monitoring.
2026 is the dividing line between programmes that build with these shifts and programmes that build on 2024 logic. The delivery gap will be visible in 2027.

Shift 1: From Model to Harness

In 2024, the central question in many programmes was: which model is best? In 2026 it isn't. Sebastian Raschka put it on a single line in the fireside chat Stop Waiting, Start Shipping, which I moderated, and the line cuts sharpest for Europe: post-training and harness, not a base-model race.

The case in point was Cursor's Composer 2: running in production, materially better than most coding LLMs on the market, yet not the product of proprietary pre-training. At its core: an open base model, Kimi 2.5, with additional reinforcement learning on top. Cursor co-founder Aman Sanger confirmed this publicly in March 2026; Moonshot AI described the procedure as "continued pretraining & high-compute RL training" on Kimi 2.5. The production gain didn't come from picking the right model. It came from the layer above it.

The same pattern shows in Claude Code: strong as a coding agent because Claude Code is a very good harness for programming work. Under the hood, less model magic than fine-grained division of labour between specialised small programs: code search, patch generation, test execution, diff generation, sandbox invocation. The model orchestrates; the tools do the work.

For agent architectures, the shift cuts twice. The programmes running production agents in 2026 have strict context management, deterministic fallback paths, and evaluation-coupled releases. They treat trust boundaries as contracts, not slogans. They have (as Gabriela Bogk crystallised in her keynote) asked "what is the blast radius?" of every autonomous step before signing it off.

Anyone still benchmarking models in 2026 is optimising at the wrong layer. The lever sits two layers higher.

Shift 2: From Test to Architecture (Evaluation)

The second shift is unshowy but central: teams no longer build first and then check whether it works. They test first and then build with intent against what they want to measure. In 2024, evaluation was a downstream check in many programmes. In 2026, it sets the tempo of productive AI development.

Frank Rust and Thomas Prexl described in their talk It Works on My Machine what that looks like in practice: before the first line of code is written, they collect 100 or more real user questions with correct answers and source references. The set is reviewed against the baseline at every sprint review, alongside the power users. Andrei Beliankou and Evgeniya Ovchinnikova (E.ON) showed the operational layer above: three observability stacks running in parallel, span-level tracing, cost breakdown, and a pointed warning about the failure mode of the evaluation loop that any programme without this discipline will run into.

The consequence is different from what most teams assume: the strategic core of a programme isn't the model: it is the eval set.

Models change every quarter. Eval sets stay.

Eval sets are the most valuable asset of a 2026 AI programme, and they belong in week one, not in phase four.

That moves the hiring category as well. The bottleneck hire in 2026 isn't the ML PhD: it's the domain expert with an appetite for experimentation, a thesis I anchored to evaluation discipline. Eval discipline lives on domain depth, not ML depth.

Shift 3: From Budget Option to Sovereignty Lever

The third shift redefines the relationship between open-source stacks and enterprise strategy. In 2024, open source was the budget option in many DACH boardrooms: free, risky, somehow not "serious". In 2026, that label no longer sticks. Open source is the architectural form in which data control, auditability and strategic post-training converge. Three axes carry the move.

Data Sovereignty Through Local Models

Locally-run models are in 2026 the more obvious choice for sensitive workloads, as Bogk confirmed from a CISO's perspective. Confidential tickets, code repositories, contract drafts no longer leave the organisation's own control plane. Sovereignty moves from the strategy paper into the stack.

Audit Sovereignty Through Open Models

Sylvain Corlay (QuantStack/Jupyter) at the Open Source as a Business panel (which I moderated) articulated what hits regulated industry as squarely as it hits science: black-box tools are structurally unfit wherever traceability is mandatory. Model weights in your own hands, audit logs at the inference layer, inspection of model behaviour: none of that works without open models.

Strategic Sovereignty Through Post-Training

Yann Lechelle (probabl/scikit-learn) on the same panel delivered the economic clarification: open source is not a business model: it is a distribution, community, governance and marketing asset. From that emerges a different question than "open or closed?": which layer do we control, and which do we delegate? Differentiation is built in post-training on proprietary data, on a base model you don't have to fund yourself.

The European open-source business ecosystem (probabl, QuantStack, spaCy and others) is available as a partner market for exactly these sovereignty programmes. The detailed view sits in the sovereignty piece of this series. Anyone still framing the 2026 choice as "make-vs-buy between US hyperscaler and own model" is overlooking a substantial market.

LLMOps Stack Standardisation

Alejandro Saucedo (Zalando) brought an uncomfortable number from his State of Production Machine Learning Operations survey: roughly half of organisations still have no productive ML monitoring. What's missing in the LLMOps stack in 2026 is exactly what slowly took shape in the MLOps stack between 2018 and 2022: shared standards, shared patterns, shared tooling expectations. OpenTelemetry standards for GenAI are a start, but the field is heterogeneous and will stay that way for a while.

Review Capacity at Ten-Fold Code Growth

The New York Times documented in April 2026 the case of a financial services firm that jumped with Cursor from 25,000 to 250,000 lines of code per month and built up a review backlog of one million lines. The gap between code generation and review capacity is unresolved in 2026. The conference named it; it did not close it.

Clear Skill Paths for Domain Experts

If the bottleneck hire in 2026 is the domain expert with an appetite for experimentation, those people need a learning path. There is no structured one today. Bootcamps don't hit the profile, and neither do classical ML degrees. What's emerging in 2026, at best, is a mentoring model: external advisory meets internal domain. That only scales so far.

2026 Marks a Dividing Line in Corporate AI Strategy. Two Very Different Kinds of Programme Are Emerging.

The first has understood that the contest is no longer settled by larger models and more compute alone. They invest deliberately in the layers that actually matter for their business: tuning and adapting models to their own data and processes, quality assurance for AI output, integration with existing systems, and an architecture that builds compliance, data security and traceability in from day one.

These firms make AI sovereignty practical. They don't just talk about it; they run early applications on controllable in-house or open stacks. They document how their systems are tested, which data may be used, where the risks sit, and how decisions remain auditable. And they don't only look for classical AI researchers; they look above all for people who combine domain knowledge, technical understanding and an appetite for experimentation.

For Decision-Makers, This Means

First: Shift Investments from Base-Model Compute to the Harness

Programme budget does not belong in the next vendor comparison; it belongs in post-training, harness and the evaluation pipeline. Anyone still optimising on model choice in 2026 is optimising two layers below the lever. Cursor's Composer 2 is the case in point: the production gain came from the layer above the base model, not from picking the right base model.

Second: Shift Hiring Profiles from the ML Market to Domain Expertise

Domain experts with an appetite for experimentation are the bottleneck hire in 2026, not ML PhDs. Eval discipline lives on domain depth, not ML depth, and the talent market for that is available, often less expensive, and a better fit for most programmes. Anyone who accepts that opens up a different talent pool than the competition.

Third: Make Sovereignty Operational, Not Rhetorical

Sovereignty does not belong on a board slide; it belongs in the stack: local models for sensitive workloads, open models where audit is mandatory, post-training on proprietary data for strategic differentiation. Several use cases running on your own stack, with a documented evaluation trail and a defensible compliance frame: that is the operational mark of 2026.

2026 is no longer about who has the biggest model. What decides is who takes harness, evaluation and open-source sovereignty seriously as architecture.

Which of the three shifts has already arrived in your programme, and which hasn't yet? What comes next?

Let's talk