The Infrastructure Layer Nobody Saw Coming: Agent Harnesses Are Eating Software

The smartest bet in AI right now isn’t the model. It’s the plumbing around it.

We spent 2025 arguing about which LLM was better. GPT-5 vs Claude vs Gemini. Benchmarks, leaderboards, Reddit fights. Underneath all that noise, engineers quietly stopped asking “which model?” and started asking “how do I run 20 of them without losing my mind?” That question is now an infrastructure category. Agent harnesses. And nobody has clean answers yet.

What actually is an agent harness?

Philipp Schmid from Google DeepMind said it better than I can:

If 2025 was beginning of agents, 2026 will be around Agent Harnesses. An Agent Harness is the infrastructure that wraps around an AI model to manage long-running tasks. It is not the agent itself.

It operates at a higher level than agent frameworks. The harness provides prompt… pic.twitter.com/7PkJf2qfqF
— Philipp Schmid (@_philschmid) January 5, 2026

@_philschmid — 142.5K views

Your agent framework handles how the agent thinks — LangChain, LlamaIndex, whatever you’re using. The harness handles everything else: scheduling, memory, tool access, error recovery, cost, and what happens when five agents are all trying to modify the same file at 3am.

The Kubernetes analogy is obvious but it fits. We didn’t start cloud infrastructure with container orchestration. We started with one server, then needed ten, then needed something that could coordinate them without requiring a human to babysit every process. Same story here, except the workers think.

Three projects that show where this is going

Dorothy calls itself “the wife your AI agents need.” It’s a desktop app — not a framework — that lets you run Claude Code, Codex, and Gemini agents across different codebases at the same time. Kanban boards, cron scheduling, Telegram and Slack control from your phone. If you’ve ever had 10 coding agents open and genuinely lost track of what each one was doing, Dorothy is solving your problem.

Hermes from NousResearch is the self-improving angle. It creates skills from experience, refines them during use, and builds a running model of who you are across sessions. Runs on a $5 VPS. Talks to you on Telegram, Discord, WhatsApp, wherever. The pitch is persistence — not “do the task” but “get better at doing the task every time.”

Paperclip has the biggest swing: “If an AI agent is an employee, Paperclip is the company.” Org charts, per-agent budgets, goal alignment. A CEO agent that delegates to CTO and marketing agents. Their tagline: “Open-source orchestration for zero-human companies.” Reading about it feels like science fiction. The GitHub activity is real.

Three tools, three different levels of ambition, same underlying mess they’re cleaning up.

The five things that break when you go from one agent to many

Coordination. Agents don’t know what other agents are doing. Two agents will duplicate work, contradict each other, or lock each other out. Without a layer that manages this, you’re hoping for the best.

Memory. Every session starts fresh. The agent that spent six hours understanding your Postgres schema yesterday has no idea about it today. Persistent cross-session memory is the gap between a useful agent and a frustrating one.

Cost. Running multiple agents on frontier models adds up fast. Paperclip puts monthly budgets per agent — when it hits the limit, it stops. That’s table stakes for anyone running this in production.

Access in locked environments. This one doesn’t get talked about enough. Agents need the web, APIs, files, databases. Enterprise environments have VPNs, 2FA, bot detection, zero-trust networks, guarded service accounts. An agent that can’t get through those layers is a toy. We’re genuinely years away from solving this cleanly for most corporate environments.

Observability. What is the agent actually doing? Why did it fail at step four? Which API call cost $6? You can’t operate what you can’t see. Most current tools are still thin here.

The enterprise gap nobody is being honest about

Agents work well on the personal side and in startups. They don’t work in most enterprises yet — and the gap isn’t technical, it’s security architecture.

I run an AI agent for my own workflows. Scheduling, research, automations, memory, cron jobs. It’s the best productivity improvement I’ve had in years. I haven’t touched it for work. The reason is straightforward: corporate environments have security layers that exist for real reasons — SSO, VPN, 2FA, DLP, zero-trust. An agent that can’t authenticate cleanly through those layers is not useful in that context. It’s a liability.

The force multiplier of the agent harness right now is crazy. The industry has landed on some architectural consistency, but there are still so many different variants of how to attack this. Maybe this gets bitter lessened out of existence, but for now it’s a huge lever. https://t.co/J05na51GA9
— Aaron Levie (@levie) March 3, 2026

Aaron Levie, CEO of Box

The teams building compliant, authenticated agent access for enterprise tooling will win large contracts. That’s a harder problem than orchestration itself.

The contrarian take worth taking seriously

Most teams building multi-agent systems don’t need them.

That’s not contrarianism for its own sake — it’s what practitioners who’ve shipped both say. A single well-configured agent with good tools handles most real tasks more reliably than five specialized agents trying to coordinate. Multi-agent adds failure modes. The debugging is brutal. The coordination overhead can exceed the value of the work done.

you will know multi-agent orchestration works when haiku subagents outperform a single opus agent and retain the token cost discount

until then we are hallucinating https://t.co/n0V8anTIj4
— darren (@darrenangle) February 8, 2026

The benchmark that actually matters for multi-agent

The 70% figure keeps coming up in honest discussions. Most use cases don’t need an orchestra. They need one musician who knows the song.

Introducing agents has to start with a business question, not a technology question. What are you measuring? Growth? Cost reduction? Hours saved on something nobody wants to do? If you can’t define success before you start, you’ll spend months building infrastructure for outcomes you never actually wanted. Bottom-up adoption — one workflow, one team, one clear win — beats top-down mandates every time. “We’re doing this because competitors are” is anxiety dressed up as strategy.

Agent harnesses matter. The teams building them are solving real problems. But the best ones will be the teams that help businesses answer the boring question first: what specific outcome are you actually trying to reach?

Everything else is scaffolding.