BLOG

Writing from the Bento team on building agents that compound.

Notes, essays, and technical deep-dives on AI observability, evaluation, and improving your agents.

ENGINEERINGJun 13, 2026
How to catch and fix AI agent regressions (the part nobody talks about)
Karpathy's recursive self-improvement loop is a beautiful mental model. It also has almost nothing to do with the silent regressions eating your week. Here's what actually works.
10 min read
ENGINEERINGJun 6, 2026
Harness Engineering: the underrated discipline of production AI
A million lines of code, zero written by a human hand. The discipline that made it possible has a name now — harness engineering. Most enterprise teams still don't have anyone whose job it is.
5 min read
RESEARCHMay 28, 2026
TB2: The benchmark where a score improvement actually means something in production
89 tasks, from compiling SQLite with gcov to fixing the OCaml garbage collector. The score your agent gets here is the one that actually tells you something.
6 min read
RESEARCHMay 27, 2026
+10.2 pp pass@1 on Terminal-Bench 2.0 with a recursive learning layer: same agent, same model, same budget
We wrapped Claude Sonnet 4.5 with a recursive learning layer on Terminal-Bench 2.0. Pass@1 went from 42.2% to 52.4%. Same agent, same model, same budget.
15 min read
PERSPECTIVEApr 27, 2026
Why you can't vibe-code your way to a better production agent
Vibe-coding finds a fix for the trajectories you pasted in — not the 999 you didn't. Improving a production agent takes infrastructure, not a chat session.
6 min read
PERSPECTIVEApr 26, 2026
Production AI agents need research infrastructure, not iteration
Production agents are time-varying distributions, not snapshots. Improving one takes research infrastructure — not iteration with a frontier model on a loop.
7 min read
ENGINEERINGApr 25, 2026
Why your AI agent stops following instructions mid-run
In long agent runs, parts of the system prompt drift into the model's attention dead zone. Reorder by persistence, not importance — the drift goes away.
5 min read
PERSPECTIVEApr 20, 2026
Nature vs. Nurture in AI agents: diagnose the layer that's actually breaking
AI agents that work in pilots often degrade in production. It's usually a diagnostic failure, not a capability one — here's how to spot the layer that's actually breaking.
7 min read
RESEARCHApr 18, 2026
2.6× higher scores on ARC-AGI-3 with a self-learning layer: same agent, same budget
We validated a self-learning engine on ARC-AGI-3. Same agent, same tools, same budget — 2.6× the score, 34% cheaper per successful outcome, three first-ever solves.
8 min read

Subscribe for weekly insights

Deep dives, practical playbooks, and hard-earned lessons from teams running AI agents in production.

Writing from the Bento team on building agents that compound.

How to catch and fix AI agent regressions (the part nobody talks about)

Harness Engineering: the underrated discipline of production AI

TB2: The benchmark where a score improvement actually means something in production

+10.2 pp pass@1 on Terminal-Bench 2.0 with a recursive learning layer: same agent, same model, same budget

Why you can't vibe-code your way to a better production agent

Production AI agents need research infrastructure, not iteration

Why your AI agent stops following instructions mid-run

Nature vs. Nurture in AI agents: diagnose the layer that's actually breaking

2.6× higher scores on ARC-AGI-3 with a self-learning layer: same agent, same budget

Ship AI to production with confidence