Writing from the Bento team on building agents that compound.
Notes, essays, and technical deep-dives on AI observability, evaluation, and improving your agents.
RESEARCHTB2: The benchmark where a score improvement actually means something in production
89 tasks, from compiling SQLite with gcov to fixing the OCaml garbage collector. The score your agent gets here is the one that actually tells you something.
6 min read
RESEARCH+10.2 pp pass@1 on Terminal-Bench 2.0 with a recursive learning layer: same agent, same model, same budget
We wrapped Claude Sonnet 4.5 with a recursive learning layer on Terminal-Bench 2.0. Pass@1 went from 42.2% to 52.4%. Same agent, same model, same budget.
15 min read
PERSPECTIVEWhy you can't vibe-code your way to a better production agent
Vibe-coding finds a fix for the trajectories you pasted in — not the 999 you didn't. Improving a production agent takes infrastructure, not a chat session.
6 min read
PERSPECTIVEProduction AI agents need research infrastructure, not iteration
Production agents are time-varying distributions, not snapshots. Improving one takes research infrastructure — not iteration with a frontier model on a loop.
7 min read
ENGINEERINGWhy your AI agent stops following instructions mid-run
In long agent runs, parts of the system prompt drift into the model's attention dead zone. Reorder by persistence, not importance — the drift goes away.
5 min read
PERSPECTIVENature vs. Nurture in AI agents: diagnose the layer that's actually breaking
AI agents that work in pilots often degrade in production. It's usually a diagnostic failure, not a capability one — here's how to spot the layer that's actually breaking.
7 min read
RESEARCH2.6× higher scores on ARC-AGI-3 with a self-learning layer: same agent, same budget
We validated a self-learning engine on ARC-AGI-3. Same agent, same tools, same budget — 2.6× the score, 34% cheaper per successful outcome, three first-ever solves.
8 min read
Subscribe for weekly insights
Deep dives, practical playbooks, and hard-earned lessons from teams running AI agents in production.