11 Apr 2026

How to Get Agents to Run for Hours (and Still Ship Useful Code)

Here’s a snapshot of my AI coding workflow as of April 2026. While there’s a good chance this will be obsolete by June because two months is an eternity in AI time, now also seems like a good time to write it down because for once it hasn’t changed significantly in months.

The headline result: agent sessions that can routinely run 20 minutes to several hours and produce shippable code at the end. Not “the agent ran for a long time and made a mess” — sessions where you review the diff and it’s clean, tested, and ready to merge. The kind where you kick off an agent before dinner and come back to a completed feature: twelve files changed across three packages, tests passing, types clean.

The TL;DR if you’re impatient:

Learn to keep the agent in situations where it can verify it’s own work.
Give your agent a memory system that provides clear next steps.

The stack is simple: any capable coding agent (Codex, Claude Code, OpenCode, whatever you prefer) plus linear-beads (lb) or Steve Yegge’s beads (from which linear-beads takes heavy inspiration). Linear-beads is an open-source tool I built that gives agents persistent cloud-based memory via Linear. If you don’t care about cloud or don’t like Linear then skip it and just go with vanilla beads. The important bit is that you’ve got a graph-based tool that your agent enjoys working with. Using these tools the agent plans its own work as a graph of tickets, then drains the queue. That’s the whole thing.

What makes it work isn’t the tooling though — it’s two underlying principles that make long autonomous runs possible at all.

The Two Things That Actually Matter

After months of iteration, the sauce has been boiled down to two ideas. Everything else is just details.

1. The agent needs a way to verify its own work

This is the single most important concept for getting agents to run autonomously for extended periods. If an agent can’t check whether it’s done something correctly, it’ll drift — no matter how smart the model is, no matter how good the prompt is.

Eric Zakariasson from Cursor captured this on X perfectly: when you need an agent to debug something, have it add a feature flag first. The test should fail with the flag off and pass with it on. Now the agent has red/green testing — binary, self-checkable, no human judgment needed. He saw this in action when someone at Cursor used the trick to fix an inline diff bug in a session that ran for 10+ hours.

This generalizes to everything. Tests passing. Build succeeding. Type checker clean. Linter output. Script matching expected format. The art of prompting for long-running agent work is really the art of turning tasks into things with verifiable outcomes. If you can’t answer “how will the agent know it’s done?”, you don’t have a real task yet. Tasks with pass/fail signals let the agent iterate toward correctness, catch its own mistakes, and know when to stop.

2. The agent needs persistent memory with clear next steps

The other half of the equation: between compaction cycles (or across sessions, or across machines), the agent needs to be able to instantly figure out where things stand and what to do next.

A markdown TODO file is way better than no persistent plan at all, but is still limiting. After compaction, the agent has to re-parse the whole plan and infer which items are done, which are in progress, and which are blocked. This is where drift starts — the agent loses track and either re-does work or skips ahead past something that wasn’t finished.

What actually works is a dependency graph: tickets with parent/child relationships, blockers, and status tracking. The agent runs one command and gets back only the tickets whose dependencies have been satisfied. No re-reading the whole plan. No figuring out where it left off. Just: here’s the next thing, here’s all the context you need, go.

This is the idea behind Steve Yegge’s beads project — a persistent, dependency-aware issue tracker designed for coding agents. His writeup on why agents need structured memory is worth reading. The core insight: agents that track work in markdown end up with what he calls “rotten half-implemented plans” that lose their structure after a few compaction cycles.

linear-beads (lb) builds on the same ideas with Linear as the backend, which adds cloud-native state, a real UI, and multi-device/multi-agent support. More on that below.

These two principles reinforce each other. Self-verification keeps the agent producing correct output on each task. The persistent graph keeps it pointed at the right task across context boundaries. Together, they’re what turn a 5-minute productive window into a 2-hour one.

What This Gets You

Concretely, working this way unlocks:

Long productive sessions. With good task decomposition and self-verification, agent sessions can routinely run 20 minutes to several hours and produce mergeable output. The agent doesn’t “lose the plot” through compaction because the graph tells it what’s next and each ticket carries enough context to resume cold.

And why would you want your agents running for long sessions autonomously you ask? Because then you can parallelize! Increasingly, we are the bottleneck. The most efficient way of working in 2026 seems to be to act like a manager and spend most of your time keeping your agents unblocked, just like good human managers spend the majority of their time making sure their engineers are unblocked.

Work that follows you across machines. Because lb uses Linear as its backend, ticket state lives in the cloud. A Codex cloud instance does work, pushes git, updates tickets — and the work can be picked up from a laptop, a different cloud session, wherever. The agent is stateless; the durable state is in Linear and git.

Agent portability. Start a feature with Claude Code, switch to codex for a tricky part, and finish up with Cursor. Each agent runs lb ready, reads the ticket context, traverses the ticket graph and the git history to get up to speed on what it needs to know for the next step, and gets to it. No context transfer needed because the context was never in the agent — it was always in the tickets.

Parallel execution / swarm mode. Point multiple agents at the same lb graph. Each one claims work, executes, closes. The dependency graph prevents collisions. Not a full swarm system, but for parallel independent work under the same parent, it’s effective, especially when combined with git worktrees.

Human oversight without hand-holding. Open Linear on your phone. See every ticket the agent created. Reprioritize, add context, kill a ticket before it gets picked up. The agent just works on whatever lb ready returns.

The Actual Workflow

Setup is one command — lb onboard — and the agent walks itself through auth and configuration. lb auto-detects the repo via git remote and scopes itself to the right Linear project or label. For best results, also ask your agent to install the built-in agent skills that ship with lb, and then tell your agent to use lb to do some work:

Tell the agent what you want (“use lb to plan and execute [goal]”)
The agent reads relevant code, creates a parent ticket, decomposes into child tickets with blocker links
The agent drains the queue: lb ready → claim → implement → commit → close → repeat

Sometimes you pause after step 2 to review the plan in Linear. Sometimes you let it run straight through. The tickets the agent creates should each carry enough context that a cold agent with no memory of any previous work can start executing immediately. Specific files, specific functions, what to change, what to leave alone, how to validate. The agent does this research during planning, which is the investment that pays dividends through every subsequent compaction and session restart.

Hierarchy and blockers

Flat task lists break down fast. The built-in skills that ship with lb push the agents toward hierarchy: one parent issue for the goal, child issues for shippable units, deeper children when needed. Blockers encode execution order in the graph so lb ready always surfaces the right next step without any agent needing to “remember” the plan.

Recursive planning

The most powerful pattern: tickets that tell the agent to create more tickets. A research ticket might say “investigate how module X handles edge case Y, then create implementation subtickets based on findings.” The plan grows from what the agent actually discovers rather than from upfront speculation.

Self-review

A cheap trick that catches real bugs: create a review ticket at the end of a feature — “review all changes under parent X, look for issues, create fix tickets for anything you find.” Agents nearly always find real problems. Missed edge cases, inconsistent patterns, shallow tests. One extra ticket, meaningful quality improvement.

Compaction vs. Fresh Context

There’s an ongoing debate about whether long sessions with compaction or hard resets with fresh context per task produce better results. The Ralph Wiggum pattern — a bash loop spawning a fresh agent per iteration — has gotten popular, and its core insight is sound: fresh context beats polluted context.

The answer depends on the model.

Recent Codex models (GPT-5.1-Codex-Max onward) are trained natively for cross-compaction coherence. Compaction carries forward an encrypted state blob with latent reasoning, not just a summary. With GPT-5.4, compacted context is (usually) a net positive — the agent retains patterns it noticed, approaches it rejected, the “feel” of the codebase. Wiping that out means re-deriving it.

For models that handle compaction less gracefully, lean Ralph-style: hard reset between tasks, let the lb graph be the only carryover. This works because lb tickets are self-contained by design. lb ready → read ticket → start working. No conversational memory required.

The lb approach is complementary to Ralph, not competing. You could run lb inside a Ralph loop: each iteration starts clean, queries the graph, picks up a ticket with full context in the body, executes, closes, exits. lb replaces Ralph’s progress.txt with something richer — hierarchy, dependencies, concrete context in every ticket, and a real UI in Linear.

The Principles That’ll Outlast the Tools

This snapshot will age. The specific tools and models won’t stay the same. But these principles should hold:

Self-verification is the key to long runs. Find or construct a pass/fail signal for every task. Agents that can check their own work stay on track for hours. Agents that can’t will drift in minutes, no matter how good the model is.

Persistent structured memory makes everything else possible. The agent needs a graph, not a list. Hierarchy and dependencies let any fresh context window orient itself instantly and pick up the right next task. This is what turns “the agent ran for a while” into “the agent shipped a feature.”

Task definition is the real skill. Not prompting, not model selection, not orchestration. The quality of task decomposition — specific enough to be verifiable, scoped enough to be completable, contextualized enough to survive a cold start — that’s the ceiling.

Keep state external and portable. Tickets in a cloud API, code in git. Not coupled to any specific agent, machine, or session. The agents are disposable. The work graph persists.

If you want to try this, clone lb, then run lb onboard and watch your agent get smarter. Your agent walks itself through setup and starts tracking work immediately.

linear-beads is open source. Feedback, issues, and PRs welcome.

Nik's Notes