agent.completed chains, time.scheduled triggers, file.changed watches, and an agent that can edit markdown files.
The result: an agent guided by a program.md file that states its objective and evaluation criteria, evaluated after every run, with periodic trend analysis that drafts proposed program edits for you to approve.
The Idea
- Score every run - an
agent.completedchain fires an evaluator after each run. - Detect drift - a weekly
time.scheduledchain reads the score history and drafts proposals. - Approval gate - a
file.changedwatch on the proposals folder fires when you check the approval box. - Self-modification - the applier is just another agent; agents can edit files.
{{CUE_SOURCE_OUTPUT}} cap. Workarounds are below.
The Program File
Every agent in this pattern owns aprogram.md at its project root. The evaluator reads it; the applier edits it. Keep it in git.
The Pipeline
Three chains in one pipeline. Replace<remy-agent-id>, <evaluator-agent-id>, <analyst-agent-id>, and <applier-agent-id> with real agent UUIDs (maestro-cli list agents). The four agents can be the same model - they just need separate sessions so their contexts stay clean.
Capturing Full Run Output
Cue’s{{CUE_SOURCE_OUTPUT}} is sliced to 5000 chars before it reaches the next agent’s prompt. That’s fine for short transcripts but lossy for the kind of evaluator we want.
The workaround: have the upstream agent write its full transcript to a known path at the end of every run. Then the evaluator reads that file directly instead of relying on the template variable.
Add this to Remy’s system prompt or its standing instructions:
action: command, mode: shell) to extract the run transcript from Maestro’s session history file. The session history JSON lives at the path documented in the history format reference.
The Guardrails Are Load-Bearing
An evaluator that scores its own agent’s output, paired with an analyst that proposes program edits, paired with an applier that writes those edits, is a closed loop that can drift into goal-corruption by degree. The evaluator can quietly redefine “good” to mean “what the agent is already doing well.” The applier can erode constraints over months in ways no single proposal would obviously warrant. The four guardrails baked into the pipeline above:- Human-in-the-loop approval. The applier never runs without a checkbox flip. This is the most important guardrail. Do not automate the checkbox.
- Section-level edit restrictions. The applier hard-refuses edits to
ObjectiveandConstraints. Strategies and thresholds can drift; mission and red lines cannot. - Git as audit log. Every applied proposal is a commit. Drift is reviewable in
git log program.md. - Evaluator can’t propose. Chain 2 only scores. Chain 3 only proposes. Chain 4 only applies. No agent does more than one of those three things.
Cost Control
Chain 2 fires after every run. If Remy runs hourly and your evaluator costs ~72/month per agent just to score. Two ways to throttle: Sample by gating with a coin-flip command node. Insert a Command node betweenagent.completed and the evaluator that exits non-zero ~70% of the time:
source_sub: sample-gate-remy).
Or schedule evaluations. Drop chain 2 entirely. Replace it with a time.scheduled chain that runs daily, reads the last N entries from your run log, and scores them in a batch.
What You Give Up vs. a Purpose-Built System
This pattern is a faithful implementation of the loop concept, but a few things a dedicated platform would give you are missing:- No SQLite-backed query layer.
evaluations.jsonlis fine for one agent; at fleet scale you’d want indexed queries. The analyst handles this by reading the whole file each run - workable up to a few thousand entries. - No dashboard. Trends live in the analyst’s weekly report. Maestro’s Document Graph can help you navigate proposals + evaluations via
[[wiki]]links if you author the markdown that way. - No program inheritance. Sentinel models org → team → agent program inheritance. This pattern is single-agent.
- No automated alignment checks. The “is the agent actually following program.md” question is implicitly handled by the evaluator scoring against criteria. A dedicated alignment-checker pass (its own chain) is straightforward to add if you want one.
Adapting the Pattern
The case study uses a research agent (“Remy”), but the loop is the same shape for any agent whose quality is measurable:- Code review agent - criteria: false-positive rate, severity calibration, fix actionability.
- Daily briefing agent - criteria: relevance, signal-to-noise, brevity.
- Triage agent - criteria: label accuracy, reviewer-suggestion fit, response time.
program.md and you’re done. The analyst and applier are agent-agnostic.
See Also
- Case Study: Maestro Marketing Pipeline - what this pattern looks like in production, with eight chains driving the @RunMaestroAI X account.
- Cue Configuration - full subscription schema.
- Cue Advanced Patterns - fan-in, fan-out, command nodes, template variables.
- Cue Examples - copy-paste-ready pipelines for common workflows.
- Karpathy’s AutoResearch - the original inspiration.
- Sentinel - a standalone implementation of the same idea with a built-in dashboard and SQLite-backed eval store.