Should an agent’s output be treated as compiler output? Link to heading

There is an increasingly common scenario in software development: you deliver a task to a coding agent, it reads files, executes commands, changes code, runs tests, and ends with a calm, organized, almost executive response.

“I implemented the change. I ran the tests. Everything is fine.”

The temptation is to accept it as a finished delivery. After all, the agent seems confident. It wrote well. It seems to have understood the context. And if you’re tired, behind schedule, or just glad you didn’t have to wade through six configuration files, that answer feels like relief.

But this is exactly the dangerous point.

The output of a coding agent should not be read as a correctness certificate. It should be read more like the output of a compiler, a linter, a test suite, or a static analysis tool: a useful, contextual, sometimes very valuable technical signal, but still a signal that needs to be interpreted.

A compiler doesn’t ask for trust. It points to a file, a line, a broken rule, a severity and, sometimes, a suggested correction. An agent should learn from this operational humility.

The good answer is not necessarily the one that sounds most human. It is the one that leaves a reviewable, traceable and reversible trail.

This thesis does not come literally from the documentation of GCC, Clang, Claude Code or Codex. It’s an editorial conclusion based on a well-known technical culture: reliable tools don’t try to appear infallible. They leave evidence.

Conceptual infographic with the phrase “Diagnostics, not oracle”, showing an agent response being turned into reviewable technical evidence.

The confident answer that seems too ready Link to heading

The problem for agents is not simply making mistakes. All software makes mistakes. Compilers have bugs, linters have false positives, tests can be poorly written, humans review poorly after lunch. Error is nothing new.

The specific problem with agents is that they can make mistakes fluently.

A compilation error usually looks like an error. It is cold, dry, almost unfriendly. It says something does not close. It shows the line. It complains about a type. It stops the build. It does not try to sound convincing.

An agent can produce an elegant explanation for a change that did not solve the problem. You can say that you ran tests without making it clear which tests you ran. You can claim that you have preserved compatibility without showing where you checked this. It may state that “the implementation now covers case X” when in practice it only covered a narrow interpretation of the request.

This fluency is useful for collaboration, but bad for validation. It mixes technical diagnosis with narrative. And narrative, in engineering, is too beautiful a knife to be left lying around on the table.

The problem is not that the agent makes mistakes; is to make a mistake with the final delivery Link to heading

When a compiler fails, no one says, “it seemed so safe.” The message may be bad, confusing, or verbose, but it clearly falls into the diagnostic category. It’s a tool saying: “look here”.

With agents, the line between diagnosis and conclusion becomes blurred. The agent does not just point out problems; it proposes solutions, changes files, interprets intent, summarizes decisions, and tries to close the conversation with a final answer.

This creates a subtle risk: the agent’s output is treated as the end of the work, when it should often be just another stage of the work.

The difference seems small, but it changes everything.

If the final answer says “done”, the human mind tends to reduce attention. If the final answer says “hypothesis implemented, verified by commands A and B, with remaining risk C,” it invites review. One asks for trust. The other leaves a trail.

And trail is what enables engineering.

Why “seems plausible” is not enough as an acceptance criterion Link to heading

In software, plausibility is cheap.

A diff might seem reasonable and break an edge case. A refactoring may look clean and remove old compatibility. A fix may pass existing tests because the tests don’t cover the actual bug. An explanation can make sense and still be based on the wrong file.

With agents, this plausibility becomes even more seductive because it is packaged in natural language. The agent does not just deliver code; it delivers a story about the code. And humans like coherent stories.

But acceptance criteria require more than narrative coherence. Asks for evidence, technical judgment and product context.

What was changed? Where? Why? What expected behavior was verified? Which command did you run? Which test failed before and passed after? What hasn’t been tested? What part depends on human review? Which decision is reversible? Which decision is risky?

Without this, the agent’s response is a well-written gamble.

What compilers taught us about technical confidence Link to heading

Compilers are one of the great schools of humility in programming. Not because they are perfect, but because their interface with error is more mature than it seems.

They don’t just say “I didn’t like it.” They classify. An error blocks the compilation. A warning indicates a suspicious condition. A note adds context. A suggested correction, when it exists, tries to be small, localized and safe.

The GCC documentation, for example, separates errors and warnings directly: errors prevent compilation; warnings indicate unusual conditions that may indicate a problem, even though the program can still be compiled. This distinction is simple but powerful. Not every problem is of the same severity. Not every suspicion should block the flow. Not every suggestion deserves automatic application.

Agents also benefit from this culture of severity.

Today, many agent responses arrive as a single narrative block: “I did X, Y, and Z.” But in a more mature engineering flow, the output would tend to better separate what is fact, hypothesis, change, verification, limitation and recommendation.

Error, warning, note and correction suggestion are different things Link to heading

Imagine an agent finishes a task and reports four things:

found a logic flaw;
noticed a possible uncovered edge case;
changed two files;
suggests adding tests in a related area.

These four things do not have the same weight.

The failure of logic is perhaps the main problem. The edge case is perhaps a warning. The changed files are part of the patch. The test suggestion can be a note. Mixing everything into one elegant paragraph reduces the usefulness of the answer.

Compilers learned this a long time ago. A good diagnosis is not just a message; it is a decision structure.

For agents, this structure could be something like:

understanding of the task;
analyzed files;
changes made;
checks performed;
limitations encountered;
remaining risks;
recommended next steps.

It’s not bureaucracy. It’s review ergonomics.

When an agent writes code, it enters the most sensitive flow of a project: the repository. It can change behavior, accidentally erase complexity, simplify something that was ugly for good reasons, or introduce an unnecessary dependency. If the final output does not help you review these decisions, it is incomplete.

Location, rule, severity and context matter Link to heading

A good compiler tries to tell you where the problem appeared. A good linter points out the violated rule. A good test shows expectation and result. A good agent tends to do something equivalent.

It’s not enough to say “I fixed the bug in the authentication flow”. That’s not enough.

It would be better to say that the change occurred in a certain module, that the likely cause was a validation carried out before normalizing the data, that the behavior was verified with a specific suite and that there was no manual testing in the browser yet. Even without turning into a giant report, the answer now has anchor points.

Location lets you review. Rule allows understanding. Severity allows prioritization. Context allows you to decide.

Without these elements, the agent becomes a kind of very productive colleague who delivers things without explaining exactly what it did. It might work on a small task. In a real codebase, this charges interest.

A good diagnosis reduces ambiguity, does not require faith Link to heading

The best tool output is the one that reduces the question “is this right?” for smaller, verifiable questions.

Compilers do this when pointing to an incompatible type. Tests do this when they isolate a broken expectation. Linters do this when showing a specific rule. Agents should do this when they finish a task.

An agent’s ideal output is not “trust me.” It’s “here’s what I did, here’s how I checked, here’s what could still be wrong.”

This change appears to be one of tone, but it is architectural.

Because a coding agent is not just a chatbot with editor access. It is an operational layer on top of the repository. It reads state, writes state, executes commands, interprets errors, changes plans, and produces diffs. The more autonomy it has, the more important it is that its output is auditable.

Where the analogy works for AI agents Link to heading

The analogy between agents and compilers works well when we talk about trust interfaces. Not because agents are compilers, but because the diagnostic culture of compilers provides a useful ruler.

A compiler analyzes and processes code according to formal rules, and when something goes wrong, it produces diagnostics. An agent transforms intention into actions on an environment and, when finished, should produce a diagnosis of what happened.

This is the important twist: the agent’s final response is not just a message to the user. It is part of the work artifact itself.

Agent output as operational diagnostics Link to heading

An operational diagnosis answers questions that matter for continuity.

How did the agent understand the task? What files did you read? What commands did you execute? What changes did you apply? What tests did you run? What error did you find? What decision did you make to overcome the error? What part were you unable to check?

This information is not embellishment. They allow another human, another agent or yourself tomorrow to rebuild the path.

This is traceability.

The answer also needs to be reviewable. In other words: it needs to allow someone to compare statement and evidence. If the agent says that “the build passes”, it must mention the command executed. If it says “there is no visual impact”, it must say whether it was checked by screenshot, browser, test or just code inspection. If it says “the change is small”, the diff needs to confirm.

And it needs to be reversible. A good exit helps you undo or adjust the change. It tells you which files were touched, what intention guided each change, and which parts are independent. If everything is mixed together in a grand narrative, reversal becomes more difficult.

Diffs, commands, tests and evidence as part of the answer Link to heading

A code-changing agent should think of evidence as part of the delivery.

It is not enough to produce diff. It is necessary to connect diff to objective.

It’s not enough to run tests. It is necessary to connect test to risk.

It’s not enough to say “I didn’t run tests”. The impact of this needs to be explained.

This is especially true for agents who work in a more autonomous mode. The Claude Code documentation recommends giving the agent ways to verify its own work, such as tests, screenshots, and expected outputs. It also recommends separating exploration, planning and implementation. This sequence is not ritual; it is a way to prevent the agent from solving the wrong problem too efficiently.

The Codex CLI, on the other hand, explicitly documents approval modes. In Read-only mode, it is more consultative. In Auto mode, you can read, edit and execute commands within the working directory, with limits. In Full Access mode, it gains much wider access and should be used with caution. The documentation also states that Codex shows a transcript of your actions, allowing review and rollback through the normal Git flow.

These two approaches point in the same direction through different paths: autonomy needs to be accompanied by verification, clear permissions and a path of actions.

If the agent only “talks nicely”, it is of little help. If it talks, changes, checks, and leaves evidence, it starts to behave like an engineering tool.

When a suggestion can become an automatic patch Link to heading

Modern compilers sometimes offer fix suggestions. Clang calls this fix-it hints: small transformations that can guide the fix of a localized problem. But there’s an important idea in Clang’s internal documentation: an autocorrect needs to have a high probability of matching user intent. When this is not clear, the suggestion should be more cautious.

This principle suits coding agents very well.

Not every suggestion should become a patch. Not every patch should be applied without confirmation. Not every confirmation should authorize broad changes.

A localized, mechanical, test-covered, and easy-to-reverse fix can be applied more autonomously. An architectural change, public behavior change, data migration, or security modification requires another level of approval.

Autonomy is not a binary button. It is a scale based on trust, scope and reversibility.

Agents in real code should better differentiate:

“I can fix this directly”;
“I can propose a small patch”;
“I need to confirm the intention”;
“this requires human decision”;
“there is not enough evidence to move”.

That’s a lot less cinematic than a demo of AI building an entire app in five minutes. It’s also much more like real engineering.

Where the Analogy Breaks Link to heading

Comparison with compilers is useful, but has a limit. And it’s important to mark this limit so as not to create a new fantasy on top of another.

Compilers work on formal rules. They have relatively clear grammars, type systems, parsing phases, and criteria for accepting or rejecting a program. Agents work with natural language, partial context, probabilistic inference, conversation history, files read, commands executed and system instructions that the user often does not see in full.

This changes the type of error.

A compiler can point out a type mismatch. An agent may misinterpret the user’s goal. A compiler may fail to optimize. An agent can optimize the wrong thing. A compiler typically doesn’t decide that it would be better to refactor your entire module because “it’s cleaner”. An agent can suggest this, with the best of intentions and the worst of impacts.

Compilers operate on formal rules; agents operate on incomplete context Link to heading

An agent typically does not operate on the entire project in context in the same way that a compiler operates on a formal unit of analysis. It can read many files, but still works with context selection. You may not see a historic decision. You may not be aware of a product restriction. You may not understand that an ugly snippet exists to maintain compatibility with an old client, a bad API, or a browser bug that no one wants to encounter again.

This is an essential point: the agent does not make mistakes just because of a lack of intelligence. It also makes mistakes because it lacks context.

And context, in agents, is an operational resource. It fills, ages, is pruned, summarized, reorganized. The recent post about Claude Code’s problems showed this well: the perceived quality of an agent does not only depend on the base model, but also on reasoning effort, context, cache, system prompt, tools and surrounding harness.

When one of these layers changes, behavior can change.

To the user, everything appears as “the AI has gotten worse”. Underneath it could be a default change, a cache policy, a system instruction, a loss of history, or a specific execution path. This is yet another reason to demand a trail. Without traceability, the diagnosis becomes a sensation.

An agent can be wrong in intention, not just in syntax Link to heading

Compilers are great at telling you that something doesn’t compile. They are less useful in telling you whether you are solving the right problem. Agents enter precisely this ambiguous territory.

When you ask “improve the performance of this screen”, the agent needs to infer what improvement means. Latency? Bundle size? Rendering time? Database query? Visual perception? It can choose a plausible direction and still choose the wrong one.

When you ask “simplify this code”, it can remove complexity that seemed accidental but was deliberate.

When you ask “fix this bug”, it may treat the symptom instead of the cause.

This type of error does not appear as a syntax error. Sometimes it passes the build. Sometimes it passes the tests. Sometimes it even seems like an improvement.

Therefore, the agent’s output needs to record understanding. Before listing what they did, the agent should make it clear what problem they believe they are solving. This part is reviewable. If the understanding is wrong, the work stops there, before becoming diff.

It looks like a detail, but it avoids a lot of rework.

Natural language increases the risk of false authority Link to heading

Natural language is the great interface for agents. It is also the source of some of the danger.

A compiler warning doesn’t try to convince you with a calm tone. An agent’s response can sound like a mature decision even when it is just a hypothesis. You can use words like “guarantees”, “resolves”, “preserves”, “no impact” and “correct” without sufficient evidence.

These words have weight. In hurried teams, weight becomes acceptance.

Therefore, a healthy agent culture should prefer language proportionate to the evidence. If you tested it, say you tested it. If you inferred, say you inferred. If you didn’t check, say you didn’t check. If there is a risk, say what it is.

The mature exit is not the most confident. It is the most honest about the degree of certainty itself.

What a good agent exit should be like Link to heading

A good agent exit should be structured for review. It doesn’t have to be long by default. You don’t need to turn each task into meeting minutes. But it needs to contain the right elements so that the human is not forced to trust the charisma of the response.

Think of it as a mixture of diagnostics, changelog and verification report.

What was understood Link to heading

Before talking about the solution, the agent should explain the interpretation of the task.

This is especially important in ambiguous requests. If the user asks “adjust the mobile layout”, the agent needs to say what it considered to be a problem: horizontal overflow, spacing, line wrapping, button size, contrast, visual hierarchy. If the user asks “fix the build”, it needs to say which flaw it found and what probable cause it addressed.

This step is not bureaucratic. It’s a hedge against efficiency in the wrong direction.

The agent can be quick to edit. This is precisely why you need to be clear before editing.

What was changed Link to heading

The answer needs to point out the files and the nature of the changes. Not at the “I tweaked some things” level, but at the level needed to guide review.

What changed in behavior? What changed in the structure? What was removed? What was kept on purpose? Was there a change to the API, layout, dependency, configuration, data or public contract?

In a human review, these categories matter. A change to local CSS does not have the same risk as a change to authentication. A text copy does not have the same risk as a migration. An adjustment in testing does not have the same risk as a change in the logic that the test is supposed to validate.

Agents need to stop treating all changes as “changes”.

What was checked Link to heading

This is the part where the agent’s output is closest to traditional tools.

If you ran a test, which one? If you ran build, which command? If you checked it visually, in which environment? If reference was validated, how? If you compared the output, with what expectations?

An unverified assertion can be useful, but needs to be marked as such.

There is a huge difference between:

“It’s working.”

AND:

“I ran Suite X; it passed. I didn’t run manual testing in the browser.”

The second answer is less glamorous and much more useful. It allows decision. Maybe the risk is acceptable. Maybe not. But now the risk is visible.

What was not checked Link to heading

This part should appear frequently in non-trivial tasks.

Agents tend to want to close with a sense of completion. But good engineering is also knowing how to tell what was left out.

Couldn’t run tests due to lack of dependency? Say. Does the task involve an external integration that has not been called? Say. The visual change was not opened on mobile? Say. Does the correction depend on production behavior? Say.

This does not diminish the agent’s value. On the contrary: it increases confidence when leaving.

Tools that hide uncertainty look better in the short term and cost more in the long term.

Which risks remain open Link to heading

Every change has residual risk. Sometimes small, sometimes huge. The agent should help name it.

A large diff can hide regression. A fix in shared flow may affect other modules. A prompt change can change behavior in untested cases. A cache tweak can create a bug only in long sessions. A CSS can solve mobile and break desktop.

When the agent lists risks, it is not being insecure. It is being useful.

This is the type of behavior that separates a tool from a demo.

The parallel with vibe coding Link to heading

The discussion about vibe coding matters here.

In the previous post on the topic , the central idea was simple: asking for code in natural language is not the problem. The problem is using AI-generated code without understanding, reviewing, or validating what was done.

This distinction continues to apply.

There is a healthy version of agent development: you use natural language to speed exploration, automate tasks, generate workarounds, do mechanical refactorings, write tests, investigate errors, and reduce cognitive load. In this scenario, the dev stops being just a syntax typist and starts acting as the technical director of the process.

But there is also the dangerous version: you describe a vague intention, accept the diff because it “looks good” and only discover the technical debt when something breaks.

This is the bad side of vibe coding: not the vibe, but the lack of acceptance criteria.

The problem is not asking in natural language Link to heading

Natural language is a great interface for intent.

Many programming tasks begin as an intention before becoming implementation: “I want this flow to be clearer”, “this screen is slow”, “this test is fragile”, “this post needs to validate references”, “this module is difficult to maintain”.

Humans work like this too. No one starts every task with a perfect formal specification. The difference is that, in good teams, the intention is refined until it becomes criteria, plan, implementation and verification.

Agents can help a lot on this path. They can ask questions, explore files, propose alternatives, identify risks, and write the first patch. Natural language is not the enemy.

The enemy is automatic acceptance.

The problem is accepting without review Link to heading

Vibe coding without review turns the agent into an educated technical debt machine.

He delivers quickly. You accept quickly. The project accumulates decisions that no one understood. Then, when something breaks, the team has no clear line back. The diff exists, but the reason does not. The code is there, but the intent is lost. The conversation explains something, but not necessarily enough.

This is why agent output needs to be reviewable, traceable, and reversible.

Reviewable so that someone can evaluate the change.

Trackable so the team knows how the decision was made.

Reversible so that a bad choice does not turn into an archaeological excavation.

Without these three properties, the agent becomes apparent productivity. And apparent productivity is one of the most efficient ways to produce rework.

The dev stops being a typist and becomes responsible for the acceptance criteria Link to heading

The agents’ good promise is not “no one needs to know how to program anymore”. This is the pamphlet version, good for short videos and bad for production.

The good promise is another: devs can spend less energy on mechanical work and more energy on intention, review, architecture, validation and decision.

But this requires a change of attitude. If the agent writes more code, the human needs to get better at evaluating code. If the agent performs more tasks, the human needs to better define what counts as a completed task. If the agent suggests solutions, the human needs to distinguish useful suggestions from dangerous shortcuts.

The center of work shifts.

You might type less. But you need to accept it better.

The Claude Code case and the importance of the operational trail Link to heading

The recent case of Claude Code is a good reminder that coding agents are not just models with a nice interface.

In the previous post about the recent Claude Code problems , the thesis was precisely this: the perceived quality of an agent depends on the surrounding system. The model matters, of course. But harness, system prompt, context, cache, reasoning effort, tools, permissions, evaluations, and communication with users also matter.

According to the Anthropic postmortem cited in that text, the recent problems involved three different changes: a change in the default reasoning effort, a bug related to cache/context in resumed sessions and a system prompt change aimed at reducing verbosity that ended up affecting quality in programming tasks.

The point here is not to retell the episode. It’s observing the pattern.

To the user, all of this appears as agent behavior. The agent forgets, responds shallowly, chooses a strange tool, loses continuity or seems less trustworthy. Underneath, the cause may lie in several layers.

Without a trail, everything becomes a sensation.

Model, prompt, context, cache and tools affect perceived quality Link to heading

When an agent makes a mistake, it is comfortable to say “the model failed”. Sometimes it’s true. It is often incomplete.

The model may be the same, but the system prompt has changed. The model may be good, but the context was pruned at the wrong time. The reasoning may be limited by a more economical default. The tool may have returned truncated output. The environment may be in a different state than expected. Testing may not cover actual behavior.

This makes agents more like operating systems than magic boxes.

They have layers. And layers fail in different ways.

Therefore, the final output needs to leave signals on these layers. What commands were executed? What context was used? What files were read? What validations have they passed? Where was there assumption? Did the agent change something or just propose? Were there environmental limitations?

These questions do not eliminate errors. They make the error diagnosable.

Without traceability, every failure becomes “AI has gotten worse” Link to heading

When there is no traceability, the conversation is poor.

The user thinks the AI has become stupid. The vendor thinks the user is comparing different cases. The team thinks the tool is unreliable. No one knows whether the problem was prompt, context, model, test, environment, instruction, cache or expectation.

This confusion is not just annoying. It prevents improvement.

In engineering, improving requires comparing. Comparing requires registration. Registering requires a trail.

The same goes for inside a repository. If an agent makes a change and doesn’t make the reason clear, the next person who changes that code will have less context than they should. If an agent changes something and doesn’t say what they checked, the review starts in the dark. If an agent fails and doesn’t show where it went, the correction becomes trial and error.

The agent output is part of the work observability system.

Logs and checkpoints are part of trust Link to heading

Trust in agents should not come from personality. It should come from checkpoints.

A checkpoint can be a test passing. It could be a build. It could be a visual review. It could be a list of changed files. It could be a decision summary. It might be a small diff. It can be a human confirmation before a destructive action. It can be an approval mode that separates reading, writing and execution.

None of this sounds as impressive as “the agent did it all himself”.

But that’s how trustworthy software is built: with limits, logging, and validation.

Good autonomy is not the absence of brakes. It’s a well-placed brake.

A practical rule for using agents in real code Link to heading

If we accept that agent output should be treated as technical diagnosis, some practical rules appear.

The first: output without testing is a draft.

This doesn’t mean that every task requires a gigantic suite. A text adjustment may not be necessary. A blog post may need a Hugo build and visual review. A change in business logic needs more serious testing. The point is that the answer should say which validation corresponds to the risk of the change.

Without validation, the correct delivery state tends to be “proposed”, not “ready”.

Output without testing is draft Link to heading

An agent who didn’t check their own work may still have done something useful. But the usefulness is at another stage.

He generated a hypothesis. Produced a candidate patch. Did an exploration. Wrote a first version. That is great.

The mistake is calling this done.

This distinction reduces anxiety and improves flow. The agent does not need to get everything right the first time if the process expects iteration. But to iterate well, the work state needs to be named correctly.

A draft is a draft. Verified patch is verified patch. Prompt delivery is another matter.

Large diff requires proportional distrust Link to heading

The greater the diff, the greater the need for explanation.

Not because big diff is always bad. Sometimes broad change is necessary. But a large diff increases the error surface, makes review difficult, and creates more chances for collateral behavior.

Agents can solve small problems with large reorganizations when there are no clear instructions. They may create abstractions too early, tinker with related files out of enthusiasm, or “improve” parts that no one asked for.

In real code this needs to be contained.

A good agent output should justify large diffs. Why so many files? What changes are essential? What are mechanics? Which could be separated? What was left out?

If the answer cannot explain the size of the diff, perhaps the diff is too large.

Statement without source or command needs to be marked as hypothesis Link to heading

Agents are good at explaining. Sometimes too good.

When an agent asserts something about a library, an API, a business rule or a tool behavior, the natural question should be: where did this information come from?

If it came from read documentation, cite it. If it came from a test run, show the command. If it came from code inspection, point out the file. If it came from inference, say it is inference.

This goes for research and it goes for code.

The difference between fact and hypothesis needs to appear in the output. Otherwise, everything gets the same tone of certainty. And when everything seems right, nothing is really verifiable.

Autonomy must grow along with verifiability Link to heading

Perhaps the most important rule is this: the more autonomy you give the agent, the stronger the verification needs to be.

If the agent only answers one question, the risk is low. If you propose a diff, the risk goes up. If you edit files, it goes up higher. If you execute commands, even bigger. If you publish, remove, migrate, change data or tamper with Git history, the risk changes category.

It makes no sense to increase autonomy without increasing traceability.

Here it is worth distinguishing the tools. The Codex CLI documents approval modes and a transcript of actions that can be reviewed alongside the normal Git flow. The Claude Code documents usage practices such as providing verification criteria, using tests or screenshots when it makes sense, and separating exploration, planning, and implementation. They are different capabilities and recommendations, but both reinforce the same operational thesis: agents should not be evaluated just by “how much they can do alone”, but by how well they can make the work auditable.

An agent who asks for confirmation at the right time is no less capable. It’s more reliable.

Three-step infographic showing agent output as technical diagnosis: claim, evidence, and acceptance.

Conclusion: do not treat agents as oracles; treat as advanced diagnostic tool Link to heading

Agent output should occupy a more technical and less mystical place in our workflow.

It is not prophecy. It’s not a sentence. It’s not a guarantee. It’s also not rubbish just because it can make mistakes.

It is an advanced diagnosis produced by a probabilistic system with access to tools, context and action capacity. This is powerful. Precisely because it is powerful, it needs to be treated with discipline.

The good question is not just “did the agent get it right?” This question comes too late.

Good questions are:

Did the agent understand the problem right? Did you make it clear what changed? Did you show evidence? Did you separate hypothesis from fact? Did you run validations proportional to the risk? Did it expose limitations? Did you make the change easy to review? Did you make the decision traceable? Did you make the path reversible?

This is the standard we should demand.

The good future for coding agents is not the one where they speak with the most confidence. It is the one where they leave the least invisible work.

Because, in the end, software doesn’t need more convincing voices. It needs systems that allow it to review, track and reverse decisions without turning every bug into a paranormal investigation.

And if a compiler, for all its coolness, already understood this decades ago, perhaps our speaking agents can learn it too.

Practical checklist: how to review an agent’s output Link to heading

Does the answer make clear which problem the agent understood it should solve?
Does the agent separate facts, hypotheses, changes made and future suggestions?
Are the changed files identified sufficiently for review?
Is the size of the diff proportional to the original order?
Are major changes justified by real necessity, not by generic “cleaning”?
Does the output tell you which commands, tests, builds or checks were run?
When there has been no testing or validation, is this explicitly stated?
Do statements about behavior, API, library or business rule point to source, file, command or evidence?
Does the answer show what was not checked?
Are the remaining risks specifically named?
Is the change reversible without requiring rebuilding the intention from scratch?
Is there a clear path between request, interpretation, action, evidence and result?
Did the agent ask for confirmation before destructive actions, publications, migrations, broad changes, or hard-to-undo changes?
Does the output help a human review better, or does it just try to convince them that everything is ready?
If this answer came from a compiler, linter, or test, would you accept the severity and evidence as sufficient?

Agent output: why it should be treated like compiler output