Robot coding agent at a desk with a laptop, blurred code screen in the background, and simple icons showing scope lock and verified tests under the title skills

Reusable "skills" for coding agents: how to design them so they do not drift

December 28, 2025

Using a coding agent every day teaches the same lesson fast: the agent is not the problem, the workflow is. The first run looks great because the request is fresh and the context is obvious. Later runs get messy because the request becomes shorthand and your rules start living in ten different places.

That is why people are building coding agent skills. A skill is not a longer prompt. A skill is a reusable workflow with a contract: clear inputs, a strict scope boundary, and a proof step.

If you want AI coding automation that stays useful after the novelty wears off, you need two things that do not drift with mood or wording. You need scope the agent cannot silently expand, and you need verification the agent cannot pretend it ran.

Why "skills" are trending in coding agents

Skills are trending for a boring reason: teams want agent output to behave like engineering work, not like a lucky chat session. When an agent is allowed to edit code, the cost of a bad run is not annoyance, it is broken builds, wasted review time, and slow drift in how the repo is maintained.

A skill is the packaging that makes AI agent workflows stable. It combines a fixed interface (what you pass in and what you expect out), a repeatable tool sequence (search, edit, test, report), and verification that is part of the deal. The implementation can be a prompt template, a small wrapper script, or a more structured graph. The important part is that the behavior is reviewable.

The fastest way to tell whether you have a skill or a prompt is to ask: can a reviewer see the scope boundary and the proof step without reading your mind. If those two things are missing, you have a chat snippet, not automation.

The core failure mode: prompt drift

Prompt drift is what happens when the thing you call a skill changes a little every time you use it. You add one extra rule while you are under pressure. You remove a constraint because it feels annoying. You keep the same name, but the behavior is not the same anymore.

The dangerous part is that drift is quiet. You do not notice it on day one because the output still looks plausible. You notice it later when a run touches unrelated files, when tests are skipped, or when the agent confidently reports an outcome it never verified.

Drift gets worse when scope is implied instead of enforced, when verification is optional, and when success is defined as a vibe like "make it better". The fix is not writing a longer prompt. The fix is designing the skill like an interface: explicit inputs, hard boundaries, and proof.

Design a skill like a function

If you want a skill to stay stable over time, design it like a function you would ship in a real codebase. The name should imply scope, the inputs should be explicit, and the output should be something a reviewer can check. Most importantly, the skill should have hard boundaries that prevent freelancing.

The point of the function metaphor is not to make things rigid for no reason. It is to make the behavior repeatable. If a teammate runs the same skill tomorrow, the skill should follow the same path: gather only the allowed context, produce a constrained diff, prove it with verification, and then report what happened.

If you want a simple template, use a contract like this and keep it next to the repo so it can be reviewed.

Field	What it means	Example
Skill name	Short, scoped, unambiguous	generate_tests_for_changed_files
Inputs	Only what the agent needs	repo root, changed files, test command
Outputs	What must exist after it runs	new tests, green test run, short report
Hard constraints	Rules that must not be violated	do not edit unrelated files
Verification	How we know it worked	run tests, show failures if any

You can store that contract as a prompt file, YAML, JSON, or code. The format is not the hard part. The hard part is making sure the contract is strong enough that you can reject a run when it violates scope or skips proof.

Skill 1: generate tests for changed files

Test generation is a great first skill because it is useful immediately, but it is also where teams get fooled. A test can pass and still fail to protect the behavior you changed. The failure mode is not red tests. The failure mode is green tests that assert something irrelevant.

So the contract has to force linkage between the change and the test. Feed the agent the list of changed files (not the whole repo), the existing test conventions, and a real test command that will be executed. The skill should treat the test command as part of the input, not as a suggestion.

Then require a repeatable loop. The agent should locate the nearest existing tests for the changed code, add a minimal new test that would fail without the change, run the test command, and attach the output. If tests fail, the skill should debug only inside the changed files and the relevant tests. That boundary prevents a "write tests" request from turning into a broad refactor.

There is one rule that eliminates most fake confidence: if the agent did not run tests, it must say that plainly. No wording that implies green. No "should pass" language. If the environment cannot run tests, the skill can still write tests, but the report must downgrade the confidence level.

If you want an extra check that catches the most expensive mistake, occasionally do a revert check. Revert the production change and confirm the new test fails. You do not need to do this every time. Doing it sometimes is enough to detect a pattern where the agent writes decorative tests that always pass.

This is also where a code review agent skill becomes valuable. A generator skill is biased toward producing output. A reviewer skill is biased toward skepticism. Keeping those roles separate makes it easier to spot brittle assertions and missing edge cases.

Skill 2: refactor one module without breaking behavior

Refactoring is where agents can save a lot of time and also cause a lot of damage. The damage is rarely a single catastrophic bug. It is scope creep that turns a focused change into a repo-wide rewrite that nobody asked for and nobody can review.

A refactor skill needs one hard boundary and one clear goal. Define the module boundary in a measurable way, like a single folder, a package, or a single file plus direct imports. Once you pick the boundary, enforce it. If the agent needs to cross it, it should stop and ask for explicit approval.

The safest refactor loop is deliberately repetitive. Start by stating the goal in one sentence and naming what must not change. Then identify the public surface area of that module, like exports, entry points, and key functions. If behavior is not protected by tests, add characterization tests before touching structure. Only then start the mechanical refactor in small steps, running tests after each step.

Notice what this workflow removes: there is no invitation for the agent to modernize unrelated code. There is no hidden "cleanup" phase. The refactor is constrained, and every step is proven.

The output of a good refactor skill is easy to review: a diff that stays inside the module boundary, tests that pass (or a clear failure report), and a short explanation of the mechanical transformations so a reviewer understands intent.

This is what makes agent reliability real. You can review and you can revert.

Skill 3: update docs without hallucinating

Doc updates are a classic agent win, but only if you treat docs as a derived artifact of code, not as a creative writing exercise.

A doc skill should be strict about sources.

Workflow

Identify the doc files to change.
Identify the code sources that justify the doc change.
Make the doc edit.
Add links or pointers to the code paths.
If something is unknown, write it as unknown and suggest where to verify.

Guardrails

do not invent flags, env vars, endpoints, or configuration
do not change product claims or security language without explicit approval
keep doc tone consistent with the existing repo

Output

doc changes with exact sections updated

The skill design rules that prevent drift

1) Scope is a first-class parameter

If scope is not explicit, the agent will expand it. Scope should be something you can pass in, not something implied.

Examples:

good: "only modify files under src/payments/"
good: "only create tests under tests/payments/"
bad: "refactor payment code"

2) Define success in terms of verification

When someone asks for AI agent workflows, they often describe outcomes in words. Skills should describe outcomes in checks.

build ran and passed
diff is limited to N files

If you cannot run the checks, the skill should downgrade gracefully and be explicit.

produce a change
review the change

That is the core pattern behind a code review agent: it is better at spotting risk and missing cases than at inventing new structure.

4) Version your skills

Even a simple version string helps:

generate_tests_for_changed_files@1.2
refactor_one_module@0.9

When output quality changes, you can correlate it to a skill version, not a mystery.

5) Measure failure modes, not vibes

tests generated but not executed
tests that pass but do not fail when you revert the change
refactors that touch unrelated files
doc updates that reference non-existent symbols

That is what makes agent reliability measurable.

Put skills in the repo.

Create a folder like agent/skills/ and store each skill as a small artifact with a name, a version, and a contract. It can be a prompt file, a JSON schema, or a little bit of code, but it must be something you can review in a PR. That single decision is the difference between "reusable prompts" and an agent playbook.

Then make drift hard by default.

The easiest practical rule is: a skill is not allowed to claim success without a check. If the skill is "generate tests with AI", it must either run tests or explicitly say tests were not run. If the skill is a refactor, it must either prove behavior is preserved (tests) or it must stop and ask for a safety net. If the skill is docs, it must cite the code source it used.

Now make the workflows concrete.

For example, for coding agent skills that touch code, define the same three guardrails every time:

scope boundary (which paths are allowed)
verification command (what is run)
stop conditions (max files changed, max tool calls, time box)

These are not abstract ideas. They are the levers that prevent an agent from turning a small request into a repo-wide rewrite.

If you want agent reliability, add two lightweight checks that catch the most expensive failure modes: Second, a scope diff check. Count changed files and changed directories. If a skill meant to touch one module modified five unrelated areas, fail the run. This is the fastest way to stop slow drift where refactors become opportunistic cleanups.

Finally, split creation and critique.

Use one skill to produce changes and a separate code review agent skill to evaluate them. The review skill should look for a short list of problems: missing assertions, brittle tests, ignored edge cases, changes outside scope, docs that describe features that do not exist. Keeping this separate reduces the chance the agent convinces itself it did great work.

If you do all of this, skills stop feeling like magic. They start feeling like automation you can trust: same inputs, same boundaries, same checks, and a clear reason when something fails.