Your Skill Is a Behavioral Contract. Are You Sure It Holds?

Let me tell you about the most dangerous assumption in agentic software development today. It isn’t prompt injection. It isn’t hallucination. It’s quieter than both, and it’s already in your repository right now.

The assumption is this: that the quality of a Skill is determined by the care with which it was written.

I’ve watched people spend hours crafting Skills for Claude Code — meticulously structured, lovingly worded, organized with clear headers. Then they ship them, watch the agent invoke them, and declare success. Two weeks later, they can’t explain why the Skill fires on 60% of the inputs it should handle, why it silently fails in Claude.ai but works fine in Claude Code, and why an agent running the Skill last Tuesday produced a subtly different result than the same agent running it today.

This is the Skills trap. And almost everyone falls into it.

The root cause isn’t bad writing. It’s a fundamental misunderstanding of what a Skill actually is.


A Skill Is Not Documentation. It’s a Behavioral Contract.

Here’s the first aha moment, and it reframes everything that follows: a Skill is a contract between you and a reasoning system that will be invoked thousands of times across inputs you’ve never seen, in environments you may not have tested, by users who phrase their needs in ways you couldn’t anticipate.

That word — contract — matters enormously. A contract has two failure modes. It can be broken by the other party, or it can be drafted poorly enough that “following it” produces outcomes you never intended. In agentic systems, the second failure mode is far more common, and far more insidious, because the agent will follow your instructions faithfully while producing exactly the wrong result.

This means your mindset when writing a Skill needs to shift from “how do I describe what this does?” to “what behavioral guarantees am I encoding, and how do I verify they hold?”

Everything in this guide flows from that shift.


The Architecture You’re Writing Into (Most People Skip This)

Before touching a single line of SKILL.md, you need to understand the three-tier loading system you’re working within, because it shapes every structural decision you’ll make.

Tier 1 is your YAML frontmatter — the name and description fields. This is always in the agent’s context window, even when the Skill is not being used. It’s approximately 100 words, and it acts as the sole decision gate: the agent reads it and decides whether to consult your Skill at all. This is the most consequential real estate in your entire codebase.

Tier 2 is your SKILL.md body — the actual instructions. This only loads when the Skill has been triggered. Here’s what most developers miss: it competes directly with the rest of the context window — user messages, file contents, tool outputs, conversation history. A bloated Skill body doesn’t just waste space. It crowds out the very information the agent needs to apply your Skill correctly. Keep this under 500 lines as a strong default.

Tier 3 is your bundled resources — scripts, references, assets. These load only on demand. Scripts can even be executed without being read into context at all, meaning you can have a 2,000-line Python library powering your Skill without consuming a single token of context unless absolutely necessary.

The practical discipline this creates: don’t put Tier 3 content in Tier 2. If you have AWS-specific deployment instructions, they don’t belong in SKILL.md. They belong in references/aws.md, with a pointer that says “if deploying to AWS, read references/aws.md before proceeding.” This pattern — progressive disclosure — is one of the highest-leverage structural choices you can make, and most developers never make it.

skill-name/
├── SKILL.md                    ← Frontmatter + core instructions only
└── bundled-resources/
    ├── scripts/                ← Deterministic, executable operations
    ├── references/             ← Domain docs, loaded only when needed
    └── assets/                 ← Templates and output artifacts

For Skills spanning multiple cloud providers or platforms, decompose references by variant so the agent reads only the relevant file for each invocation:

cloud-deploy/
├── SKILL.md                    ← Workflow + provider selection logic
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md

This keeps every invocation lean. An agent deploying to GCP never loads the AWS reference. It sounds obvious written this way, but the monolithic alternative is the default pattern you’ll find in the wild.


The Triggering Problem: Your Most Important Line of Code

Here’s the second aha moment, and it stings a little: the description field in your YAML frontmatter is not metadata. It is your entire triggering mechanism. And you are almost certainly writing it wrong.

The agent sees every available Skill’s name and description simultaneously, then decides whether yours is relevant to the current task. If your description is vague, passive, or misses trigger context, your Skill won’t fire — even on inputs it was built to handle perfectly.

There’s a documented undertriggering bias at play here. If an agent believes it can handle something directly, it will — even when consulting your Skill would produce substantially better results. The agent is not lazy; it’s confident. And your description needs to counteract that confidence by being explicit about situations where your Skill adds value that direct handling won’t.

This means descriptions need to be slightly more “pushy” than feels natural. Instead of describing what the Skill can do, describe the situations in which consulting it adds irreplaceable value.

Compare these two descriptions for a commit message Skill:

Weak: “Creates commit messages in conventional format.”

Strong: “Creates structured commit messages following the Conventional Commits specification. Use this whenever the user is making a git commit, asking for a commit message, wants to summarize changes, or mentions staging, committing, or pushing — even if they just say ‘help me commit this’. Handles multi-file changes, breaking changes, scopes, and footers. Use even for simple single-file changes — consistency matters more than simplicity here.”

The second version converts edge cases into explicit trigger conditions. That phrase “even if they just say ‘help me commit this’” is doing real work — it’s directly counteracting the undertriggering bias by preemptively telling the agent not to handle that case on its own.

Your description should contain four distinct elements working together: what the Skill does (capability), when to use it phrased as user behaviors rather than abstract task types, where not to use it to prevent false triggers in adjacent domains, and any environment assumptions the Skill makes. A Skill that assumes a monorepo layout or a Python backend should say so — so it doesn’t trigger inappropriately in a Go microservices project.


The Environment Trap That Breaks Everything Silently

This is the gap I see most consistently in production Skills, and it causes a specific and maddening failure pattern: the Skill works perfectly in your testing environment, then silently fails in production for a subset of users, with no error, no signal, and no obvious debugging path.

The problem is environment blindness. The same SKILL.md may be invoked across fundamentally different execution contexts.

In Claude Code, you have a full terminal, subagents for parallel execution, a persistent filesystem, and browser access. You can run parallel test cases, use CLI tools like claude -p, and open local review servers. This is the richest environment.

In Claude.ai, there are no subagents. Test cases must run sequentially. You can’t do meaningful baseline comparisons. Any script that tries to open an HTML viewer will silently fail, consume compute, and give no signal about what went wrong.

In Cowork, subagents are available and parallel execution works well. Depending on the task context and configuration, browser or display access may not be available in certain workflows — in those cases, tools that open HTML viewers need a --static flag to write a standalone file instead, and feedback delivery differs: rather than a “Submit Reviews” button posting to a local server, it downloads a feedback.json file to be read from the filesystem. The point isn’t a definitive capability list for Cowork as a product — it’s that the same workflow may require different paths depending on what the environment exposes at runtime.

Capability Claude Code Claude.ai Cowork
Subagents
Browser / display varies by context
claude -p CLI
Parallel test runs
Filesystem persistence limited

The solution is conditional branching in your Skill body. Not a separate Skill per environment — that creates maintenance overhead — but explicit conditional blocks:

## Running the review viewer

If you're in Claude Code:
  Run `generate_review.py` and open the local server URL in the browser.

If you're in Claude.ai:
  Skip the browser viewer entirely. Present results directly in conversation,
  walk through each test case output, and ask for inline feedback.

If browser access is unavailable (e.g. certain Cowork configurations):
  Use `generate_review.py --static /tmp/review.html` and give the user
  a file path they can open in their own browser.

A Skill that silently fails in Claude.ai is worse than a Skill that doesn’t exist. The former consumes compute and produces no output; the latter at least makes the gap obvious.


The Writing Practice That Separates Good Skills from Great Ones

Here’s the third aha moment, and it’s the one that has the most immediate practical effect: stop writing rules. Start writing reasoning.

When you write ALWAYS use this exact template or NEVER skip this step, you’re treating a sophisticated reasoning system like a state machine. But modern coding agents have deep reasoning capabilities — they generalize, adapt to edge cases, and make judgment calls. When you give them the why behind a rule, they apply it better across novel inputs than they ever would from the rule alone.

Compare these two instructions for a timing data capture step:

Rigid: “ALWAYS save timing data immediately when each subagent task completes.”

Explanatory: “Save timing data as soon as each subagent task completes — this data arrives through the task notification and isn’t persisted anywhere else in the system. If you don’t capture it at the moment the notification arrives, it’s gone permanently. There’s no way to reconstruct it from logs or state.”

The second version encodes the urgency and the mechanism. The agent now understands this isn’t an arbitrary rule — it’s about a data loss race condition with no recovery path. That understanding carries into edge cases the instruction never anticipated: what happens when two tasks complete almost simultaneously, what to do if the notification arrives during another operation, how to prioritize when system resources are constrained.

Heavy use of MUST, NEVER, and ALWAYS in all caps is a yellow flag in your own writing. Every time you reach for one, pause and ask: what is the underlying reason this matters? Write that instead. The result is more generalizable, more humane, and — empirically — more effective.


Design for Three Modes, Not One Linear Script

Coding agents switch between reasoning and action in ways that most Skills don’t account for. Every production-grade Skill should explicitly address three modes of operation — not as three labeled sections, but as three concerns that your instructions cover somewhere.

Plan mode is about decomposition and preflight verification before anything irreversible happens. What preconditions must be true? Is the agent on the right branch? Are there uncommitted changes that could be clobbered? Are required tools installed at the right version? These checks prevent the entire class of failures where an agent confidently executes the right steps in the wrong context — which is one of the most expensive failure modes in coding agent workflows.

Execute mode is the command and file edit sequence. Instructions here should be short, exact, and specify the expected side effect of each step. Prefer minimal diffs and localized edits over broad rewrites. Define ordering constraints when multiple files must change atomically. Explicitly mark “do not touch” zones: generated files, vendored code, lock files. These aren’t nice-to-haves — they’re the difference between a Skill that’s safe to run autonomously and one that occasionally corrupts state in ways that take hours to debug.

Validate mode is where your Skill proves its own correctness. A Skill is incomplete unless it specifies how to verify that the work is done. A good validation stack moves from fast to slow: format/lint/typecheck first, then targeted tests on changed modules, then integration or smoke tests on critical paths, then artifact verification. Crucially, define explicit “done” gates — minimum acceptance criteria the agent uses to decide when to stop iterating rather than running tests indefinitely or declaring success prematurely.


Script-First for Anything Deterministic

Any shell snippet or operation that appears repeatedly across different test case runs should become a script in scripts/. This is one of the most reliable ways to improve Skill consistency, and it’s almost universally underused.

Here’s why it matters: when agents regenerate shell commands from scratch each invocation, they introduce variation — slightly different flags, different error handling, different output formats. That variation accumulates across runs and creates debugging nightmares where “the same Skill” produces subtly different results depending on when you run it.

A practical diagnostic: after running three or more test cases, read the transcripts and look for repeated patterns. If two out of three agents independently wrote nearly identical helper code — a script to aggregate benchmark results, a script to build a Word document, a script to open a review viewer — that code belongs in scripts/. The Skill then simply says “run scripts/aggregate_benchmark.py with the workspace path as an argument,” and the agent executes it without regenerating or re-reasoning about the implementation. The same proven code runs every time. You can test it independently. You can version it.

Scripts should be parameterized (avoid hardcoded repo-specific paths unless truly required), idempotent where possible (safe to run twice without corrupting state), and accompanied by usage examples and expected output in comments. This last point matters more than it looks: an agent reading a script it didn’t write needs to understand what “success” looks like to know whether to continue or escalate.


The Eval Loop Is Not Optional. It’s the Entire Job.

Here’s the fourth and most important aha moment, the one that fundamentally distinguishes engineers who ship reliable Skills from those who ship Skills that almost work: Skill quality is an empirical question, not a design question. You cannot reason your way to a great Skill. You have to measure it.

The development loop looks like this:

Draft Skill → Write realistic test cases → Run with-skill AND baseline →
Read transcripts + outputs → Draft assertions → Grade results →
Improve Skill → Repeat

Every step in this loop matters, and the order matters. Let me walk through the parts that most developers skip.

Always run a baseline. For every test case, run two versions simultaneously: one using your Skill and one without (or, for Skill improvements, one using the previous version). Running only the with-skill case makes every iteration feel like progress because you have no reference point. It is the measurement equivalent of testing software without knowing what it’s supposed to do.

Read the transcripts, not just the outputs. This is the single most underemphasized practice in Skill development. The final output of a Skill invocation is what the agent produced. The transcript is how it got there — every tool call, every intermediate step, every backtrack and retry. These are completely different sources of information, and you need both.

Transcripts reveal things that outputs hide. If an agent ran the same verification command three times before accepting the result, your success criteria are ambiguous. If every test case agent independently wrote the same 40-line helper script, that script belongs in scripts/. If agents are setting up scaffolding and then tearing it down, some instruction is creating unnecessary work. If an agent hesitated before a destructive operation but proceeded anyway, your preflight checks are insufficiently explicit. None of this is visible in the final output.

Write realistic test cases. Real prompts are lowercase, contain typos, use abbreviations, include irrelevant context, and describe the goal rather than the method. “My boss just sent me this file and she wants a margin column added” is a real prompt. “Create a spreadsheet with a calculated profit margin column” is a textbook prompt. Test with the former.


The Overfitting Trap: Fixing the Test Instead of the Category

When you receive feedback from test case review, your instinct will be to make a specific fix for the specific failing case. This instinct is almost always wrong.

Remember: your Skill will be invoked across thousands of inputs you’ve never seen. Fixing it narrowly for your test case without understanding the general class of failure is a form of overfitting that shows up as brittleness in production — the Skill handles your test cases perfectly and fails mysteriously on everything else.

Before touching the Skill, ask: what category of input does this failure represent? Not “the agent missed the margin column” but “the agent failed to infer output column placement from positional context in the user’s description.” Then write the improvement to address that category, not the instance.

If you find yourself writing a very specific instruction like “if the user mentions column C as revenue and column D as costs, put the margin column in column E” — stop. That’s a dead giveaway you’re overfitting. The better instruction explains the reasoning: “Infer output placement from the user’s column references and natural reading order of the data, placing calculated columns adjacent to the inputs they’re derived from.” The second version handles every variant of this failure mode, including ones you haven’t seen yet.


Description Optimization: The Last Step, Not the First

Almost everyone gets this backwards: they obsess over whether the Skill triggers correctly before verifying that it produces good outputs when it does trigger. If your Skill produces mediocre outputs, getting it to trigger more reliably just means more mediocre outputs.

Description optimization is a distinct, late-stage activity that happens after you’ve confirmed the Skill produces excellent results. The process is empirical: generate approximately 20 trigger eval queries — a balanced mix of should-trigger and should-not-trigger cases — and run each one multiple times, because variance matters enormously. A trigger that fires one out of three times is a fundamentally different problem than one that fires three out of three.

The most valuable test cases are the near-misses: queries that share keywords with your Skill but actually need something different. “Write a fibonacci function” as a negative test for a PDF Skill is too easy. You want genuinely adversarial cases where a naive keyword match would trigger but shouldn’t. Real queries look like real user messages:

Should trigger: “ok so my boss just sent me this xlsx file (its in my downloads, called something like ‘Q4 sales final FINAL v2.xlsx’) and she wants me to add a column that shows the profit margin as a percentage. Revenue is col C, costs D i think”

Should not trigger: “can you help me write a python script that reads a csv and generates some charts for a presentation”

Evaluate against a held-out test set rather than just the training queries. You can easily overfit a description to the examples you’re thinking about while writing it, and that description will fail on phrasing you didn’t anticipate. The held-out set is your protection against that.


The Anti-Patterns That Are Silently Killing Your Skills

Let me be direct about the failure modes I see most often, because they’re all fixable once you can name them.

Overly broad scope is the most common. A Skill that “handles everything related to Python projects” is a Skill that handles nothing reliably. Narrow scope with precise triggers almost always outperforms broad scope with vague ones — because the agent can reason clearly about when to use it.

Passive description language causes chronic undertriggering. Descriptions that describe what a Skill can do rather than what it should trigger on will miss a significant fraction of legitimate invocations. Rewrite descriptions in terms of user behaviors and situations, not abstract capabilities.

Ignoring transcripts is the debugging equivalent of reading compiled binaries instead of source code. The transcript is the source of truth for what the agent actually did. If you’re not reading them, you’re flying blind.

Untested scripts in scripts/ will fail in production in ways that are hard to diagnose, because the agent will execute them with confidence. Every script should be tested against realistic fixtures before shipping — not “looks right,” but actually run.

Massive reference dumps consistently produce worse results than well-organized focused files. A 1,500-line monolithic reference with no table of contents forces the agent to scan the whole thing on every invocation. Decompose by domain, add a table of contents for any file over 300 lines, and make the Skill point explicitly to the right file based on context.

Hidden environment assumptions are the silent failures I described earlier. Skills that silently assume a specific OS, specific tool versions, or specific network access will fail in a fraction of invocations with no clear signal about why. Make assumptions explicit in your frontmatter and add preflight checks that verify them.


The Definition of Done

A Skill is production-ready when all six of these properties hold simultaneously — not five out of six, but all six. A change that improves output quality but breaks triggering reliability is a net regression. A change that improves triggering but introduces a new failure mode in execution has the same problem.

The six properties: triggering is precise and reliable (verified empirically, not assumed); instructions are concise and mode-aware (covering plan, execute, and validate without being a rigid flowchart); repetitive logic is script-backed and the scripts are tested; validation gates are explicit and achievable (the agent can determine “done” without human confirmation); failure paths and reporting standards are defined for the most likely failure modes; and an iteration process exists with measurable feedback so the Skill can evolve.

That last point — the iteration process — is the one that distinguishes a Skill as a software artifact from a Skill as a document you wrote once and hoped for the best. The best Skills I’ve seen are treated like production services: they have owners, they have runbooks, they have metrics, and they get updated when the world changes around them.


The Core Insight

There is a version of Skill development where you write careful instructions and ship them, check that the output looks reasonable on a few examples, and declare the job done. Many teams are doing this right now.

There is another version where you treat a Skill as a behavioral contract, verify it against baselines, read the execution transcripts, run empirical trigger evals, and iterate on the failure categories rather than the individual instances. Very few teams are doing this.

The distance between those two approaches is the distance between Skills that almost work and Skills that work. The infrastructure for the second approach exists. The methodology is documented. The barrier is recognizing that the work doesn’t end when the writing ends — it ends when the measurement confirms it.

A Skill is not a document. It’s a behavioral contract. And like all contracts, its quality is determined not by how carefully it was drafted, but by how thoroughly it was tested.