Aurora AITell us your case

Offering

ServicesProductsCase studies

For whom

Private EquityEnterpriseSMB
ServicesProductsCase studiesAboutBlogContact

Knowledge base

Start hereWikiGlossaryGuides

Working with Claude Knowledge base

How to write your own AI-assistant skills and sharpen them with tests

A skill is a recipe your AI assistant loads when it's needed. I'll show you how to write one and how to use tests to make it work better over time.

An abstract dark banner: a single glowing module circling a closed loop between an attempt and a fix, a little brighter on each pass — a picture of a reusable skill that gets better through testing.
An abstract dark banner: a single glowing module circling a closed loop between an attempt and a fix, a little brighter on each pass — a picture of a reusable skill that gets better through testing.
Working with Claude#claude-code #skills #evals #ai-automation #ai-skills

Most people write the same kind of prompt to an AI assistant for the tenth time in a week and get a slightly different result every time. You can settle it once: you write your procedure down as a skill, the assistant reaches for it when it's needed, and from then on it does the thing just as well every single time. I'll explain in plain terms what a skill is, what its two kinds are, how to write one so it actually fires, and — most important — how to use tests to make it work better and better over time.

A skill is a recipe for your assistant

Let me start with the term, because it runs through the whole piece. A skill is simply a recipe: a set of instructions written in plain text that tells the assistant how to carry out one specific task. Nothing technical — it's a file an intern could read and understand. When you ask the assistant for, say, a LinkedIn post, it loads the matching recipe and hits the mark every time.

The key trait of a skill is this: the assistant doesn't load it all the time, only when it's needed. You can keep a dozen different skills in a project, and the assistant reaches for the right one only the moment your request matches it. That way it isn't weighed down by every instruction at once, and you don't have to remember which file to point it at — you just ask for the result in plain language.

Writing skills is far easier now, because the maker of this tool has released a separate, official skill whose only job is to create, test, measure and improve other skills. In practice it's packaged know-how on how to build them well: instead of poring over a multi-page guide, you hand that know-how to the assistant and it walks you through the process. From here on I'll just call it the skill builder.

Two kinds of skill — and why the distinction matters

Before you start writing, you need to know that skills come in two kinds. They sound similar, but they do something entirely different — and they have different shelf lives.

The first is a capability-uplift skill. It teaches the assistant to do better at something it's only average at on its own — designing tasteful pages, laying out documents, building spreadsheet formulas. At heart it's a well-written prompt. The classic example: ask the assistant for a web page without a skill like this and you'll get something correct but generic — that "made by AI" look. Add a skill that suggests good typefaces, good color palettes and sensible layouts, and the same design suddenly looks far better.

The second is an encoded-preference skill. Here the assistant already knows every individual step — the catch is that it has to do them in your order and your way. This is more of a step-by-step recipe, a genuine workflow. Example: a skill that first reviews the comments under your content, then checks online trends, then hands two helper agents the data to analyze in parallel, and finally pulls it all together and serves you finished conclusions. Without the skill you'd describe this from scratch each time and get a different result. With it, you say one word and get exactly what you like.

Why does the distinction matter? Because it decides durability. Capability-uplift skills can go stale over time — if the next, stronger version of the model designs pages better on its own than an older model with the skill did, you simply retire that skill. Encoded-preference skills, by contrast, tend to last, because they describe a process specific to you — and no model can guess that in advance. That's the first thing worth knowing before you write anything: are you building something one-off, or something meant to stay.

How to write a skill that actually fires

Now for the practical part. The easiest way to start is in plan mode: you describe in plain language what you want, and the assistant asks follow-up questions and lays out a plan before it builds anything. Tell it, for example: "I want a skill that, at the end of each week, gathers the data from my channel, analyzes it and prepares a PDF report with strengths, weaknesses, opportunities and threats." It will ask about the time window, the report's sections, the style — and then spell out exactly what it's going to do.

A note here that explains why this is still your job today, not the assistant's. Most people who use skills are managers and operators, not engineers. They're great at saying what they want and which result should come out — less so with the technical details. That's fine, because you write a skill in plain language. But the more clearly you describe what you expect, the better the skill you'll get. Leave something to assumption, and the assistant will guess at that something its own way.

Every finished skill is a single text file with a clear structure. At the top, in a short header, sit two crucial things: the name and the description — what the skill is called and, more importantly, when to fire it. Below the header you describe the rest: what the skill does and the steps in order, sometimes referencing a ready-made script or some data.

A diagram of a skill's structure: at the top a short header with the name and a highlighted 'when to fire' description, below it the procedure broken into steps — shown as two clearly separated layers of one file.
A diagram of a skill's structure: at the top a short header with the name and a highlighted "when to fire" description, below it the procedure broken into steps — shown as two clearly separated layers of one file.

The heart of the whole thing is that description in the header. When the assistant gets a request, it doesn't read every skill in full — it skims only their short headers and picks the matching one by the description. That's why the most common reason a skill fails isn't the procedure itself, but a vague description of its trigger condition. If the description is unclear, the assistant either reaches for the wrong skill or reaches for none. So write the description so it states unambiguously in what situation this skill should come into play.

The eval loop: how to make a skill get better

Here begins the thing that separates a skill written once from a skill that matures over time. An eval (short for evaluation) is simply a test: you give the assistant a set of real examples of a good result and check whether the skill actually fires at the right moment and whether it delivers what you want. With the new skill builder, the assistant can run that assessment itself and improve the skill based on it.

Here's how it works. Say you have a skill for writing job descriptions. You give the assistant plenty of examples of great descriptions — the kind you want to receive. It takes your skill, tries a few prompts on it, compares the results against your benchmarks and tunes the instructions accordingly. This shortcuts a process that used to be slow: the more you use a skill and the more feedback you give on what you like and don't, the better it works. The eval does part of that work for you, right away.

There are two reasons to reach for evals — they sound alike, yet they're opposites:

  • Catching regressions. When the model changes, it may start using your skill worse than before, because it "thinks" a little differently. A regular test is then an early signal that the skill needs tuning to the new version.
  • Catching uplift. That same model change may mean the assistant does the task better with no skill at all. The eval will show this — and then you can calmly delete or archive that skill, because it's no longer needed.

On top of this come benchmarks, that is, comparative measurement. When you change a skill or the model updates, you run all the evals and get hard numbers: the share of successful runs, how long it takes and how many tokens it burns (a token is a fragment of text the model operates on — the more of them you use, the higher the cost). You can ask outright: "compare the task with the skill on and off, show the results side by side." Then you'll see in black and white whether the skill really raises quality or whether it just feels that way.

A side-by-side comparison of two test-result columns: on the left a pale, uneven bar 'no skill', on the right a brighter, even bar 'with skill' — a benchmark visualizing a real lift in quality.
A side-by-side comparison of two test-result columns: on the left a pale, uneven bar "no skill", on the right a brighter, even bar "with skill" — a benchmark visualizing a real lift in quality.

Trigger tuning: when a skill fires at the wrong time

There's one more kind of test you'll only appreciate once you have a larger pile of skills. When a dozen or so accumulate in a project, false triggers start: you wanted one skill, the assistant fired another — or fired none. You can work around it by invoking a skill manually with a command, but it's more convenient to speak plain language and be confident the assistant has understood you right.

This is where trigger tuning helps. The skill builder analyzes your skill, tests on its own the various phrasings you might use to invoke it, and then rewrites the description in the header so the skill fires more accurately. This comes back to what I wrote above: all the accuracy of recognition sits in the description. After such tuning, recognition doesn't become perfect, but it clearly improves — and that's the difference between a tool you have to dictate everything to and one that simply understands what you're asking for.

A skill you write once and keep testing becomes an asset

Where is all this heading? Today, when you build a skill, you give the assistant the steps, the rules and the format. But a signal is emerging that over time just a description of what the skill should do may be enough — and the model will fill in the rest from a high-level, natural request. In my view that "may" will sooner or later turn into "will": more and more detail will come off your shoulders, and you'll simply say what you want.

But don't wait for that future to begin. The mechanism worth remembering is simple and works already: describe your repeatable work once, then don't leave the skill alone — test it on real examples, check that it fires at the right moment and delivers, and run it through an eval after every model change. A skill written once is a convenience. A skill you keep testing and improving is an asset that works harder for you the longer you have it.

The simplest first step is right at hand: pick one thing you do over and over and slightly differently each time, describe it in plain language as a skill — and, right away, take care to write one clear sentence that says when it should fire.