10h ago

Microsoft's SkillOpt improves AI agent coding performance by 23.5 points by optimizing Markdown skill documents instead of weights

An optimizer edits the skill documents based on failures

0
Original post

The problem is that agent skills are usually hand-written, made once by an LLM, or revised in loose ways that can easily make them worse. SkillOpt from Microsoft, argues that agent skills should be trained like small external programs, it teaches AI agents better task habits by editing a reusable skill document, not the model itself. The paper’s core idea is to treat the skill document like the thing being trained, while the main AI model stays frozen and unchanged. SkillOpt watches the agent try tasks, studies what worked and failed, then asks a stronger optimizer model to suggest small edits to the skill. It only accepts an edit when the new skill improves on a held-out check set, so the skill does not drift just because an edit sounds good. The authors tested this across 6 benchmarks, 7 target models, and 3 agent settings, including direct chat, Codex, and Claude Code. SkillOpt was best or tied on all 52 tested cases, and on GPT-5.5 it raised average accuracy by 23.5 points in direct chat. The final result is a small readable skill file that can improve agents across tasks and settings without retraining the model. The best part is that the optimizer is used during training, but deployment only needs the final skill file. That makes the artifact inspectable, portable, and cheap to reuse, which is exactly what most prompt-engineering systems lack. ---- Link – arxiv. org/abs/2605.23904 Title: "SkillOpt: Executive Strategy for Self-Evolving Agent Skills"

1:52 AM · May 29, 2026 View on X

🧵 The first real optimizer for AI agent skills just dropped.

+23.5 points on coding tasks. Zero weight updates. One reusable Markdown file.

Here's how SkillOpt turns agent training into a solved problem:

Your agent fails at spreadsheet tasks. You write a better prompt. It fails differently.

You're flying blind—no gradient, no validation loop, no memory of what didn't work.

Human-written skills can't self-correct. One-shot methods can't learn from rollouts.

SkillOpt treats the skill document as external trainable state.

Not weights. Not prompts. A compact procedural artifact that gets optimized like a neural net—but in text space.

Think: SGD for Markdown files.

Step 1: Collect evidence

Run your agent on a batch of tasks using the current skill.

Pass/fail scored by automatic verifier

Separate failures from successes Raw training signal → structured minibatches.

Step 2: Minibatch reflection

A separate optimizer model (GPT-4, Claude) reviews:

Current skill text Failed trajectories Successful trajectories

Output: Structured add/delete/replace edits targeting observed weaknesses.

Step 3: Bounded merge

Rank all proposed edits by utility.

Apply only the top Lt edits (e.g., 4).

This is your textual learning rate.

Prevents catastrophic rewrites. Keeps changes incremental.

Same idea as weight-update step size.

Step 4: Validation gate (the secret sauce)

Run the candidate skill on a held-out selection set. Strict rule: Accept only if new_score > old_score.

If rejected → store edit in a rejected-edit buffer for negative feedback. No harmful updates survive.

Step 5: Epoch-wise slow/meta update

After multiple inner steps, compare performance across epochs.

Durable lessons → protected slow-update section (read-only during fast edits). The optimizer itself learns which edit patterns work.

Momentum for text.

What you deploy:

✅ Single Markdown file (300–2K tokens) ✅ Tool policies, formatting rules, failure modes ✅ Zero additional inference cost ✅ Transferable across models, harnesses, benchmarks

One artifact. Seven models. Three execution environments. Why this works when prompt hacking doesn't:

🔹 Bounded updates (textual LR) 🔹 Validation gate (no regressions) 🔹 Rejected-edit memory (negative signal) 🔹 Protected slow field (durable knowledge) 🔹 Optimizer meta skill (learning to edit) Discipline beats creativity.

The next scaling law frontier isn't more parameters.

It's disciplined text-space optimizers.

Rejected edits are the real training signal. Accepted edits are just the visible tip.

If you're building agents:

🔹 Pick one benchmark with an auto-verifier 🔹 Run SkillOpt for 1 epoch 🔹 Measure lift on held-out validation

The training loop for agent skills is here. Use it.

4:21 PM · May 29, 2026 · 202 Views