Microsoft's SkillOpt improves AI agent coding performance by 23.5 points by optimizing Markdown skill documents instead of weights
An optimizer edits the skill documents based on failures
🧵 The first real optimizer for AI agent skills just dropped.
+23.5 points on coding tasks. Zero weight updates. One reusable Markdown file.
Here's how SkillOpt turns agent training into a solved problem:
Your agent fails at spreadsheet tasks. You write a better prompt. It fails differently.
You're flying blind—no gradient, no validation loop, no memory of what didn't work.
Human-written skills can't self-correct. One-shot methods can't learn from rollouts.
SkillOpt treats the skill document as external trainable state.
Not weights. Not prompts. A compact procedural artifact that gets optimized like a neural net—but in text space.
Think: SGD for Markdown files.
Step 1: Collect evidence
Run your agent on a batch of tasks using the current skill.
Pass/fail scored by automatic verifier
Separate failures from successes Raw training signal → structured minibatches.
Step 2: Minibatch reflection
A separate optimizer model (GPT-4, Claude) reviews:
Current skill text Failed trajectories Successful trajectories
Output: Structured add/delete/replace edits targeting observed weaknesses.
Step 3: Bounded merge
Rank all proposed edits by utility.
Apply only the top Lt edits (e.g., 4).
This is your textual learning rate.
Prevents catastrophic rewrites. Keeps changes incremental.
Same idea as weight-update step size.
Step 4: Validation gate (the secret sauce)
Run the candidate skill on a held-out selection set. Strict rule: Accept only if new_score > old_score.
If rejected → store edit in a rejected-edit buffer for negative feedback. No harmful updates survive.
Step 5: Epoch-wise slow/meta update
After multiple inner steps, compare performance across epochs.
Durable lessons → protected slow-update section (read-only during fast edits). The optimizer itself learns which edit patterns work.
Momentum for text.
What you deploy:
✅ Single Markdown file (300–2K tokens) ✅ Tool policies, formatting rules, failure modes ✅ Zero additional inference cost ✅ Transferable across models, harnesses, benchmarks
One artifact. Seven models. Three execution environments. Why this works when prompt hacking doesn't:
🔹 Bounded updates (textual LR) 🔹 Validation gate (no regressions) 🔹 Rejected-edit memory (negative signal) 🔹 Protected slow field (durable knowledge) 🔹 Optimizer meta skill (learning to edit) Discipline beats creativity.
The next scaling law frontier isn't more parameters.
It's disciplined text-space optimizers.
Rejected edits are the real training signal. Accepted edits are just the visible tip.
If you're building agents:
🔹 Pick one benchmark with an auto-verifier 🔹 Run SkillOpt for 1 epoch 🔹 Measure lift on held-out validation
The training loop for agent skills is here. Use it.