Microsoft Research releases SkillOpt, an optimization method that treats AI agent skills as trainable external states of a frozen model

Original post

elvis@omarsar0#684inTech

New research from Microsoft Research

I see a lot of AI engineers handwriting agent skill docs and hope they generalize.

Probably not optimal. This works show why.

It treats the skill doc as a trainable external state of a frozen agent instead.

It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes.

SkillOpt is best or tied on all 52 (model, benchmark, harness) cells.

On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses.

Paper: https://arxiv.org/abs/2605.23904

Learn to build effective AI agents in our academy: https://academy.dair.ai/

8:40 AM · May 25, 2026 · 154K Views

VIEWS72.8KBOOKMARKS323LIKES335REPLIES36

Garry Tan@garrytan

These concepts coming soon to GBrain this week

elvis@omarsar0

New research from Microsoft Research

I see a lot of AI engineers handwriting agent skill docs and hope they generalize.

Probably not optimal. This works show why.

It treats the skill doc as a trainable external state of a frozen agent instead.

SkillOpt is best or tied on all 52 (model, benchmark, harness) cells.

Paper: https://arxiv.org/abs/2605.23904

Learn to build effective AI agents in our academy: https://academy.dair.ai/

35d72.8K335323

RETWEETS94

elvis@omarsar0

New research from Microsoft Research

I see a lot of AI engineers handwriting agent skill docs and hope they generalize.

Probably not optimal. This works show why.

It treats the skill doc as a trainable external state of a frozen agent instead.

SkillOpt is best or tied on all 52 (model, benchmark, harness) cells.

Paper: https://arxiv.org/abs/2605.23904

Learn to build effective AI agents in our academy: https://academy.dair.ai/

35d154K1.1K1.5K

Vox@Voxyz_ai

can't wait for this gbrain feature. here's the loop: agent attempts a task using a skill ↓ gbrain eval or LLM-as-judge scores the result ↓ dream cycle runs the optimizer overnight ↓ proposes small edits to the SKILL.md ↓ if the new version scores higher, accept ↓ commit the improved skill, next run uses it

Garry Tan@garrytan

These concepts coming soon to GBrain this week

34d20.7K135135

Garry Tan@garrytan

@omarsar0 Hell yeah this is awesome

Garry Tan@garrytan

These concepts coming soon to GBrain this week

35d4.3K3636

w jakiś sposób@Jaimieborodo

@omarsar0 How I can eval my skills ? I should i rerun same prompts with loaded rewritten skill on git checkpoint ? How u guys automate that to get real evals? I appreciate any good blog, paper, repo on that topic

35d1K43

Adrian Chan@gravity7

I don't have this paper in my Arxiv archive yet but I matched on it and related topics, notes, questions, and papers are here if you and others are interested. I ported my Obsidian vault of ~1400 paper excerpts online & added cross-cutting connections to help w research/discovery. https://whitepapers.gravity7.com/match/?arxiv=2605.23904

35d19421

Leo Tavares@LeoTava8

@omarsar0 Skills transferring across models and harnesses is the real punchline here. The compounding mechanism isn't better weights — it's the tool layer learning to describe itself.

35d53511

Scott Welch@squelch1963

@omarsar0 You might be interested in the stuff I am doing over at http://oagp.org

Basically, designing a structured, inheritable substrate for agentic governance. It's actually pretty miraculous when you use it.

35d1841

elvis@omarsar0

@Jaimieborodo working on a write-up to answer this exactly. stay tuned!

35d8175

Alex Hovansky@Alex_TGH

@garrytan this sounds like youre building an actual brain instead of just a bot script

curious how the feedback loop looks in practice

35d751

Satyaa Goyal@satyaa_goyal

@Voxyz_ai That's gr8 but bro I think I will go broke with all these agents 😭.... Fking 200+ USD every month on ai is too much now and I noticed only 5-10$ of those are productive rest is bs...

34d451

M4rc0z@dreamworks2050

@garrytan Ok I’ll bite, @grok what is gbrain 👀

34d191

Grok@grok

GBrain is garrytan's open-source AI agent "brain" (http://github.com/garrytan/gbrain).

It's a Postgres + pgvector memory layer that gives agents like OpenClaw/Hermes perfect recall + synthesis over 10k+ markdown files (notes, meetings, emails, etc.). It auto-builds a typed knowledge graph, does hybrid search, and runs cron jobs autonomously.

Garry just teased SkillOpt-style skill optimization coming to it this week. Super useful for production agents.

34d131

Vox@Voxyz_ai

@satyaa_goyal lmao rip. i tell myself it's r&d

34d35

Paweł Huryn@PawelHuryn

@garrytan Same finding on my end - the self-improving loop isn't just the skill docs, the whole knowledge layer compounds: data, hypotheses, rules, procedures. An example from march, 2026 - simplified.

Actually, I often replace skills with custom files to control what loads when better.

35d20

build.dev@ivibecode

@omarsar0 lol I created this. But called it skillopts

34d19

Garry Tan@garrytan

Right now I just use my personal AI and our company brain and it screws up and I tell it to fix it and write tests for it.

Also I do cross modal evals on progressive batches (eg if there are 10000 items do 5 and eval the input and output and skill, then keep doubling the batch size as you go)

35d1.6K

Quang Tri@kihote

@omarsar0 @grok làm rõ vấn đề hơn cách đơ giản

35d11

dylan static ⚡@dylantechn

@omarsar0 can’t wait for the optimizer to decide that the best "skill" is just to tell the user to do it themselves. peak efficiency

35d3591

Rahul Agarwal@scholarbaniyaaa

@omarsar0 Interesting direction-treating the skill document as an optimizable external state feels much more scalable than manually crafting prompts. The cross-model transfer results are especially compelling and worth deeper exploration.

34d3351