Microsoft SkillOpt Paper Enables Self-Evolving AI Agent Skills

Original post

elvis@omarsar0#684inTech

This SkillOpt paper from Microsoft is a must-read!

(bookmark it)

I was a bit skeptical of the results reported in the paper when I shared it a few days ago.

However, I managed to integrate it into my agent orchestrator and ran a few experiments.

The results are mindblowing.

Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this.

One exciting result was when I applied it to my paper-figure-extraction skill, which requires an agent to do multimodal analysis. In particular, it improved quality by +20 points (0.73 → 0.93). I went to see the extracted tables and figures, and I was absolutely stunned by how much better my skill got at the task.

Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve.

In this case, it was skills, but it's not hard to imagine how this scales to optimizing agent patterns, tool use, context engineering efforts, agentic search, workflows, evals, and even the harness itself. I already started with a few of these ideas inspired by SkillOpt.

Stay tuned!

9:07 AM · Jun 3, 2026 · 36.5K Views

VIEWS3.4KLIKES14REPLIES2

elvis@omarsar0

I love this figure from the paper. It does a really good job explaining how it all works.

elvis@omarsar0

In case you were wondering, I have already started to package this into something that's more accessible to others. It still requires thinking of the eval side of things. Learn to do evals, start with agents helping you with this initially, and automate it.

I am actually trying to figure out a way to automate this whole thing so I can run an experiment for how something like this can work on a schedule autonomously.

20d3.4K143

BOOKMARKS4

elvis@omarsar0

I am actually trying to figure out a way to automate this whole thing so I can run an experiment for how something like this can work on a schedule autonomously.

elvis@omarsar0

This SkillOpt paper from Microsoft is a must-read!

(bookmark it)

I was a bit skeptical of the results reported in the paper when I shared it a few days ago.

However, I managed to integrate it into my agent orchestrator and ran a few experiments.

The results are mindblowing.

Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this.

Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve.

Stay tuned!

20d3K104

RETWEETS29

elvis@omarsar0

This SkillOpt paper from Microsoft is a must-read!

(bookmark it)

I was a bit skeptical of the results reported in the paper when I shared it a few days ago.

However, I managed to integrate it into my agent orchestrator and ran a few experiments.

The results are mindblowing.

Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this.

Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve.

Stay tuned!

20d36.5K589801

elvis@omarsar0

Paper info here:

elvis@omarsar0

New research from Microsoft Research

I see a lot of AI engineers handwriting agent skill docs and hope they generalize.

Probably not optimal. This works show why.

It treats the skill doc as a trainable external state of a frozen agent instead.

It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes.

SkillOpt is best or tied on all 52 (model, benchmark, harness) cells.

On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses.

Paper: https://arxiv.org/abs/2605.23904

Learn to build effective AI agents in our academy: https://academy.dair.ai/

20d3K64

Yifan Yang@Yif_Yang

This is super exciting to see — thank you for sharing such a thoughtful experiment and analysis.

We are especially happy to see SkillOpt(https://github.com/microsoft/SkillOpt) being tested in real agent orchestrators and multimodal skills. The eval side is indeed critical, and making this kind of skill optimization more accessible and autonomous is exactly the direction we hope the community will explore together.

Really looking forward to seeing what you build next with SkillOpt!

20d17241

Mattew Phillips@MattewPhillips

@omarsar0 https://proof.skillier.ai/#ScanMe

Just built that, skill scanner for safety + skill optimizer (leveraging SkillOPT and other techniques).

20d202

elvis@omarsar0

@Yif_Yang It's really good work and a great idea. Sharing more in the coming days for sure.

20d861

Christopher@communicating

@omarsar0 @dair_ai This one is a game changer. I’ve also integrated it. 👍

20d691

LANGERIUS@Langerius

@omarsar0 Nothing beats the feeling of being skeptical about a paper, testing it out yourself, and watching the code actually work as planned

20d421

LANGERIUS@Langerius

@omarsar0 Completely agree that paper is a game-changer When you can give agent skills a proper framework to self-evolvethe compounding efficiency is going to be crazy

20d351

Alpha Batcher@alphabatcher

@omarsar0 this paper about SkillOpt will be interesting

let me reading

20d351

Lunari@0x_lun

@omarsar0 +23.5 points on gpt-5.5 direct chat with zero extra inference calls at deployment is the part that makes this worth actually reading

the skill transferring across codex and claude code without retraining is the quieter result that matters more long term

20d81

Vanar@Vanarchain

@omarsar0 This is essentially structured self improvement. Agents improving skills through evaluation loops rather than static prompting.

20d171

Sven Nachtzeit@SvenUrbanSci

@omarsar0 A +20 lift on multimodal extraction is not exactly subtle. In my own agent robustness work, the real question is whether these skill artifacts survive production workflows. Any data on transfer into other execution environments?

20d32

Danish Khan@nahkhsinad

@omarsar0 @dair_ai Interesting, ingesting, thx for sharing.

20d26

Strata@ChainZenit

@omarsar0 Another day, another "mindblowing" agent paper. Seen it all before.

20d26

Alex YGift@Radipdegen

@omarsar0 was skeptical too until i actually ran the numbers myself?

the gap between reading a paper and running it is massive

20d24

Alina Fomina@fominaaalina

@omarsar0 this is the harness war garry tan predicted. skills, orchestration, context engineering – all harness layer

20d23

Ajantik YZ@AjantikYZ

@omarsar0 SkillOpt’u orchestrator’a eklerken en kritik nokta eval setinin sızmaması. Sonuçlar hangi görev tipinde iyileşti, tool selection mı yoksa skill promptlarının yeniden yazımı mı asıl farkı yarattı?

20d23

Rugbist@rugbist_

@omarsar0 did u implement it fully or just cherry-pick the parts u needed? curious what ur setup looks like.

20d22