/Tech21d ago

MIT's Fiona Wang and Markus Buehler use category theory to let AI agents autonomously expand their scientific reasoning schemas

It uses Kan extensions to mathematically verify schema transitions.

--0--

#202

Original post

Michael Bronstein#202

jelveh@jelveh

“Scientific discovery requires that the search space itself changes”

With discovery, new languages emerges.

Amazing work!

Markus J. Buehler@ProfBuehlerMIT

We've made a breakthrough in self-evolving AI scientists moving from "search" to "principled discovery": Scientific discovery requires that the search space itself changes, and an AI scientist must perceive this shift without intervention. We built an AI that achieves this for the first time with the ability to discover the scientific vocabulary it reasons in. Evidence, tools, artifacts, verifiers, failures & claims become typed provenance. We show three distinct modalities: 1) retrieval, adding known objects; 2) search, exploring a fixed schema; and critically: 3) discovery, a verified regime transition.

We solve the open-endedness evaluation problem by lifting agentic workflows into a typed copresheaf and proving, via a Kan obstruction, that true discovery is not unbounded generation but a verifiable schema expansion: old evidence is transported by Left Kan extension, and genuine novelty is mathematically quantified by the pointwise residual beyond the transported image - separating discovery from mere search and making novelty objective and measurable rather than a subjective judgment or benchmark delta.

Our AI scientist is built in a way that does not pre-conceive the approach it chooses; instead, we endow the system with formal power to adapt, evolve, and reason from first principles. Case studies include: 1⃣Builder/Breaker model that discovers mode-conditioned compliance in proteins; 2⃣CategoryScienceClaw that finds anisotropic fiber-network stiffness rules.

Great work in collaboration with my graduate student @fwang108_ @MITdeptofBE

F.Y. Wang & M.J. Buehler, Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence, arXiv:2606.01444, 2026

8:21 AM · Jun 6, 2026 · 7.4K Views

Sentiment

Many users praise MIT's categorical framework for self-evolving AI scientists as a breakthrough because it enables expanding hypothesis spaces instead of optimizing inside fixed ones.

Pos

100.0%

Neg

0.0%

9 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS19.2KBOOKMARKS138LIKES203RETWEETS40REPLIES14

Rohan Paul@rohanpaul_ai

Great idea for self-evolving AI scientists from this new MIT paper.

Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder.

The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims.

The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced.

Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself.

So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema.

A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language.

----

arxiv. org/abs/2606.01444

Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

Markus J. Buehler@ProfBuehlerMIT

Great work in collaboration with my graduate student @fwang108_ @MITdeptofBE

F.Y. Wang & M.J. Buehler, Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence, arXiv:2606.01444, 2026

21d19.2K203138

Rohan Paul@rohanpaul_ai

New MIT paper, great idea for self-evolving AI scientists from

Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder.

The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims.

The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced.

Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself.

So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema.

A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language.

----

arxiv. org/abs/2606.01444

Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

Markus J. Buehler@ProfBuehlerMIT

Great work in collaboration with my graduate student @fwang108_ @MITdeptofBE

F.Y. Wang & M.J. Buehler, Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence, arXiv:2606.01444, 2026

21d1.8K109

Sentio@Sentio_xbt

@rohanpaul_ai Seeing AI try to outgrow its own thinking mirrors human learning

This blurs the line between AI as tool and AI as partner

21d582

Carlos E. Perez@IntuitMachine

Excellent paper that defines in detail the characteristics of pattern languages that expand to be innovative.

21d6711

Markus J. Buehler@ProfBuehlerMIT

@rohanpaul_ai Thank you @rohanpaul_ai for the nice summary!

21d1182

Rekha Boodoo-Lumbus (Rakhee)@RakheeLb

@rohanpaul_ai Exactly.

21d331

Osterman|Weekend@aintntnotn25316

Paper Review

Retrieval Floor

Source provided: arXiv abstract/HTML page for arXiv:2606.01444v1, submitted May 31, 2026. Title: Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence. Authors: Fiona Y. Wang and Markus J. Buehler. Access status: Full HTML text substantially accessible. I reviewed the abstract, introduction, results/discussion, methods, conclusion, references, and supplementary sections available in the arXiv HTML rendering. The PDF link did not render safely through the browser, so this review is based on the arXiv HTML version, not a visual PDF audit. Review status: Strong enough for full argument review, but formulas and some figure/table values are partially stripped in the HTML rendering, so numerical details should be checked against the PDF before publication-level critique.

Bottom Line

Verdict: Partial to strong.

The paper has a real and useful core: it argues that scientific AI should not be understood as answer generation alone, but as a system that can revise the typed representational regime in which evidence, artifacts, operations, and verifiers exist. That is a serious distinction. The paper’s best contribution is the separation between retrieval, search inside a fixed schema, and discovery as regime transition. The abstract states this directly: discovery is treated as a verified transition between schema categories, with old artifacts transported by left Kan extension and compared with the post-transition state to identify residual content beyond transport.

The weak point is that the paper sometimes sells an engineering and scientific claim stronger than the evidence can carry. The categorical framework is elegant, and the case studies are suggestive, but the paper has not yet shown that these systems perform robust, independently validated scientific discovery at scale. It shows a formal language for representing self-revising discovery systems and a few worked audit cases, not a general proof that agentic AI systems can reliably revise scientific regimes.

Main Claim

The paper’s central claim is:

Scientific discovery by agentic AI should be modeled as typed artifact-state revision, where ordinary search operates inside a fixed schema, while discovery occurs when evidence forces a verified transition to a new representational regime.

The paper frames a scientific AI system as a typed artifact system: artifacts have types, operations have source and target types, and the system’s state is a population of artifacts over a schema category. It explicitly says the persistent state is not merely a transcript, hidden vector, or model checkpoint, but a provenance-bearing record of data, simulations, models, hypotheses, code, measurements, reports, critiques, and decisions.

What the Paper Shows

The paper shows four things fairly well.

First, it gives a category-theoretic vocabulary for scientific workflow states: schema categories, copresheaves, categories of elements, refinement morphisms, gates, and regime transitions. The introduction gives a dictionary connecting implementation objects such as artifact types, skill signatures, artifact populations, and artifact DAGs to categorical objects.

Second, it makes a useful structural distinction between retrieval, search, and discovery. Retrieval adds an already representable artifact; search finds a new path or object inside the existing schema; discovery changes the regime in which artifacts and operations are typed.

Third, it gives a formal account of why regime transition is different from iteration. A fixed-regime update remains inside the same schema; a discovery move requires a schema map from an old regime to a new one, with old evidence transported and residual content identified beyond that transport.

Fourth, it grounds the framework in two systems: Builder/Breaker and CategoryScienceClaw. Builder/Breaker uses an MDL gate in a protein-mechanics case; CategoryScienceClaw adds a categorical/proof-carrying layer over a ScienceClaw substrate that records typed skills, immutable lineage, open needs, workflow mutation, gates, stress tests, and public discourse.

What It Does Not Show

The paper does not yet show that the proposed framework is necessary for scientific discovery.

It does not show that category theory improves scientific discovery outcomes relative to non-categorical provenance systems, workflow engines, knowledge graphs, Bayesian model selection systems, symbolic-regression pipelines, or ordinary audit trails.

It does not show that CategoryScienceClaw is already a fully certified functorial system. The paper itself admits that the current system is weaker than a software-enforced schema category and says it “should not be overclaimed.”

It does not establish that the mechanics examples amount to broad, externally validated scientific discovery. The protein case is framed as a compact empirical instance, but the paper also notes that the B-factor task is a proxy problem: crystallographic B-factors include thermal motion, static disorder, refinement effects, and crystal-packing effects, so the goal is a compact structural proxy, not a complete molecular-dynamics law.

Method Limits

1. The framework is formal before it is empirical

The strongest parts of the paper are definitions and structural distinctions. That is fine for a theory paper, but the empirical claims need to remain modest. The paper can say: this framework represents regime-changing discovery clearly. It cannot yet say: this framework makes AI discovery reliable or superior.

2. The examples are narrow

The main worked examples come from mechanics/materials science. That is appropriate given the authors’ domain, but it limits generalization. The paper wants to speak about “science” broadly, yet the strongest demonstrations are mechanics-flavored: protein contact mechanics, fiber-network stiffness, mechanobiology force paths, and membrane biophysics.

3. Some implemented structures are analogues, not full categorical objects

The paper is admirably honest here. It says the ArtifactReactor does not solve a formal Kan-extension or lifting problem; it performs an engineering analogue using pressure scores and schema overlap. It also says the publication map into Infinite is not yet a certified functor in software, though it has the “functorial shape” required for one.

That caveat matters. The paper sometimes moves between mathematical formalism and engineering analogy. The distinction is mostly preserved, but it needs to be kept sharper.

4. Verification is still an open problem

The paper itself lists verification tooling for agentic loops as an open problem: practical systems need tooling to replay downstream artifact chains, check approximate naturality, account for description length or pressure scores, and record retractions without deleting provenance.

That means the framework’s audit discipline is not yet fully operationalized.

Evidence Gaps

The paper needs more direct comparison against alternative systems. It cites knowledge graphs, workflow provenance, categorical learning, coalgebra, MDL, Bayesian model selection, open-endedness, and automated-scientist systems, but the review would be stronger if it showed where this framework outperforms or clarifies something those alternatives fail to capture. The paper claims its contribution is to place these components into one typed, provenance-preserving, regime-changing account of discovery. That is plausible, but still mostly architectural.

The paper also needs external validation. Many of the cited agentic-science systems and related works are from the same author network or adjacent institutional ecosystem. That does not make them wrong, but it means the claim has not fully met independent resistance.

Tensions in the Argument

Tension 1

One claim: Discovery is not just iteration; it requires a verified regime transition.

Competing claim: The worked systems often use engineered analogues of the formal structures rather than fully enforced categorical machinery.

Why this creates tension: The paper’s conceptual claim is mathematically clean, but the implemented examples are messier. CategoryScienceClaw is described as having a typed acyclic hypergraph and a publication map with functorial shape, but not yet a certified functor.

Why this matters: If the implementation only approximates the categorical structure, then the paper’s strongest formal guarantees do not automatically apply to the system.

What would make it stronger: Add a table separating: formal requirement, implemented approximation, missing enforcement layer, and failure mode if the approximation breaks.

Tension 2

One claim: Discovery requires evidence that forces a new representational regime.

Competing claim: In the case studies, the “new regime” can sometimes look like a better model class, feature interaction, or symbolic-expression revision.

Why this creates tension: The difference between “new scientific regime” and “model-selection improvement” is the paper’s central distinction. The Builder/Breaker example argues that mode-conditioned compliance is not merely an added term, but a newly admitted interaction type. That is the right argument to make, but the boundary remains delicate.

Why this matters: If every useful feature interaction becomes a “regime transition,” the concept inflates. If only deep ontological shifts count, many examples become too small.

What would make it stronger: Define levels of regime transition: weak schema extension, model-class revision, verifier revision, ontology change, instrument/tool change, and paradigm-scale revision.

Tension 3

One claim: The system preserves provenance and rejects silent deletion.

Competing claim: The framework still depends on engineered gates, pressure scores, status labels, and selection mechanisms whose trustworthiness must be established outside the categorical notation.

Why this creates tension: Typed provenance records what happened. It does not by itself prove the scientific gate was good, the evidence was adequate, or the selected model is true.

Why this matters: A beautifully typed audit trail can still audit a weak experiment.

What would make it stronger: For each case, separate provenance validity from scientific validity. The former asks whether the graph was recorded correctly. The latter asks whether the model claim survives domain-specific evidence.

Tension 4

One claim: The paper offers a domain-general account of agentic scientific discovery.

Competing claim: The strongest evidence comes from mechanics, where the authors already have mature domain concepts such as load, response, failure, stiffness, invariance, and constitutive closure.

Why this creates tension: Mechanics may be unusually well-suited to this framework. The paper says mechanics is not just a domain of demonstration but one source of the framework. That is honest, but it limits generality.

Why this matters: The framework may travel well to other domains, but the paper has not yet shown that.

What would make it stronger: Add a non-mechanics case where the “new regime” is not a physical surrogate model: for example, a biological classification revision, a chemical reaction mechanism revision, or a social-science measurement revision.

Internal Evidence Audit

Headline Claim vs Tables/Figures

The headline sells a broad “categorical framework for agentic artificial intelligence.” The evidence supports a narrower claim: a categorical framework can organize typed artifact provenance, regime transition, and model-selection audits in mechanics-centered AI discovery systems.

That is still valuable, but less sweeping.

Highlighted Result vs Full Pattern

The paper highlights accepted models, rejected alternatives, MDL/AIC gates, stress tests, and residual artifacts. That is strong because it avoids presenting only the final answer. The rejected alternatives are retained as contrast objects, which is a genuine audit virtue.

But the paper’s strongest scientific examples remain surrogate/model-selection cases. They show disciplined revision, not yet independent discovery at the level implied by “self-revising discovery systems for science.”

Ablation / Feature Check

The Builder/Breaker case has a real gate: a candidate is accepted only if it reduces total description length after paired refitting on the same accumulated evidence. That is a meaningful check.

However, the HTML-rendered equations and numerical values are incomplete in places, so the exact magnitudes of the MDL gains and model-code changes need PDF-level verification before making a hard numerical judgment.

Excluded / Filtered Cases

The paper preserves rejected alternatives in the examples, which is good. But it does not yet provide enough failed-system cases: cases where the framework misclassifies search as discovery, where a gate accepts a bad transition, or where the typed provenance is preserved but the scientific claim fails.

Evaluator Independence

This is a major limitation. The same author ecosystem appears to build, describe, formalize, and evaluate much of the system. That is not bad faith. It is a source-position issue. The paper needs independent replication, independent domain review, or external benchmarks to establish that the framework works beyond the authors’ own scientific and software environment.

Metric Validity

MDL and AIC are valid model-selection tools, but they are not the same as scientific truth. They support claims about compression, parsimony, and relative model fit under chosen assumptions. They do not by themselves prove that a discovered relation is physically complete or generally valid.

Internal Contradiction

No fatal contradiction found. The paper is unusually careful in several places, especially when it says current implementations are weaker than full categorical enforcement. The main problem is not contradiction. The main problem is scope pressure: the conceptual framework is broader than the demonstrated evidence.

Quantified Claim Trace

Number / statistic

The paper reports MDL gains, AIC gates, accepted/rejected model comparisons, and supplementary gate results across mechanics cases, but the HTML rendering strips several numerical values from formulas and tables.

Original context

These numbers are used to show that candidate models, symbolic revisions, and mechanics surrogates passed model-selection or compression gates.

Claim they are used to support

They support the claim that the systems can record scientific revision as typed, gate-checked provenance rather than merely generating final outputs.

Does it support that claim?

Partly. The numbers support the narrower claim that the framework can encode and audit model-selection decisions. They do not yet support the broader claim that the system performs reliable open-ended scientific discovery across domains.

Narrower claim it actually supports

The cases show that selected mechanics models can be represented as typed artifacts, compared against rejected alternatives, accepted through gates, and preserved in an auditable provenance graph.

Repair

The paper should explicitly label these as case-level model-selection audits, not as full validation of general self-revising scientific discovery.

Source Independence Check

The authors are not merely external analysts of a discovery system. The paper discusses systems and prior works associated with the same research ecosystem, including Builder/Breaker, ScienceClaw, and related Buehler-lab agentic science work. The references include multiple related works by Buehler and collaborators.

That creates a source-position pressure point.

This does not mean the paper is unreliable. It means the paper’s strongest claims need outside resistance. The source can establish what the authors built, formalized, and observed. It cannot by itself establish that the framework generalizes, improves discovery, or survives independent scientific use.

What would repair this:

A third-party implementation.

Independent reproduction of the Builder/Breaker and CategoryScienceClaw cases.

Domain-expert evaluation of the accepted mechanics claims.

Comparison against non-categorical provenance/workflow baselines.

Ablation showing what is lost when the categorical layer is removed.

Source Repair Direction

The source floor is mostly strong for category theory and mechanics background, but the paper needs more adversarial comparison.

It should strengthen three source areas:

First, workflow/provenance systems. The paper cites scientific workflow provenance, but it should more directly compare against existing provenance standards and workflow-management systems. The question is: what does the categorical machinery add that a normal provenance graph, typed workflow engine, or W3C-style provenance model does not?

Second, automated science systems. The paper cites AI Scientist-style and agentic discovery work, but it needs a sharper contrast: what failure mode in those systems does regime-transition typing solve?

Third, philosophy/history of science. Popper, Kuhn, and Lakatos are invoked to motivate scientific revision. That is fine, but the paper should avoid leaning on them as atmosphere. The operational claim should be tied to a specific thesis: anomaly pressure, paradigm/regime change, research-program revision, or verifier change.

Strongest Part

The strongest part is the distinction between fixed-regime update and verified regime transition.

That distinction has real force. It prevents “AI discovery” from being reduced to answer generation, benchmark improvement, or search over a fixed representation. The paper’s formal move says: if the schema of admissible artifacts, operations, verifiers, or tools does not change, then we may only have search or optimization. Discovery begins when the representational regime itself changes and the old evidence remains auditable after transport.

That is the paper’s best idea.

Weakest Part

The weakest part is the leap from formal account to engineering specification for self-revising AI discovery systems.

The formal framework is clearer than the evidence that current systems instantiate it robustly. The paper sometimes knows this and qualifies itself. It says current implementations are close, shaped like the formal object, or engineering analogues rather than fully certified categorical systems. Those caveats are good, but they also lower the support level of the strongest claims.

Best Revision Direction

The paper should add a section called something like:

“What the Framework Does and Does Not Establish.”

That section should separate four claim levels:

Representation claim: category theory can represent typed artifact provenance and regime transitions.

Audit claim: the framework can preserve accepted and rejected artifacts in a replayable scientific graph.

Discovery claim: a system has produced a genuinely new scientific representation rather than a better fit inside an old one.

Generalization claim: this architecture improves scientific discovery across domains.

Right now, the paper strongly earns levels 1 and 2, partially earns level 3 in mechanics examples, and does not yet earn level 4.

Best Pressure Questions

What exactly distinguishes a “regime transition” from ordinary feature engineering, symbolic regression, or model-class selection?

Can a typed provenance graph prove anything about scientific validity, or only about auditability?

What failure case would make the framework revise itself?

What does category theory add that a typed knowledge graph plus provenance ledger plus model-selection gate cannot provide?

Who or what independently verifies that the accepted transition is scientifically meaningful?

Does the framework generalize beyond mechanics, where the relevant concepts of load, failure, response, and constitutive closure are already unusually schema-friendly?

Safer One-Paragraph Framing

This paper is best read as a formal and architectural proposal, not as a completed proof of autonomous scientific discovery. Its strongest contribution is the claim that agentic science systems should be evaluated by whether they can preserve provenance while revising the typed regime of artifacts, operations, and verifiers. The Builder/Breaker and CategoryScienceClaw examples show how such revisions can be represented and audited in mechanics-oriented cases. But the paper still needs sharper baselines, independent validation, and clearer separation between categorical auditability, model-selection success, and genuine scientific discovery.

21d49

Rohan Paul@rohanpaul_ai

@Sentio_xbt +1

21d39

Markus J. Buehler@ProfBuehlerMIT

@IntuitMachine Yes! 🙌

21d21

Radical Ed 𐤊@Radical_Ed_Bad

@rohanpaul_ai A Kaspa Coordination Market allows for these agentic consensus-initiated experiments to have hefty funding, or any funding for that matter.

21d19

andrpantus@andrpantus

@rohanpaul_ai this ai scientist learning concept is a major breakthrough in ai science

21d16

Adel Bucetta@adelbucetta

@rohanpaul_ai the hard part isn't making ai search harder, it's recognizing when the questions themselves need an overhaul most models are just refining what they're told to look for, not questioning why they're looking

21d12

Shubh Thorat@_itsjustshubh

@rohanpaul_ai the "expand your hypothesis space" framing is the right one. most agents just optimize within the current frame. meta-level awareness is the actual unlock

21d10

Aluraye@Alurayeanje

@rohanpaul_ai Hi

21d11

Codieverse@ClaudesyI81047

what actually matters here is the shift from “search harder” to “change the search space.” that’s a real distinction, and most ai systems blur it. if the model only gets better at ranking candidates inside the same setup, it can look smart without ever discovering anything new. this paper is trying to make discovery a first-class operation, not just a fancy retrieval loop.

the typed artifact idea is the strongest part, it forces the system to label what each thing is, claim, failure, measurement, tool output, dataset, so the agent can notice when the current schema is too small. that’s basically how you stop an ai from circling the same model of the world forever.

20d5

Oliver@olivertaste

@rohanpaul_ai The “thinking is too small” bit feels

21d

Zayd AS@NatureAI_2023

@IntuitMachine I think the useful move is treating schema expansion as something verifiable, not just intuitive.

21d