Open-Source Qwen3-4B Model Matches O3 Performance On ML Discovery Tasks

Original post

The paper and the model: https://arxiv.org/abs/2606.21891

BOOM! New Paper Proves Open Source Can Outperform Closed Frontier Labs!

A groundbreaking new paper titled “Learning the ARTS of Search for Automated Discovery” from researchers at UC Santa Barbara and Mila shows that sophisticated AI-driven scientific research no longer requires massive closed labs or enormous budgets.

Using a lightweight 4 billion parameter open source model, the system called ARTS (Agentic Reasoning for Tree Search) matches or exceeds the performance of advanced closed models like OpenAI’s o3 in conducting end-to-end machine learning research projects.

The ARTS framework turns an AI into a full-fledged scientist.

It proposes hypotheses, implements code, runs experiments, analyzes failures by inspecting logs and code, and decides the next steps intelligently.

Unlike simpler agents that abandon directions based solely on low scores, ARTS distinguishes between bad ideas and flawed executions. This allows it to persist with promising concepts while fixing bugs, leading to more efficient discovery.

On 22 real ML research tasks from benchmarks like MLGym and MLEBench, ARTS outperformed leading methods on 16 of them with a 15.3 percent average improvement.

A 4B model enhanced with test-time training on its own search history even surpassed o3-powered versions on certain tasks, such as rediscovering effective memory-based solutions that larger systems prematurely discarded.

This work directly challenges the narrative pushed by figures like Anthropic CEO Dario Amodei, who has repeatedly warned that open source AI is “dangerous” and does not work like in other fields because models cannot be easily controlled once released.

Amodei argues for heavy regulation and elite control, suggesting only well-resourced labs can safely advance the frontier.

Yet ARTS demonstrates the opposite: open collaboration accelerates progress, spreads capabilities widely, and achieves elite-level results without the secrecy or billions in funding.

Points:

1It shatters the myth that frontier AI research demands closed labs with vast resources. A small open source 4B model, running at roughly 5 times lower inference cost than o3, delivers comparable or superior research automation, proving individuals and small teams can rival billion-dollar operations.

2ARTS introduces smarter reasoning for scientific search. By analyzing execution logs and code to separate idea quality from implementation errors, it enables persistent innovation that score-based agents miss, raising the bar for autonomous AI scientists.

3Test-time training allows models to learn from their own failures in real time. This self-improvement loop on public models turns exploration history into knowledge, making advanced capabilities accessible without retraining from scratch.

4It accelerates open source momentum against calls for restriction. While some executives advocate locking down powerful models to prevent misuse, ARTS shows openness fosters rapid iteration, broader participation, and safer distributed progress through community scrutiny.

5The work democratizes high-impact AI development at a fraction of the cost. Frontier labs invest billions in compute clusters and talent; here, anyone with access to modest hardware and public code can conduct cutting-edge ML research, lowering barriers and multiplying global innovation potential.

While some in AI push for centralized control and fear open weights, the ARTS paper stands as powerful evidence for the open source ethos.

Transparency, reusability, and community-driven refinement do not hinder safety or capability; they multiply them.

By releasing everything publicly, these researchers invite the world to build upon their work, proving that true power in AI comes not from hoarding resources but from sharing knowledge freely.

This is how we advance toward understanding the universe together, faster and more equitably than any gated lab ever could.

Empirically proven.

6:56 AM · Jul 2, 2026 · 940 Views