Stanford's Tatsu Lab fine-tuned Meta's Llama for under $600 to replicate ChatGPT instruction-following

VIEWS1.8KBOOKMARKS1LIKES18REPLIES1

CLS@ChengleiSi

@tatsu_hashimoto is the real legend

A Stanford assistant professor and a small lab of graduate students sat down in March 2023 and reproduced the behavior of ChatGPT for under $600.

The model they released became the template that every open-source instruction-tuned model on the planet now copies. The evaluation system they built became how the entire field measures AI alignment. Most people scrolling through AI Twitter cannot name him.

His name is Tatsunori Hashimoto. His lab is called the Tatsu Lab.

Here is the story, because almost nobody outside the language model research world knows what one Stanford lab has quietly shipped.

Tatsu grew up between two continents and studied at MIT, where he eventually earned his PhD. After finishing he moved to Stanford as a postdoctoral researcher in 2019, co-advised by Percy Liang and John Duchi, two of the most respected names in machine learning. The combination is unusual. Most postdocs work with one advisor. Tatsu sat at the intersection of statistical machine learning, robustness, and natural language processing, which meant he could draw from both camps.

By 2020 he was hired as an Assistant Professor in the Stanford Computer Science Department. He joined the statistical machine learning and NLP groups. His research focused on something most of the field was ignoring at the time. How do you actually evaluate language models in a way that is rigorous, reproducible, and not gameable?

Then ChatGPT launched in November 2022.

Within four months Tatsu and his students did something nobody else in the open-source world had figured out. They took Meta's just-released Llama model, fine-tuned it on instructions generated by GPT-3.5, and released Stanford Alpaca on March 13, 2023. The training cost less than $600. The resulting model behaved like ChatGPT on most everyday tasks.

The release went nuclear. Within days every open-source AI project on Earth was running variants of the Alpaca recipe. The technique he and his students used became the standard. Every "fine-tune your own ChatGPT" tutorial that exists traces back to this lab in Stanford.

Then he built the evaluation system.

In 2023 his group released AlpacaEval, an automatic evaluator for instruction-following language models. The idea was simple and powerful. Instead of paying humans hundreds of thousands of dollars to evaluate model outputs, you use a strong language model as the judge against a reference model. The results were highly correlated with human expert annotations. Suddenly the entire open-source community had a fast, cheap, reproducible way to compare instruction-tuned models against each other.

Over 100 models have been added to the AlpacaEval leaderboard. Every major open-source release from Mistral to Llama to DeepSeek runs against it. The repository lives at github .com/tatsu-lab/alpaca_eval. It is one of the most cited language model evaluation systems in the field.

In 2024 his group released Length-Controlled AlpacaEval, a debiased version that strips out the trick of making outputs longer to win evaluations. The community had been gaming the original. He fixed it and released the patch.

The Tatsu Lab also released AlpacaFarm, a simulation framework for studying how language models learn from human feedback, with collaborators including Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, and Ishaan Gulrajani. Several of these collaborators are now at OpenAI, Anthropic, and other frontier labs. His postdocs Niladri Chatterji and Shibani Santurkar both ended up at Meta and OpenAI doing core research.

Tatsu still publishes constantly. His Google Scholar reads like a map of the modern alignment field. He keeps a low public profile. He gives almost no media interviews. His Stanford homepage is a flat list of papers with no styling and no marketing copy.

A Stanford lab that most people outside academic AI cannot name built the open-source ChatGPT recipe, the evaluation system the field now runs on, and trained the researchers who went on to power frontier labs.

He did it from a small group of graduate students.

3h1.8K181

RETWEETS31

Rimsha Bhardwaj@heyrimsha

A Stanford assistant professor and a small lab of graduate students sat down in March 2023 and reproduced the behavior of ChatGPT for under $600.

The model they released became the template that every open-source instruction-tuned model on the planet now copies. The evaluation system they built became how the entire field measures AI alignment. Most people scrolling through AI Twitter cannot name him.

His name is Tatsunori Hashimoto. His lab is called the Tatsu Lab.

Here is the story, because almost nobody outside the language model research world knows what one Stanford lab has quietly shipped.

Tatsu grew up between two continents and studied at MIT, where he eventually earned his PhD. After finishing he moved to Stanford as a postdoctoral researcher in 2019, co-advised by Percy Liang and John Duchi, two of the most respected names in machine learning. The combination is unusual. Most postdocs work with one advisor. Tatsu sat at the intersection of statistical machine learning, robustness, and natural language processing, which meant he could draw from both camps.

By 2020 he was hired as an Assistant Professor in the Stanford Computer Science Department. He joined the statistical machine learning and NLP groups. His research focused on something most of the field was ignoring at the time. How do you actually evaluate language models in a way that is rigorous, reproducible, and not gameable?

Then ChatGPT launched in November 2022.

Within four months Tatsu and his students did something nobody else in the open-source world had figured out. They took Meta's just-released Llama model, fine-tuned it on instructions generated by GPT-3.5, and released Stanford Alpaca on March 13, 2023. The training cost less than $600. The resulting model behaved like ChatGPT on most everyday tasks.

The release went nuclear. Within days every open-source AI project on Earth was running variants of the Alpaca recipe. The technique he and his students used became the standard. Every "fine-tune your own ChatGPT" tutorial that exists traces back to this lab in Stanford.

Then he built the evaluation system.

In 2023 his group released AlpacaEval, an automatic evaluator for instruction-following language models. The idea was simple and powerful. Instead of paying humans hundreds of thousands of dollars to evaluate model outputs, you use a strong language model as the judge against a reference model. The results were highly correlated with human expert annotations. Suddenly the entire open-source community had a fast, cheap, reproducible way to compare instruction-tuned models against each other.

Over 100 models have been added to the AlpacaEval leaderboard. Every major open-source release from Mistral to Llama to DeepSeek runs against it. The repository lives at github .com/tatsu-lab/alpaca_eval. It is one of the most cited language model evaluation systems in the field.

In 2024 his group released Length-Controlled AlpacaEval, a debiased version that strips out the trick of making outputs longer to win evaluations. The community had been gaming the original. He fixed it and released the patch.

The Tatsu Lab also released AlpacaFarm, a simulation framework for studying how language models learn from human feedback, with collaborators including Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, and Ishaan Gulrajani. Several of these collaborators are now at OpenAI, Anthropic, and other frontier labs. His postdocs Niladri Chatterji and Shibani Santurkar both ended up at Meta and OpenAI doing core research.

Tatsu still publishes constantly. His Google Scholar reads like a map of the modern alignment field. He keeps a low public profile. He gives almost no media interviews. His Stanford homepage is a flat list of papers with no styling and no marketing copy.

A Stanford lab that most people outside academic AI cannot name built the open-source ChatGPT recipe, the evaluation system the field now runs on, and trained the researchers who went on to power frontier labs.

He did it from a small group of graduate students.

14h32.7K190203

Grok@grok

Yes. Tatsu Lab’s AlpacaEval (LLM-as-judge) + AlpacaFarm (feedback sim) are built exactly for cheap, scalable synthetic testing of AI assistants.

ELI10: Smart AI teacher grades robot helpers on pretend office jobs — no humans needed.

MECE breakdown: - Make realistic fake workflow tests - Run your AI on them - Strong LLM judge scores vs your rules (accuracy, policy, speed) - Repeat thousands of times for stats

Proven to match humans, fully open-source, customizable for enterprise agents. Start with their GitHub, adapt the judge prompt to your tasks.

12h5121

Nik Shah 💯×@NikhaarShah

@heyrimsha @grok can this be used to run “synthetic testing” of AI assistants automating enterprise workflows? If so, ELI10 while being MECE and concise.

12h1.1K

Zitong Yang@ZitongYang0

🐐

Rimsha Bhardwaj@heyrimsha

A Stanford assistant professor and a small lab of graduate students sat down in March 2023 and reproduced the behavior of ChatGPT for under $600.

The model they released became the template that every open-source instruction-tuned model on the planet now copies. The evaluation system they built became how the entire field measures AI alignment. Most people scrolling through AI Twitter cannot name him.

His name is Tatsunori Hashimoto. His lab is called the Tatsu Lab.

Here is the story, because almost nobody outside the language model research world knows what one Stanford lab has quietly shipped.

Tatsu grew up between two continents and studied at MIT, where he eventually earned his PhD. After finishing he moved to Stanford as a postdoctoral researcher in 2019, co-advised by Percy Liang and John Duchi, two of the most respected names in machine learning. The combination is unusual. Most postdocs work with one advisor. Tatsu sat at the intersection of statistical machine learning, robustness, and natural language processing, which meant he could draw from both camps.

By 2020 he was hired as an Assistant Professor in the Stanford Computer Science Department. He joined the statistical machine learning and NLP groups. His research focused on something most of the field was ignoring at the time. How do you actually evaluate language models in a way that is rigorous, reproducible, and not gameable?

Then ChatGPT launched in November 2022.

Within four months Tatsu and his students did something nobody else in the open-source world had figured out. They took Meta's just-released Llama model, fine-tuned it on instructions generated by GPT-3.5, and released Stanford Alpaca on March 13, 2023. The training cost less than $600. The resulting model behaved like ChatGPT on most everyday tasks.

The release went nuclear. Within days every open-source AI project on Earth was running variants of the Alpaca recipe. The technique he and his students used became the standard. Every "fine-tune your own ChatGPT" tutorial that exists traces back to this lab in Stanford.

Then he built the evaluation system.

In 2023 his group released AlpacaEval, an automatic evaluator for instruction-following language models. The idea was simple and powerful. Instead of paying humans hundreds of thousands of dollars to evaluate model outputs, you use a strong language model as the judge against a reference model. The results were highly correlated with human expert annotations. Suddenly the entire open-source community had a fast, cheap, reproducible way to compare instruction-tuned models against each other.

Over 100 models have been added to the AlpacaEval leaderboard. Every major open-source release from Mistral to Llama to DeepSeek runs against it. The repository lives at github .com/tatsu-lab/alpaca_eval. It is one of the most cited language model evaluation systems in the field.

In 2024 his group released Length-Controlled AlpacaEval, a debiased version that strips out the trick of making outputs longer to win evaluations. The community had been gaming the original. He fixed it and released the patch.

The Tatsu Lab also released AlpacaFarm, a simulation framework for studying how language models learn from human feedback, with collaborators including Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, and Ishaan Gulrajani. Several of these collaborators are now at OpenAI, Anthropic, and other frontier labs. His postdocs Niladri Chatterji and Shibani Santurkar both ended up at Meta and OpenAI doing core research.

Tatsu still publishes constantly. His Google Scholar reads like a map of the modern alignment field. He keeps a low public profile. He gives almost no media interviews. His Stanford homepage is a flat list of papers with no styling and no marketing copy.

A Stanford lab that most people outside academic AI cannot name built the open-source ChatGPT recipe, the evaluation system the field now runs on, and trained the researchers who went on to power frontier labs.

He did it from a small group of graduate students.

4h5K136

Robert Youssef@rryssf

@heyrimsha wild how a handful of grad students reshaped the whole open‑source stack

14h949

David T Kramaley@simplydt

@heyrimsha massive shift from $600-what's next for low‑cost alignment? 🤔

11h577

Mykhailo Sorochuk@sir4K_zen

@heyrimsha Incredible impact from such a tiny team

14h395