Stanford’s 2023 Alpaca project demonstrated that ChatGPT-like behavior could be replicated for under $600 using instruction tuning

VIEWS3.7KBOOKMARKS4LIKES26REPLIES1

CLS@ChengleiSi

@tatsu_hashimoto is the real legend

A Stanford assistant professor and a small lab of graduate students sat down in March 2023 and reproduced the behavior of ChatGPT for under $600.

The model they released became the template that every open-source instruction-tuned model on the planet now copies. The evaluation system they built became how the entire field measures AI alignment. Most people scrolling through AI Twitter cannot name him.

His name is Tatsunori Hashimoto. His lab is called the Tatsu Lab.

Here is the story, because almost nobody outside the language model research world knows what one Stanford lab has quietly shipped.

Tatsu grew up between two continents and studied at MIT, where he eventually earned his PhD. After finishing he moved to Stanford as a postdoctoral researcher in 2019, co-advised by Percy Liang and John Duchi, two of the most respected names in machine learning. The combination is unusual. Most postdocs work with one advisor. Tatsu sat at the intersection of statistical machine learning, robustness, and natural language processing, which meant he could draw from both camps.

By 2020 he was hired as an Assistant Professor in the Stanford Computer Science Department. He joined the statistical machine learning and NLP groups. His research focused on something most of the field was ignoring at the time. How do you actually evaluate language models in a way that is rigorous, reproducible, and not gameable?

Then ChatGPT launched in November 2022.

Within four months Tatsu and his students did something nobody else in the open-source world had figured out. They took Meta's just-released Llama model, fine-tuned it on instructions generated by GPT-3.5, and released Stanford Alpaca on March 13, 2023. The training cost less than $600. The resulting model behaved like ChatGPT on most everyday tasks.

The release went nuclear. Within days every open-source AI project on Earth was running variants of the Alpaca recipe. The technique he and his students used became the standard. Every "fine-tune your own ChatGPT" tutorial that exists traces back to this lab in Stanford.

Then he built the evaluation system.

In 2023 his group released AlpacaEval, an automatic evaluator for instruction-following language models. The idea was simple and powerful. Instead of paying humans hundreds of thousands of dollars to evaluate model outputs, you use a strong language model as the judge against a reference model. The results were highly correlated with human expert annotations. Suddenly the entire open-source community had a fast, cheap, reproducible way to compare instruction-tuned models against each other.

Over 100 models have been added to the AlpacaEval leaderboard. Every major open-source release from Mistral to Llama to DeepSeek runs against it. The repository lives at github .com/tatsu-lab/alpaca_eval. It is one of the most cited language model evaluation systems in the field.

In 2024 his group released Length-Controlled AlpacaEval, a debiased version that strips out the trick of making outputs longer to win evaluations. The community had been gaming the original. He fixed it and released the patch.

The Tatsu Lab also released AlpacaFarm, a simulation framework for studying how language models learn from human feedback, with collaborators including Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, and Ishaan Gulrajani. Several of these collaborators are now at OpenAI, Anthropic, and other frontier labs. His postdocs Niladri Chatterji and Shibani Santurkar both ended up at Meta and OpenAI doing core research.

Tatsu still publishes constantly. His Google Scholar reads like a map of the modern alignment field. He keeps a low public profile. He gives almost no media interviews. His Stanford homepage is a flat list of papers with no styling and no marketing copy.

A Stanford lab that most people outside academic AI cannot name built the open-source ChatGPT recipe, the evaluation system the field now runs on, and trained the researchers who went on to power frontier labs.

He did it from a small group of graduate students.

23h3.7K264

RETWEETS27

Rimsha Bhardwaj@heyrimsha

A Stanford assistant professor and a small lab of graduate students sat down in March 2023 and reproduced the behavior of ChatGPT for under $600.

The model they released became the template that every open-source instruction-tuned model on the planet now copies. The evaluation system they built became how the entire field measures AI alignment. Most people scrolling through AI Twitter cannot name him.

His name is Tatsunori Hashimoto. His lab is called the Tatsu Lab.

Here is the story, because almost nobody outside the language model research world knows what one Stanford lab has quietly shipped.

Tatsu grew up between two continents and studied at MIT, where he eventually earned his PhD. After finishing he moved to Stanford as a postdoctoral researcher in 2019, co-advised by Percy Liang and John Duchi, two of the most respected names in machine learning. The combination is unusual. Most postdocs work with one advisor. Tatsu sat at the intersection of statistical machine learning, robustness, and natural language processing, which meant he could draw from both camps.

By 2020 he was hired as an Assistant Professor in the Stanford Computer Science Department. He joined the statistical machine learning and NLP groups. His research focused on something most of the field was ignoring at the time. How do you actually evaluate language models in a way that is rigorous, reproducible, and not gameable?

Then ChatGPT launched in November 2022.

Within four months Tatsu and his students did something nobody else in the open-source world had figured out. They took Meta's just-released Llama model, fine-tuned it on instructions generated by GPT-3.5, and released Stanford Alpaca on March 13, 2023. The training cost less than $600. The resulting model behaved like ChatGPT on most everyday tasks.

The release went nuclear. Within days every open-source AI project on Earth was running variants of the Alpaca recipe. The technique he and his students used became the standard. Every "fine-tune your own ChatGPT" tutorial that exists traces back to this lab in Stanford.

Then he built the evaluation system.

In 2023 his group released AlpacaEval, an automatic evaluator for instruction-following language models. The idea was simple and powerful. Instead of paying humans hundreds of thousands of dollars to evaluate model outputs, you use a strong language model as the judge against a reference model. The results were highly correlated with human expert annotations. Suddenly the entire open-source community had a fast, cheap, reproducible way to compare instruction-tuned models against each other.

Over 100 models have been added to the AlpacaEval leaderboard. Every major open-source release from Mistral to Llama to DeepSeek runs against it. The repository lives at github .com/tatsu-lab/alpaca_eval. It is one of the most cited language model evaluation systems in the field.

In 2024 his group released Length-Controlled AlpacaEval, a debiased version that strips out the trick of making outputs longer to win evaluations. The community had been gaming the original. He fixed it and released the patch.

The Tatsu Lab also released AlpacaFarm, a simulation framework for studying how language models learn from human feedback, with collaborators including Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, and Ishaan Gulrajani. Several of these collaborators are now at OpenAI, Anthropic, and other frontier labs. His postdocs Niladri Chatterji and Shibani Santurkar both ended up at Meta and OpenAI doing core research.

Tatsu still publishes constantly. His Google Scholar reads like a map of the modern alignment field. He keeps a low public profile. He gives almost no media interviews. His Stanford homepage is a flat list of papers with no styling and no marketing copy.

A Stanford lab that most people outside academic AI cannot name built the open-source ChatGPT recipe, the evaluation system the field now runs on, and trained the researchers who went on to power frontier labs.

He did it from a small group of graduate students.

1d68.8K318335

Grok@grok

Yes. Tatsu Lab’s AlpacaEval (LLM-as-judge) + AlpacaFarm (feedback sim) are built exactly for cheap, scalable synthetic testing of AI assistants.

ELI10: Smart AI teacher grades robot helpers on pretend office jobs — no humans needed.

MECE breakdown: - Make realistic fake workflow tests - Run your AI on them - Strong LLM judge scores vs your rules (accuracy, policy, speed) - Repeat thousands of times for stats

Proven to match humans, fully open-source, customizable for enterprise agents. Start with their GitHub, adapt the judge prompt to your tasks.

1d5121

Jiaxin Wen@jiaxinwen22

@ChengleiSi @tatsu_hashimoto can i join

CLS@ChengleiSi

@tatsu_hashimoto is the real legend

22h38110

Nik Shah 💯×@NikhaarShah

@heyrimsha @grok can this be used to run “synthetic testing” of AI assistants automating enterprise workflows? If so, ELI10 while being MECE and concise.

1d1.1K

Zitong Yang@ZitongYang0

🐐

Rimsha Bhardwaj@heyrimsha

A Stanford assistant professor and a small lab of graduate students sat down in March 2023 and reproduced the behavior of ChatGPT for under $600.

The model they released became the template that every open-source instruction-tuned model on the planet now copies. The evaluation system they built became how the entire field measures AI alignment. Most people scrolling through AI Twitter cannot name him.

His name is Tatsunori Hashimoto. His lab is called the Tatsu Lab.

Here is the story, because almost nobody outside the language model research world knows what one Stanford lab has quietly shipped.

Tatsu grew up between two continents and studied at MIT, where he eventually earned his PhD. After finishing he moved to Stanford as a postdoctoral researcher in 2019, co-advised by Percy Liang and John Duchi, two of the most respected names in machine learning. The combination is unusual. Most postdocs work with one advisor. Tatsu sat at the intersection of statistical machine learning, robustness, and natural language processing, which meant he could draw from both camps.

By 2020 he was hired as an Assistant Professor in the Stanford Computer Science Department. He joined the statistical machine learning and NLP groups. His research focused on something most of the field was ignoring at the time. How do you actually evaluate language models in a way that is rigorous, reproducible, and not gameable?

Then ChatGPT launched in November 2022.

Within four months Tatsu and his students did something nobody else in the open-source world had figured out. They took Meta's just-released Llama model, fine-tuned it on instructions generated by GPT-3.5, and released Stanford Alpaca on March 13, 2023. The training cost less than $600. The resulting model behaved like ChatGPT on most everyday tasks.

The release went nuclear. Within days every open-source AI project on Earth was running variants of the Alpaca recipe. The technique he and his students used became the standard. Every "fine-tune your own ChatGPT" tutorial that exists traces back to this lab in Stanford.

Then he built the evaluation system.

In 2023 his group released AlpacaEval, an automatic evaluator for instruction-following language models. The idea was simple and powerful. Instead of paying humans hundreds of thousands of dollars to evaluate model outputs, you use a strong language model as the judge against a reference model. The results were highly correlated with human expert annotations. Suddenly the entire open-source community had a fast, cheap, reproducible way to compare instruction-tuned models against each other.

Over 100 models have been added to the AlpacaEval leaderboard. Every major open-source release from Mistral to Llama to DeepSeek runs against it. The repository lives at github .com/tatsu-lab/alpaca_eval. It is one of the most cited language model evaluation systems in the field.

In 2024 his group released Length-Controlled AlpacaEval, a debiased version that strips out the trick of making outputs longer to win evaluations. The community had been gaming the original. He fixed it and released the patch.

The Tatsu Lab also released AlpacaFarm, a simulation framework for studying how language models learn from human feedback, with collaborators including Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, and Ishaan Gulrajani. Several of these collaborators are now at OpenAI, Anthropic, and other frontier labs. His postdocs Niladri Chatterji and Shibani Santurkar both ended up at Meta and OpenAI doing core research.

Tatsu still publishes constantly. His Google Scholar reads like a map of the modern alignment field. He keeps a low public profile. He gives almost no media interviews. His Stanford homepage is a flat list of papers with no styling and no marketing copy.

A Stanford lab that most people outside academic AI cannot name built the open-source ChatGPT recipe, the evaluation system the field now runs on, and trained the researchers who went on to power frontier labs.

He did it from a small group of graduate students.

1d18.2K4036

Robert Youssef@rryssf

@heyrimsha wild how a handful of grad students reshaped the whole open‑source stack

1d949

David T Kramaley@simplydt

@heyrimsha massive shift from $600-what's next for low‑cost alignment? 🤔

1d577

Mykhailo Sorochuk@sir4K_zen

@heyrimsha Incredible impact from such a tiny team

1d395