1d ago

Lewis Tunstall from Hugging Face introduces Carbon genomic foundation models with Carbon-3B matching top DNA models after 1 trillion token training and over 275 times faster inference

Demo runs on Hugging Face Space alongside the new paper.

0
Original post

Introducing Carbon 🧬 a family of open generative DNA foundation models. Carbon-3B matches Evo2-7B while running 250x faster at inference. It can generate new DNA sequences and score the functional impact of mutations, zero-shot. We borrowed a lot from how modern LLMs are trained, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe: Tokenizer. Most genomic models tokenize at the nucleotide/character level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on DNA. We use deterministic 6-mer tokens (one token = 6 nucleotides): 6× shorter sequences and cheaper attention. Training loss. With 6-mer tokens, cross-entropy scores a prediction that gets 5/6 nucleotides right the same as one that's completely wrong. This gets brittle late in training and produces loss spikes. We switch mid-training to a more flexible factorized loss (FNS). Data. Genomes are mostly sparse, repetitive background. We curate down to a staged functional DNA + mRNA mixture, with every ratio chosen by ablation, like mixing a web corpus, but for biology. We're releasing the models, training data, training code, evaluation suite, and a demo to play with. More details in the technical report: https://github.com/huggingface/carbon/blob/main/tech-report.pdf Demo to play with the model, with a biology primer for our ML friends ;) https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

9:20 AM · May 19, 2026 View on X
Reposted by

It turns out DNA modeling is interestingly different from language modeling. Read more in our interactive blogpost/demo and explore our work here

A joint work of the @huggingscience, pre-training and post-training teams here

Leandro von WerraLeandro von Werra@lvwerra

We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days. Here are the tricks we used: When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens. Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure. The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size. We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life. https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

4:31 PM · May 19, 2026 · 274.1K Views
7:35 PM · May 19, 2026 · 22.1K Views

The future of biology shouldn’t stay behind black-box APIs. Especially when it touches personal health.

Whether you’re @bryan_johnson measuring every biomarker, or @sytses openly sharing and analyzing his own immune-genetics data, you need open, local, transparent AI.

@huggingface wasn’t created to be a biology company. It’s not the most obvious focus for us. But it feels too important not to do something.

That’s why we built and released Carbon 🧬: a frontier DNA base model with open weights, training code and data pipeline, designed to be fine-tuned or continually pretrained for downstream biological tasks.

Carbon is 275x faster than the next best model at its size. Fast enough to run locally on your laptop. Powerful enough to process a whole human genome on a single GPU in less than 2 days.

The technical unlock: a DNA-native tokenizer that splits sequences into 6-base chunks for efficiency, while preserving single-base resolution during training and inference. More people able to inspect, run, fine-tune, improve and build on top of the models shaping biology.

Open weights: https://huggingface.co/collections/HuggingFaceBio/carbon Dataset: https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus Demo: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

Let's go open AI biology!

12:10 PM · May 20, 2026 · 16.9K Views

Excited to share Carbon, the most efficient foundation models for generative DNA 🧬. Carbon-3B matches the performance of leading DNA models, while being over 275x faster at inference!

We trained Carbon on 1T tokens of high-quality DNA sequences and folded in all the tricks of modern LLMs:

- RMSNorm + SwiGLU + RoPE - long-context expansion - GQA

However, training DNA models is cursed compared to LLMs: most of the public data is noisy, BPE doesn't work, cross-entropy loss blows up after a few hundred billion tokens, and there's basically no public evals for such models :(

We solved all these issues over the past few months and you can read all about them in our interactive explainer: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

4:48 PM · May 19, 2026 · 1.5K Views

The model is really blazing fast and can even generate the whole human genome (3.1M base pairs) on your laptop

Lewis TunstallLewis Tunstall@_lewtun

Excited to share Carbon, the most efficient foundation models for generative DNA 🧬. Carbon-3B matches the performance of leading DNA models, while being over 275x faster at inference! We trained Carbon on 1T tokens of high-quality DNA sequences and folded in all the tricks of modern LLMs: - RMSNorm + SwiGLU + RoPE - long-context expansion - GQA However, training DNA models is cursed compared to LLMs: most of the public data is noisy, BPE doesn't work, cross-entropy loss blows up after a few hundred billion tokens, and there's basically no public evals for such models :( We solved all these issues over the past few months and you can read all about them in our interactive explainer: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

4:48 PM · May 19, 2026 · 1.5K Views
4:48 PM · May 19, 2026 · 1.5K Views

More details can be found in our tech report https://paperswithcode.co/paper/83340

Lewis TunstallLewis Tunstall@_lewtun

The model is really blazing fast and can even generate the whole human genome (3.1M base pairs) on your laptop

4:48 PM · May 19, 2026 · 1.5K Views
4:48 PM · May 19, 2026 · 320 Views

We are releasing Carbon: a crazy fast DNA model

Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days.

Here are the tricks we used:

When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens.

Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure.

The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size.

We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life.

huggingface.co
/spaces/HuggingFaceBio/carbon-demo
4:31 PM · May 19, 2026 · 274.1K Views

very nice, like in many other ai4science examples a great place to improve models is the tokenizer

Loubna Ben AllalLoubna Ben Allal@LoubnaBenAllal1

Introducing Carbon 🧬 a family of open generative DNA foundation models. Carbon-3B matches Evo2-7B while running 250x faster at inference. It can generate new DNA sequences and score the functional impact of mutations, zero-shot. We borrowed a lot from how modern LLMs are trained, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe: Tokenizer. Most genomic models tokenize at the nucleotide/character level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on DNA. We use deterministic 6-mer tokens (one token = 6 nucleotides): 6× shorter sequences and cheaper attention. Training loss. With 6-mer tokens, cross-entropy scores a prediction that gets 5/6 nucleotides right the same as one that's completely wrong. This gets brittle late in training and produces loss spikes. We switch mid-training to a more flexible factorized loss (FNS). Data. Genomes are mostly sparse, repetitive background. We curate down to a staged functional DNA + mRNA mixture, with every ratio chosen by ablation, like mixing a web corpus, but for biology. We're releasing the models, training data, training code, evaluation suite, and a demo to play with. More details in the technical report: https://github.com/huggingface/carbon/blob/main/tech-report.pdf Demo to play with the model, with a biology primer for our ML friends ;) https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

4:20 PM · May 19, 2026 · 33.4K Views
7:43 PM · May 19, 2026 · 566 Views