Hugging Face releases Carbon, a family of open autoregressive genomic foundation models in 500M, 3B, and 8B sizes that match Evo2-7B performance at 250 times higher throughput

Original post

We are releasing Carbon: a crazy fast DNA model

Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days.

Here are the tricks we used:

When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens.

Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure.

The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size.

We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life.

https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

9:31 AM · May 19, 2026 · 336.7K Views

carbon demo

HUGGINGFACE.COVia

tech report.pdf

GITHUB.COMVia

VIEWS22.1K

Thomas Wolf@Thom_Wolf

It turns out DNA modeling is interestingly different from language modeling. Read more in our interactive blogpost/demo and explore our work here

A joint work of the @huggingscience, pre-training and post-training teams here

Leandro von Werra@lvwerra

We are releasing Carbon: a crazy fast DNA model

Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days.

Here are the tricks we used:

The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size.

https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

41d22.1K10456

BOOKMARKS77LIKES227REPLIES17

clem 🤗@ClementDelangue

The future of biology shouldn’t stay behind black-box APIs. Especially when it touches personal health.

Whether you’re @bryan_johnson measuring every biomarker, or @sytses openly sharing and analyzing his own immune-genetics data, you need open, local, transparent AI.

@huggingface wasn’t created to be a biology company. It’s not the most obvious focus for us. But it feels too important not to do something.

That’s why we built and released Carbon 🧬: a frontier DNA base model with open weights, training code and data pipeline, designed to be fine-tuned or continually pretrained for downstream biological tasks.

Carbon is 275x faster than the next best model at its size. Fast enough to run locally on your laptop. Powerful enough to process a whole human genome on a single GPU in less than 2 days.

The technical unlock: a DNA-native tokenizer that splits sequences into 6-base chunks for efficiency, while preserving single-base resolution during training and inference. More people able to inspect, run, fine-tune, improve and build on top of the models shaping biology.

Open weights: https://huggingface.co/collections/HuggingFaceBio/carbon Dataset: https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus Demo: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

Let's go open AI biology!

40d19.7K22777

RETWEETS66

Loubna Ben Allal@LoubnaBenAllal1

Introducing Carbon 🧬 a family of open generative DNA foundation models. Carbon-3B matches Evo2-7B while running 250x faster at inference. It can generate new DNA sequences and score the functional impact of mutations, zero-shot.

We borrowed a lot from how modern LLMs are trained, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe:

Tokenizer. Most genomic models tokenize at the nucleotide/character level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on DNA. We use deterministic 6-mer tokens (one token = 6 nucleotides): 6× shorter sequences and cheaper attention.

Training loss. With 6-mer tokens, cross-entropy scores a prediction that gets 5/6 nucleotides right the same as one that's completely wrong. This gets brittle late in training and produces loss spikes. We switch mid-training to a more flexible factorized loss (FNS).

Data. Genomes are mostly sparse, repetitive background. We curate down to a staged functional DNA + mRNA mixture, with every ratio chosen by ablation, like mixing a web corpus, but for biology.

We're releasing the models, training data, training code, evaluation suite, and a demo to play with.

More details in the technical report: https://github.com/huggingface/carbon/blob/main/tech-report.pdf Demo to play with the model, with a biology primer for our ML friends ;) https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

41d34.6K341219

Loubna Ben Allal@LoubnaBenAllal1

What can a DNA foundation model actually do?

We got this question a lot after releasing Carbon, our new DNA model. Here are three things it does.

🧬 All live in our demo: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

39d15.4K7738

Lewis Tunstall@_lewtun

Excited to share Carbon, the most efficient foundation models for generative DNA 🧬. Carbon-3B matches the performance of leading DNA models, while being over 275x faster at inference!

We trained Carbon on 1T tokens of high-quality DNA sequences and folded in all the tricks of modern LLMs:

- RMSNorm + SwiGLU + RoPE - long-context expansion - GQA

However, training DNA models is cursed compared to LLMs: most of the public data is noisy, BPE doesn't work, cross-entropy loss blows up after a few hundred billion tokens, and there's basically no public evals for such models :(

We solved all these issues over the past few months and you can read all about them in our interactive explainer: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

41d1.5K3521

elie@eliebakouch

very cool work and always nice to see more projects pushing llm training into the ai for science direction

here for instance predicting how likely a DNA change is to be dangerous among other things

Loubna Ben Allal@LoubnaBenAllal1

What can a DNA foundation model actually do?

We got this question a lot after releasing Carbon, our new DNA model. Here are three things it does.

🧬 All live in our demo: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

39d5.9K6115

Lewis Tunstall@_lewtun

The Carbon tech report is now on bioRxiv. It provides a detailed recipe for training fully open and efficient DNA models - enjoy!

Leandro von Werra@lvwerra

We are releasing Carbon: a crazy fast DNA model

Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days.

Here are the tricks we used:

The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size.

https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

34d3.1K2819

Anshul Kundaje@anshulkundaje

@lvwerra @huggingface Can you preview in a tweetorial what the model can actually do biologically. That would be a lot more useful than knowing it is fast. Particularly interested in benchmarks against other SOTA methods (not EVO2 or other DNALMs which get trashed in all benchmarks).

41d2.3K506

Lewis Tunstall@_lewtun

The model is really blazing fast and can even generate the whole human genome (3.1M base pairs) on your laptop

Lewis Tunstall@_lewtun

Excited to share Carbon, the most efficient foundation models for generative DNA 🧬. Carbon-3B matches the performance of leading DNA models, while being over 275x faster at inference!

We trained Carbon on 1T tokens of high-quality DNA sequences and folded in all the tricks of modern LLMs:

- RMSNorm + SwiGLU + RoPE - long-context expansion - GQA

We solved all these issues over the past few months and you can read all about them in our interactive explainer: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

41d1.5K129

Kyle Boddy@drivelinekyle

@lvwerra Really awesome. Used it today on my data and parsed out some spots to then use with GPT-5.5 Pro and my other biometric data (Oura, bloodwork, etc).

Appreciate the model!

41d3K123

Bojan Jakimovski@Shekswess

@huggingface Bio released Carbon, an open DNA foundation model family.

We tested a simple infra question: "Can Carbon run on @awscloud Trainium2 with NxD Inference on day one?"

The answer is: Hell yes !!!

Carbon-500M, 3B, and 8B all compiled and ran on a single trn2.3xlarge !!!

39d1.4K121

Kyle Boddy@drivelinekyle

I've long used GPT-5.5-Pro and Opus 4.6/4.7 for my profile of biometrics; I've had an Oura ring for years, track bloodwork (not as regularly as I should...), and have gotten my genome sequenced for various reasons.

Tokenization issues with genomes is a real problem (don't have to tell you that haha) so this efficient model (I loaded it on a Blackwell RTX PRO 6000, 8b param edition) pre-processing it for use in large, long TTC models is welcomed to help me plan out supplementation, sleeping/exercise patterns, etc.

I can go into more detail if you like - shoot me a DM if you want to connect!

41d44552

Quentin Lhoest 🤗@lhoestq

@LoubnaBenAllal1 @danaaubakir Wait there is also the 500GB dataset here 🤯 https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus

41d30771

Leandro von Werra@lvwerra

@vincentweisser Thanks @vincentweisser! Genetic-Intellect when?

41d1.2K31

Loubna Ben Allal@LoubnaBenAllal1

1/ Generate DNA

Carbon can auto-complete DNA sequences. We translated the output into protein and folded it with ESMFold (AlphaFold-style model): the 3D structure closely matches the real protein.

39d11021

Arie Windmill@ArieWindmill

@lvwerra the only way you could get that kind of performance without a genuine breakthrough is by tamping down the KV, distilling, and lowering precision

41d2061

Vincent Weisser@vincentweisser

@lvwerra epic work advancing open science!!

41d1.3K5

Lewis Tunstall@_lewtun

More details can be found in our tech report https://paperswithcode.co/paper/83340

Lewis Tunstall@_lewtun

The model is really blazing fast and can even generate the whole human genome (3.1M base pairs) on your laptop

41d32021

Anshul Kundaje@anshulkundaje

@ekinda @lvwerra @huggingface Ok that's not a good sign. The big issue with genome scale cross species DNALM models is not their speed or size. It is that they are not learning fundamental properties of DNA especially in vertebrates & substantially underperform alternative modeling strategies.

40d7131

Loubna Ben Allal@LoubnaBenAllal1

2/ Internal representations

Carbon's embeddings separate genes by species, biotype, strand, and GC content. Biological structure emerges from pretraining alone.

Biologists can train lightweight classifiers on top for tasks like splice site or regulatory element prediction.

39d1642