/Tech5h ago

Elie Bakouch says Mistral AI co-founder Guillaume Lample's FP8 training run matches the learning rate schedule of DeepSeek-V2

Story Overview

Guillaume Lample posted a loss curve from an internal FP8 training run at Mistral AI, and Elie Bakouch noted that the learning rate change points align exactly with DeepSeek-V2's warmup and step-decay pattern across roughly 8.1 trillion tokens, the only visible difference being a mild decay in the later stages rather than a flat rate.

779581038087.7K

#250

Original post

Guillaume Lample @ NeurIPS 2024@GuillaumeLample#250inTech

5:33 AM · Jun 15, 2026 · 87.8K Views

FYI

The final drop reaches 1.27

The shared chart tracks a sharp early plunge, a long noisy plateau, and a steeper finish, all under the run label that nods to the ongoing 'le gros chaton' chatter without any further confirmation.

Open Question

Release details stay out of frame

No model size, architecture choices, benchmarks, or timeline have been released alongside the chart, so any link to an upcoming launch stays speculative.

Sentiment

Many users expressed excitement about the FP8 model training loss curves shared at NeurIPS because they indicate Mistral AI's new model is advancing successfully and may lead to strong releases.

Pos

100.0%

Neg

0.0%

18 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS4.1KLIKES134

Romain Simon@romainsimon

@GuillaumeLample

4h4.1K134

BOOKMARKS5RETWEETS2REPLIES4

elie@eliebakouch

this is the exact same lr schedule as deepseek v2 in terms of when the lr changes

the difference is that the second phase does seem to be a slight decay instead of constant, and of course cooldown is also a decay instead of constant

Guillaume Lample @ NeurIPS 2024@GuillaumeLample

1h3.5K415

Sam Ctrlman@ceo_of_the_moon

@GuillaumeLample Do not expose Le Chaton Fat to ozempic please (no quantization)

5h3.5K271

Lucas Beyer (bl16)@giffmana

@GuillaumeLample @yacineMTB Now that's some proper cooldown

3h2.3K25

TimDarcet@TimDarcet

@GuillaumeLample @ESYudkowsky prepare the ICBMs

4h95113

jl²@jlsquare_

@GuillaumeLample

5h2.6K12

nlev@nlevnaut

@GuillaumeLample

5h2K11

Thaddée Tyl@espadrine

@GuillaumeLample How far we’ve come

3h1.7K11

Sachit@cyb3r_17

@GuillaumeLample let’s see paul allens loss curve

4h1.6K9

Mylène Aurafarmer@hewlettplacard

@GuillaumeLample qlq a une comparaison je me rends pas compte je connais rien en ml

4h54811

Aurea@AureaLibe

@GuillaumeLample Cook cook cook

2h27411

elie@eliebakouch

guillaume tweeting the loss curve also makes me hope that they release a tech report for the next big chaton model 🙏

elie@eliebakouch

this is the exact same lr schedule as deepseek v2 in terms of when the lr changes

the difference is that the second phase does seem to be a slight decay instead of constant, and of course cooldown is also a decay instead of constant

1h44470