Elie Bakouch says Mistral AI co-founder Guillaume Lample's FP8 training run matches the learning rate schedule of DeepSeek-V2
Story Overview
Guillaume Lample posted a loss curve from an internal FP8 training run at Mistral AI, and Elie Bakouch noted that the learning rate change points align exactly with DeepSeek-V2's warmup and step-decay pattern across roughly 8.1 trillion tokens, the only visible difference being a mild decay in the later stages rather than a flat rate.
The final drop reaches 1.27
The shared chart tracks a sharp early plunge, a long noisy plateau, and a steeper finish, all under the run label that nods to the ongoing 'le gros chaton' chatter without any further confirmation.
Release details stay out of frame
No model size, architecture choices, benchmarks, or timeline have been released alongside the chart, so any link to an upcoming launch stays speculative.
Many users expressed excitement about the FP8 model training loss curves shared at NeurIPS because they indicate Mistral AI's new model is advancing successfully and may lead to strong releases.
Most Activity

@GuillaumeLample
this is the exact same lr schedule as deepseek v2 in terms of when the lr changes
the difference is that the second phase does seem to be a slight decay instead of constant, and of course cooldown is also a decay instead of constant

@GuillaumeLample Do not expose Le Chaton Fat to ozempic please (no quantization)

@GuillaumeLample @yacineMTB Now that's some proper cooldown

@GuillaumeLample @ESYudkowsky prepare the ICBMs

@GuillaumeLample

@GuillaumeLample

@GuillaumeLample How far we’ve come

@GuillaumeLample let’s see paul allens loss curve

@GuillaumeLample qlq a une comparaison je me rends pas compte je connais rien en ml

@GuillaumeLample Cook cook cook
guillaume tweeting the loss curve also makes me hope that they release a tech report for the next big chaton model 🙏
this is the exact same lr schedule as deepseek v2 in terms of when the lr changes
the difference is that the second phase does seem to be a slight decay instead of constant, and of course cooldown is also a decay instead of constant

@GuillaumeLample Sudden drop!

@GuillaumeLample @grok ça veux dire quoi ?

@GuillaumeLample Is it finally done training?

@GuillaumeLample 😸

@GuillaumeLample It's happening !!!

@GuillaumeLample `run000` & `percent_done` 🤣

@GuillaumeLample Keep on training for a few 1000 iterations more. We can wait for a quality product.

@GuillaumeLample We will meme this into reality