Systems engineer Yacine and developer xlr8harder debate LLM hyperparameter tuning, arguing MuP is insufficient on its own
Story Overview
Systems engineer Yacine starts a public thread exploring everyday choices when scaling up LLM training runs, zeroing in on hyperparameter selection, the practical limits of MuP, and whether learning-rate decay is standard practice, while developer xlr8harder replies that scaling laws and affordable grid searches remain the go-to tools and that MuP alone does not cut it.
Why MuP needs backup in production
Thread replies treat MuP as a partial stabilizer for transferring settings from small proxies to larger models, yet note it leaves gaps that require extra tuning steps or other adjustments once full runs begin.
Fast iteration beats perfect theory
The main takeaway shared is that teams prioritize cheap, quick cycles on smaller models using scaling laws and searches rather than relying on any single parameterization trick to handle everything.
Positive users praise the builder's questions on hyperparameter methods for large language model training as a great way to learn, while negative users react with hostility and dismissiveness.
Most Activity

@yacineMTB its all vibes man

@yacineMTB https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook
Here is all inside

@yacineMTB muP gets u 90%, u do a sweep on a small fixed budget to refine, and then sweep on how optimal lr changes with token budget, then extrapolate
(probably)
@yacineMTB Scaling laws, grid searches or algorithmic searches, etc. main key is to make it fast and affordable to iterate
Sort of, but it's not enough
Yes
I'm going to ask some very stupid questions about large language model training this week. You guys will be very annoyed
How do people figure out the hyperparameters for these big trians? Does mup actually work? Do they do LR decay?

@yacineMTB See also MAI-1 thinking; a lot of the values are done in ablation studies for smaller models and then upscaled via a “best guess”.

@yacineMTB Hyperparameters you tune on smaller models and work your way up. Lots of experiments upfront that should hold as you ladder up
Mup: yeah it's important in keeping the above stable
LR decay: yup

@yacineMTB And yes, LR decay is goat and always works. I mostly spend my time fixing loss spikes, which is always bad data and never something cool or interesting.

@yacineMTB You just tweak one knob, train, see what happens....then tweak another and repeat until you've run out of Runpod credits.

@yacineMTB lots of checkpointing, a bit of luck, a ton of data cleanup :)

@yacineMTB Asking stupid questions is the best way to learn things. (I also don't know the answers to any of these)

@yacineMTB Alchemy

@yacineMTB Vibes and Inshallah man. I was wondering the same thing while I was running a NN

@yacineMTB It’s like stock trading. They guess… you can decide if it’s luck or skill 😂

@yacineMTB Oh boy here we go

@yacineMTB random selection and survival of the fittest

@yacineMTB Just let the last sota model decide

@yacineMTB i dont know your goal but start with Lora first. Do you really need a new base model or do you just want it ?

@yacineMTB small learning rate and big weight decay
@yacineMTB At Google we ran 1000 training runs with different hparams and then selected from the best ones.
Maybe things have changed I think hparam selection isn't really a science.

@yacineMTB wait until u realize language is a piece of shit n u have to cook your own vocabulary anon
llms r dumb because our way of using words is dumb especially english
