Systems engineer Yacine and developer xlr8harder debate LLM hyperparameter tuning, arguing MuP is insufficient on its own
Story Overview
Systems engineer Yacine starts a public thread exploring everyday choices when scaling up LLM training runs, zeroing in on hyperparameter selection, the practical limits of MuP, and whether learning-rate decay is standard practice, while developer xlr8harder replies that scaling laws and affordable grid searches remain the go-to tools and that MuP alone does not cut it.
Why MuP needs backup in production
Thread replies treat MuP as a partial stabilizer for transferring settings from small proxies to larger models, yet note it leaves gaps that require extra tuning steps or other adjustments once full runs begin.
Fast iteration beats perfect theory
The main takeaway shared is that teams prioritize cheap, quick cycles on smaller models using scaling laws and searches rather than relying on any single parameterization trick to handle everything.
Positive users defend questions about hyperparameter methods for LLM training as valuable for learning and not stupid, while negative users call language flawed and LLMs dumb.
Most Activity

@yacineMTB its all vibes man

@yacineMTB https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook
Here is all inside
@yacineMTB Scaling laws, grid searches or algorithmic searches, etc. main key is to make it fast and affordable to iterate
Sort of, but it's not enough
Yes
I'm going to ask some very stupid questions about large language model training this week. You guys will be very annoyed
How do people figure out the hyperparameters for these big trians? Does mup actually work? Do they do LR decay?

@rms80 @yacineMTB @usehallway implement the paper

@yacineMTB muP gets u 90%, u do a sweep on a small fixed budget to refine, and then sweep on how optimal lr changes with token budget, then extrapolate
(probably)

@yacineMTB See also MAI-1 thinking; a lot of the values are done in ablation studies for smaller models and then upscaled via a “best guess”.

@yacineMTB Hyperparameters you tune on smaller models and work your way up. Lots of experiments upfront that should hold as you ladder up
Mup: yeah it's important in keeping the above stable
LR decay: yup

@yacineMTB And yes, LR decay is goat and always works. I mostly spend my time fixing loss spikes, which is always bad data and never something cool or interesting.

@yacineMTB You just tweak one knob, train, see what happens....then tweak another and repeat until you've run out of Runpod credits.

@yacineMTB Tune critical HR (esp LR) on small procy models + mup. Yes to LR decay. Standard is warmup + cosine, with warmup standard decay.

@yacineMTB lots of checkpointing, a bit of luck, a ton of data cleanup :)

@yacineMTB Asking stupid questions is the best way to learn things. (I also don't know the answers to any of these)

@yacineMTB Alchemy

@yacineMTB Vibes and Inshallah man. I was wondering the same thing while I was running a NN

@yacineMTB It’s like stock trading. They guess… you can decide if it’s luck or skill 😂

@yacineMTB why don't you just ask your clanker

@yacineMTB Oh boy here we go

@yacineMTB random selection and survival of the fittest

@yacineMTB Just let the last sota model decide

@yacineMTB i dont know your goal but start with Lora first. Do you really need a new base model or do you just want it ?