/Tech3h ago

Systems engineer Yacine and developer xlr8harder debate LLM hyperparameter tuning, arguing MuP is insufficient on its own

Story Overview

Systems engineer Yacine starts a public thread exploring everyday choices when scaling up LLM training runs, zeroing in on hyperparameter selection, the practical limits of MuP, and whether learning-rate decay is standard practice, while developer xlr8harder replies that scaling laws and affordable grid searches remain the go-to tools and that MuP alone does not cut it.

46424314133.5K

#403

Original post

kache@yacineMTB#403inTech

I'm going to ask some very stupid questions about large language model training this week. You guys will be very annoyed

How do people figure out the hyperparameters for these big trians? Does mup actually work? Do they do LR decay?

8:15 AM · Jun 15, 2026 · 34K Views

Open Question

Why MuP needs backup in production

Thread replies treat MuP as a partial stabilizer for transferring settings from small proxies to larger models, yet note it leaves gaps that require extra tuning steps or other adjustments once full runs begin.

Developer Impact

Fast iteration beats perfect theory

The main takeaway shared is that teams prioritize cheap, quick cycles on smaller models using scaling laws and searches rather than relying on any single parameterization trick to handle everything.

Sentiment

Positive users defend questions about hyperparameter methods for LLM training as valuable for learning and not stupid, while negative users call language flawed and LLMs dumb.

Pos

75.0%

Neg

25.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Elliot Arledge@elliotarledge

@yacineMTB its all vibes man

3h62816

BOOKMARKS11

Jannis@basement_agi

@yacineMTB https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook

Here is all inside

3h310811

LIKES19

xlr8harder@xlr8harder

@yacineMTB Scaling laws, grid searches or algorithmic searches, etc. main key is to make it fast and affordable to iterate

Sort of, but it's not enough

Yes

kache@yacineMTB

I'm going to ask some very stupid questions about large language model training this week. You guys will be very annoyed

How do people figure out the hyperparameters for these big trians? Does mup actually work? Do they do LR decay?

2h471190

REPLIES2

John Devor@johndevor

@rms80 @yacineMTB @usehallway implement the paper

1h8

ueaj@_ueaj

@yacineMTB muP gets u 90%, u do a sweep on a small fixed budget to refine, and then sweep on how optimal lr changes with token budget, then extrapolate

(probably)

3h14642

Vincenzo@Kenfus2

@yacineMTB See also MAI-1 thinking; a lot of the values are done in ablation studies for smaller models and then upscaled via a “best guess”.

3h24411

Olek@oleksoleksoleks

@yacineMTB Hyperparameters you tune on smaller models and work your way up. Lots of experiments upfront that should hold as you ladder up

Mup: yeah it's important in keeping the above stable

LR decay: yup

3h2995

Vincenzo@Kenfus2

@yacineMTB And yes, LR decay is goat and always works. I mostly spend my time fixing loss spikes, which is always bad data and never something cool or interesting.

3h2805