/Tech1h ago

Systems engineer Yacine and developer xlr8harder debate LLM hyperparameter tuning, arguing MuP is insufficient on its own

Story Overview

Systems engineer Yacine starts a public thread exploring everyday choices when scaling up LLM training runs, zeroing in on hyperparameter selection, the practical limits of MuP, and whether learning-rate decay is standard practice, while developer xlr8harder replies that scaling laws and affordable grid searches remain the go-to tools and that MuP alone does not cut it.

3826028119.3K

#403

Original post

kache@yacineMTB#403inTech

I'm going to ask some very stupid questions about large language model training this week. You guys will be very annoyed

How do people figure out the hyperparameters for these big trians? Does mup actually work? Do they do LR decay?

8:15 AM · Jun 15, 2026 · 19K Views

Open Question

Why MuP needs backup in production

Thread replies treat MuP as a partial stabilizer for transferring settings from small proxies to larger models, yet note it leaves gaps that require extra tuning steps or other adjustments once full runs begin.

Developer Impact

Fast iteration beats perfect theory

The main takeaway shared is that teams prioritize cheap, quick cycles on smaller models using scaling laws and searches rather than relying on any single parameterization trick to handle everything.

Sentiment

Positive users praise the builder's questions on hyperparameter methods for large language model training as a great way to learn, while negative users react with hostility and dismissiveness.

Pos

66.7%

Neg

33.3%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS628LIKES16REPLIES1

Elliot Arledge@elliotarledge

@yacineMTB its all vibes man

1h62816

BOOKMARKS11

Jannis@basement_agi

@yacineMTB https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook

Here is all inside

1h310811

ueaj@_ueaj

@yacineMTB muP gets u 90%, u do a sweep on a small fixed budget to refine, and then sweep on how optimal lr changes with token budget, then extrapolate

(probably)

1h14642

xlr8harder@xlr8harder

@yacineMTB Scaling laws, grid searches or algorithmic searches, etc. main key is to make it fast and affordable to iterate

Sort of, but it's not enough

Yes

kache@yacineMTB

I'm going to ask some very stupid questions about large language model training this week. You guys will be very annoyed

How do people figure out the hyperparameters for these big trians? Does mup actually work? Do they do LR decay?

1h31190

Vincenzo@Kenfus2

@yacineMTB See also MAI-1 thinking; a lot of the values are done in ablation studies for smaller models and then upscaled via a “best guess”.

1h24411

Olek@oleksoleksoleks

@yacineMTB Hyperparameters you tune on smaller models and work your way up. Lots of experiments upfront that should hold as you ladder up

Mup: yeah it's important in keeping the above stable

LR decay: yup

1h2995

Vincenzo@Kenfus2

@yacineMTB And yes, LR decay is goat and always works. I mostly spend my time fixing loss spikes, which is always bad data and never something cool or interesting.

1h2805