/goal for AI model training runs is *so* good - it really feels like the future. Very little babysitting now.
Mine: Launch a full training run on 4 nodes.
Continuously record things in an experiment document if it exists. Log hyper params, configs, periodic evals, performance insights, analyze training stability, and important changes for future analysis and reproducibility.
Fix any major bugs you encounter while you monitor training but do not change the fundamental nature of the experiment without asking. If it crashes, resume again and keep training.
Resume from latest reliable checkpoint you have.
Reach <num> steps






