@eliebakouch > each team has their own current data, environments, codebase, infra/kernels
This changes the meaning completely compared to how i first understood OP btw, so i think your result will be a mix of interpretations "with their current stack" vs "from scratch"
exact conditions:
> gpu count is 100k B200 equivalent (like colossus 2 first cluster) but can be the team's favorite accelerators, the cluster is properly setup in terms of inter/intra node bandwidth, nodes get replaced automatically when there is an issue > best model is defined as best on a weighted average of AA + cursorbench + frontiercode + metr ECI with a budget of max($30 per task, 5M output tokens per task), the cost per task is based on anthropic's margin on fable inference (but inference stack can be optimized however they want) > there is 0 bureaucracy, just the 50 "best" people trying to build the best model possible > they all have access to the same previous generation of AI models (let's say gpt 5.5/opus 4.8, otherwise some people will argue that mythos can build itself etc. which is not the point), this includes synthetic data > otherwise each team has their own current data, environments, codebase, infra/kernels etc. (you cannot buy new environments etc.) > the first 3 months are ONLY about scaling the recipe or improving it, you cannot start training the final model or a better model to distill from (this allows chinese labs in a future poll to potentially catch up or not with the US labs). next 3 months are for pre/mid/post training, let's assume no big focus on "safety" etc. (same reason) > the benchmark choice suggests that we're focusing on text capabilities, but if there is some "knowledge transfer" it's within the rules
