How are you setting the score for PROTEIN? It needs a lot of trials because it explicitly isn't just trying to find the single best point. It is finding the pareto-optimal front of wall-clock compute and your specified score metric. It doesn't have knowledge that the test is run at a cutoff to a specific score, so it would have to discover that slowly through the process of pushing up the entire pareto curve with new experiments.
It appears the HEBO-PROTEIN version you wrote is better at finding high-value points faster, but it loses this frontier-finding property of PROTEIN. PROTEIN is based on CARBS which is based on HEBO. We intentionally dropped the way CARBS relies on a GP for the frontier as well as some of the HEBO random sampling behavior in favor of explicitly modeling the front with sampled runs.
After over 1000 tuning trials using various hyperparameter optimizers, the comparison remains meaningless.
No algorithm gets close to rediscovering the speedrun's hyperparameters.
Tested were: * HEBO * PROTEIN * BOHB * TPE * CMA-ES * HEBO-PROTEIN