Sobering take-away from 1stproof (round 2) https://1stproof.org/. OpenAI's vanilla prompt to 5.5pro https://tinyurl.com/yc8ymuna solves research math 10-40 x cheaper than custom prompts from academic teams. We used Gemini pro. Switching to 5.5pro improves results a lot but costs rise to the level of other academic pipelines :(
OpenAI Vanilla Prompt Solves Research Math 10-40x Cheaper Than Custom Academic Prompts
Most Activity
During the official evaluation our pipeline also seemed to have had some timeout error on several questions (a default "Disclaimer" line with some brief report by the orchestrator). This was unfortunate, especially since it happened on several of the easier problems
Sobering take-away from 1stproof (round 2) https://1stproof.org/. OpenAI's vanilla prompt to 5.5pro https://tinyurl.com/yc8ymuna solves research math 10-40 x cheaper than custom prompts from academic teams. We used Gemini pro. Switching to 5.5pro improves results a lot but costs rise to the level of other academic pipelines :(

@prfsanjeevarora Hard to figure out what to do about the bitter lesson. I think it is going to be hard for researchers to succeed long term with any task that can be framed as a competition, because it will be so natural for the labs to train on it themselves. Need to find a complement to LLMs