Experiment Shows Standard Benchmarks Favor Overconfident AI Over Humble Models
1/4 Fixing hallucinations means fixing evaluations, as shown in our new paper https://rdcu.be/fjJFP building on our earlier @OpenAI blog. Accuracy-based scoring rewards models for making their best guess even when unsure, so hallucinations are like students guessing on tests.
@OpenAI 2/3 A simple experiment illustrates the incentive problem. We consider “HumbleGPT,” a toy model that makes fewer errors by often saying “I don’t know.” Now, ChatGPT outscores HumbleGPT on most evaluations, so it would not be selected.
1/4 Fixing hallucinations means fixing evaluations, as shown in our new paper https://rdcu.be/fjJFP building on our earlier @OpenAI blog. Accuracy-based scoring rewards models for making their best guess even when unsure, so hallucinations are like students guessing on tests.
@OpenAI 4/4 This is a key idea in our recent hallucinations paper in Nature https://rdcu.be/fjJFP building on an earlier blog https://openai.com/index/why-language-models-hallucinate/ Evaluations should reward appropriate humility, not just confident answers. See this explainer video https://youtu.be/JMxXmFfTWIU
@OpenAI 3/4 But if we change the evaluations by stating the scoring system in the prompt and rewarding abstentions like IDK, HumbleGPT outscores ChatGPT. The incentives are now flipped to motivate releasing HumbleGPT.
@OpenAI 2/3 A simple experiment illustrates the incentive problem. We consider “HumbleGPT,” a toy model that makes fewer errors by often saying “I don’t know.” Now, ChatGPT outscores HumbleGPT on most evaluations, so it would not be selected.
@OpenAI 4/4 This is the main idea in our recent hallucinations paper in Nature https://rdcu.be/fjJFP building on an earlier https://openai.com/index/why-language-models-hallucinate/ Evaluations should reward appropriate humility, not just confident answers. See this explainer video https://youtu.be/JMxXmFfTWIU
@OpenAI 3/4 But if we change the evaluations by stating the scoring system in the prompt and rewarding abstentions like IDK, HumbleGPT outscores ChatGPT. The incentives are now flipped to motivate releasing HumbleGPT.
@OpenAI 4/4 This is the main idea in our recent hallucinations paper in Nature https://rdcu.be/fjJFP building on an earlier https://openai.com/index/why-language-models-hallucinate/ Evaluations should reward appropriate humility, not just confident answers. See this explainer video https://youtu.be/JMxXmFfTWIU
@OpenAI 3/4 But if we change the evaluations by stating the scoring system in the prompt and rewarding abstentions like IDK, HumbleGPT outscores ChatGPT. The incentives are now flipped to motivate releasing HumbleGPT.
@OpenAI 2/ A simple experiment illustrates the incentive problem. We consider “HumbleGPT,” a toy model that makes fewer errors by often saying “I don’t know.” Now, ChatGPT outscores HumbleGPT on most evaluations, so it would not be selected.
1/ Fixing hallucinations means fixing evaluations, as shown in our new paper https://rdcu.be/fjJFP building on our earlier @OpenAI blog. Accuracy-based scoring rewards models for making their best guess even when unsure, so hallucinations are like students guessing on tests.
@OpenAI 3/ But if we change the evaluations by stating the scoring system in the prompt and rewarding abstentions like IDK, HumbleGPT outscores ChatGPT. The incentives are not flipped to release HumbleGPT.
@OpenAI 2/ A simple experiment illustrates the incentive problem. We consider “HumbleGPT,” a toy model that makes fewer errors by often saying “I don’t know.” Now, ChatGPT outscores HumbleGPT on most evaluations, so it would not be selected.