@EgeErdil2 it's kind of hard but ok, here's a self-contained simple case Fable can infer the logic of the screenshot, Opus trips up on its shoelaces
@teortaxesTex can you show me an example?
Fable falsely attributed Claude-specific speech patterns to general LLMs.
@EgeErdil2 it's kind of hard but ok, here's a self-contained simple case Fable can infer the logic of the screenshot, Opus trips up on its shoelaces
@teortaxesTex can you show me an example?
Users praise the Fable model for outperforming Opus on complex tasks like detailed 3D graphics engine reviews because it crushes competitors at inferring logic from screenshots.
it's a vibe Here's one simple example Opus and Fable are related models, with a very similar post-training, trying to do the exact same thing. The difference is that Opus goes through the motions and trips up on its shoelaces, whereas Fable… just gets it, holistically.
i don't get the crazy hype about mythos. i thought the model was unimpressive in every single interaction i've had with it
my sense is it's the same size improvement we saw from opus 4.6 to opus 4.8. it's not something to get this excited about
@teortaxesTex agree on this prompt fable is better than opus
but i think actually the bigger issue with the response to this prompt is on the confabulation axis
fable has no clue why claude models speak like this, and its explanation is garbage bc this is a claude tick and not an LLM tick
@EgeErdil2 it's kind of hard but ok, here's a self-contained simple case Fable can infer the logic of the screenshot, Opus trips up on its shoelaces
@teortaxesTex i also think you just had bad rng on opus
this is what opus says when i show your screenshot to it
it fails to make the inference from the outer tweet that fable just demonstrated the tic, which is definitely worse, but i don't think is a massive capability gap
@teortaxesTex agree on this prompt fable is better than opus
but i think actually the bigger issue with the response to this prompt is on the confabulation axis
fable has no clue why claude models speak like this, and its explanation is garbage bc this is a claude tick and not an LLM tick

@EgeErdil2 the more interesting case that I don't want to show was a detailed engineering review of a complex 3d graphics engine, where it absolutely crushed 5.5 and 4.8