We're releasing the preparedness report for Muse Spark Contemplating, MSL's extreme reasoning model, benchmarking its capabilities and behaviors in biology, cybersecurity, and more!
Meta releases Muse Spark Contemplating safety report, showing the reasoning model trails Claude Opus 4.6 on cybersecurity benchmarks
The model is not yet deployed for general availability.
Users expressed excitement over the solid capability jump in Muse Spark from MSL's preparedness report, noting its strong reasoning performance against other major models.
Most Activity
In retrospect it was obvious how they'll fall off after Llama-Guard release. Meta is a very cowardly company after all, the whole doomer gnashing of teeth about their "irresponsibility" in peak Llama era was misguided. Below Opus 4.6, and still no general availability.
We're releasing the preparedness report for Muse Spark Contemplating, MSL's extreme reasoning model, benchmarking its capabilities and behaviors in biology, cybersecurity, and more!

https://ai.meta.com/static-resource/muse-spark-contemplating-safety-and-preparedness-report/

@natliml why not most recent opus or gpt evals

@natliml contemplating 🤣

@natliml Is the model more or less eval aware than Muse Spark normal?

pretty solid capability jump! how do you think about using static evals for a reasoning model? Almost every model from a major lab can crush point-in-time checks (ie wmdp). They typically fail later on (maybe turn 12) via continuous reasoning loops. I'd love to see results from dynamic trajectory testing.

@natliml openrouter or it never happened!

@natliml You see this yet, @ml_angelopoulos?