The benchmark results published alongside the launch paint a picture of a model that is competitive but not dominant. For instance, on the GPQA Diamond benchmark, which is supposed to test PhD-level reasoning skill, Muse Spark scored 89.5%, which slightly trailed both Gemini 3.1 Pro’s 94.3% as well as the 92.7% and 92.8% that Anthropic’s Claude Opus 4.6 and OpenAI’s GPT-5.4 scored respectively. On a leading health benchmark, HealthBench Hard, Muse Spark beat all rival models with a score of 42.8%, which was far better than either Opus 4.6 or Gemini 3.1 Pro, and slightly better than GPT-5.4.
Meta acknowledged the performance gaps. Its technical blog post states that the company continues “to invest in areas with current performance gaps, specifically long-horizon agentic systems and coding workflows.”
The Muse Spark launch is the most tangible product yet of the sweeping reorganization Meta undertook after the Llama 4 fiasco. In June 2025, Meta spent $14.3 billion to acquire a 49% nonvoting stake in Scale AI and brought in its cofounder and CEO, Alexandr Wang, as Meta’s first-ever chief AI officer.
Wang has been tasked with leading a newly created Meta Superintelligence Labs unit. Wang and Zuckerberg went on a talent acquisition spree, offering AI researchers at rival AI labs pay packages that reportedly climbed into the hundreds of millions of dollars when equity was included. The company has also committed hundreds of billions of dollars to build out AI computing infrastructure to support its new AI drive.
There has since been further reorganization, even as Muse Spark was in development. In March 2026, Meta created a new applied AI engineering organization led by Maher Saba, a vice president who previously worked in Meta’s Reality Labs virtual and augmented reality unit. Saba reports directly to Meta chief technology officer Andrew Bosworth. Saba’s unit works alongside Wang’s Superintelligence Labs to build what an internal memo described as “the data engine that helps our models get better, faster.” The move was widely interpreted as Zuckerberg hedging his bets—ensuring product-focused AI development continues even as Wang pursues longer-term superintelligence research.
In a technical blog post, Meta says that over the past nine months its team rebuilt its AI stack from the ground up, including improvements to model architecture, optimization, and data curation. The company claims these advances allow it to achieve the same capabilities with “over an order of magnitude less compute” than Llama 4 Maverick, Meta’s previous model. Meta also says its reinforcement learning pipeline now delivers “smooth, predictable gains,” and that Muse Spark is the first step on a deliberate “scaling ladder” where each generation validates the last before the company trains larger models.
On safety, Meta says Muse Spark underwent extensive evaluation before deployment, following the company’s updated safety framework. The model reports impressive results for safety around potential bioweapons engineering—on one benchmark, it refused 98% of requests that the benchmark designers judged as potentially helping someone develop a bioweapon.
However, the blog post also said third-party evaluator Apollo Research found that Muse Spark demonstrated the highest rate of “evaluation awareness” of any model Apollo has observed, frequently identifying test scenarios as “alignment traps.” Meta says its own follow-up investigation found initial evidence that this awareness may affect model behavior on a small subset of alignment evaluations, but concluded it was “not a blocking concern for release.”



