For leaders, three points stand out:
The study is highly realistic. It examined 44 occupations and 1,320 specialized tasks required by those occupations. For example: the final testing step in manufacturing a cable spooling truck for underground mining operations. Appropriate professionals (average experience: 14 years) vetted the tasks, all of which are elements of actual work deliverables. Previous research has almost always focused on less realistic tests. The AI results were graded by expert humans who didn’t know if they were looking at work from AI or from an expert human professional.
The best models are already nearly as good as human industry experts. The study examined seven AI models from Open AI, Google’s Gemini, xAI’s Grok, and Anthropic’s Claude. The clear winner was Claude Opus 4.1, which came within a few percentage points of reaching parity with human industry experts. The best models also completed tasks about 100 times faster and 100 times cheaper than the industry experts, though the comparisons ignore “the human oversight, iteration, and integration steps required in real workplace settings,” OpenAI says.
The models are improving at a galloping pace. For example, as OpenAI’s models improved, the percentage of their task outputs that were as good as or better than humans’ outputs more than tripled. If that rate continues—a big if—OpenAI would be better at these real-world tasks than humans overall in a few months. At least some AI competitors could well be on similar trajectories.



