It is true that there are benchmarks that show the most recent AI models making rapid progress on professional tasks outside of coding. One of the best of these is OpenAI’s GDPval benchmark. It shows that frontier models can achieve parity with human experts across a range of professional tasks, from complex legal work to manufacturing to health care. So far, the results aren’t in for the models OpenAI and Anthropic released last week. But for their predecessors, Claude Opus 4.5 and GPT-5.2, the models achieve parity with human experts across a diverse range of tasks, and beat human experts in many domains.
So wouldn’t this suggest that Shumer is correct? Well, not so fast. It turns out that in many professions what “good” looks like is highly subjective. Human experts only agreed with one another on their assessment of the AI outputs about 71% of the time. The automated grading system used by OpenAI for GDPval has even more variance, agreeing on assessments only 66% of the time. So those headline numbers about how good AI is at professional tasks could have a wide margin of error.
This variance is one of the things that holds enterprises back from deploying fully automated workflows. It’s not just that the output of the AI model itself might be faulty. It’s that, as the GDPval benchmark suggests, the equivalent of an automated unit test in many professional contexts might produce an erroneous result a third of the time. Most companies cannot tolerate the possibility that poor quality work is being shipped in a third of cases. The risks are simply too great. Sometimes, the risk might be merely reputational. In others, it could mean immediate lost revenue. But in many professional tasks, the consequences of a wrong decision can be even more severe: professional sanction, lawsuits, the loss of licenses, the loss of insurance coverage, and, even, the risk of physical harm and death—sometimes to large numbers of people.
What’s more, trying to keep a human in the loop to review automated outputs is problematic. Today’s AI models are genuinely getting better. Hallucinations occur less frequently. But that only makes the problem worse. As AI-generated errors become less frequent, human reviewers become complacent. AI errors become harder to spot. AI is wonderful at being confidently wrong and at presenting results that are impeccable in form but lack substance. That bypasses some of the proxy criteria humans use to calibrate their level of vigilance. AI models often fail in ways that are alien to the ways humans fail at the same tasks, which makes guarding against AI-generated errors more of a challenge.
For all these reasons, until the equivalent of software development’s automated unit tests are developed for more professional fields, deploying automated AI workflows in many knowledge work contexts will be too risky for most enterprises. AI will remain an assistant or copilot to human knowledge workers in many cases, rather than fully automating their work.
There are other reasons that the kind of automation software developers have observed is unlikely for other categories of knowledge work. In many cases, enterprises cannot give AI agents access to the kinds of tools and data systems they need to perform automated workflows. It is notable that the most enthusiastic boosters of AI automation so far have been developers who work either by themselves or for AI-native startups. These software coders are often unencumbered by legacy systems and tech debt, and often don’t have a lot of governance and compliance systems to navigate.
Big organizations often currently lack ways to link data sources and software tools together. In other cases, concerns about security risks and governance mean large enterprises, especially in regulated sectors such as banking, finance, law, and health care, are unwilling to automate without ironclad guarantees that the outcomes will be reliable and that there is a process for monitoring, governing, and auditing the outcomes. The systems for doing this are currently primitive. Until they become much more mature and robust, don’t expect enterprises to fully automate the production of business critical or regulated outputs.
I’m not the only one who found Shumer’s analysis faulty. Gary Marcus, the emeritus professor of cognitive science at New York University who has become one of the leading skeptics of today’s large language models, told me Shumer’s X post was “weaponized hype.” And he pointed to problems with even Shumer’s arguments about automated software development.
“He gives no actual data to support this claim that the latest coding systems can write whole complex apps without making errors,” Marcus said.
He points out that Shumer mischaracterizes a well-known benchmark from the AI evaluation organization METR that tries to measure AI models’ autonomous coding capabilities that suggests AI’s abilities are doubling every seven months. Marcus notes that Shumer fails to mention that the benchmark has two thresholds for accuracy, 50% and 80%. But most businesses aren’t interested in a system that fails half the time, or even one that fails one out of every five attempts.
“No AI system can reliably do every five-hour-long task humans can do without error, or even close, but you wouldn’t know that reading Shumer’s blog, which largely ignores all the hallucination and boneheaded errors that are so common in everyday experience,” Marcus says.
He also noted that Shumer didn’t cite recent research from Caltech and Stanford that chronicled a wide range of reasoning errors in advanced AI models. And he pointed out that Shumer has been caught previously making exaggerated claims about the abilities of an AI model he trained. “He likes to sell big. That doesn’t mean we should take him seriously,” Marcus said.
Other critics of Shumer’s blog point out that his economic analysis is ahistorical. Every other technological revolution has, in the long run, created more jobs than it eliminated. Connor Boyack, president of the Libertas Institute, a policy think tank in Utah, wrote an entire counter-blog-post making this argument.
So, yes, AI may be poised to transform work. But the kind of full-task automation that some software developers have started to observe is possible for some tasks? For most knowledge workers, especially those embedded in large organizations, that is going to take much longer than Shumer implies.



