The growing share of American office workers who have experimented with artificial intelligence in their day-to-day work have likely had a few moments of doubt as to their long-term job stability.
But for all the improvements in AI over the past few years, the technology is still only able to hit low bars in specific workplace tasks, according to recent data published by MIT. Even then, it might still be making some big mistakes.
AI is gradually improving at accomplishing a variety of tasks across a number of professions, according to a study of preliminary findings released on Thursday. But in most cases, the performance of currently available models is similar to that of a disenchanted intern—hitting minimum benchmarks but struggling overall to produce quality work without a human hand to refine its output.
MIT researchers used 41 different LLMs—including versions of Claude, Gemini, and ChatGPT—to analyze performance in more than 11,000 primarily text-based tasks for various job roles listed by the Labor Department. Their outputs were then scored by humans with actual on-the-job experience in those fields. The goal was to see how often an AI worker replacement could produce an output that a manager would find acceptable without any human edits, and then to evaluate its quality.
The researchers found AI has become more reliable over the years for many types of work, but still falls short whenever the stakes or standards are raised. The MIT study utilized a 1–9 scoring scale to judge AI performance, in which a 7 was defined as “minimally sufficient,” meaning the work is useful as is and requires no edits. As of late 2025, AI models scored a 7 in roughly 65% of tasks.
Most important for companies considering replacing patches of their workforce with AI, the MIT data suggests AI struggles to perform more complicated tasks. Regardless of how much time an AI model had to complete a task, the probability of success when graded against a 9 or “superior” quality score never exceeded 50%. In other words, when a job requires multiple steps, creativity, or precision, AI replacements are more likely to fail than succeed.
That was reflected in MIT’s data, which found average success rates lower for skilled roles in legal and IT jobs, while AI models generally had an easier time tackling the text-based tasks associated with construction and maintenance professions.
The anecdotal evidence and MIT’s data suggest AI still requires a human hand to maximize its upside, though the technology is rapidly improving. MIT researchers estimated AI’s success rate at the tasks analyzed increased by up to 11 percentage points each year owing to more capable models.
By 2029, the authors estimate, most AI models will be able to accomplish between 80% and 95% of text-based tasks at the minimally sufficient benchmark.
Whether AI will ever be able to scale toward excellent or even perfect performance remains unknown.
“Widespread automation, particularly in domains with low tolerance for errors, may still be some distance away,” the researchers wrote.
AI might be able to do the bare-minimum work that comes with drafting, emailing, and number-crunching, but it has yet to hit the superior performance territory where humans can still stand out.



