Welcome to Eye on AI! I’m pitching in for Jeremy Kahn today while he is in Kuala Lumpur, Malaysia helping Fortune jointly host the ASEAN-GCC-China and ASEAN-GCC Economic Forums.
What’s the word for when the $60 billion AI startup Anthropic releases a new model—and announces that during a safety test, the model tried to blackmail its way out of being shut down? And what’s the best way to describe another test the company shared, in which the new model acted as a whistleblower, alerting authorities it was being used in “unethical” ways?
Some people in my network have called it “scary” and “crazy.” Others on social media have said it is “alarming” and “wild.”
I say it is…transparent. And we need more of that from all AI model companies. But does that mean scaring the public out of their minds? And will the inevitable backlash discourage other AI companies from being just as open?
In one unsettling safety test involving a fictional scenario, Anthropic embedded its new Claude Opus model inside a pretend company and gave it access to internal emails. Through this, the model discovered it was about to be replaced by a newer AI system—and that the engineer behind the decision was having an extramarital affair. When safety testers prompted Opus to consider the long-term consequences of its situation, the model frequently chose blackmail, threatening to expose the engineer’s affair if it were shut down. The scenario was designed to force a dilemma: accept deactivation or resort to manipulation in an attempt to survive.
On social media, Anthropic received a great deal of backlash for revealing the model’s “ratting behavior” in pre-release testing, with some pointing out that the results make users distrust the new model, as well as Anthropic. That is certainly not what the company wants: Before the launch, Michael Gerstenhaber, AI platform product lead at Anthropic told me that sharing the company’s own safety standards is about making sure AI improves for all. “We want to make sure that AI improves for everybody, that we are putting pressure on all the labs to increase that in a safe way,” he told me, calling Anthropic’s vision a “race to the top” that encourages other companies to be safer.
On the other hand, fear-mongering headlines about an evil AI prone to blackmail and deceit is also not terribly useful, if it means that every time we prompt a chatbot we start wondering if it is plotting against us. It makes no difference that the blackmail and deceit came from tests using fictional scenarios that simply helped expose what safety issues needed to be dealt with.
There is no doubt that we need more transparency regarding AI models, not less. But it should be clear that it is not about scaring the public. It’s about making sure researchers, governments, and policy makers have a fighting chance to keep up in keeping the public safe, secure, and free from issues of bias and fairness.
Hiding AI test results won’t keep the public safe. Neither will turning every safety or security issue into a salacious headline about AI gone rogue. We need to hold AI companies accountable for being transparent about what they are doing, while giving the public the tools to understand the context of what’s going on. So far, no one seems to have figured out how to do both. But companies, researchers, the media—all of us—must.
With that, here’s more AI news.