
The startup Anthropic is one of the most talked about AI companies in the world; its recent valuation was $61.5 billion. In an essay by its CEO Dario Amodei, he wrote, “People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology.” He noted this is increasing the risk of unintended and potentially harmful outcomes. And, he argued the industry should turn its attention to so-called “interpretability” before AI advances to the point where it becomes an impossible feat.
“These systems will be absolutely central to the economy, technology, and national security, and will be capable of so much autonomy that I consider it basically unacceptable for humanity to be totally ignorant of how they work,” Amodei wrote in the essay.
Amodei said that, unlike traditional software which is explicitly programmed to perform specific tasks, no one truly understands why AI systems make the decisions they do when generating an output. Recently, OpenAI admitted that “more research is needed” to understand why its o3 and o4-mini models are hallucinating more than previous iterations.
SEE: Anthropic’s Generative AI Research Reveals More About How LLMs Affect Security and Bias
“It’s a bit like growing a plant or a bacterial colony: we set the high-level conditions that direct and shape growth,” Amodei wrote. “But the exact structure which emerges is unpredictable and difficult to understand or explain.”
This is the root of all concerns about AI’s safety, Amodei went on. If we understood what it was doing, we could anticipate harmful behaviours and confidently design systems to prevent them, such as systematically blocking jailbreaks that would allow users to access information about biological or cyber weapons. It would also fundamentally prevent AI from ever deceiving humans or becoming uncontrollably powerful.
This isn’t the first time the startup’s CEO has been vocal about his concern about the general lack of AI understanding. Speaking in November, he said that while “people laugh today when chatbots say something a little unpredictable,” it highlights the importance of controlling AI before it develops more nefarious capabilities.
Anthropic has been working on model interpretability for some time
Amodei said that Anthropic and other industry players have been working on opening AI’s black box for several years. The ultimate goal is to create “the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model, identifying issues like a model’s tendency to lie and flaws in jailbreaks.”
Early on in the research, Amodei and others identified neurons inside the models that could be directly mapped to single, human-understandable concepts. However, the vast majority were “an incoherent pastiche of many different words and concepts,” blocking progress.
“The model uses superposition because this allows it to express more concepts than it has neurons, enabling it to learn more,” Amodei wrote. Eventually, researchers discovered that they could use signal processing to correspond certain neuron combinations to human-understandable concepts.
SEE: UK’s International AI Safety Report Shows Progress is at Breakneck Speed
These concepts were dubbed “features,” and Amodei said they can have their importance increased or decreased inside a neural network, giving AI researchers a degree of control. About 30 million features have been mapped so far, but Amodei says this likely represents just a fraction of the number found inside even a small model.
Now, researchers are tracking and manipulating groups of features called “circuits,” which provide deeper insight into how a model creates concepts from input words and how they lead to its output. Amodei predicts the “MRI for AI” will be here in five to 10 years.
“On the other hand, I worry that AI itself is advancing so quickly that we might not have even this much time,” he wrote.
Three steps to interpretability
The Anthropic CEO outlined three things that can be done to achieve interpretability sooner:
- Researchers need to directly work on model interpretability. He urged the likes of Google, DeepMind, and OpenAI to allocate more resources to the effort, and even encouraged neuroscientists to transition into AI.
- Governments should require companies to disclose how they’re using interpretability in AI testing. Amodei is clear that he does not want regulation to prevent progress, but admits that this requirement would boost shared knowledge and incentivise companies to behave responsibly.
- Governments should use export controls to help democracies lead in AI and “spend” that lead on safeguarding interpretability. Amodei trusts that democratic nations would accept slower progress to ensure safety, while autocracies, like China, may not.