Anthropic Accidentally Gives the World a Peek Into Its Model’s ‘Soul’

Artificial intelligence models donâ€™t have souls, but one of them does apparently have a â€œsoulâ€ document. A person named Richard Weiss was able to get Anthropicâ€™s latest large language model, Claude 4.5 Opus, to produce a document referred to as a â€œSoul overview,â€ which was seemingly used to shape how the model interacts with users and presents its â€œpersonality.â€ Amanda Askell, a philosopher who works on Anthropicâ€™s technical staff, confirmed that the overview produced by Claude is â€œbased on a real documentâ€ used to train the model.

In a post on Less Wrong, Weiss said that he prompted Claude for its system message, which is a set of conversation instructions given to the model by the people who trained it to inform the large language model how to interact with users. In response, Claude highlighted several supposed documents that it had been given, including one called â€œsoul_overview.â€ Weiss asked the chatbot to produce that document specifically, which resulted in Claude spitting out the 11,000-word guide to how the LLM should carry itself.

The document includes numerous references to safety, attempting to imbue the chatbot with guardrails to keep it from producing potentially dangerous or harmful outputs. The LLM is told by the document that â€œbeing truly helpful to humans is one of the most important things Claude can do for both Anthropic and for the world,â€ and forbidden from doing anything that would require it to â€œperform actions that cross Anthropicâ€™s ethical bright lines.â€

Weiss apparently has made a habit of going searching for these types of insights into how LLMs are trained and operate, and said in Less Wrong that itâ€™s not uncommon for the models to hallucinate documents when asked to produce system messages. (Seems not great that the AI can make up what it thinks it was trained on, though who knows if its behavior is in any way affected by a made-up document generated in response to user prompting.) But the â€œsoul overviewâ€ seemed legitimate to him, and he claims that he prompted the chatbot to reproduce the document 10 times, and it spit out the exact same text in each and every instance.

Users on Reddit were also able to get Claude to produce snippets of the same document with the identical text, suggesting that the LLM seemed to be pulling from something accessible internally in its training documents.

Turns out his instincts may have been right. On X, Askell confirmed that the output from Claude is based on a document that was used during the modelâ€™s supervised learning period. â€œItâ€™s something Iâ€™ve been working on for a while, but itâ€™s still being iterated on and we intend to release the full version and more details soon,â€ she wrote. Askell added, â€œThe model extractions arenâ€™t always completely accurate, but most are pretty faithful to the underlying document. It became endearingly known as the â€˜soul docâ€™ internally, which Claude clearly picked up on, but thatâ€™s not a reflection of what weâ€™ll call it.â€

Gizmodo reached out to Anthropic for comment on the document and its reproduction via Claude, but did not receive a response at the time of publication.

The so-called soul of Claude may just be some guidance for the chatbot to keep it from going off the rails, but itâ€™s interesting to see that a user was able to get the chatbot to access and produce that document, and that we actually get to see it. So little of the sausage-making of AI models has been made public, so getting a glimpse into the black box is something of a surprise, even if the guidelines themselves seem pretty straightforward.

Original Source: https://gizmodo.com/anthropic-accidentally-gives-the-world-a-peek-into-its-models-soul-2000694624

Disclaimer: This article is a reblogged/syndicated piece from a third-party news source. Content is provided for informational purposes only. For the most up-to-date and complete information, please visit the original source. Digital Ground Media does not claim ownership of third-party content and is not responsible for its accuracy or completeness.