Anthropic recently unveiled the underlying mechanisms powering its AI model, Claude, offering insights into how the system plans, reasons, and composes its responses.
Through the publication of two research papers, Anthropic introduced techniques known as circuit tracing and attribution graphs, designed to illuminate the internal operations of the model. The company emphasized that Claude does not merely mimic human linguistic patterns but actively engages in a form of βthinking.β
For instance, when tasked with composing poetry, Claude first strategizes the rhyme scheme; when answering geography-related questions, it identifies the relevant state before naming its capital. This indicates that Claude constructs a coherent framework for its answers and employs logical reasoning to derive themβeschewing the simplistic, step-by-step matching process typical of traditional search engines.
The research also explores Claudeβs approach to multilingual queries. The model translates various languages into a shared abstract βlanguage.β For example, when prompted with words related to the concept of βsmallβ in different languages, Claude first maps them to this universal abstraction and then identifies the appropriate linguistic equivalents. This allows the model to accurately interpret queries across languages and process cross-lingual tasks more efficiently.
Anthropic also addressed the common phenomenon of AI βhallucinations.β When the model detects familiar vocabulary in a query, it may proceed to generate a responseβeven if it lacks true understanding. If the system mistakenly assumes it possesses knowledge about a topic, it may fabricate an answer, leading to inaccuracies.
As such, Anthropic concludes that the modelβs tendency to confidently assert incorrect information stems from this flawed inference process. By uncovering the specific factors that trigger such errors, researchers may eventually mitigate more significant issues and enhance the reliability of AI systems.