Anthropic: Circuit Tracing – Revealing Computational Graphs in Language Models
2025/04/03
再生時間： 30 分
ポッドキャスト

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

Anthropic: Circuit Tracing – Revealing Computational Graphs in Language Models

無料で聴く

ポッドキャストの詳細を見る

サマリー
Summary of https://transformer-circuits.pub/2025/attribution-graphs/methods.html

Introduces a novel methodology called "circuit tracing" to understand the inner workings of language models. The authors developed a technique using "replacement models" with interpretable components to map the computational steps of a language model as "attribution graphs." These graphs visually represent how different computational units, or "features," interact to process information and generate output for specific prompts.

The research details the construction, visualization, and validation of these graphs using an 18-layer model and offers a preview of their application to a more advanced model, Claude 3.5 Haiku. The study explores the interpretability and sufficiency of this method through various evaluations, including case studies on acronym generation and addition.

While acknowledging limitations like missing attention circuits and reconstruction errors, the authors propose circuit tracing as a significant step towards achieving mechanistic interpretability in large language models.

This paper introduces a methodology for revealing computational graphs in language models using Cross-Layer Transcoders (CLTs) to extract interpretable features and construct attribution graphs that depict how these features interact to produce model outputs for specific prompts. This approach aims to bridge the gap between raw neurons and high-level model behaviors by identifying meaningful building blocks and their interactions.

The methodology involves several key steps: training CLTs to reconstruct MLP outputs, building attribution graphs with nodes representing active features, tokens, errors, and logits, and edges representing linear effects between these nodes. A crucial aspect is achieving linearity in feature interactions by freezing attention patterns and normalization denominators. Attribution graphs allow for the study of how information flows from the input prompt through intermediate features to the final output token.

The paper demonstrates the application of this methodology through several case studies, including acronym generation, factual recall, and small number addition. These case studies illustrate how attribution graphs can reveal the specific features and pathways involved in different cognitive tasks performed by language models. For instance, in the addition case study, the method uncovers a hierarchy of heuristic features that collaboratively solve the task.

Despite the advancements, the methodology has several significant limitations. A key limitation is the missing explanation of how attention patterns are formed and how they mediate feature interactions (QK-circuits), as the analysis is conducted with fixed attention patterns. Other limitations include reconstruction errors (unexplained model computation), the role of inactive features and inhibitory circuits, the complexity of the resulting graphs, and the difficulty of understanding global circuits that generalize across many prompts.

The paper also explores the concept of global weights between features, which are prompt-independent and aim to capture general algorithms used by the replacement model. However, interpreting these global weights is challenging due to issues like interference (spurious connections) and the lack of accounting for attention-mediated interactions. While attribution graphs provide insights on specific prompts, future work aims to enhance the understanding of global mechanisms and address current limitations, potentially through advancements in dictionary learning and handling of attention mechanisms.
続きを読む一部表示

あらすじ・解説

Summary of https://transformer-circuits.pub/2025/attribution-graphs/methods.html

Introduces a novel methodology called "circuit tracing" to understand the inner workings of language models. The authors developed a technique using "replacement models" with interpretable components to map the computational steps of a language model as "attribution graphs." These graphs visually represent how different computational units, or "features," interact to process information and generate output for specific prompts.

The research details the construction, visualization, and validation of these graphs using an 18-layer model and offers a preview of their application to a more advanced model, Claude 3.5 Haiku. The study explores the interpretability and sufficiency of this method through various evaluations, including case studies on acronym generation and addition.

While acknowledging limitations like missing attention circuits and reconstruction errors, the authors propose circuit tracing as a significant step towards achieving mechanistic interpretability in large language models.

This paper introduces a methodology for revealing computational graphs in language models using Cross-Layer Transcoders (CLTs) to extract interpretable features and construct attribution graphs that depict how these features interact to produce model outputs for specific prompts. This approach aims to bridge the gap between raw neurons and high-level model behaviors by identifying meaningful building blocks and their interactions.
The methodology involves several key steps: training CLTs to reconstruct MLP outputs, building attribution graphs with nodes representing active features, tokens, errors, and logits, and edges representing linear effects between these nodes. A crucial aspect is achieving linearity in feature interactions by freezing attention patterns and normalization denominators. Attribution graphs allow for the study of how information flows from the input prompt through intermediate features to the final output token.
The paper demonstrates the application of this methodology through several case studies, including acronym generation, factual recall, and small number addition. These case studies illustrate how attribution graphs can reveal the specific features and pathways involved in different cognitive tasks performed by language models. For instance, in the addition case study, the method uncovers a hierarchy of heuristic features that collaboratively solve the task.
Despite the advancements, the methodology has several significant limitations. A key limitation is the missing explanation of how attention patterns are formed and how they mediate feature interactions (QK-circuits), as the analysis is conducted with fixed attention patterns. Other limitations include reconstruction errors (unexplained model computation), the role of inactive features and inhibitory circuits, the complexity of the resulting graphs, and the difficulty of understanding global circuits that generalize across many prompts.
The paper also explores the concept of global weights between features, which are prompt-independent and aim to capture general algorithms used by the replacement model. However, interpreting these global weights is challenging due to issues like interference (spurious connections) and the lack of accounting for attention-mediated interactions. While attribution graphs provide insights on specific prompts, future work aims to enhance the understanding of global mechanisms and address current limitations, potentially through advancements in dictionary learning and handling of attention mechanisms.

続きを読む一部表示