Mechanistic interpretability research seeks to explain the internal mechanisms that operate within AI models as they perform different tasks. A widely used strategy to uncover and analyze those mechanisms is by identifying circuits within the model. A circuit is a sub-graph of the model’s computational graph, believed to be critical for executing a specific task.
In recent years, several methods have been proposed for automatically identifying circuits inside languege models, but most overlook the timestep dimension. As a result, these approaches either produce circuits where nodes lack position specificity or are limited to datasets with uniform example lengths. In this work, we propose two key improvements: First, we extend edge attribution patching, a popular gradient-based method for circuit discovery, to differentiate between token positions. Second, we introduce the concept of a dataset schema—a structure that defines how to split examples in the dataset into spans based on shared structural features. By defining this high-level shared structure, we are able to discover circuits that differentiate between token positions, even in datasets that are not fully aligned. Additionally, we demonstrate that schema definition and span labeling can be automated using large language models (LLMs). Ultimately, our approach enables the fully automated discovery of position-sensitive circuits, yielding smaller circuits compared to previous methods.