SAEfarer: exploring text classification models
Leveraging SAEs to analyze the behavior of text classification LMs. (HILDA)
As language models (LMs) rise in prominence, there is interest in making them more transparent in order to better understand and steer their internal behavior. Recent work in interpreting LMs has focused on using sparse autoencoders (SAEs) to break down neuron activations at a given layer in the LM into human-understandable features, where each feature represents a single concept that the model knows. In this paper, we present initial work on leveraging SAEs to analyze the behavior of text classification LMs. We present techniques for exploring the relationships between the features extracted by the SAE and the model’s predictions and errors. We integrate these techniques into SAEfarer, an open-source prototype visual analytics tool for analyzing text classification models.