Omer Antverg, M.Sc. Thesis Seminar
Wednesday, 9.3.2022, 11:00
Advisor: Dr. Yonatan Belinkov
Neural language models have significantly developed in recent years, becoming more and more successful on numerous language tasks. Those models rely on encoding words as hidden vector representations, before utilizing these representations for the task at hand. Their success spiked interest in their interpretability: understanding how they work, and what is encoded within these representations. While many studies have shown that linguistic information is encoded in hidden word representations, few have studied individual neurons of these representations, to show how and in which neurons it is encoded. Among these studies, the common approach is to use an external probe to rank neurons according to their relevance to some linguistic attribute, and to evaluate the obtained ranking using the same probe that produced it. We show two pitfalls in this methodology: 1. It confounds distinct factors: probe quality and ranking quality. We follow this methodology and show where the conflation occurs, and why it may mislead us. 2. It focuses on encoded information, rather than information that is used by the model. We perform neuron-level causal analysis of the model, and show that encoded information and used information are not the same. We compare two recent ranking methods and a simple one we introduce, and evaluate them with regard to both of these aspects.