Course held within the national PhD program in AI and Society (20 hours). The course focuses on recent topics related to mechanistic interpretability of Large Language Models (LLMs) including:
Probing
Outliers in LLMs
Activation Steering
Sparse Autoencoders
With guest lectures from Fabio Brau, Gabriele Sarti and William Rudman