Cross-Layer Transcoders replace ViT MLPs without hurting zero-shot accuracy, and reveal which layers actually build the final representation.
Functional Replacement.
CLTs can replace MLP blocks, especially in later layers for patches or for the [CLS] token across all layers, preserving and in some cases even improving zero-shot classification performance.
Patch granularity matters.
ViT-B/16 reconstruction is more faithful than ViT-B/32. Smaller patches distribute information across more tokens, yielding simpler per-token activations that are easier to approximate.
[CLS] integrates across depth; patches remain local.
Patch tokens show diagonal attribution, while [CLS] draws credit from many preceding layers.
Necessary and sufficient attribution layers.
The top-4 attributed layers recover accuracy, while removing the highest-scored layer causes substantial degradation.
@inproceedings{chatzoudis2026clt,
author = {Chatzoudis, Gerasimos and Polyzos, Konstantinos D. and Li, Zhuowei and Gu, Difei and Moran, Gemma E. and Wang, Hao and Metaxas, Dimitris N.},
title = {Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision},
booktitle = {CVPR Workshop on Explainable AI for Computer Vision (XAI4CV)},
year = {2026},
note = {Spotlight}
}