A nice and recent paper from Lior Wolf's lab at Tel Aviv University: https://arxiv.org/pdf/2103.15679.pdf by Hila Chefer, Shir Gur and Lior Wolf. The problem is very simple: given a transformer encoder/ decoder network, we would like to visualize the affect of attention on the image. While the problem is simple the answer is pretty complicated: we need to take into account attention matrices from mutliple layers at once. The paper suggests an iterative way to add up all those attention layers into one coherent image.
Figure 4 shows that the result is very compelling vs. previous art: