Thank you for your research work and code!I'm a bit confused about Figure 2 in the Pyramid KV paper.
Figure 2 shows the attention patterns of one QA example over six different layers.
What I am puzzled about is what exactly this example is and how was it chosen? Does this attention patterns also exist in other samples?
Thanks.