-
Notifications
You must be signed in to change notification settings - Fork 159
Open
Description
Hi! I've been working on KV cache compression with the LLaMA3 model using the library recently. I have two questions at the moment.
- It seems that the implementation of GQA doesn't differ significantly from MHA. In the Transformers library, the KV cache stores content before repeating the keys and values, whereas in your implementation, the KV cache is stored after the repetition. This approach could result in the KV cache being up to 4 times larger for llama(with 8 key heads and 32 query heads) during inference. Could you explain the advantages of this approach?
- It seems that your implementation of H2O differs somewhat from the original description. In your implementation, it appears that after computing the scores, the attention matrix is only causally masked within the most recent
window_size, rather than applying a lower triangular mask to all inputs during the prefill phase. This approach could lead to an issue where, during the softmax computation, keys outside thewindow_sizemight still be visible to queries beyond the intended range. I'm not sure about the purpose or benefit of this approach, or whether theattention_maskshould instead be applied directly at this stage.
Looking forward to your response.
kugwzk, hjeon2k, an-yongqi, Halo-949, CR400AF-A and 3 more
Metadata
Metadata
Assignees
Labels
No labels