Regarding the Implementation Details of H2O

Hi! I've been working on KV cache compression with the LLaMA3 model using the library recently. I have two questions at the moment.

1. It seems that the implementation of GQA doesn't differ significantly from MHA. In the [Transformers library](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L283), the KV cache stores content before repeating the keys and values, whereas in your implementation, the KV cache is stored after the repetition. This approach could result in the KV cache being up to 4 times larger for llama(with 8 key heads and 32 query heads) during inference. Could you explain the advantages of this approach?
2. **It seems that your implementation of H2O differs somewhat from the original description**. In your[ implementation](https://github.com/Zefan-Cai/KVCache-Factory/blob/main/pyramidkv/pyramidkv_utils.py#L544-L563), it appears that after computing the scores, the attention matrix is **only causally masked** within the most recent `window_size`, rather than applying a lower triangular mask to all inputs during the prefill phase. This approach could lead to an issue where, during the softmax computation, keys outside the `window_size` might still be visible to queries beyond the intended range. I'm not sure about the purpose or benefit of this approach, or whether the `attention_mask` should instead be applied directly at this stage.

Looking forward to your response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regarding the Implementation Details of H2O #49

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Regarding the Implementation Details of H2O #49

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions