Skip to content

Regarding the Implementation Details of H2O #49

@masn1310

Description

@masn1310

Hi! I've been working on KV cache compression with the LLaMA3 model using the library recently. I have two questions at the moment.

  1. It seems that the implementation of GQA doesn't differ significantly from MHA. In the Transformers library, the KV cache stores content before repeating the keys and values, whereas in your implementation, the KV cache is stored after the repetition. This approach could result in the KV cache being up to 4 times larger for llama(with 8 key heads and 32 query heads) during inference. Could you explain the advantages of this approach?
  2. It seems that your implementation of H2O differs somewhat from the original description. In your implementation, it appears that after computing the scores, the attention matrix is only causally masked within the most recent window_size, rather than applying a lower triangular mask to all inputs during the prefill phase. This approach could lead to an issue where, during the softmax computation, keys outside the window_size might still be visible to queries beyond the intended range. I'm not sure about the purpose or benefit of this approach, or whether the attention_mask should instead be applied directly at this stage.

Looking forward to your response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions