Skip to content

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models #38966

@AshAnand34

Description

@AshAnand34

Model description

Nemotron-H is a family of hybrid Mamba-Transformer models developed by NVIDIA that combines the efficiency of Mamba layers with the accuracy of Transformer architecture. The models come in two sizes:

  • 8B parameter model
  • 56B/47B parameter models (with a compressed 47B version using MiniPuzzle compression)

Key Features:

  • Hybrid Architecture: Replaces majority of self-attention layers with Mamba layers for constant computation and memory per token
  • Superior Performance: Up to 3x faster inference compared to similarly-sized state-of-the-art Transformer models
  • Competitive Accuracy: Achieves better or on-par accuracy compared to Qwen-2.5-7B/72B and Llama-3.1-8B/70B
  • FP8 Training: Introduces FP8-based training recipe achieving on-par results with BF16 training
  • Compression Technique: MiniPuzzle compression reduces 56B model to 47B while maintaining accuracy and improving inference speed by 20%

Technical Innovations:

  • Constant computation and memory requirements per generated token
  • Novel compression via pruning and distillation technique (MiniPuzzle)
  • FP8 training recipe for efficient training
  • Hybrid Mamba-Transformer architecture optimization

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

HF model: https://huggingface.co/nvidia/Nemotron-H-8B-Base-8K/tree/main
arXiv paper: https://arxiv.org/abs/2504.03624

I would definitely love to integrate this into Transformers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions