-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Closed
Labels
Description
Model description
Nemotron-H is a family of hybrid Mamba-Transformer models developed by NVIDIA that combines the efficiency of Mamba layers with the accuracy of Transformer architecture. The models come in two sizes:
- 8B parameter model
- 56B/47B parameter models (with a compressed 47B version using MiniPuzzle compression)
Key Features:
- Hybrid Architecture: Replaces majority of self-attention layers with Mamba layers for constant computation and memory per token
- Superior Performance: Up to 3x faster inference compared to similarly-sized state-of-the-art Transformer models
- Competitive Accuracy: Achieves better or on-par accuracy compared to Qwen-2.5-7B/72B and Llama-3.1-8B/70B
- FP8 Training: Introduces FP8-based training recipe achieving on-par results with BF16 training
- Compression Technique: MiniPuzzle compression reduces 56B model to 47B while maintaining accuracy and improving inference speed by 20%
Technical Innovations:
- Constant computation and memory requirements per generated token
- Novel compression via pruning and distillation technique (MiniPuzzle)
- FP8 training recipe for efficient training
- Hybrid Mamba-Transformer architecture optimization
Open source status
- The model implementation is available
- The model weights are available
Provide useful links for the implementation
HF model: https://huggingface.co/nvidia/Nemotron-H-8B-Base-8K/tree/main
arXiv paper: https://arxiv.org/abs/2504.03624
I would definitely love to integrate this into Transformers.