A high-quality neural network weight quantization tool that converts PyTorch model weights to FP16 format using learned rounding optimization and SVD-based error correction.
- Learned Rounding Optimization: Advanced quantization using adaptive rounding inspired by AdaRound
- SVD Error Correction: Principal component analysis for minimizing quantization errors
- Adaptive Bias Correction: Automatically adjusts biases to compensate for quantization errors
- Smart Scaling: Intelligent scaling strategy optimized for FP16's dynamic range
- Model Compatibility: Special handling for T5XXL and distillation layers
- Memory Efficient: Processes tensors individually to minimize GPU memory usage
- Progress Tracking: Detailed progress bars and logging for long conversions
git clone https://github.com/marduk191/fp16_learned_rounding.git
cd Learned-Roundingpython convert_fp16_scaled_learned_svd_fast.py --input model.safetensors# T5XXL model with custom optimization parameters
python convert_fp16_scaled_learned_svd_fast.py \
--input t5xxl_model.safetensors \
--output t5xxl_fp16.safetensors \
--t5xxl \
--num_iter 512 \
--top_k 2 \
--calib_samples 4096| Argument | Type | Default | Description |
|---|---|---|---|
--input |
str | Required | Input safetensors file path |
--output |
str | Auto-generated | Output file path |
--t5xxl |
flag | False | Enable T5XXL compatibility mode |
--keep_distillation |
flag | False | Preserve distillation layers from quantization |
--num_iter |
int | 256 | Optimization iterations per tensor |
--top_k |
int | 1 | Number of principal components for SVD |
--calib_samples |
int | 3072 | Calibration samples for bias correction |
The core algorithm implements "TPEC-Quant" (Top-Principal Error Correction Quantization):
# Simplified algorithm flow
W_scaled = W_original * scale_factor
W_quantized = optimize_rounding(W_scaled, calibration_data)
W_final = W_quantized.to(torch.float16)Uses singular value decomposition to focus optimization on the most important error directions:
U, _, Vh = torch.pca_lowrank(W_original, q=top_k)
projected_error = U_k.T @ error @ Vh_k.T
gradient = U_k @ projected_error @ Vh_kAutomatically corrects biases to compensate for quantization errors:
weight_error = W_original - W_dequantized
output_error = calibration_data @ weight_error.T
bias_correction = output_error.mean(dim=0)
new_bias = original_bias - bias_correction| Aspect | FP16 Version |
|---|---|
| Precision | ~3-4 decimal digits |
| Range | Β±65,504 |
| Memory Usage | 2 bytes per parameter |
| Speed | Fast on modern GPUs |
| Compatibility | Universal hardware support |
- β Diffusion models (Stable Diffusion, FLUX, etc.)
- β Language models (T5, BERT, etc.)
- β Vision transformers
- β Any PyTorch model saved as safetensors
- T5XXL Mode (
--t5xxl): Handles encoder-decoder architectures - Distillation Preservation (
--keep_distillation): Maintains teacher-student model compatibility
The converted model includes:
- Quantized weights: FP16 format with learned rounding
- Scale factors:
{layer}.scale_weighttensors for dequantization - Corrected biases: Automatically adjusted bias terms
- Metadata:
scaled_fp16marker tensor
from safetensors import safe_open
import torch
# Load converted model
with safe_open("model_fp16_scaled.safetensors", framework="pt") as f:
weight = f.get_tensor("layer.weight") # FP16 quantized
scale = f.get_tensor("layer.scale_weight") # FP32 scale factor
# Dequantize for use
dequantized_weight = weight.to(torch.float32) * scalepython convert_fp16_scaled_learned_svd_fast.py --input stable_diffusion.safetensors
# Output: stable_diffusion_float16_scaled_learned_svd.safetensorspython convert_fp16_scaled_learned_svd_fast.py \
--input t5xxl_encoder.safetensors \
--t5xxl \
--num_iter 512 \
--top_k 3 \
--calib_samples 8192#!/bin/bash
for model in models/*.safetensors; do
echo "Converting $model"
python convert_fp16_scaled_learned_svd_fast.py --input "$model" --num_iter 128
done- GPU Memory: Use
--calib_samples 1024for very large models on limited GPU memory - Speed vs Quality: Reduce
--num_iterto 64-128 for faster conversion with minimal quality loss - High Quality: Increase
--top_kto 2-3 and--num_iterto 512+ for maximum quality - Batch Processing: Process multiple models sequentially to avoid memory issues
Out of GPU Memory
# Solution: Reduce calibration samples
python convert_fp16_scaled_learned_svd_fast.py --input model.safetensors --calib_samples 512Very Slow Conversion
# Solution: Reduce iterations or use CPU
python convert_fp16_scaled_learned_svd_fast.py --input model.safetensors --num_iter 64File Size Too Large
- FP16 models are 2x smaller than FP32 but 2x larger than FP8
- Consider the original FP8 version for maximum compression
The algorithm optimizes for minimal reconstruction error:
- Objective: Minimize
||W_original - W_dequantized||_F - Method: Learned rounding with principal component focus
- Validation: Automatic bias correction ensures output consistency
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- AdaRound Paper: Adaptive Rounding for Post-Training Quantization
- Original Author: Clybius (FP8 implementation)
- PyTorch Team: For excellent quantization primitives
- Safetensors: For efficient model serialization
β If this project helped you, please give it a star!