Skip to content

Releases: flashinfer-ai/flashinfer

Nightly Release v0.4.0-20251012

12 Oct 03:46
bbb57ad

Choose a tag to compare

Pre-release

Automated nightly build for version 0.4.0 (dev20251012)

Nightly Release v0.4.0-20251011

11 Oct 03:35
e8addbf

Choose a tag to compare

Pre-release

Automated nightly build for version 0.4.0 (dev20251011)

Nightly Release v0.4.0-20251010

10 Oct 03:41
c3ff7e7

Choose a tag to compare

Pre-release

Automated nightly build for version 0.4.0 (dev20251010)

Nightly Release v0.4.0-20251009

09 Oct 03:42
c08b529

Choose a tag to compare

Pre-release

Automated nightly build for version 0.4.0 (dev20251009)

v0.4.0

09 Oct 01:59
68826ac

Choose a tag to compare

What's Changed

  • perf: Enable SplitK and fix autotuner for trtllm fp4 fused moe by @stslxg-nv in #1548
  • bugfix: Fix FLOPS calculation for bench_trtllm_gen_mla.py by @RayWang96 in #1640
  • feat: add support of fp4_batched_quantize by @yicwang in #1633
  • fix: zero-init workspace buffer for trtllm-gen fmha by @yyihuang in #1643
  • misc: Add the keyword "template" to member template specialization by @tomflinda in #1246
  • chore: Switch pynvml to nvidia-ml-py by @toulzx in #1650
  • [TVM] Rename NDArray -> Tensor by @MasterJH5574 in #1651
  • misc: remove unused load_cuda_ops function by @yzh119 in #1649
  • feat: Add k_scale and v_scale to persistent attention by @Edenzzzz in #1322
  • misc: add script to analyzer code owners from git history by @yzh119 in #1653
  • Tiny allow compiling with line info and release moe by @fzyzcjy in #1659
  • Speedup MLARopeQuantize by 20-35% by @fzyzcjy in #1660
  • Add benchmark for MLARopeQuantize by @fzyzcjy in #1656
  • Added mx_fp4 support using the cudnn backend by @nvmbreughe in #1644
  • feat: Support s_qo < s_kv for prefill in flashinfer_benchmark.py and benchmark minor updates by @bkryu in #1664
  • test: update fused_moe test to random scale factor by @yyihuang in #1665
  • perf&bugfix: skip kv-tile computation out of sliding window in FA2; fix __syncthreads in mergestate by @happierpig in #1661
  • [Hotfix] test_fp4_quantize.py failure on sm103 by @sunghyunp-nvdia in #1666
  • benchmark: add cupti support to benchmark by @nv-yunzheq in #1662
  • TGV GEMM as a BF16 backend alternative to cuBLAS by @yangs75 in #1668
  • feat: Add variant.OutputTransform() to decode kernels by @gau-nernst in #1670
  • ci: collect module status and update flashinfer-cli by @yzh119 in #1676
  • feat: Batch-size invariant FA2 Prefill & Decode by @Edenzzzz in #1675
  • test: better fp8 quantization init for fused_moe test by @yyihuang in #1674
  • Support output signals for overlapping for cutedsl gemm by @fzyzcjy in #1677
  • [misc] add a wrapper class for attention sink jit args by @happierpig in #1679
  • [TVM] Default fixed_split_size value in TVM binding by @MasterJH5574 in #1680
  • Update TGV GEMM default kernel and TGV code cleanup. by @yangs75 in #1682
  • perf: improve performance of cutlass fmha by @yzh119 in #1681
  • fix: correct the sm version number in cutlass_fused_moe_module for rtx pro 6000 by @yongwww in #1683
  • Refactor Blackwell unit test scripts by @dierksen in #1667
  • bugfix: increase workspace to make unit test pass by @nv-yunzheq in #1684
  • Update deepgemm backend for 103a by @kahyunnam in #1694
  • gemm: Enabled alpha with the mx_fp4 format by @nvmbreughe in #1688
  • hotfix: Hotfix for test_pod_kernels.py on B300 by @sunghyunp-nvdia in #1698
  • misc: Do not use the limited API with free-threaded Python by @rostan-t in #1687
  • Remove incorrect method call "isdigit" on number type by @HelloCard in #1699
  • ci: fix prefill attention unittests by @yzh119 in #1700
  • misc: unify the macro to determine cuda version at compile time by @yzh119 in #1703
  • Support Kimi-K2 for TRT: templatize number of experts by @GordonGustafson in #1696
  • feat: Benchmark mm_fp4 mxfp4 support and gemm autotune support. Restore mm_fp4 API behavior by @bkryu in #1706
  • bugfix: increase workspace to make trtllm gen attention unit test pass by @nv-yunzheq in #1707
  • CI: Updated test lists and addressed some failing tests by @nvmbreughe in #1708
  • misc: update the pypi release github action by @yzh119 in #1713
  • perf: Add tuning config for cutlass moe for a hardware by @fzyzcjy in #1716
  • ci: remove deprecated github actions for aot wheel by @yzh119 in #1714
  • test: skip the unsupported test cases for sm120/121 by @yongwww in #1710
  • [cute_dsl] add gemm + all reduce (two_shot) by @Amir-19 in #1695
  • misc: remove unused torch.utils.cpp_extension dependencies by @yzh119 in #1711
  • test: skip unsupported (non-SM90) test cases for xqa by @jimmyzho in #1715
  • Fix DeepSeek quality for TRTLLM fused MoE routing by @GordonGustafson in #1723
  • perf: Port the separate reduce kernel mode from trtllm. by @weireweire in #1685
  • typo: Super tiny fix typo by @fzyzcjy in #1730
  • fix: put sampling kernel launch into macro by @ir1ka in #1727
  • bugfix: Fix flashinfer download-cubin by @tiran in #1729
  • Fix missing namespace qualifier by @joker-eph in #1731
  • ci/cd: bring up flashinfer-cubin package by @yzh119 in #1718
  • disable optimization and add more debug information during verbose mode by @rainj-me in #1719
  • ci/cd: add github workflows to publish flashinfer-cubin wheel to pypi by @yzh119 in #1737
  • Bump base container image from 13.0.0 to 13.0.1 for cu130 container by @bkryu in #1739
  • fix: CI containers install nvidia-cudnn-cu12 vs. nvidia-cudnn-cu13 based on CUDA Version by @bkryu in #1742
  • Test refactoring and fixes by @nvmbreughe in #1736
  • TVM: support TVM binding for GroupedGemm by @neurusL in #1725
  • ci: enable tests for sm75 (G4) by @yongwww in #1705
  • doc: Super tiny fix doc math by @fzyzcjy in #1747
  • hotfix: Fix parsing pytorch verison by @sunghyunp-nvdia in #1749
  • feat: port fast_decode_plan from sgl by @zihaoye in #1745
  • hotfix: slightly bump up atol to 3e-3 to pass test_cudnn_prefill on B40 by @sunghyunp-nvdia in #1750
  • tests: xfail moe quantization classes mxfp8_bf16 UTs on sm103 by @jimmyzho in #1754
  • ci: complete the list of modules in aot.py by @yzh119 in #1746
  • tests: xfail attention sink UT for sliding window + non causal case by @yzh119 in #1752
  • feat: Add compute capability checks to flashinfer_benchmark by @bkryu in #1756
  • test: minor update on trtllm-gen attn speculative-decoding test by @yyihuang in #1760
  • fix: should pass global_override_indptr_cpu in fast_decode_plan param list by @yyihuang in #1757
  • fix(cleanup): ensure repository URL has no trailing slash by @tarukumar in #1759
  • Fix tests/test_trtllm_gen_attention.py::test_trtllm_batch_prefill, ::test_trtllm_batch_decode mismatch error by @kahyunnam in #1755
  • ci: add apache-tvm-ffi to ci docker container by @yzh119 in #1763
  • fix: fix cannot import name 'cuda' from 'cuda' in CUDA13 by @LuYanFCP in #1764
  • bugfix: partially fix tests/test_trtllm_gen_fused_moe.py unit test failure by @nv-yu...
Read more

Nightly Release v0.4.0-20251008

08 Oct 15:49
ebea4bd

Choose a tag to compare

Pre-release

Automated nightly build for version 0.4.0 (dev20251008)

Nightly Release v0.3.1-20251007

07 Oct 18:50
a4ddf26

Choose a tag to compare

Pre-release

Automated nightly build for version 0.3.1 (dev20251007)

v0.3.1

05 Sep 06:24
3c1e8d7

Choose a tag to compare

What's Changed

  • hotfix: change MAX_JOBS in aot ci by @yzh119 in #1621
  • fix: export MAX_JOBS for AOT build by @yongwww in #1626
  • feat: initial support for SM103, SM110, SM120, SM121 by @aleozlx in #1608
  • perf: Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices by @jinyangyuan-nvidia in #1615
  • Fix cute dsl gemm API wrong arg name and silent error when passing wrong kwargs by @fzyzcjy in #1619
  • bugfix: fix merge_attention_state in BatchAttention w/ gqa-group-size in Qwen family by @happierpig in #1614
  • bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs in test_trtllm_mnnvl_allreduce by @bkryu in #1627
  • ci: add cuda-13 unittests to CI by @yzh119 in #1603
  • Revert "hotfix: change MAX_JOBS in aot ci (#1621)" by @yzh119 in #1629
  • patch mm segfault & patch cubin avail. by @aleozlx in #1628
  • bugfix: fix flashinfer_benchmark.py IMA when running a test list by @bkryu in #1625
  • feat: cutlass fp4 gemm bringup for SM120 & SM121 by @yongwww in #1609
  • feat: update flashinfer-cli by @yzh119 in #1613
  • bugfix: trtllm-gen fmha sm101 and sm100 compatibility by @cyx-6 in #1631
  • bugfix: collect all modules to aot by @yzh119 in #1622
  • fix: pass workspace for trtllm-gen attention by @yyihuang in #1635
  • feat: cutlass fp8 gemm bringup for SM120 & SM121 by @yongwww in #1610
  • test: pytest.mark.xfail on deepgemm by @yongwww in #1636
  • release: bump version v0.3.1 by @yongwww in #1637

Full Changelog: v0.3.0...v0.3.1

v0.3.0

01 Sep 06:21
f131f3d

Choose a tag to compare

What's Changed

  • Backend: downgrade trtllm-gen kernel to cuda-12 by @cyx-6 in #1567
  • feat: Add fp8-qkv, fp16/bf16 output MHA by @weireweire in #1540
  • bump cutlass submodule to v4.2 by @ttyio in #1572
  • typo: fix typo in variable names of fp4 masked gemm by @fzyzcjy in #1570
  • benchmark: Add autotunner to moe benchmark by @nv-yunzheq in #1536
  • bugfix: fix cuda version guard macros by @nvjullin in #1571
  • misc: remove some unused files by @yzh119 in #1574
  • bugfix: update trtllm-gen gemm kernel names by @cyx-6 in #1577
  • feat: Support for inferring out_dtype from out.dtype for TRTLLM attention kernel by @elvischenv in #1578
  • fix: semaphoress must be at the fixed range in workspace buffer on trtllm_gen attention by @yyihuang in #1584
  • bugfix: Fix arg passing to TORCH_CHECK and TORCH_WARN macros by @amitz-nv in #1582
  • refactor: Expose calculate_tile_tokens_dim function by @amitz-nv in #1581
  • fix unignorable narrowing conversion issue by @luccafong in #1586
  • bugfix: Fix test_fp4_quantize test bug by @sricketts in #1585
  • update trtllm-gen fp4 autotuner and routing by @IwakuraRein in #1573
  • fix: limit the number of nvcc threads for each kernel by @yzh119 in #1589
  • fix: Improve TRTLLM attention kernel out_dtype unit test by @elvischenv in #1590
  • refactor: use allocator class for workspace buffer allocation by @yyihuang in #1588
  • misc: Fix footnote and typo in CONTRIBUTING.md by @sricketts in #1583
  • Mnnvl memory with custom communicator by @wenscarl in #1245
  • Add mnnvl_moe_alltoallv_prepare_without_allgather by @trevor-m in #1550
  • bugfix: Adding version checks to tests/test_hopper*.py files by @bkryu in #1594
  • Remove cuda-python from dependency and check at runtime by @VALLIS-NERIA in #1534
  • bugfix: fix fused-temperature softmax IMA issue by @yzh119 in #1596
  • bugfix: Fix RuntimeError("FlashInfer requires sm75+") by @hijkzzz in #1598
  • bugfix: fix the register overflow issue for topk renorm kernels on blackwell by @yzh119 in #1597
  • bugfix: fix unittest test_fp8_quantize by @yzh119 in #1599
  • bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs instead of failing by @bkryu in #1600
  • feat: Enable MnnvlMemory (for alltoallv) on B200 by @trevor-m in #1601
  • ci: add ci container of cuda 13 and add cute-dsl as dependency. by @yzh119 in #1595
  • ci: Fix unittests of logits processor by @yzh119 in #1602
  • feat: integrate xqa attention backend by @qsang-nv in #1503
  • [cute dsl] optimize cute dsl make_ptr perf by @limin2021 in #1607
  • bugfix: fix fp4 quantization with 8x4 scale factor layout by @cyx-6 in #1611
  • feat: enable trtllm-gen attn speculative decoding verify by decode by @yyihuang in #1453
  • ci: limit aot parallel build jobs based on available memory by @yongwww in #1612
  • releas: bump version v0.3.0 by @yzh119 in #1617

New Contributors

Full Changelog: v0.2.14.post1...v0.3.0

v0.2.14.post1

25 Aug 03:15
0380322

Choose a tag to compare

What's Changed

  • bugfix: Fix Persistent kernel precision for masked output by @Edenzzzz in #1533
  • ci: create docker image for cu126/cu128/cu129 by @yzh119 in #1558
  • Bugfix: some typos in Persistent kernel by @Edenzzzz in #1562
  • fix: separate out fp4 lib into sm90 and sm100 versions, add oob checking in fused moe by @djmmoss in #1565
  • bugfix: fix persistent attention kernel correctness on blackwell by @yzh119 in #1559
  • ci: add unittest for different cuda version by @yzh119 in #1560
  • release: bump version to v0.2.14.post1 by @yzh119 in #1568

Full Changelog: v0.2.14...v0.2.14.post1