Releases: flashinfer-ai/flashinfer
Releases · flashinfer-ai/flashinfer
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.9
0.0.9 (2024-07-12)
Bugfix
- fix decode kernels output for empty kv cache (#363)(ac72b1)
- check gpu id in PyTorch APIs and use input tensor's gpu default stream (#361)(1b84fa)
Performance Improvements
- accelerate alibi (#365) (4f0a9f9)
- accelerate gqa performance (#356) (e56ddad)
- Optimize tensor conversions in C++ code to avoid unnecessary copies (#366) (1116237)
Acknowledgement
We thank @Yard1, @Ying1123 and @zhyncs for their contributions.
v0.0.8
v0.0.7
0.0.7 (2024-06-28)
Breaking Changes
batch_decode_with_padded_kv_cachewas removed, we encourage user to useBatchDecodeWithPagedKVCacheWrapperinstead. (#343)
Bugfix
- fix the
forward_return_lsefunction inBatchPrefillWithRaggedKVCacheclass (#337) - fix the scheduler behavior of large page size (#333)
Features
Performance Improvements
v0.0.6
v0.0.5
0.0.5 (2024-06-20)
Highlights
- Support any GQA group size for tensor-cores kernels.
- Support any page size for tensor-cores kernels.
- Support CUDA-Graph for prefill/decode APIs.
- Add an option to accelerate decode kernels with Tensor Cores.
- Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor)
- Support logits cap in Grok-1 models.
- Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html)
- PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/group_gemm.html)
Acknowledgement
We thank @ibsidorenko, @LiuXiaoxuanPKU, @Yard1 @AgrawalAmey, @xuzhenqi, @mgerstgrasser, @esmeetu, @yz-tang, @HSQ79815, @Qubitium, @shreygupta2809, @sighingnow, @vinx13, @tqchen, @merrymercy, @comaniac and many others for their contributions and helpful discussions for 0.0.5 release.
Refactor
- support any GQA group size for tensor-cores kernels (#301) (c111ca)
- support any page size for tensor-cores kernels (#306) (82fd8c)
Features
- add
use_tensor_coresoption to decode kernels to accelerate GQA (#317) (3b50dd5) - add group gemm operators (#282) (e08ba42)
- initial support of distributed operators (#289) (03553da)
- initial support of logits hook (#298) (ab1e2ad)
- Separate Q and KV dtypes for decode (#286) (5602659)
- support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
- support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
- support custom attention mask in prefill/append attention kernels (#266) (7304282)
- fused speculative sampilng kernels (#259) (cea2bb)
- expose sampling APIs in pytorch (#238) (092902)