v0.1.6
0.1.6 (2024-08-27)
SM75 Support
Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).
API Changes
plan/run
Since 0.1.6 on, begin_forward/forward/end_forward APIs are replaced with the new plan/run API.
forwardis renamed torun, which is more precise and consistent with the naming convention of cutlass's python API.begin_forwardis renamed toplan, which is consistent with the naming convention of nvmath API.end_forwardis deprecated and has no effect after this PR.
There is some slight difference between the old forward and the new run API:
- All extra arguments such as
causalandlogits_soft_capwill be provided inplan(previouslybegin_forward) API, and cached until nextplancall, and we only need to provide query and KV-Cache tensors inrunAPI.
The old begin_forward/forward/end_forward APIs are still functional, but we will gradually deprecate them in future releases.
Check #466 for more details.
MultiLevelCascadeAttentionWrapper
Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper API for cascade inference,
which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.
See documentation and tutorial on API usage and layout explaination.
The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper and BatchPrefillWithSharedPrefixPagedKVCacheWrapper will be deprecated in future releases.
Features
- sm75 support (#448, #449)
- add
MultiLevelCascadeAttentionWrapperAPI (#462) (1e37989) - add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
- support bmm fp8 (#469) (f1c0b68)
Refactor
- refactor: replace
begin_forward/forward/end_forwardwithplan/run#466
Misc
Performance Improvements
- slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
- slight optimization on fragment layout swizzle (#458) (7c397cb)
- use persistent kernel for merging attention states (#459) (be6bf5b)
Acknowledgement
We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.