You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -39,16 +39,23 @@ This guide gives you a quick start on optimization in Transformers.
39
39
40
40
[torch.compile](./perf_torch_compile) reduces Python overhead, fuses operations, and creates kernels tuned for your shapes and hardware. The first run warms it up and subsequent runs use the faster compiled path.
41
41
42
-
Call `torch.compile()` on a model to enable it.
42
+
Pass a [fixed size cache](./kv_cache#fixed-size-cache) to [`~GenerationMixin.generate`]to trigger `torch.compile` automatically.
43
43
44
44
```py
45
45
import torch
46
-
from transformers import AutoModelForCausalLM
46
+
from transformers import AutoTokenizer, AutoModelForCausalLM
> Avoid calling `torch.compile(model)` outside of [`~GenerationMixin.generate`] to prevent the model from recompiling every step.
58
+
52
59
## Attention backends
53
60
54
61
Alternative [attention backends](./attention_interface) like FlashAttention lower memory traffic. They tile attention computations and avoid large intermediate tensors to reduce memory footprint.
@@ -63,7 +70,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", attn_implementat
63
70
64
71
## Kernels
65
72
66
-
Kernels fuse operations to boost throughput. The [Kernels](https://huggingface.co/docs/kernels/en/index) library loads optimized compute kernels from the [Hub](https://huggingface.co/kernels-community) in a flexible and version-safe way.
73
+
Kernels fuse operations to boost throughput and reduce memory usage. The [Kernels](https://huggingface.co/docs/kernels/en/index) library loads optimized compute kernels from the [Hub](https://huggingface.co/kernels-community) in a flexible and version-safe way.
67
74
68
75
The example below loads an optimized FlashAttention-2 kernel without installing the package.
69
76
@@ -95,9 +102,8 @@ model = AutoModelForCausalLM.from_pretrained(
95
102
96
103
## Caching
97
104
98
-
[Caching](./kv_cache) increases speed by reusing past keys and values instead of recomputing them for every token. All Transformers models use a [`DynamicCache`] by default to allow the cache to grow proportionally with decoding.
99
-
100
-
Pick a caching strategy that fits your use case. If you want maximum speed, consider a [`StaticCache`]. A [`StaticCache`] is a fixed-size cache compatible with [torch.compile](#compilation).
105
+
[Caching](./kv_cache) speeds up generation by reusing past keys and values instead of recomputing them for every token. To offset and reduce the memory cost of storing past keys and values, Transformers
106
+
supports offloading the cache to the CPU. Only the current layer remains on the GPU.
101
107
102
108
Use the `cache_implementation` argument in [`~GenerationMixin.generate`] to set a cache strategy.
103
109
@@ -110,7 +116,7 @@ model = AutoModelForCausalLM.from_pretrained(
0 commit comments