Skip to content

Commit dd98809

Browse files
committed
feedback
1 parent dc2cdd0 commit dd98809

File tree

2 files changed

+17
-23
lines changed

2 files changed

+17
-23
lines changed

docs/source/en/_toctree.yml

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -70,18 +70,6 @@
7070
- sections:
7171
- local: optimization_overview
7272
title: Overview
73-
- local: perf_torch_compile
74-
title: torch.compile
75-
- local: perf_infer_gpu_one
76-
title: GPU
77-
- local: perf_infer_gpu_multi
78-
title: Distributed inference
79-
- local: perf_infer_cpu
80-
title: CPU
81-
- local: perplexity
82-
title: Perplexity of fixed-length models
83-
title: Generate API
84-
- sections:
8573
- local: attention_interface
8674
title: Attention backends
8775
- local: continuous_batching

docs/source/en/optimization_overview.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ Use the table below to pick an optimization technique.
2727
|---|:---:|:---:|
2828
| [Compilation](#compilation) || |
2929
| [Attention backends](#attention-backends) |||
30-
| [Kernels](#kernels) || |
30+
| [Kernels](#kernels) || |
3131
| [Quantization](#quantization) |||
32-
| [Caching](#caching) || |
32+
| [Caching](#caching) || |
3333
| [Parallelism](#parallelism) || |
3434
| [Continuous batching](#continuous-batching) || |
3535

@@ -39,16 +39,23 @@ This guide gives you a quick start on optimization in Transformers.
3939

4040
[torch.compile](./perf_torch_compile) reduces Python overhead, fuses operations, and creates kernels tuned for your shapes and hardware. The first run warms it up and subsequent runs use the faster compiled path.
4141

42-
Call `torch.compile()` on a model to enable it.
42+
Pass a [fixed size cache](./kv_cache#fixed-size-cache) to [`~GenerationMixin.generate`] to trigger `torch.compile` automatically.
4343

4444
```py
4545
import torch
46-
from transformers import AutoModelForCausalLM
46+
from transformers import AutoTokenizer, AutoModelForCausalLM
47+
48+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
49+
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", dtype=torch.float16, device_map="auto")
50+
input = tokenizer("The French Bread Law states", return_tensors="pt").to(model.device)
4751

48-
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
49-
compiled_model = torch.compile(model)
52+
output = model.generate(**input, do_sample=False, max_new_tokens=20, cache_implementation="static")
53+
tokenizer.batch_decode(output, skip_special_tokens=True)[0]
5054
```
5155

56+
> [!WARNING]
57+
> Avoid calling `torch.compile(model)` outside of [`~GenerationMixin.generate`] to prevent the model from recompiling every step.
58+
5259
## Attention backends
5360

5461
Alternative [attention backends](./attention_interface) like FlashAttention lower memory traffic. They tile attention computations and avoid large intermediate tensors to reduce memory footprint.
@@ -63,7 +70,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", attn_implementat
6370

6471
## Kernels
6572

66-
Kernels fuse operations to boost throughput. The [Kernels](https://huggingface.co/docs/kernels/en/index) library loads optimized compute kernels from the [Hub](https://huggingface.co/kernels-community) in a flexible and version-safe way.
73+
Kernels fuse operations to boost throughput and reduce memory usage. The [Kernels](https://huggingface.co/docs/kernels/en/index) library loads optimized compute kernels from the [Hub](https://huggingface.co/kernels-community) in a flexible and version-safe way.
6774

6875
The example below loads an optimized FlashAttention-2 kernel without installing the package.
6976

@@ -95,9 +102,8 @@ model = AutoModelForCausalLM.from_pretrained(
95102

96103
## Caching
97104

98-
[Caching](./kv_cache) increases speed by reusing past keys and values instead of recomputing them for every token. All Transformers models use a [`DynamicCache`] by default to allow the cache to grow proportionally with decoding.
99-
100-
Pick a caching strategy that fits your use case. If you want maximum speed, consider a [`StaticCache`]. A [`StaticCache`] is a fixed-size cache compatible with [torch.compile](#compilation).
105+
[Caching](./kv_cache) speeds up generation by reusing past keys and values instead of recomputing them for every token. To offset and reduce the memory cost of storing past keys and values, Transformers
106+
supports offloading the cache to the CPU. Only the current layer remains on the GPU.
101107

102108
Use the `cache_implementation` argument in [`~GenerationMixin.generate`] to set a cache strategy.
103109

@@ -110,7 +116,7 @@ model = AutoModelForCausalLM.from_pretrained(
110116
"Qwen/Qwen3-0.6B", attn_implementation="kernels-community/flash-attn2"
111117
)
112118
inputs = tokenizer("The Le Décret Pain states that a baguette must,", return_tensors="pt")
113-
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=50, cache_implementation="static")
119+
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=50, cache_implementation="offloaded")
114120
```
115121

116122
## Parallelism

0 commit comments

Comments
 (0)