Performance Improvement: Avoid host↔device synchronizations caused by tensor-to-Python conversions and certain tensor ops

### Feature request

Across the codebase there are places where tensor ops like `.item()`, `.nonzero()` are used. The PyTorch docs state the these operations cause host↔device synchronization if the tensor is on GPU. This can significantly hurt performance and block the CPU while long GPU kernels run. 

For reference one such instance which has been fixed in #42433 is given below:

https://github.com/huggingface/transformers/blob/f77950605e2f2f27613b829559b133c2b9fd8912/src/transformers/models/llama/modeling_llama.py#L402

`past_key_values` is an instance of `CacheLayerMixin`. Now, if someone is using `StaticLayer` then the `get_seq_len()` method returns value after doing tensor operations so it will be a 0-d tensor. Hence, `past_seen_tokens` will be a 0-d tensor. But since `torch.arange()` expects numbers as for its `start` and `end` it expects Number. So it will get the value to CPU to make the arange tensor. This will cause an implicit call to .item().

<img width="1104" height="440" alt="Image" src="https://github.com/user-attachments/assets/1a2e0400-9905-4d4a-b843-08f81f0a7aa0" />

 If the `past_seen_tokens` is on gpu then it calls `cudaStreamSynchronize` which can block the CPU if a large kernel is running on GPU. The profile image above demonstrates the sync. Additionally, this also seemed to cause graph-breaks while using with `torch.compile()`.

Fix is to make change given below:
```
cache_position: torch.Tensor = torch.arange(
        inputs_embeds.shape[1], device=inputs_embeds.device
    ) + past_seen_tokens
```
Now the arange call gets a proper int to work with and the `past_seen_tokens` tokens is added to the arange tensor assuming both will be on GPU without any sync. This is also `torch.compile()` friendly as needed for `StaticLayer`.

Another reference PR #40060.

### What to look for to identify similar instances:
1. For torch ops that require Number/float/int inputs check if the input can be a tensor.
2. Explicit calls to `.item()`, `.nonzero()` or similar operations that can cause a sync. Check PyTorch docs for more such ops.
3. Looking at profiles can help identify such instances.
4. torch.cuda.set_sync_debug_mode("warn") might also help find more such instances
5.  If the issue is present in modeling files be sure to run `make fix-copies` to apply changes to other similar files after confirming the changes with the maintainers.

**NOTE: While the above the reference instances could be fixed with minor refactoring others might need some larger-scale changes so better to consult with maintainers before putting effort into such changes.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Improvement: Avoid host↔device synchronizations caused by tensor-to-Python conversions and certain tensor ops #42422

Feature request

What to look for to identify similar instances:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Improvement: Avoid host↔device synchronizations caused by tensor-to-Python conversions and certain tensor ops #42422

Description

Feature request

What to look for to identify similar instances:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions