Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing #30667

storyicon · 2025-12-15T04:38:03Z

Purpose

Add support for GPU tensors in the tensor_data function, enabling proper functionality of GPU-accelerated multimodal preprocessing.

Background

In our practical deployment, we enabled GPU-accelerated multimodal preprocessing by utilizing the following configurations, thereby moving tasks such as image and video preprocessing to the GPU. This significantly reduces CPU overhead in high-concurrency scenarios:

CLI argument: --mm-processor-kwargs '{"device": "cuda"}'
Model config: Setting "device": "cuda" in preprocessor_config.json

However, the current implementation of the tensor_data() function in vllm/v1/utils.py fails to handle tensors residing on the GPU, causing errors when GPU preprocessing is enabled.

Problem

The tensor_data() function is used for tensor serialization and hashing, particularly in multimodal input processing. The current implementation directly calls .numpy() on tensors:

return tensor.flatten().contiguous().view(torch.uint8).numpy().data

This fails for GPU tensors because PyTorch's .numpy() method only supports CPU tensors, raising:

 TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Solution

Add .cpu() call before .numpy() to handle both CPU and GPU tensors:

return tensor.flatten().contiguous().view(torch.uint8).cpu().numpy().data

Performance Impact:

CPU tensors: .cpu() is a no-op, no performance impact
GPU tensors: Necessary device-to-memory transfer (same as what would be needed anyway for serialization)

This change is critical for multimodal models when GPU preprocessing is enabled, as tensors may reside on GPU devices.

Test Plan

Unit Tests

import torch
from vllm.v1.utils import tensor_data

def test_tensor_data_cpu():
    tensor = torch.randn(10, 20)
    result = tensor_data(tensor)
    assert isinstance(result, memoryview)
    assert len(result) == tensor.numel() * tensor.element_size()
    print("CPU tensor test passed")

def test_tensor_data_gpu():
    if not torch.cuda.is_available():
        print("GPU not available, skipping GPU test")
        return
    tensor = torch.randn(10, 20).cuda()
    result = tensor_data(tensor)
    assert isinstance(result, memoryview)
    assert len(result) == tensor.numel() * tensor.element_size()
    print("GPU tensor test passed")

def test_tensor_data_various_dtypes():
    dtypes = [torch.float32, torch.float16, torch.int32, torch.int64]
    for dtype in dtypes:
        tensor = torch.randn(5, 5).to(dtype)
        result = tensor_data(tensor)
        assert isinstance(result, memoryview)
    print("Various dtypes test passed")

# Run tests
test_tensor_data_cpu()
test_tensor_data_gpu()
test_tensor_data_various_dtypes()

Integration Test

Test with actual multimodal preprocessing using GPU device:

# Start vLLM with GPU preprocessing
vllm serve <multimodal-model> \
  --mm-processor-kwargs '{"device": "cuda"}'
  
# Send multimodal inference request
# Should not raise TypeError during tensor serialization

Test Result

Before Fix

CPU tensors: Works correctly
GPU tensors: TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Impact: GPU-accelerated preprocessing cannot be used

After Fix

CPU tensors: Works correctly (no performance regression)
GPU tensors: Works correctly with automatic device-to-memory transfer
Impact: GPU-accelerated preprocessing fully functional

Test Output:

CPU tensor test passed
GPU tensor test passed
Various dtypes test passed

Affected Code Paths

The tensor_data() function is called in:

vllm/v1/serial_utils.py - Tensor encoding for serialization
vllm/v1/core/kv_cache_utils.py - Block prompt embeddings hashing
Various test files - Unit testing

All these code paths now work correctly with GPU tensors when GPU preprocessing is enabled.

Documentation Updates

Updated docstring in vllm/v1/utils.py:tensor_data() to clarify:

Function now supports both CPU and GPU tensors
Device-to-memory transfer behavior for GPU tensors
No-op behavior for CPU tensors

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…nversion Signed-off-by: storyicon <[email protected]>

chatgpt-codex-connector · 2025-12-15T04:38:11Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

github-actions · 2025-12-15T04:38:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request effectively addresses a critical issue where tensor_data() failed to process GPU tensors due to PyTorch's .numpy() method requiring CPU tensors. The solution, adding a .cpu() call before .numpy(), is correct and ensures compatibility with both CPU and GPU tensors, enabling GPU-accelerated multimodal preprocessing. The updated docstring clearly explains the behavior for both CPU and GPU tensors, including the necessary device-to-memory transfer for GPU tensors and the no-op for CPU tensors. The change is well-justified and includes a comprehensive test plan and results.

DarkLight1337 · 2025-12-15T05:46:22Z

I actually have a PR for that: #22070

The reason why this hasn't been merged into main branch is that it goes against the design of GPU memory management being done inside the engine core. Let me rebase the PR to keep it up-to-date.

storyicon · 2025-12-16T03:38:37Z

This offers significant benefits for high-concurrency multimodal scenarios. The implementation in #22070 appears to be more complete.

fix(tensor_data): handle GPU tensors by adding .cpu() before numpy co…

6201754

…nversion Signed-off-by: storyicon <[email protected]>

mergify bot added the v1 label Dec 15, 2025

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

storyicon closed this Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing #30667

Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing #30667

storyicon commented Dec 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

DarkLight1337 commented Dec 15, 2025 •

edited

Loading

Uh oh!

storyicon commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing #30667

Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing #30667

Conversation

storyicon commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background

Problem

Solution

Test Plan

Unit Tests

Integration Test

Test Result

Before Fix

After Fix

Affected Code Paths

Documentation Updates

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

DarkLight1337 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

storyicon commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

storyicon commented Dec 15, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Dec 15, 2025 •

edited

Loading