[Misc] Add ReplicaId to Ray metrics #24267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ruisearch42 merged 7 commits into vllm-project:main from eicherseiji:ray-metrics-replicaid

Dec 2, 2025

Contributor

eicherseiji commented Sep 4, 2025 •

edited by github-actions bot

Loading

Purpose

Taking over [Misc]add replicaid to ray metrics #22159 by @lengrongfu
Currently, Ray metrics only include WorkerId in default tags. ReplicaId is the more readable, dashboard-visible way to distinguish model replicas in metrics.

Test Plan

Run Ray Serve LLM app:

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config={
        "model_id": "qwen-0.5b",
        "model_source": "Qwen/Qwen2.5-0.5B-Instruct",
    },
    deployment_config={
        "autoscaling_config": {
            "min_replicas": 2,
            "max_replicas": 2,
        },
    },
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Test Result

Curl the Prometheus endpoint, check that ReplicaId tag is included:

(base) ray@ip-10-1-113-171:~/default/work/ray$ curl localhost:8085 | grep vllm_time_to
...
ray_vllm_time_to_first_token_seconds_count{Component="core_worker",NodeAddress="10.1.113.171",ReplicaId="3r4e76ag",SessionName="session_2025-12-01_11-05-34_460418_3579",Version="2.52.0",WorkerId="6f8b9bc4ba0fda1ef6769ce262129be2b522f90a31a593ec37379267",engine="0",model_name="qwen-0.5b"} 266.0
ray_vllm_time_to_first_token_seconds_sum{Component="core_worker",NodeAddress="10.1.113.171",ReplicaId="3r4e76ag",SessionName="session_2025-12-01_11-05-34_460418_3579",Version="2.52.0",WorkerId="6f8b9bc4ba0fda1ef6769ce262129be2b522f90a31a593ec37379267",engine="0",model_name="qwen-0.5b"} 2.077499999999997
ray_vllm_time_to_first_token_seconds_bucket{Component="core_worker",NodeAddress="10.1.113.171",ReplicaId="r7owse5i",SessionName="session_2025-12-01_11-05-34_460418_3579",Version="2.52.0",WorkerId="2a2948dc4ab2c93c9731b7fb1fc0d2ec4c0dd33794454bacf527fa53",engine="0",le="0.001",model_name="qwen-0.5b"} 0.0
...

Screenshot 2025-12-01 at 2 30 57 PM

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.


          Bring over changes by @rongfu.leng <[email protected]> in vllm-projec…

953f2c2

…t#22159

Signed-off-by: Seiji Eicher <[email protected]>
Co-authored-by: rongfu.leng <[email protected]>

eicherseiji mentioned this pull request

[Misc]add replicaid to ray metrics #22159

Closed

4 tasks

mergify bot added the v1 label

eicherseiji added 5 commits

December 1, 2025 10:07


          Merge branch 'main' of https://github.com/vllm-project/vllm into ray-…

1b55f1e

…metrics-replicaid

Signed-off-by: Seiji Eicher <[email protected]>


          Lint

056d82f

Signed-off-by: Seiji Eicher <[email protected]>


          Reduce complexity

7a9231a

Signed-off-by: Seiji Eicher <[email protected]>


          Remove super

115a8e9

Signed-off-by: Seiji Eicher <[email protected]>


          Re-add label len validation

371b10e

Signed-off-by: Seiji Eicher <[email protected]>

eicherseiji marked this pull request as ready for review

December 1, 2025 22:34

eicherseiji requested a review from markmc as a code owner

December 1, 2025 22:34

chatgpt-codex-connector bot commented Dec 1, 2025

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.


          Fix type annotation

df2310a

Signed-off-by: Seiji Eicher <[email protected]>

ruisearch42 reviewed

View reviewed changes

Collaborator

ruisearch42 left a comment

Otherwise LGTM

vllm/v1/metrics/ray_wrappers.py

    
                          self.metric.set_default_tags(labelskwargs)

                  @staticmethod

                  def _get_tag_keys(labelnames: list[str] | None) -> tuple[str, ...]:

Collaborator

ruisearch42 Dec 2, 2025

can we call it label_names?

Contributor Author

eicherseiji Dec 2, 2025

The variable is called labelnames in the Prometheus class we are trying to emulate, unfortunately:

vllm/tests/v1/metrics/test_metrics_reader.py

Line 110 in 342c4f1

labelnames=["position", "model", "engine_index"],

So it's not safe to rename, due to keyword arguments

vllm/v1/metrics/ray_wrappers.py

    
                      labels.append("ReplicaId")

                      return tuple(labels)

                  def labels(self, *labels, **labelskwargs):

Collaborator

ruisearch42 Dec 2, 2025

not sure why it's named labelskwargs in the first place, but a bit less readable. Can we rename to labels_kwargs?

Contributor Author

eicherseiji Dec 2, 2025

Also named labelkwargs in the prometheus_client package.

    def labels(self: T, *labelvalues: Any, **labelkwargs: Any) -> T:

vllm/v1/metrics/ray_wrappers.py

    
                                  f"Expected {len(self.metric._tag_keys)}, got {len(labels)}"

                                  f"Expected {expected}, got {len(labels)}"

                              )

                          labelskwargs.update(zip(self.metric._tag_keys, labels))

Collaborator

ruisearch42 Dec 2, 2025

this modifies the original dict? should we make a copy?

Contributor Author

eicherseiji Dec 2, 2025

In my understanding, **labelskwargs is a fresh dict each time. So we are not modifying any caller data

vllm/v1/metrics/ray_wrappers.py

    
                      if labelskwargs:

                          for k, v in labelskwargs.items():

                              if not isinstance(v, str):

                                  labelskwargs[k] = str(v)

Collaborator

ruisearch42 Dec 2, 2025

ditto, should we modify the copy rather than the original?

Contributor Author

eicherseiji Dec 2, 2025

(as above) I think **labelskwargs will be a fresh dict each function call. So we are not modifying any caller data

vllm/v1/metrics/ray_wrappers.py

Comment on lines +44 to +45

    
                          expected = len(self.metric._tag_keys) - 1

                          if len(labels) != expected:

Collaborator

ruisearch42 Dec 2, 2025

could there be a backwards compatibility issue? say an old version of ray/vllm is used?

Contributor Author

eicherseiji Dec 2, 2025

Hm there is no corresponding change in Ray for this to be (in)compatible with. This change is self-contained within vLLM.

vllm/v1/metrics/ray_wrappers.py

    
                  @staticmethod

                  def _get_tag_keys(labelnames: list[str] | None) -> tuple[str, ...]:

                      labels = list(labelnames) if labelnames else []

                      labels.append("ReplicaId")

Collaborator

ruisearch42 Dec 2, 2025

Can you add a comment here to make the intention explicit

Contributor Author

eicherseiji Dec 2, 2025

Done

eicherseiji commented

View reviewed changes

Contributor Author

eicherseiji left a comment

Thanks for the review @ruisearch42!

vllm/v1/metrics/ray_wrappers.py

    
                          self.metric.set_default_tags(labelskwargs)

                  @staticmethod

                  def _get_tag_keys(labelnames: list[str] | None) -> tuple[str, ...]:

Contributor Author

eicherseiji Dec 2, 2025

The variable is called labelnames in the Prometheus class we are trying to emulate, unfortunately:

vllm/tests/v1/metrics/test_metrics_reader.py

Line 110 in 342c4f1

labelnames=["position", "model", "engine_index"],

So it's not safe to rename, due to keyword arguments

vllm/v1/metrics/ray_wrappers.py

    
                      labels.append("ReplicaId")

                      return tuple(labels)

                  def labels(self, *labels, **labelskwargs):

Contributor Author

eicherseiji Dec 2, 2025

Also named labelkwargs in the prometheus_client package.

    def labels(self: T, *labelvalues: Any, **labelkwargs: Any) -> T:

vllm/v1/metrics/ray_wrappers.py

    
                  @staticmethod

                  def _get_tag_keys(labelnames: list[str] | None) -> tuple[str, ...]:

                      labels = list(labelnames) if labelnames else []

                      labels.append("ReplicaId")

Contributor Author

eicherseiji Dec 2, 2025

Done

vllm/v1/metrics/ray_wrappers.py

Comment on lines +44 to +45

    
                          expected = len(self.metric._tag_keys) - 1

                          if len(labels) != expected:

Contributor Author

eicherseiji Dec 2, 2025

Hm there is no corresponding change in Ray for this to be (in)compatible with. This change is self-contained within vLLM.

vllm/v1/metrics/ray_wrappers.py

    
                                  f"Expected {len(self.metric._tag_keys)}, got {len(labels)}"

                                  f"Expected {expected}, got {len(labels)}"

                              )

                          labelskwargs.update(zip(self.metric._tag_keys, labels))

Contributor Author

eicherseiji Dec 2, 2025

In my understanding, **labelskwargs is a fresh dict each time. So we are not modifying any caller data

vllm/v1/metrics/ray_wrappers.py

    
                      if labelskwargs:

                          for k, v in labelskwargs.items():

                              if not isinstance(v, str):

                                  labelskwargs[k] = str(v)

Contributor Author

eicherseiji Dec 2, 2025

(as above) I think **labelskwargs will be a fresh dict each function call. So we are not modifying any caller data

ruisearch42 approved these changes

View reviewed changes

ruisearch42 added the ready label

ruisearch42 enabled auto-merge (squash)

December 2, 2025 01:11

ruisearch42 merged commit 22274b2 into vllm-project:main

47 checks passed

xbfs pushed a commit to xbfs/vllm that referenced this pull request


          [Misc] Add ReplicaId to Ray metrics (vllm-project#24267)

d2edeaa

Signed-off-by: Seiji Eicher <[email protected]>
Co-authored-by: rongfu.leng <[email protected]>
Signed-off-by: Bofeng BF1 Xue <[email protected]>

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request


          [Misc] Add ReplicaId to Ray metrics (vllm-project#24267)

1ed2c26

Signed-off-by: Seiji Eicher <[email protected]>
Co-authored-by: rongfu.leng <[email protected]>
Signed-off-by: Xingyu Liu <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels