Skip to content

Conversation

@harshit-anyscale
Copy link
Contributor

  • test_target_capacity windows test is failing, possibly because we have put up a short timeout of 10 seconds, increasing it to verify whether timeout is an issue or not.

Signed-off-by: harshit <[email protected]>
@harshit-anyscale harshit-anyscale requested a review from a team as a code owner November 6, 2025 05:11
@harshit-anyscale harshit-anyscale self-assigned this Nov 6, 2025
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Nov 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the timeout for a test in test_initial_replica_tests to address a potential flakiness issue on Windows. The change is straightforward and reasonable. I've added a suggestion to use a named constant for the timeout value to improve code maintainability.

deployment_name: int(initial_replicas * config_target_capacity / 100)
},
app_name="app1",
timeout=30,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability and maintainability, it's better to define this timeout value as a named constant at the top of the file or test class, for example INITIAL_REPLICA_TEST_TIMEOUT_S = 30. This makes it easier to understand the purpose of the timeout and to adjust it in the future if needed, especially since other timeouts are used in this file.

Suggested change
timeout=30,
timeout=30, # Consider defining this as a constant, e.g., INITIAL_REPLICA_TEST_TIMEOUT_S

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent Timeout Updates in Replicas Test Suite

Inconsistent timeout handling in test_initial_replicas_new_configs. The PR aims to increase timeout to 30 seconds to address Windows test failures, but only the first wait_for_condition call (line 1078) was updated. Two subsequent wait_for_condition calls with check_expected_num_replicas (lines 1103-1109 for "app1" and lines 1110-1116 for "app2") still use the default 10-second timeout. These calls are checking similar replica scaling conditions and are likely to experience the same timeout issues on Windows, making the fix incomplete.

python/ray/serve/tests/test_target_capacity.py#L1102-L1116

client.deploy_apps(new_config)
wait_for_condition(
lambda: serve.status().target_capacity == new_config_target_capacity
)
wait_for_condition(
check_expected_num_replicas,
deployment_to_num_replicas={
deployment_name: int(
initial_replicas * new_config_target_capacity / 100
)
},
app_name="app1",
)
wait_for_condition(
check_expected_num_replicas,

Fix in Cursor Fix in Web


@ray-gardener ray-gardener bot added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core labels Nov 6, 2025
@zcin
Copy link
Contributor

zcin commented Nov 6, 2025

@harshit-anyscale did you run the windows test to verify the fix

@harshit-anyscale
Copy link
Contributor Author

@harshit-anyscale did you run the windows test to verify the fix

not right now, this seems to be a brute-force solution to me because the status we were getting is Deploying, but what we want is running, so thought of increasing the timeout first to make it less flakey and less problematic for others. If this works, will do the RCA for this, and take steps

this is sort-of trying a short term solution. let me know if that's okay or else will perform the windows test locally first.

deployment_name: int(initial_replicas * config_target_capacity / 100)
},
app_name="app1",
timeout=30,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incomplete timeout propagation in test retries

The timeout increase to 30 seconds is only applied to the first wait_for_condition call in test_initial_replicas_new_configs, but two similar calls later in the same test (around lines 1103 and 1111) still use the default 10-second timeout. This incomplete fix means the test can still fail on Windows due to timeouts in those later assertions, defeating the purpose of this PR.

Fix in Cursor Fix in Web

@zcin zcin merged commit 3f7a7b4 into master Nov 10, 2025
6 checks passed
@zcin zcin deleted the increase-timeout-for-wait-condition-v2 branch November 10, 2025 18:15
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 13, 2025
commit b3a8434d35f7af0322e3b766b1a1809bd29c2837
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 14:31:31 2025 -0800

    [doc] remove python 3.12 in doc building (#58572)

    unifying to python 3.10

    Signed-off-by: Lonnie Liu <[email protected]>

commit 31f904f630809152ceba67c8bf1684c8c9b685ea
Author: Andrew Sy Kim <[email protected]>
Date:   Thu Nov 13 17:27:23 2025 -0500

    Add support for RAY_AUTH_MODE=k8s  (#58497)

    This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray
    will delegate authentication and authorization of Ray access to
    Kubernetes TokenReview and SubjectAccessReview APIs.

    ---------

    Signed-off-by: Andrew Sy Kim <[email protected]>

commit ade535a9519c19c25aa50c562d2c27128b3ca356
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 14:08:29 2025 -0800

    [serve] fix serve dashboard metric name (#58573)

    Prometheus auto-append the `_total` suffix to all Counter metrics. Ray
    historically has been supported counter metric with and without `_total`
    suffix for backward compatibility, but it is now time to drop the
    support (2 years since the warning was added).

    There is one place in ray serve dashboard that still doesn't use the
    `_total` suffix so fix it in this PR.

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit 62a33c29d23a5c1fb91a969b9aea3ffe1f8281cc
Author: Rui Qiao <[email protected]>
Date:   Thu Nov 13 13:33:33 2025 -0800

    [Serve.LLM] Add avg prompt length metric (#58599)
    Add avg prompt length metric

    When using uniform prompt length (especially in testing), the P50 and
    P90 computations are skewed due to the 1_2_5 buckets used in vLLM.
    Average prompt length provides another useful dimension to look at and
    validate.

    For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows
    9400, and avg accurately shows 5000.

    <img width="1186" height="466" alt="image"
    src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a"
    />

    ---------

    Signed-off-by: Rui Qiao <[email protected]>
    Signed-off-by: Rui Qiao <[email protected]>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit 0c4dcb032ce03a771c3b6276fb661cfc6b839c01
Author: Elliot Barnwell <[email protected]>
Date:   Thu Nov 13 12:42:49 2025 -0800

    [release] allowing for py3.13 images (cpu & cu123) in release tests (#58581)

    allowing for py3.13 images (cpu & cu123) in release tests

    Signed-off-by: elliot-barn <[email protected]>

commit c3ba35e6cb1ce4030d8d361a921a697af516fbca
Author: Goutam <[email protected]>
Date:   Thu Nov 13 12:26:10 2025 -0800

    [Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype (#58225)
    As title suggests
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    Signed-off-by: Goutam <[email protected]>

commit af20446c362a8f4d17b9226d944a3242b0acafaf
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 12:18:38 2025 -0800

    [core] fix get_metric_check_condition tests (#58598)

    Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`,
    which is a non-flaky version of `fetch_prometheus`. Update all of test
    usage accordingly.

    Test:
    - CI

    ---------

    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: Cuong Nguyen <[email protected]>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit f1c613dc386268beec06b6c57c12191218ae7e74
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 12:14:04 2025 -0800

    [core] add an option to disable otel sdk error logs (#58257)

    Currently, Ray metrics and events are exported through a centralized
    process called the Dashboard Agent. This process functions as a gRPC
    server, receiving data from all other components (GCS, Raylet, workers,
    etc.). However, during a node shutdown, the Dashboard Agent may
    terminate before the other components, resulting in gRPC errors and
    potential loss of metrics and events.

    As this issue occurs, the otel sdk logs become very noisy. Add a default
    options to disable otel sdk logs to avoid confusion.

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit 638933ef4aabe24b5def68d72f21e772e354e853
Author: Abrar Sheikh <[email protected]>
Date:   Thu Nov 13 11:41:29 2025 -0800

    [1/n] [Serve] Refactor replica rank to prepare for node local ranks (#58471)

    2. **Extracted generic `RankManager` class** - Created reusable rank
    management logic separated from deployment-specific concerns

    3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
    replacing raw integers

    4. **Simplified error handling** - not supporting self healing

    5. **Updated tests** - Refactored unit tests to use new API and removed
    flag-dependent test cases

    **Impact:**
    - Cleaner separation of concerns in rank management
    - Foundation for future multi-level rank support

    Next PR https://github.com/ray-project/ray/pull/58473

    ---------

    Signed-off-by: abrar <[email protected]>

commit 5d5113134bce5929ff7504f733bbee44a7de2987
Author: Kunchen (David) Dai <[email protected]>
Date:   Thu Nov 13 11:21:50 2025 -0800

    [Core] Refactor reference_counter out of memory store and plasma store (#57590)

    As discovered in the [PR to better define the interface for reference
    counter](https://github.com/ray-project/ray/pull/57177#pullrequestreview-3312168933),
    plasma store provider and memory store both share thin dependencies on
    reference counter that can be refactored out. This will reduce
    entanglement in our code base and improve maintainability.

    The main logic changes are located in
    * src/ray/core_worker/store_provider/plasma_store_provider.cc, where
    reference counter related logic is refactor into core worker
    * src/ray/core_worker/core_worker.cc, where factored out reference
    counter logic is resolved
    * src/ray/core_worker/store_provider/memory_store/memory_store.cc, where
    logic related to reference counter has either been removed due to the
    fact that it is tech debt or refactored into caller functions.

    <!-- Please give a short summary of the change and the problem this
    solves. -->

    <!-- For example: "Closes #1234" -->
    Microbenchmark:
    ```
    single client get calls (Plasma Store) per second 10592.56 +- 535.86
    single client put calls (Plasma Store) per second 4908.72 +- 41.55
    multi client put calls (Plasma Store) per second 14260.79 +- 265.48
    single client put gigabytes per second 11.92 +- 10.21
    single client tasks and get batch per second 8.33 +- 0.19
    multi client put gigabytes per second 32.09 +- 1.63
    single client get object containing 10k refs per second 13.38 +- 0.13
    single client wait 1k refs per second 5.04 +- 0.05
    single client tasks sync per second 960.45 +- 15.76
    single client tasks async per second 7955.16 +- 195.97
    multi client tasks async per second 17724.1 +- 856.8
    1:1 actor calls sync per second 2251.22 +- 63.93
    1:1 actor calls async per second 9342.91 +- 614.74
    1:1 actor calls concurrent per second 6427.29 +- 50.3
    1:n actor calls async per second 8221.63 +- 167.83
    n:n actor calls async per second 22876.04 +- 436.98
    n:n actor calls with arg async per second 3531.21 +- 39.38
    1:1 async-actor calls sync per second 1581.31 +- 34.01
    1:1 async-actor calls async per second 5651.2 +- 222.21
    1:1 async-actor calls with args async per second 3618.34 +- 76.02
    1:n async-actor calls async per second 7379.2 +- 144.83
    n:n async-actor calls async per second 19768.79 +- 211.95
    ```
    This PR mainly makes logic changes to the `ray.get` call chain. As we
    can see from the benchmark above, the single clientget calls performance
    matches pre-regression levels.

    ---------

    Signed-off-by: davik <[email protected]>
    Co-authored-by: davik <[email protected]>
    Co-authored-by: Ibrahim Rabbani <[email protected]>

commit 2352e6b8e1e4488822eb787e6112c18c1964fbe0
Author: Sampan S Nayak <[email protected]>
Date:   Fri Nov 14 00:49:39 2025 +0530

    [Core] Support get-auth-token cli command  (#58566)

    add support for `ray get-auth-token` cli command + test

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit ea5bc3491a74e2b71f4cb6fdb14787fdcb3314fc
Author: Sampan S Nayak <[email protected]>
Date:   Fri Nov 14 00:37:23 2025 +0530

    [Core] Migrate to HttpOnly cookie-based authentication for enhanced security (#58591)

    Migrates Ray dashboard authentication from JavaScript-managed cookies to
    server-side HttpOnly cookies to enhance security against XSS attacks.
    This addresses code review feedback to improve the authentication
    implementation (https://github.com/ray-project/ray/pull/58368)

    main changes:
    - authentication middleware first looks for `Authorization` header, if
    not found it then looks at cookies to look for the auth token
    - new `api/authenticate` endpoint for verifying token and setting the
    auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and
    `secure=true` (when using https))
    - removed javascript based cookie manipulation utils and axios
    interceptors (were previously responsible for setting cookies)
    - cookies are deleted when connecting to a cluster with
    `AUTH_MODE=disabled`. connecting to a different ray cluster (with
    different auth token) using the same endpoint (eg due to port-forwarding
    or local testing) will reshow the popup and ask users to input the right
    token.

    ---------

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 0905c77db5acd286a6ba84a907c60ad2b15416dd
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:41:57 2025 -0800

    [ci] doc check: remove dependency on `ray_ci` (#58516)

    this makes it possible to run on a different python version than the CI
    wrapper code.

    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: Lonnie Liu <[email protected]>

commit 0bbd8fd22e0447ec66c12e67afc973e95523451b
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:35:38 2025 -0800

    [ci] mark github.Repository as typechecking (#58582)

    so that importing test.py does not always import github

    github repo imports jwt, which then imports cryptography and can lead to
    issues on windows.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 208970b5b399133a41557db8b16ad6832180e6b7
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:35:23 2025 -0800

    [wheel] stop building python 3.9 wheels on the pipelines (#58587)

    also stops building python 3.9 aarch64 images

    Signed-off-by: Lonnie Liu <[email protected]>

commit 33e855e42baaa1ebf4f3f0a1f96f00e87fdc1d11
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:32:21 2025 -0800

    [serve] run tests in python 3.10 (#58586)

    all tests are passing

    Signed-off-by: Lonnie Liu <[email protected]>

commit 5e8433d3cf8b6bea3366094bb4ecfc6f410dec01
Author: Zac Policzer <[email protected]>
Date:   Thu Nov 13 07:37:52 2025 -0800

    [core] Add monitoring in raylet for resouce view (#58382)

    We today have very little observability into pubsub. On a raylet one of
    the most important states that need to be propagated through the cluster
    via pubsub is cluster membership. All raylets should in an eventual BUT
    timely fashion agree on the list of available nodes. This metric just
    emits a simple counter to keep track of the node count.

    More pubsub observability to come.
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    ---------

    Signed-off-by: zac <[email protected]>
    Signed-off-by: Zac Policzer <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit dde70e76e5aa993e9224a2d173a053a35a132ebd
Author: Xinyu Zhang <[email protected]>
Date:   Wed Nov 12 23:04:37 2025 -0800

    [Data] Fix HTTP streaming file download by using `open_input_stream` (#58542)

    Fixes HTTP streaming file downloads in Ray Data's download operation.
    Some URIs (especially HTTP streams) require `open_input_stream` instead
    of `open_input_file`.

    - Modified `download_bytes_threaded` in `plan_download_op.py` to try
    both `open_input_file` and `open_input_stream` for each URI
    - Improved error handling to distinguish between different error types
       - Failed downloads now return `None` gracefully instead of crashing
    ```
    import pyarrow as pa
    from ray.data.context import DataContext
    from ray.data._internal.planner.plan_download_op import download_bytes_threaded
    urls = [
        "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
    ]
    table = pa.table({"url": urls})
    ctx = DataContext.get_current()
    results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
    result_table = results[0]
    for i in range(result_table.num_rows):
        url = result_table['url'][i].as_py()
        bytes_data = result_table['bytes'][i].as_py()

        if bytes_data is None:
            print(f"Row {i}: FAILED (None) - try-catch worked ✓")
        else:
            print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
        print(f"  URL: {url[:60]}...")

    print("\n✅ Test passed: Failed downloads return None instead of crashing.")
    ```

    Before the fix:
    ```
    TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
        test_download_expression_with_streaming_fallback()
      File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
        with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
        if not self.__exit__(*sys.exc_info()):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
        setattr(self.target, self.attribute, self.temp_original)
    TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
    (base) ray@ip-10-0-39-21:~/default$ python test.py
    2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
    Traceback (most recent call last):
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
        for result in fn(input_queue_iter):
                      ^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
        yield f.read()
              ^^^^^^^^
      File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
      File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
      File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
        raise ValueError("Cannot seek streaming HTTP file")
    ValueError: Cannot seek streaming HTTP file
    Traceback (most recent call last):
      File "/home/ray/default/test.py", line 16, in <module>
        results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
        uri_bytes = list(
                    ^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
        raise item
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
        for result in fn(input_queue_iter):
                      ^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
        yield f.read()
              ^^^^^^^^
      File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
      File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
      File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
        raise ValueError("Cannot seek streaming HTTP file")
    ValueError: Cannot seek streaming HTTP file
    ```
    After the fix:
    ```
    Row 0: SUCCESS (189370 bytes)
      URL: https://static-assets.tesla.com/configurator/compositor?cont...
    ```

    Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
    previously failed:
       - ✅ Successfully downloads HTTP stream files
       - ✅ Gracefully handles failed downloads (returns None)
       - ✅ Maintains backward compatibility with existing file downloads

    ---------

    Signed-off-by: xyuzh <[email protected]>
    Signed-off-by: Robert Nishihara <[email protected]>
    Co-authored-by: Robert Nishihara <[email protected]>

commit 438d6dcf225b7b03ba75ce9593050971458b94ac
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 22:19:50 2025 -0800

    [ci] pin docker client version (#58579)

    otherwise, the newer docker client will refuse to communicate with the
    docker daemon that is on an older version.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 633bb7b1d57ca58a05e905ee4551ee5f96d71750
Author: Elliot Barnwell <[email protected]>
Date:   Wed Nov 12 22:08:45 2025 -0800

    [deps] adding include_setuptools flag for depset config (#58580)

    Adding optional `include_setuptools` flag for depset configuration

    If the flag is set on a depset config --unsafe-package setuptools will
    not be included for depset compilation

    If the flag does not exist (default false) on a depset config
    --unsafe-package setuptools will be appended to the default arguments

    ---------

    Signed-off-by: elliot-barn <[email protected]>
    Co-authored-by: Lonnie Liu <[email protected]>

commit 292b977661b1ee9804bc0c6a3d3fbecd2b89ec25
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 20:36:43 2025 -0800

    [serve] remove minbuild-serve-py3.9 (#58585)

    nothing is using it anymore

    Signed-off-by: Lonnie Liu <[email protected]>

commit 0cdbe3f24132c69c4d6ce9322f85de767b660135
Author: Ibrahim Rabbani <[email protected]>
Date:   Wed Nov 12 18:48:27 2025 -0800

    [core] (cgroups) Use /proc/mounts if mount file is missing. (#58577)

    Signed-off-by: irabbani <[email protected]>

commit 22fbee343bc5326b2912ee24eb8faa8517ea29ec
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 18:26:25 2025 -0800

    [deps] update `requirements_buildkite.txt` (#58574)

    as the pydantic version is pinned in `requirements-doc.txt` now.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 7a6e29e96b1fa33ad5ff45e37d6f4da7eadd822a
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 16:38:54 2025 -0800

    Revert "[bazel] upgrade bazel python rules to 0.25.0" (#58578)

    Reverts ray-project/ray#58535

    failing on windows.. :(

commit 2f55d078bb69f39198eccf6293683e17a2e72dc5
Author: Goutam <[email protected]>
Date:   Wed Nov 12 16:37:24 2025 -0800

    [Data] - Iceberg support upsert tables + schema update + overwrite tables (#58270)
    - Support upserting iceberg tables for IcebergDatasink
    - Update schema on APPEND and UPSERT
    - Enable overwriting the entire table

    Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
    functionality. Also for append, the library now handles the transaction
    logic implicitly so that burden can be lifted from Ray Data.
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    ---------

    Signed-off-by: Goutam <[email protected]>

commit d6793ecdbc4e6043cc0b0f19862b4b0c8256bb7f
Author: Joshua Lee <[email protected]>
Date:   Wed Nov 12 16:31:26 2025 -0800

    [core] Use GetNodeAddressAndLiveness in raylet client pool (#58576)

    Using GetNodeAddressAndLiveness in raylet client pool instead of the
    bulkier Get, same for AsyncGetAll. Seems like it was already done in
    core worker client pool, so just making the same change for raylet
    client pool.

    Signed-off-by: joshlee <[email protected]>

commit e713b3de319afd437f2de7435f5a2870167fa99a
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 15:01:35 2025 -0800

    [doc] set default python env to 3.10 (#58570)

    we stop supporting building with python 3.9 now

    Signed-off-by: Lonnie Liu <[email protected]>

commit 8e4b32e0366a9b32f7dfbd55d5dd5a30fc5c734b
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 15:01:20 2025 -0800

    [bazel] rename contraint from hermatic to python_version (#58499)

    which is more accurate

    also moves python constraint definitions into `bazel/` directory and
    registering python 3.10 platform with hermetic toolchain

    this allows performing migration from python 3.19 to python 3.10
    incrementally

    Signed-off-by: Lonnie Liu <[email protected]>

commit 0d56f3ef9ae32c5ce8543bb76d9ccde120140623
Author: Elliot Barnwell <[email protected]>
Date:   Wed Nov 12 14:23:17 2025 -0800

    [images][deps] raydepsets base extra depset (#58461)

    generating depsets for base extra python requirements
    Installing requirements in base extra image

    ---------

    Signed-off-by: elliot-barn <[email protected]>

commit df65225e4f98bce2b45405b1cf89fb70556e2871
Author: Daniel Shin <[email protected]>
Date:   Thu Nov 13 07:08:15 2025 +0900

    [Data] Use Approximate Quantile for RobustScaler Preprocessor (#58371)
    Currently Ray Data has a preprocessor called `RobustScaler`. This scales
    the data based on given quantiles. Calculating the quantiles involves
    sorting the entire dataset by column for each column (C sorts for C
    number of columns), which, for a large dataset, will require a lot of
    calculations.

    ** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch`
    as I couldn't actually find well-maintained tdigest libraries for
    python. ddsketch is better maintained.

    ** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile`
    aggregator
    N/A
    N/A

    ---------

    Signed-off-by: kyuds <[email protected]>
    Signed-off-by: Daniel Shin <[email protected]>
    Co-authored-by: You-Cheng Lin <[email protected]>

commit 5e71d58badbfdcfc002826398c3e02469065cc71
Author: Sampan S Nayak <[email protected]>
Date:   Thu Nov 13 03:33:18 2025 +0530

    [Core] support token auth in ray client server  (#58557)
    support token auth in ray client server by using the existing grpc
    interceptors. This pr refactors the code to:
    - add/rename sync and async client and server interceptors
    - create grpc utils to house grpc channel and server creation logic,
    python codebase is updated to use these methods
    - separate tests for sync and async interceptors
    - make existing authentication integration tests to run with RAY_CLIENT
    mode

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit a6cc5499e7fa07c0d6cdc7b7cd0b08dfc08073dd
Author: Kunchen (David) Dai <[email protected]>
Date:   Wed Nov 12 13:45:02 2025 -0800

    [Core] Move request id creation to worker to address plasma get perf regression (#58390)
    This PR address the performance regression introduced in the [PR to make
    ray.get thread safe](https://github.com/ray-project/ray/pull/57911).
    Specifically, the previous PR requires the worker to block and wait for
    AsyncGet to return with a reply of the request id needed for correctly
    cleaning up get requests. This additional synchronous step causes the
    plasma store Get to regress in performance.

    This PR moves the request id generation step to the plasma store,
    removing the blocking step to fix the perf regression.
    - [PR which introduced perf
    regression](https://github.com/ray-project/ray/pull/57911)
    - [PR which observed the
    regression](https://github.com/ray-project/ray/pull/58175)
    New performance of the change measured by `ray microbenchmark`.
    <img width="485" height="17" alt="image"
    src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0"
    />

    Original performance prior to the change. Here we focus on the
    regressing `single client get calls (Plasma Store)` metric, where our
    new performance returns us back to the original 10k per second range
    compared to the existing sub 5k per second.
    <img width="811" height="355" alt="image"
    src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c"
    />

    ---------

    Signed-off-by: davik <[email protected]>
    Co-authored-by: davik <[email protected]>

commit 9e450e6805824ac825488e1455ac97f93df0bbc3
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 12:36:21 2025 -0800

    [doc] symlink the doc dependency lock file (#58520)

    and ask people to use that lock file for building docs.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 16c2f5fffbd1d772606de28ac39c0bb7182efdd4
Author: Lehui Liu <[email protected]>
Date:   Wed Nov 12 12:08:28 2025 -0800

    [train] Set JAX_PLATFORMS env var based on ScalingConfig (#57783)

    1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set
    to initialize jax.distributed:
    https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38
    2. Before this change, user will have to configure both `use_tpu=True`
    in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able
    to start jax.distributed. `JAX_PLATFORMS` can be comma separated string.
    3. If user uses other jax.distributed libraries like Orbax, sometimes,
    it will leads to misleading error about distributed initialization.
    4. After this change, if user sets `use_tpu=True`, we automatically add
    this to env var.
    5. tpu unit test is not available this time, will explore for how to
    cover it later.

    ---------

    Signed-off-by: Lehui Liu <[email protected]>

commit 1ab16e26a0251d3964637c6fe0f2f9a0ae8c6312
Author: iamjustinhsu <[email protected]>
Date:   Wed Nov 12 12:04:16 2025 -0800

    [Data] Add `Ranker` Interface (#58513)
    Creates a ranker interface that will rank the best operator to run next
    in `select_operator_to_run`. This code only refractors the existing
    code. The ranking value must be something that is comparable.
    None
    None

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit 9d5a2416e2980501ffc5c094ce5c59709f93ccf2
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 11:50:42 2025 -0800

    [bazel] upgrade bazel python rules to 0.25.0 (#58535)

    previously it was actually using 0.4.0, which is set up by the grpc
    repo. the declaration in the workspace file was being shadowed..

    Signed-off-by: Lonnie Liu <[email protected]>

commit 02afe68937429bfd6501e4d0f46780bca4dea329
Author: Balaji Veeramani <[email protected]>
Date:   Wed Nov 12 11:34:59 2025 -0800

    [Data] Refactor concurrency validation tests in `test_map.py` (#58549)

    The original `test_concurrency` function combined multiple test
    scenarios into a single test with complex control flow and expensive Ray
    cluster initialization. This refactoring extracts the parameter
    validation tests into focused, independent tests that are faster,
    clearer, and easier to maintain.

    Additionally, the original test included "validation" cases that tested
    valid concurrency parameters but didn't actually verify that concurrency
    was being limited correctly—they only checked that the output was
    correct, which isn't useful for validating the concurrency feature
    itself.

    **Key improvements:**
    - Split validation tests into `test_invalid_func_concurrency_raises` and
    `test_invalid_class_concurrency_raises`
    - Use parametrized tests for different invalid concurrency values
    - Switch from `shutdown_only` with explicit `ray.init()` to
    `ray_start_regular_shared` to eliminate cluster initialization overhead
    - Minimize test data from 10 blocks to 1 element since we're only
    validating parameter errors
    - Remove non-validation tests that didn't verify concurrency behavior

    N/A

    The validation tests now execute significantly faster and provide
    clearer failure messages. Each test has a single, well-defined purpose
    making maintenance and debugging easier.

    ---------

    Signed-off-by: Balaji Veeramani <[email protected]>

commit 676b86f4a8d6a4c4eab70f5f381642d9a17fdca2
Author: Balaji Veeramani <[email protected]>
Date:   Wed Nov 12 11:32:48 2025 -0800

    [Data] Convert rST-style to Google-style docstrings in `ray.data` (#58523)

    This PR improves documentation consistency in the `python/ray/data`
    module by converting all remaining rST-style docstrings (`:param:`,
    `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.).

    **Files modified:**
    - `python/ray/data/preprocessors/utils.py` - Converted
    `StatComputationPlan.add_callable_stat()`
    - `python/ray/data/preprocessors/encoder.py` - Converted
    `unique_post_fn()`
    - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()`
    and `BlockColumnAccessor.is_composed_of_lists()`
    - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` -
    Converted `DeltaSharingDatasource.setup_delta_sharing_connections()`

    Signed-off-by: Balaji Veeramani <[email protected]>

commit 7e872837e450411e9da45acea0c52f4b67221500
Author: Nikhil G <[email protected]>
Date:   Wed Nov 12 09:07:32 2025 -0800

    [serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (#58504)

    Signed-off-by: Nikhil Ghosh <[email protected]>

commit cd09d104f6d595a805fd8f9979d9f81a828823b5
Author: Alexey Kudinkin <[email protected]>
Date:   Wed Nov 12 11:50:05 2025 -0500

    [Data] Lowering `DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR` to 2 (#58262)

    > Thank you for contributing to Ray! 🚀
    > Please review the [Ray Contribution
    Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
    before opening a pull request.

    > ⚠️ Remove these instructions before submitting your PR.

    > 💡 Tip: Mark as draft if you want early feedback, or ready for review
    when it's complete.

    This was setting the value to be aligned with the previous default of 4.

    However, after some consideration i've realized that 4 is too high of a
    number so actually lowering this to 2
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    Signed-off-by: Alexey Kudinkin <[email protected]>

commit 126a40bc711cf06ed44686ee5026624d6b78766e
Author: Cuong Nguyen <[email protected]>
Date:   Wed Nov 12 07:44:53 2025 -0800

    [core] fix idle node termination on object pulling (#57928)

    Currently, a node is considered idle while pulling objects from the
    remote object store. This can lead to situations where a node is
    terminated as idle, causing the cluster to enter an infinite loop when
    pulling large objects that exceed the node idle termination timeout.

    This PR fixes the issue by treating object pulling as a busy activity.
    Note that nodes can still accept additional tasks while pulling objects
    (since pulling consumes no resources), but the auto-scaler will no
    longer terminate the node prematurely.

    Closes #54372

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit ad8f30291137efce9e463fb23e6821f4c7c74a9c
Author: Sagar Sumit <[email protected]>
Date:   Wed Nov 12 05:40:47 2025 -0800

    [core] Use graceful shutdown path when actor OUT_OF_SCOPE (`del actor`) (#57090)

    When actors terminate gracefully, Ray calls the actor's
    `__ray_shutdown__()` method if defined, allowing for cleanup of
    resources. But, this is not invoked in case actor goes out of scope due
    to `del actor`.

    Traced through the entire code path, and here's what happens:

    Flow when `del actor` is called:

    1. **Python side**: `ActorHandle.__del__()` ->
    `worker.core_worker.remove_actor_handle_reference(actor_id)`

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040

    2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` ->
    `reference_counter_->RemoveLocalReference()`
    - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed`
    callback

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506

    3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` ->
    `AsyncReportActorOutOfScope()` to GCS

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183
    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51

    4. **GCS receives notification**: `HandleReportActorOutOfScope()`
    - **THE PROBLEM IS HERE** ([line 279 in
    `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)):
       ```cpp
       DestroyActor(actor_id,
                    GenActorOutOfScopeCause(actor),
                    /*force_kill=*/true,  // <-- HARDCODED TO TRUE!
                    [reply, send_reply_callback]() {
       ```

    5. **Actor worker receives kill signal**: `HandleKillActor()` in
    [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970)
       ```cpp
       if (request.force_kill()) {  // This is TRUE for OUT_OF_SCOPE
           ForceExit(...)  // Skips __ray_shutdown__
       } else {
           Exit(...)  // Would call __ray_shutdown__
       }
       ```

    6. **ForceExit path**: Bypasses graceful shutdown -> No
    `__ray_shutdown__` callback invoked.

    This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE
    actors. Also, updated the docs.

    ---------

    Signed-off-by: Sagar Sumit <[email protected]>
    Co-authored-by: Ibrahim Rabbani <[email protected]>

commit 15393edbe72f5079279d3a0e46b72adc7496cdfc
Author: Sampan S Nayak <[email protected]>
Date:   Wed Nov 12 19:00:10 2025 +0530

    [Core] use client interceptor for adding auth token in c++ client calls (#58424)
    - Use client interceptor for adding auth tokens in grpc calls when
    `AUTH_MODE=token`
    - BuildChannel() will automatically include the interceptor
    - Removed `auth_token` parameter from `ClientCallImpl`
    - removed manual auth from `python_gcs_subscriber`.cc
    - tests to verify auth works for autoscaller apis

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit d496ea87808706333703be6ff25ecc9472330fd5
Author: Sampan S Nayak <[email protected]>
Date:   Wed Nov 12 11:25:11 2025 +0530

     [core] Token auth usability improvements (#58408)
    - rename RAY_auth_mode → RAY_AUTH_MODE environment variable across
    codebase
    - Excluded healthcheck endpoints from authentication for Kubernetes
    compatibility
    - Fixed dashboard cookie handling to respect auth mode and clear stale
    tokens when switching clusters

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 584f5acdf804b1ba097ff7fa5d78a0bfd63c682b
Author: kourosh hakhamaneshi <[email protected]>
Date:   Tue Nov 11 19:50:52 2025 -0800

    [doc][serve][llm] Attached the correct figure to the pd docs (#58543)

    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

commit a15f5be797ced0df321bfd8d42bab7d57defa2de
Author: Lonnie Liu <[email protected]>
Date:   Tue Nov 11 18:00:43 2025 -0800

    [doc] downgrade readthedocs to use python 3.10 (#58536)

    be consistent with the default build environment

    Signed-off-by: Lonnie Liu <[email protected]>

commit 9dcb67dc9ff20d9b9ae29875bb610273ba4149ed
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 17:26:15 2025 -0800

    [core] Fix auth test import (#58554)

    The python test step is failing on master now because of this. Probably
    a logical merge conflict.
    ```
    FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary)
    ...

    [2025-11-11T22:11:54Z]     from ray.tests.authentication_test_utils import (
    --
      | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils'
    ```

    Signed-off-by: dayshah <[email protected]>

commit 20bf68263beed3609e24aede3d9fc96bc07f0da0
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:44:05 2025 -0800

    [core][rdt] Abort NIXL and allow actor reuse on failed transfers  (#56783)

    Signed-off-by: dayshah <[email protected]>

commit 89a329cd1e0219629132abc203085117a11949f3
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:26:17 2025 -0800

    [core] Improve kill actor logs (#58544)

    Signed-off-by: dayshah <[email protected]>

commit 6c9607ea57b9edde07c856f094835c84f47b79a6
Author: Nikhil G <[email protected]>
Date:   Tue Nov 11 12:16:41 2025 -0800

    [docs][serve][llm] examples and doc for cross-node TP/PP in Serve (#57715)

    Signed-off-by: Nikhil Ghosh <[email protected]>
    Signed-off-by: Nikhil G <[email protected]>

commit 711d9453828fecebb91b9642e799b4b0b4a493f7
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:13:13 2025 -0800

    [core] Make GlobalState lazy initialization thread-safe (#58182)

    Signed-off-by: dayshah <[email protected]>

commit fd10c39829a580bd83ba28c8518e7a7a5ebd3dfb
Author: Kai-Hsun Chen <[email protected]>
Date:   Tue Nov 11 09:43:05 2025 -0800

    [core] Scheduling a detached actor with a placement group is not recommended (#57726)

    <!-- Thank you for contributing to Ray! 🚀 -->
    <!-- Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->
    <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
    review when it's complete -->

    If users schedule a detached actor into a placement group, Raylet will
    kill the actor when the placement group is removed. The actor will be
    stuck in the `RESTARTING` state forever if it's restartable until users
    explicitly kill it.

    In that case, if users try to `get_actor` with the actor's name, it can
    still return the restarting actor, but no process exists. It will no
    longer be restarted because the PG is gone, and no PG with the same ID
    will be created during the cluster's lifetime.

    The better behavior would be for Ray to transition a task/actor's state
    to dead when it is impossible to restart. However, this would add too
    much complexity to the core, so I think it's not worth it. Therefore,
    this PR adds a warning log, and users should use detached actors or PGs
    correctly.

    Example: Run the following script and run `ray list actors`.

    ```python
    import ray
    from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
    from ray.util.placement_group import placement_group, remove_placement_group

    @ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1)
    class Actor:
      pass

    ray.init()

    pg = placement_group([{"CPU": 1}])
    ray.get(pg.ready())

    actor = Actor.options(
        scheduling_strategy=PlacementGroupSchedulingStrategy(
            placement_group=pg,
        )
    ).remote()

    ray.get(actor.__ray_ready__.remote())
    ```

    <!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to

    - [ ] Bug fix 🐛
    - [ ] New feature ✨
    - [x] Enhancement 🚀
    - [ ] Code refactoring 🔧
    - [ ] Documentation update 📖
    - [ ] Chore 🧹
    - [ ] Style 🎨

    **Does this PR introduce breaking changes?**
    - [ ] Yes ⚠️
    - [x] No
    <!-- If yes, describe what breaks and how users should migrate -->

    **Testing:**
    - [ ] Added/updated tests for my changes
    - [x] Tested the changes manually
    - [ ] This PR is not tested ❌ _(please explain why)_

    **Code Quality:**
    - [x] Signed off every commit (`git commit -s`)
    - [x] Ran pre-commit hooks ([setup
    guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

    **Documentation:**
    - [ ] Updated documentation (if applicable) ([contribution
    guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
    - [ ] Added new APIs to `doc/source/` (if applicable)

    <!-- Optional: Add screenshots, examples, performance impact, breaking
    change details -->

    ---------

    Signed-off-by: Kai-Hsun Chen <[email protected]>
    Signed-off-by: Robert Nishihara <[email protected]>
    Signed-off-by: Kai-Hsun Chen <[email protected]>
    Co-authored-by: Robert Nishihara <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 0752886e7d55694b6cf8d780b7470d58266c6a10
Author: Cuong Nguyen <[email protected]>
Date:   Tue Nov 11 07:19:19 2025 -0800

    [core] enable open telemetry by default (#56432)

    This PR enables open telemetry as the default backend for ray metric
    stack. The bulk of this PR is actually to fix tests that were written
    with some assumptions that no longer hold true. For ease of reviewing, I
    inline the reasons for the change together with the change for each
    tests in the comments.

    This PR also depends on a release of vllm (so that we can update the
    minimal supported version of vllm in ray).

    Test:
    - CI

    <!-- CURSOR_SUMMARY -->
    ---

    > [!NOTE]
    > Enable OpenTelemetry metrics backend by default and refactor
    metrics/Serve tests to use timeseries APIs and updated `ray_serve_*`
    metric names.
    >
    > - **Core/Config**:
    > - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to
    `true` in `ray_constants.py` and `ray_config_def.h`.
    > - Metrics `Counter`: use `CythonCount` by default; keep legacy
    `CythonSum` only when OTEL is explicitly disabled.
    > - **Serve/Metrics Tests**:
    > - Replace text scraping with `PrometheusTimeseries` and
    `fetch_prometheus_metric_timeseries` throughout.
    > - Update metric names/tags to `ray_serve_*` and counter suffixes
    `*_total`; adjust latency metric names and processing/queued gauges.
    > - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and
    pass through helpers.
    > - **General Test Fixes**:
    > - Remove OTEL parametrization/fixtures; simplify expectations where
    counters-as-gauges no longer apply; drop related tests.
    > - Cardinality tests: include `"low"` level and remove OTEL gating;
    stop injecting `enable_open_telemetry` in system config.
    > - Actor/state/thread tests: migrate to cluster fixtures, wait for
    dashboard agent, and adjust expected worker thread counts.
    > - **Build**:
    > - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env
    from C++ stats test.
    >
    > <sup>Written by [Cursor
    Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
    1d0190f3dd58d5f0c982fcbdab95fcf5f733553f. This will update automatically
    on new commits. Configure
    [here](https://cursor.com/dashboard?tab=bugbot).</sup>
    <!-- /CURSOR_SUMMARY -->

    ---------

    Signed-off-by: Cuong Nguyen <[email protected]>

commit bf595e32d049503f5c1931c5b477647a06d191c2
Author: Sampan S Nayak <[email protected]>
Date:   Tue Nov 11 19:15:41 2025 +0530

    [Core] move authentication_test_utils into ray._private to fix macos tests (#58528)

    the auth token test setup in `conftest.py` is breaking macos test. there
    are two test scripts (`test_microbenchmarks.py` and `test_basic.py`)
    that run after the wheel is installed but without editable mode. for
    these test to pass,` conftest.py` cannot import anything under
    `ray.tests`.

    this pr moves `authentication_test_utils` into `ray._private` to fix
    this issue

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 3d29c4ccc9182c44d3cfab08fb561cb7db74eea8
Author: Sampan S Nayak <[email protected]>
Date:   Tue Nov 11 19:10:56 2025 +0530

    [Core] Add Service Interceptor to support token authentication in dashboard agent (#58405)

    Add a grpc service interceptor to intercept all dashboard agent rpc
    calls and validate the presence of auth token (when auth mode is token)

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 1a48e7318442d038f2c43d22da3b580fa643b8d1
Author: curiosity-hyf <[email protected]>
Date:   Tue Nov 11 21:35:42 2025 +0800

    [Docs] fix pattern_async_actor demo typo (#58486)

    fix pattern_async_actor demo typo. Add `self.`.

    ---------

    Signed-off-by: curiosity-hyf <[email protected]>

commit f2a7a94a75b007a801ee5a2cf6a6e24b93e9cb9a
Author: Thomas Desrosiers <[email protected]>
Date:   Mon Nov 10 18:28:46 2025 -0800

    Update pydoclint to version 0.8.1 (#58490)
    * Does the work to bump pydoclint up to the latest version
    * And allowlist any new violations it finds
    n/a
    n/a

    ---------

    Signed-off-by: Thomas Desrosiers <[email protected]>

commit 10983e8c9f50ddfa355efe7977d056b29b38d4c1
Author: Goutam <[email protected]>
Date:   Mon Nov 10 17:34:13 2025 -0800

    [Data] - Iceberg support predicate & projection pushdown (#58286)
    Predicate pushdown (https://github.com/ray-project/ray/pull/58150) in
    conjunction with this PR should speed up reads from Iceberg.

    Once the above change lands, we can add the pushdown interface support
    for IcebergDatasource

    ---------

    Signed-off-by: Goutam <[email protected]>

commit 09f01135f4ab71d52be7a44d06e40ff3767f6cee
Author: Seiji Eicher <[email protected]>
Date:   Mon Nov 10 17:28:23 2025 -0800

    [serve][llm] Fix import path in muli-node release test (#58498)

    Signed-off-by: Seiji Eicher <[email protected]>

commit 405c4648c2fe71afb7daf4ea574605190f129fd7
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 16:04:48 2025 -0800

    [ci] upgrade rayci version (#58514)

    to 0.21.0; supports wanda priority now.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 6de012fd0df23993054653ca5517a66944c58dd2
Author: Zac Policzer <[email protected]>
Date:   Mon Nov 10 14:05:15 2025 -0800

    [core] Add owned object spill metrics (#57870)

    This PR adds 2 new metrics to core_worker by way of the reference
    counter. The two new metrics keep track of the count and size of objects
    owned by the worker as well as keeping track of their states. States are
    defined as:

    - **PendingCreation**: An object that is pending creation and hasn't
    finished it's initialization (and is sizeless)
    - **InPlasma**: An object which has an assigned node address and isn't
    spilled
    - **Spilled**: An object which has an assigned node address and is
    spilled
    - **InMemory**: An object which has no assigned address but isn't
    pending creation (and therefore, must be local)

    The approach used by these new metrics is to examine the state 'before
    and after' any mutations on the reference in the reference_counter. This
    is required in order to do the appropriate bookkeeping (decrementing
    values and incrementing others). Admittedly, there is potential for
    counting on the in between decrements/increments depending on when the
    RecordMetrics loop is run. This unfortunate side effect however seems
    preferable to doing mutual exclusion with metric collection as this is
    potentially a high throughput code path.

    In addition, performing live counts seemed preferable then doing full
    accounting of the object store and across all references at time of
    metric collection. Reason being, that potentially the reference counter
    is tracking millions of objects, and each metric scan could potentially
    be very expensive. So running the accounting (despite being potentially
    innaccurate for short periods) seemed the right call.

    This PR also allows for object size to potentially change due to
    potential non deterministic instantation (say an object is initially
    created, but it's primary copy dies, and then the recreation fails).
    This is an edge case, but seems important for completeness sake.

    ---------

    Signed-off-by: zac <[email protected]>

commit f2dd0e2b6dc7bc074f72197ff08f7d4e58635052
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:02:11 2025 -0800

    [java] remove local genrule `//java:ray_java_pkg` (#58503)

    using `bazelisk run //java:gen_ray_java_pkg` everywhere

    Signed-off-by: Lonnie Liu <[email protected]>

commit b23adc777c5b103291cf3a35b51b123a808d36f6
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:01:27 2025 -0800

    [ci] apply isort to release test directory, part 1 (#58505)

    excluding `*_tests` directories for now to reduce the impact

    Signed-off-by: Lonnie Liu <[email protected]>

commit ce1fd472b2677069a5bfcd2b5ed7a2695f5f2966
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:01:06 2025 -0800

    [doc] change link check to run on python 3.12 (#58506)

    migrating all doc related things to run on python 3.12

    Signed-off-by: Lonnie Liu <[email protected]>

commit b09b076e15fefe842a0b7e33accff71ec3c31435
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:00:01 2025 -0800

    [doc] ci: move doc annotation check to python 3.12 (#58507)

    be consistent with doc build environment

    Signed-off-by: Lonnie Liu <[email protected]>

commit 8971f83ecb40d54729c2c26d394594c29199e19d
Author: iamjustinhsu <[email protected]>
Date:   Mon Nov 10 12:52:43 2025 -0800

    [data] Clear queue for manually mark_execution_finished operators (#58441)
    Currently, we clear _external_ queues when an operator is manually
    marked as finished. But we don't clear their _internal_ queues. This PR
    fixes that
    Fixes this test
    https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit ffb51f866802ad3858d82a9356855a38503efec9
Author: Matthew Owen <[email protected]>
Date:   Mon Nov 10 10:54:34 2025 -0800

    [data] Update depsets for multimodal inference release tests (#57233)

    Update remaining mulitmodal release tests to use new depsets.

commit 62231dd4ba8e784da8800b248ad7616b8db92de7
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 10:30:00 2025 -0800

    [ci] seperate doc related jobs into its own group (#58454)

    so that they are not called lints any more

    Signed-off-by: Lonnie Liu <[email protected]>

commit 3f7a7b42fda0bb75a9af6e5ad197ba3743b011c2
Author: harshit-anyscale <[email protected]>
Date:   Mon Nov 10 23:45:38 2025 +0530

    increase timeout for test_initial_replica tests (#58423)

    - `test_target_capacity` windows test is failing, possibly because we
    have put up a short timeout of 10 seconds, increasing it to verify
    whether timeout is an issue or not.

    Signed-off-by: harshit <[email protected]>

commit 217031a48f4f83d04950ad39b94846ba362edd37
Author: Jugal Shah <[email protected]>
Date:   Mon Nov 10 09:39:43 2025 -0800

    Define an env for controlling UVloop (#58442)

    > Briefly describe what this PR accomplishes and why it's needed.

    Our serve ingress keeps running into below error related to `uvloop`
    under heavy load
    ```
    File descriptor 97 is used by transport
    ```
    The uvloop team have a
    [PR](https://github.com/MagicStack/uvloop/pull/646) to fix it, but seems
    like no one is working on it

    One of workaround mentioned in the
    ([PR](https://github.com/MagicStack/uvloop/pull/646#issuecomment-3138886982))
    is to just turn off uvloop .
    We tried it in our env and didn't see any major performance difference
    Hence as part of this PR, we are defining a new env for controlling
    UVloop

    Signed-off-by: jugalshah291 <[email protected]>

commit 2486ddd9fec83cc940937e3d91368942588ef177
Author: fscnick <[email protected]>
Date:   Mon Nov 10 23:29:03 2025 +0800

    [Doc][KubeRay] eliminate vale errors (#58429)

    Fix some vale's error and suggestions on the kai-scheduler document.

    See https://github.com/ray-project/ray/pull/58161#discussion_r2463701719

    Signed-off-by: fscnick <[email protected]>

commit cb6a60d0afcfca87734a399291343e297031f1d5
Author: Daniel Sperber <[email protected]>
Date:   Mon Nov 10 16:24:34 2025 +0100

    [air] Add stacklevel option to deprecation_warning (#58357)

    Currently are deprecation warnings sometimes not informative enough. The
    the warning is triggered it does not tell us *where* the deprecated
    feature is used. For example, ray internally raises a deprecation
    warning when an `RLModuleConfig` is initialized.

    ```python
    >>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig
    >>> RLModuleConfig()
    2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
    ```

    This is confusing, where did *I* use a config, what am I doing wrong?
    This raises issues like:
    https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064

    Tracing where the error actually happens is tedious - is it my code or
    internal? The output just shows `deprecation.:50`. Not helpful.

    This PR adds a stacklevel option with stacklevel=2 as the default to all
    `deprecation_warning`s. So devs and users can better see where is the
    deprecated option actually used.

    ---

    EDIT:

    **Before**

    ```python
    WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])`
    ```

    **After** module.py:line where the deprecated artifact is used is shown
    in the log output:

    When building an Algorithm:
    ```python
    WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
    ```

    ```python
    .../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
    ```

    Signed-off-by: Daraan <[email protected]>

commit 5bff52ab5d9a9d67de88c4f0b86c918487ed7216
Author: Sampan S Nayak <[email protected]>
Date:   Mon Nov 10 20:50:21 2025 +0530

    [core] Configure an interceptor to pass auth token in python direct g… (#58395)

    there are places in the python code where we use the raw grpc library to
    make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term
    we want to fully deprecate grpc library usage in our python code base
    but as that can take more effort and testing, in this pr I am
    introducing an interceptor to add auth headers (this will take effect
    for all grpc calls made from python).
    ```
    export RAY_auth_mode="token"
    export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
    ray start --head
    ray job submit -- echo "hi"
    ```

    output
    ```
    ray job submit -- echo "hi"
    2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads.
    Job submission server address: http://127.0.0.1:8265

    -------------------------------------------------------
    Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully
    -------------------------------------------------------

    Next steps
      Query the logs of the job:
        ray job logs raysubmit_1EV8q86uKM24nHmH
      Query the status of the job:
        ray job status raysubmit_1EV8q86uKM24nHmH
      Request the job to be stopped:
        ray job stop raysubmit_1EV8q86uKM24nHmH

    Tailing logs until the job exits (disable with --no-wait):
    2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up.
    hi
    Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi

    ------------------------------------------
    Job 'raysubmit_1EV8q86uKM24nHmH' succeeded
    ------------------------------------------
    ```
    dashboard
    test.py
    ```python
    import time
    import ray
    from ray._raylet import Config

    ray.init()

    @ray.remote
    def print_hi():
        print("Hi")
        time.sleep(2)

    @ray.remote
    class SimpleActor:
        def __init__(self):
            self.value = 0

        def increment(self):
            self.value += 1
            return self.value

    actor = SimpleActor.remote()
    result = ray.get(actor.increment.remote())

    for i in range(100):
        ray.get(print_hi.remote())
        time.sleep(20)

    ray.shutdown()
    ```

    ```
    export RAY_auth_mode="token"
    export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
    python test.py
    ```
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292"
    />
    overview page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762"
    />

    job page: tasks are listed
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a"
    />

    task page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136"
    />

    actors page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459"
    />

    specific actor page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0"
    />

    ---------

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 71c7bd056cc132c57a4c3cf13d0f5207cbcfd73f
Author: Xinyu Zhang <[email protected]>
Date:   Sun Nov 9 08:34:46 2025 -0800

    [Data] Add exception handling for invalid URIs in download operation (#58464)

commit d74c1570543045a0f99df4d5690ac44f1fda4a55
Author: iamjustinhsu <[email protected]>
Date:   Sat Nov 8 15:35:11 2025 -0800

    [dashboards][core] Make `do_reply` accept status_code, instead of success: bool (#58384)
    Pass in `status_code` directly into `do_reply`. This is a follow up to
    https://github.com/ray-project/ray/pull/58255

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit e793631896f65a88513510b4e7bf6f100607cb03
Author: Rueian <[email protected]>
Date:   Sat Nov 8 15:32:10 2025 -0800

    [core][autoscaler] Fix RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460)

    This ensures node type names are correctly reported even when the
    autoscaler is disabled (read-only mode).

    Autoscaler v2 fails to report prometheus metrics when operating in
    read-only mode on KubeRay with the following KeyError error:

    ```
    2025-11-08 12:06:57,402	ERROR autoscaler.py:215 -- 'small-group'
    Traceback (most recent call last):
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
        return Reconciler.reconcile(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile
        Reconciler._step_next(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next
        Reconciler._scale_cluster(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster
        reply = scheduler.schedule(sched_request)
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule
        ResourceDemandScheduler._enforce_max_workers_per_type(ctx)
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type
        node_config = ctx.get_node_type_configs()[node_type]
    KeyError: 'small-group'
    ```

    This happens because the `ReadOnlyProviderConfigReader` populates
    `ctx.get_node_type_configs()` using node IDs as node types, which is
    correct for local Ray (where local ray does not have
    `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where
    `ray_node_type_name` is present and expected wi…
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 13, 2025
commit b3a8434d35f7af0322e3b766b1a1809bd29c2837
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 14:31:31 2025 -0800

    [doc] remove python 3.12 in doc building (#58572)

    unifying to python 3.10

    Signed-off-by: Lonnie Liu <[email protected]>

commit 31f904f630809152ceba67c8bf1684c8c9b685ea
Author: Andrew Sy Kim <[email protected]>
Date:   Thu Nov 13 17:27:23 2025 -0500

    Add support for RAY_AUTH_MODE=k8s  (#58497)

    This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray
    will delegate authentication and authorization of Ray access to
    Kubernetes TokenReview and SubjectAccessReview APIs.

    ---------

    Signed-off-by: Andrew Sy Kim <[email protected]>

commit ade535a9519c19c25aa50c562d2c27128b3ca356
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 14:08:29 2025 -0800

    [serve] fix serve dashboard metric name (#58573)

    Prometheus auto-append the `_total` suffix to all Counter metrics. Ray
    historically has been supported counter metric with and without `_total`
    suffix for backward compatibility, but it is now time to drop the
    support (2 years since the warning was added).

    There is one place in ray serve dashboard that still doesn't use the
    `_total` suffix so fix it in this PR.

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit 62a33c29d23a5c1fb91a969b9aea3ffe1f8281cc
Author: Rui Qiao <[email protected]>
Date:   Thu Nov 13 13:33:33 2025 -0800

    [Serve.LLM] Add avg prompt length metric (#58599)
    Add avg prompt length metric

    When using uniform prompt length (especially in testing), the P50 and
    P90 computations are skewed due to the 1_2_5 buckets used in vLLM.
    Average prompt length provides another useful dimension to look at and
    validate.

    For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows
    9400, and avg accurately shows 5000.

    <img width="1186" height="466" alt="image"
    src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a"
    />

    ---------

    Signed-off-by: Rui Qiao <[email protected]>
    Signed-off-by: Rui Qiao <[email protected]>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit 0c4dcb032ce03a771c3b6276fb661cfc6b839c01
Author: Elliot Barnwell <[email protected]>
Date:   Thu Nov 13 12:42:49 2025 -0800

    [release] allowing for py3.13 images (cpu & cu123) in release tests (#58581)

    allowing for py3.13 images (cpu & cu123) in release tests

    Signed-off-by: elliot-barn <[email protected]>

commit c3ba35e6cb1ce4030d8d361a921a697af516fbca
Author: Goutam <[email protected]>
Date:   Thu Nov 13 12:26:10 2025 -0800

    [Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype (#58225)
    As title suggests
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    Signed-off-by: Goutam <[email protected]>

commit af20446c362a8f4d17b9226d944a3242b0acafaf
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 12:18:38 2025 -0800

    [core] fix get_metric_check_condition tests (#58598)

    Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`,
    which is a non-flaky version of `fetch_prometheus`. Update all of test
    usage accordingly.

    Test:
    - CI

    ---------

    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: Cuong Nguyen <[email protected]>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit f1c613dc386268beec06b6c57c12191218ae7e74
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 12:14:04 2025 -0800

    [core] add an option to disable otel sdk error logs (#58257)

    Currently, Ray metrics and events are exported through a centralized
    process called the Dashboard Agent. This process functions as a gRPC
    server, receiving data from all other components (GCS, Raylet, workers,
    etc.). However, during a node shutdown, the Dashboard Agent may
    terminate before the other components, resulting in gRPC errors and
    potential loss of metrics and events.

    As this issue occurs, the otel sdk logs become very noisy. Add a default
    options to disable otel sdk logs to avoid confusion.

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit 638933ef4aabe24b5def68d72f21e772e354e853
Author: Abrar Sheikh <[email protected]>
Date:   Thu Nov 13 11:41:29 2025 -0800

    [1/n] [Serve] Refactor replica rank to prepare for node local ranks (#58471)

    2. **Extracted generic `RankManager` class** - Created reusable rank
    management logic separated from deployment-specific concerns

    3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
    replacing raw integers

    4. **Simplified error handling** - not supporting self healing

    5. **Updated tests** - Refactored unit tests to use new API and removed
    flag-dependent test cases

    **Impact:**
    - Cleaner separation of concerns in rank management
    - Foundation for future multi-level rank support

    Next PR https://github.com/ray-project/ray/pull/58473

    ---------

    Signed-off-by: abrar <[email protected]>

commit 5d5113134bce5929ff7504f733bbee44a7de2987
Author: Kunchen (David) Dai <[email protected]>
Date:   Thu Nov 13 11:21:50 2025 -0800

    [Core] Refactor reference_counter out of memory store and plasma store (#57590)

    As discovered in the [PR to better define the interface for reference
    counter](https://github.com/ray-project/ray/pull/57177#pullrequestreview-3312168933),
    plasma store provider and memory store both share thin dependencies on
    reference counter that can be refactored out. This will reduce
    entanglement in our code base and improve maintainability.

    The main logic changes are located in
    * src/ray/core_worker/store_provider/plasma_store_provider.cc, where
    reference counter related logic is refactor into core worker
    * src/ray/core_worker/core_worker.cc, where factored out reference
    counter logic is resolved
    * src/ray/core_worker/store_provider/memory_store/memory_store.cc, where
    logic related to reference counter has either been removed due to the
    fact that it is tech debt or refactored into caller functions.

    <!-- Please give a short summary of the change and the problem this
    solves. -->

    <!-- For example: "Closes #1234" -->
    Microbenchmark:
    ```
    single client get calls (Plasma Store) per second 10592.56 +- 535.86
    single client put calls (Plasma Store) per second 4908.72 +- 41.55
    multi client put calls (Plasma Store) per second 14260.79 +- 265.48
    single client put gigabytes per second 11.92 +- 10.21
    single client tasks and get batch per second 8.33 +- 0.19
    multi client put gigabytes per second 32.09 +- 1.63
    single client get object containing 10k refs per second 13.38 +- 0.13
    single client wait 1k refs per second 5.04 +- 0.05
    single client tasks sync per second 960.45 +- 15.76
    single client tasks async per second 7955.16 +- 195.97
    multi client tasks async per second 17724.1 +- 856.8
    1:1 actor calls sync per second 2251.22 +- 63.93
    1:1 actor calls async per second 9342.91 +- 614.74
    1:1 actor calls concurrent per second 6427.29 +- 50.3
    1:n actor calls async per second 8221.63 +- 167.83
    n:n actor calls async per second 22876.04 +- 436.98
    n:n actor calls with arg async per second 3531.21 +- 39.38
    1:1 async-actor calls sync per second 1581.31 +- 34.01
    1:1 async-actor calls async per second 5651.2 +- 222.21
    1:1 async-actor calls with args async per second 3618.34 +- 76.02
    1:n async-actor calls async per second 7379.2 +- 144.83
    n:n async-actor calls async per second 19768.79 +- 211.95
    ```
    This PR mainly makes logic changes to the `ray.get` call chain. As we
    can see from the benchmark above, the single clientget calls performance
    matches pre-regression levels.

    ---------

    Signed-off-by: davik <[email protected]>
    Co-authored-by: davik <[email protected]>
    Co-authored-by: Ibrahim Rabbani <[email protected]>

commit 2352e6b8e1e4488822eb787e6112c18c1964fbe0
Author: Sampan S Nayak <[email protected]>
Date:   Fri Nov 14 00:49:39 2025 +0530

    [Core] Support get-auth-token cli command  (#58566)

    add support for `ray get-auth-token` cli command + test

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit ea5bc3491a74e2b71f4cb6fdb14787fdcb3314fc
Author: Sampan S Nayak <[email protected]>
Date:   Fri Nov 14 00:37:23 2025 +0530

    [Core] Migrate to HttpOnly cookie-based authentication for enhanced security (#58591)

    Migrates Ray dashboard authentication from JavaScript-managed cookies to
    server-side HttpOnly cookies to enhance security against XSS attacks.
    This addresses code review feedback to improve the authentication
    implementation (https://github.com/ray-project/ray/pull/58368)

    main changes:
    - authentication middleware first looks for `Authorization` header, if
    not found it then looks at cookies to look for the auth token
    - new `api/authenticate` endpoint for verifying token and setting the
    auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and
    `secure=true` (when using https))
    - removed javascript based cookie manipulation utils and axios
    interceptors (were previously responsible for setting cookies)
    - cookies are deleted when connecting to a cluster with
    `AUTH_MODE=disabled`. connecting to a different ray cluster (with
    different auth token) using the same endpoint (eg due to port-forwarding
    or local testing) will reshow the popup and ask users to input the right
    token.

    ---------

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 0905c77db5acd286a6ba84a907c60ad2b15416dd
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:41:57 2025 -0800

    [ci] doc check: remove dependency on `ray_ci` (#58516)

    this makes it possible to run on a different python version than the CI
    wrapper code.

    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: Lonnie Liu <[email protected]>

commit 0bbd8fd22e0447ec66c12e67afc973e95523451b
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:35:38 2025 -0800

    [ci] mark github.Repository as typechecking (#58582)

    so that importing test.py does not always import github

    github repo imports jwt, which then imports cryptography and can lead to
    issues on windows.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 208970b5b399133a41557db8b16ad6832180e6b7
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:35:23 2025 -0800

    [wheel] stop building python 3.9 wheels on the pipelines (#58587)

    also stops building python 3.9 aarch64 images

    Signed-off-by: Lonnie Liu <[email protected]>

commit 33e855e42baaa1ebf4f3f0a1f96f00e87fdc1d11
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:32:21 2025 -0800

    [serve] run tests in python 3.10 (#58586)

    all tests are passing

    Signed-off-by: Lonnie Liu <[email protected]>

commit 5e8433d3cf8b6bea3366094bb4ecfc6f410dec01
Author: Zac Policzer <[email protected]>
Date:   Thu Nov 13 07:37:52 2025 -0800

    [core] Add monitoring in raylet for resouce view (#58382)

    We today have very little observability into pubsub. On a raylet one of
    the most important states that need to be propagated through the cluster
    via pubsub is cluster membership. All raylets should in an eventual BUT
    timely fashion agree on the list of available nodes. This metric just
    emits a simple counter to keep track of the node count.

    More pubsub observability to come.
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    ---------

    Signed-off-by: zac <[email protected]>
    Signed-off-by: Zac Policzer <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit dde70e76e5aa993e9224a2d173a053a35a132ebd
Author: Xinyu Zhang <[email protected]>
Date:   Wed Nov 12 23:04:37 2025 -0800

    [Data] Fix HTTP streaming file download by using `open_input_stream` (#58542)

    Fixes HTTP streaming file downloads in Ray Data's download operation.
    Some URIs (especially HTTP streams) require `open_input_stream` instead
    of `open_input_file`.

    - Modified `download_bytes_threaded` in `plan_download_op.py` to try
    both `open_input_file` and `open_input_stream` for each URI
    - Improved error handling to distinguish between different error types
       - Failed downloads now return `None` gracefully instead of crashing
    ```
    import pyarrow as pa
    from ray.data.context import DataContext
    from ray.data._internal.planner.plan_download_op import download_bytes_threaded
    urls = [
        "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
    ]
    table = pa.table({"url": urls})
    ctx = DataContext.get_current()
    results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
    result_table = results[0]
    for i in range(result_table.num_rows):
        url = result_table['url'][i].as_py()
        bytes_data = result_table['bytes'][i].as_py()

        if bytes_data is None:
            print(f"Row {i}: FAILED (None) - try-catch worked ✓")
        else:
            print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
        print(f"  URL: {url[:60]}...")

    print("\n✅ Test passed: Failed downloads return None instead of crashing.")
    ```

    Before the fix:
    ```
    TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
        test_download_expression_with_streaming_fallback()
      File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
        with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
        if not self.__exit__(*sys.exc_info()):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
        setattr(self.target, self.attribute, self.temp_original)
    TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
    (base) ray@ip-10-0-39-21:~/default$ python test.py
    2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
    Traceback (most recent call last):
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
        for result in fn(input_queue_iter):
                      ^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
        yield f.read()
              ^^^^^^^^
      File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
      File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
      File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
        raise ValueError("Cannot seek streaming HTTP file")
    ValueError: Cannot seek streaming HTTP file
    Traceback (most recent call last):
      File "/home/ray/default/test.py", line 16, in <module>
        results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
        uri_bytes = list(
                    ^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
        raise item
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
        for result in fn(input_queue_iter):
                      ^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
        yield f.read()
              ^^^^^^^^
      File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
      File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
      File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
        raise ValueError("Cannot seek streaming HTTP file")
    ValueError: Cannot seek streaming HTTP file
    ```
    After the fix:
    ```
    Row 0: SUCCESS (189370 bytes)
      URL: https://static-assets.tesla.com/configurator/compositor?cont...
    ```

    Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
    previously failed:
       - ✅ Successfully downloads HTTP stream files
       - ✅ Gracefully handles failed downloads (returns None)
       - ✅ Maintains backward compatibility with existing file downloads

    ---------

    Signed-off-by: xyuzh <[email protected]>
    Signed-off-by: Robert Nishihara <[email protected]>
    Co-authored-by: Robert Nishihara <[email protected]>

commit 438d6dcf225b7b03ba75ce9593050971458b94ac
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 22:19:50 2025 -0800

    [ci] pin docker client version (#58579)

    otherwise, the newer docker client will refuse to communicate with the
    docker daemon that is on an older version.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 633bb7b1d57ca58a05e905ee4551ee5f96d71750
Author: Elliot Barnwell <[email protected]>
Date:   Wed Nov 12 22:08:45 2025 -0800

    [deps] adding include_setuptools flag for depset config (#58580)

    Adding optional `include_setuptools` flag for depset configuration

    If the flag is set on a depset config --unsafe-package setuptools will
    not be included for depset compilation

    If the flag does not exist (default false) on a depset config
    --unsafe-package setuptools will be appended to the default arguments

    ---------

    Signed-off-by: elliot-barn <[email protected]>
    Co-authored-by: Lonnie Liu <[email protected]>

commit 292b977661b1ee9804bc0c6a3d3fbecd2b89ec25
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 20:36:43 2025 -0800

    [serve] remove minbuild-serve-py3.9 (#58585)

    nothing is using it anymore

    Signed-off-by: Lonnie Liu <[email protected]>

commit 0cdbe3f24132c69c4d6ce9322f85de767b660135
Author: Ibrahim Rabbani <[email protected]>
Date:   Wed Nov 12 18:48:27 2025 -0800

    [core] (cgroups) Use /proc/mounts if mount file is missing. (#58577)

    Signed-off-by: irabbani <[email protected]>

commit 22fbee343bc5326b2912ee24eb8faa8517ea29ec
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 18:26:25 2025 -0800

    [deps] update `requirements_buildkite.txt` (#58574)

    as the pydantic version is pinned in `requirements-doc.txt` now.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 7a6e29e96b1fa33ad5ff45e37d6f4da7eadd822a
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 16:38:54 2025 -0800

    Revert "[bazel] upgrade bazel python rules to 0.25.0" (#58578)

    Reverts ray-project/ray#58535

    failing on windows.. :(

commit 2f55d078bb69f39198eccf6293683e17a2e72dc5
Author: Goutam <[email protected]>
Date:   Wed Nov 12 16:37:24 2025 -0800

    [Data] - Iceberg support upsert tables + schema update + overwrite tables (#58270)
    - Support upserting iceberg tables for IcebergDatasink
    - Update schema on APPEND and UPSERT
    - Enable overwriting the entire table

    Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
    functionality. Also for append, the library now handles the transaction
    logic implicitly so that burden can be lifted from Ray Data.
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    ---------

    Signed-off-by: Goutam <[email protected]>

commit d6793ecdbc4e6043cc0b0f19862b4b0c8256bb7f
Author: Joshua Lee <[email protected]>
Date:   Wed Nov 12 16:31:26 2025 -0800

    [core] Use GetNodeAddressAndLiveness in raylet client pool (#58576)

    Using GetNodeAddressAndLiveness in raylet client pool instead of the
    bulkier Get, same for AsyncGetAll. Seems like it was already done in
    core worker client pool, so just making the same change for raylet
    client pool.

    Signed-off-by: joshlee <[email protected]>

commit e713b3de319afd437f2de7435f5a2870167fa99a
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 15:01:35 2025 -0800

    [doc] set default python env to 3.10 (#58570)

    we stop supporting building with python 3.9 now

    Signed-off-by: Lonnie Liu <[email protected]>

commit 8e4b32e0366a9b32f7dfbd55d5dd5a30fc5c734b
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 15:01:20 2025 -0800

    [bazel] rename contraint from hermatic to python_version (#58499)

    which is more accurate

    also moves python constraint definitions into `bazel/` directory and
    registering python 3.10 platform with hermetic toolchain

    this allows performing migration from python 3.19 to python 3.10
    incrementally

    Signed-off-by: Lonnie Liu <[email protected]>

commit 0d56f3ef9ae32c5ce8543bb76d9ccde120140623
Author: Elliot Barnwell <[email protected]>
Date:   Wed Nov 12 14:23:17 2025 -0800

    [images][deps] raydepsets base extra depset (#58461)

    generating depsets for base extra python requirements
    Installing requirements in base extra image

    ---------

    Signed-off-by: elliot-barn <[email protected]>

commit df65225e4f98bce2b45405b1cf89fb70556e2871
Author: Daniel Shin <[email protected]>
Date:   Thu Nov 13 07:08:15 2025 +0900

    [Data] Use Approximate Quantile for RobustScaler Preprocessor (#58371)
    Currently Ray Data has a preprocessor called `RobustScaler`. This scales
    the data based on given quantiles. Calculating the quantiles involves
    sorting the entire dataset by column for each column (C sorts for C
    number of columns), which, for a large dataset, will require a lot of
    calculations.

    ** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch`
    as I couldn't actually find well-maintained tdigest libraries for
    python. ddsketch is better maintained.

    ** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile`
    aggregator
    N/A
    N/A

    ---------

    Signed-off-by: kyuds <[email protected]>
    Signed-off-by: Daniel Shin <[email protected]>
    Co-authored-by: You-Cheng Lin <[email protected]>

commit 5e71d58badbfdcfc002826398c3e02469065cc71
Author: Sampan S Nayak <[email protected]>
Date:   Thu Nov 13 03:33:18 2025 +0530

    [Core] support token auth in ray client server  (#58557)
    support token auth in ray client server by using the existing grpc
    interceptors. This pr refactors the code to:
    - add/rename sync and async client and server interceptors
    - create grpc utils to house grpc channel and server creation logic,
    python codebase is updated to use these methods
    - separate tests for sync and async interceptors
    - make existing authentication integration tests to run with RAY_CLIENT
    mode

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit a6cc5499e7fa07c0d6cdc7b7cd0b08dfc08073dd
Author: Kunchen (David) Dai <[email protected]>
Date:   Wed Nov 12 13:45:02 2025 -0800

    [Core] Move request id creation to worker to address plasma get perf regression (#58390)
    This PR address the performance regression introduced in the [PR to make
    ray.get thread safe](https://github.com/ray-project/ray/pull/57911).
    Specifically, the previous PR requires the worker to block and wait for
    AsyncGet to return with a reply of the request id needed for correctly
    cleaning up get requests. This additional synchronous step causes the
    plasma store Get to regress in performance.

    This PR moves the request id generation step to the plasma store,
    removing the blocking step to fix the perf regression.
    - [PR which introduced perf
    regression](https://github.com/ray-project/ray/pull/57911)
    - [PR which observed the
    regression](https://github.com/ray-project/ray/pull/58175)
    New performance of the change measured by `ray microbenchmark`.
    <img width="485" height="17" alt="image"
    src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0"
    />

    Original performance prior to the change. Here we focus on the
    regressing `single client get calls (Plasma Store)` metric, where our
    new performance returns us back to the original 10k per second range
    compared to the existing sub 5k per second.
    <img width="811" height="355" alt="image"
    src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c"
    />

    ---------

    Signed-off-by: davik <[email protected]>
    Co-authored-by: davik <[email protected]>

commit 9e450e6805824ac825488e1455ac97f93df0bbc3
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 12:36:21 2025 -0800

    [doc] symlink the doc dependency lock file (#58520)

    and ask people to use that lock file for building docs.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 16c2f5fffbd1d772606de28ac39c0bb7182efdd4
Author: Lehui Liu <[email protected]>
Date:   Wed Nov 12 12:08:28 2025 -0800

    [train] Set JAX_PLATFORMS env var based on ScalingConfig (#57783)

    1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set
    to initialize jax.distributed:
    https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38
    2. Before this change, user will have to configure both `use_tpu=True`
    in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able
    to start jax.distributed. `JAX_PLATFORMS` can be comma separated string.
    3. If user uses other jax.distributed libraries like Orbax, sometimes,
    it will leads to misleading error about distributed initialization.
    4. After this change, if user sets `use_tpu=True`, we automatically add
    this to env var.
    5. tpu unit test is not available this time, will explore for how to
    cover it later.

    ---------

    Signed-off-by: Lehui Liu <[email protected]>

commit 1ab16e26a0251d3964637c6fe0f2f9a0ae8c6312
Author: iamjustinhsu <[email protected]>
Date:   Wed Nov 12 12:04:16 2025 -0800

    [Data] Add `Ranker` Interface (#58513)
    Creates a ranker interface that will rank the best operator to run next
    in `select_operator_to_run`. This code only refractors the existing
    code. The ranking value must be something that is comparable.
    None
    None

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit 9d5a2416e2980501ffc5c094ce5c59709f93ccf2
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 11:50:42 2025 -0800

    [bazel] upgrade bazel python rules to 0.25.0 (#58535)

    previously it was actually using 0.4.0, which is set up by the grpc
    repo. the declaration in the workspace file was being shadowed..

    Signed-off-by: Lonnie Liu <[email protected]>

commit 02afe68937429bfd6501e4d0f46780bca4dea329
Author: Balaji Veeramani <[email protected]>
Date:   Wed Nov 12 11:34:59 2025 -0800

    [Data] Refactor concurrency validation tests in `test_map.py` (#58549)

    The original `test_concurrency` function combined multiple test
    scenarios into a single test with complex control flow and expensive Ray
    cluster initialization. This refactoring extracts the parameter
    validation tests into focused, independent tests that are faster,
    clearer, and easier to maintain.

    Additionally, the original test included "validation" cases that tested
    valid concurrency parameters but didn't actually verify that concurrency
    was being limited correctly—they only checked that the output was
    correct, which isn't useful for validating the concurrency feature
    itself.

    **Key improvements:**
    - Split validation tests into `test_invalid_func_concurrency_raises` and
    `test_invalid_class_concurrency_raises`
    - Use parametrized tests for different invalid concurrency values
    - Switch from `shutdown_only` with explicit `ray.init()` to
    `ray_start_regular_shared` to eliminate cluster initialization overhead
    - Minimize test data from 10 blocks to 1 element since we're only
    validating parameter errors
    - Remove non-validation tests that didn't verify concurrency behavior

    N/A

    The validation tests now execute significantly faster and provide
    clearer failure messages. Each test has a single, well-defined purpose
    making maintenance and debugging easier.

    ---------

    Signed-off-by: Balaji Veeramani <[email protected]>

commit 676b86f4a8d6a4c4eab70f5f381642d9a17fdca2
Author: Balaji Veeramani <[email protected]>
Date:   Wed Nov 12 11:32:48 2025 -0800

    [Data] Convert rST-style to Google-style docstrings in `ray.data` (#58523)

    This PR improves documentation consistency in the `python/ray/data`
    module by converting all remaining rST-style docstrings (`:param:`,
    `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.).

    **Files modified:**
    - `python/ray/data/preprocessors/utils.py` - Converted
    `StatComputationPlan.add_callable_stat()`
    - `python/ray/data/preprocessors/encoder.py` - Converted
    `unique_post_fn()`
    - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()`
    and `BlockColumnAccessor.is_composed_of_lists()`
    - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` -
    Converted `DeltaSharingDatasource.setup_delta_sharing_connections()`

    Signed-off-by: Balaji Veeramani <[email protected]>

commit 7e872837e450411e9da45acea0c52f4b67221500
Author: Nikhil G <[email protected]>
Date:   Wed Nov 12 09:07:32 2025 -0800

    [serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (#58504)

    Signed-off-by: Nikhil Ghosh <[email protected]>

commit cd09d104f6d595a805fd8f9979d9f81a828823b5
Author: Alexey Kudinkin <[email protected]>
Date:   Wed Nov 12 11:50:05 2025 -0500

    [Data] Lowering `DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR` to 2 (#58262)

    > Thank you for contributing to Ray! 🚀
    > Please review the [Ray Contribution
    Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
    before opening a pull request.

    > ⚠️ Remove these instructions before submitting your PR.

    > 💡 Tip: Mark as draft if you want early feedback, or ready for review
    when it's complete.

    This was setting the value to be aligned with the previous default of 4.

    However, after some consideration i've realized that 4 is too high of a
    number so actually lowering this to 2
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    Signed-off-by: Alexey Kudinkin <[email protected]>

commit 126a40bc711cf06ed44686ee5026624d6b78766e
Author: Cuong Nguyen <[email protected]>
Date:   Wed Nov 12 07:44:53 2025 -0800

    [core] fix idle node termination on object pulling (#57928)

    Currently, a node is considered idle while pulling objects from the
    remote object store. This can lead to situations where a node is
    terminated as idle, causing the cluster to enter an infinite loop when
    pulling large objects that exceed the node idle termination timeout.

    This PR fixes the issue by treating object pulling as a busy activity.
    Note that nodes can still accept additional tasks while pulling objects
    (since pulling consumes no resources), but the auto-scaler will no
    longer terminate the node prematurely.

    Closes #54372

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit ad8f30291137efce9e463fb23e6821f4c7c74a9c
Author: Sagar Sumit <[email protected]>
Date:   Wed Nov 12 05:40:47 2025 -0800

    [core] Use graceful shutdown path when actor OUT_OF_SCOPE (`del actor`) (#57090)

    When actors terminate gracefully, Ray calls the actor's
    `__ray_shutdown__()` method if defined, allowing for cleanup of
    resources. But, this is not invoked in case actor goes out of scope due
    to `del actor`.

    Traced through the entire code path, and here's what happens:

    Flow when `del actor` is called:

    1. **Python side**: `ActorHandle.__del__()` ->
    `worker.core_worker.remove_actor_handle_reference(actor_id)`

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040

    2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` ->
    `reference_counter_->RemoveLocalReference()`
    - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed`
    callback

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506

    3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` ->
    `AsyncReportActorOutOfScope()` to GCS

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183
    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51

    4. **GCS receives notification**: `HandleReportActorOutOfScope()`
    - **THE PROBLEM IS HERE** ([line 279 in
    `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)):
       ```cpp
       DestroyActor(actor_id,
                    GenActorOutOfScopeCause(actor),
                    /*force_kill=*/true,  // <-- HARDCODED TO TRUE!
                    [reply, send_reply_callback]() {
       ```

    5. **Actor worker receives kill signal**: `HandleKillActor()` in
    [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970)
       ```cpp
       if (request.force_kill()) {  // This is TRUE for OUT_OF_SCOPE
           ForceExit(...)  // Skips __ray_shutdown__
       } else {
           Exit(...)  // Would call __ray_shutdown__
       }
       ```

    6. **ForceExit path**: Bypasses graceful shutdown -> No
    `__ray_shutdown__` callback invoked.

    This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE
    actors. Also, updated the docs.

    ---------

    Signed-off-by: Sagar Sumit <[email protected]>
    Co-authored-by: Ibrahim Rabbani <[email protected]>

commit 15393edbe72f5079279d3a0e46b72adc7496cdfc
Author: Sampan S Nayak <[email protected]>
Date:   Wed Nov 12 19:00:10 2025 +0530

    [Core] use client interceptor for adding auth token in c++ client calls (#58424)
    - Use client interceptor for adding auth tokens in grpc calls when
    `AUTH_MODE=token`
    - BuildChannel() will automatically include the interceptor
    - Removed `auth_token` parameter from `ClientCallImpl`
    - removed manual auth from `python_gcs_subscriber`.cc
    - tests to verify auth works for autoscaller apis

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit d496ea87808706333703be6ff25ecc9472330fd5
Author: Sampan S Nayak <[email protected]>
Date:   Wed Nov 12 11:25:11 2025 +0530

     [core] Token auth usability improvements (#58408)
    - rename RAY_auth_mode → RAY_AUTH_MODE environment variable across
    codebase
    - Excluded healthcheck endpoints from authentication for Kubernetes
    compatibility
    - Fixed dashboard cookie handling to respect auth mode and clear stale
    tokens when switching clusters

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 584f5acdf804b1ba097ff7fa5d78a0bfd63c682b
Author: kourosh hakhamaneshi <[email protected]>
Date:   Tue Nov 11 19:50:52 2025 -0800

    [doc][serve][llm] Attached the correct figure to the pd docs (#58543)

    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

commit a15f5be797ced0df321bfd8d42bab7d57defa2de
Author: Lonnie Liu <[email protected]>
Date:   Tue Nov 11 18:00:43 2025 -0800

    [doc] downgrade readthedocs to use python 3.10 (#58536)

    be consistent with the default build environment

    Signed-off-by: Lonnie Liu <[email protected]>

commit 9dcb67dc9ff20d9b9ae29875bb610273ba4149ed
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 17:26:15 2025 -0800

    [core] Fix auth test import (#58554)

    The python test step is failing on master now because of this. Probably
    a logical merge conflict.
    ```
    FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary)
    ...

    [2025-11-11T22:11:54Z]     from ray.tests.authentication_test_utils import (
    --
      | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils'
    ```

    Signed-off-by: dayshah <[email protected]>

commit 20bf68263beed3609e24aede3d9fc96bc07f0da0
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:44:05 2025 -0800

    [core][rdt] Abort NIXL and allow actor reuse on failed transfers  (#56783)

    Signed-off-by: dayshah <[email protected]>

commit 89a329cd1e0219629132abc203085117a11949f3
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:26:17 2025 -0800

    [core] Improve kill actor logs (#58544)

    Signed-off-by: dayshah <[email protected]>

commit 6c9607ea57b9edde07c856f094835c84f47b79a6
Author: Nikhil G <[email protected]>
Date:   Tue Nov 11 12:16:41 2025 -0800

    [docs][serve][llm] examples and doc for cross-node TP/PP in Serve (#57715)

    Signed-off-by: Nikhil Ghosh <[email protected]>
    Signed-off-by: Nikhil G <[email protected]>

commit 711d9453828fecebb91b9642e799b4b0b4a493f7
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:13:13 2025 -0800

    [core] Make GlobalState lazy initialization thread-safe (#58182)

    Signed-off-by: dayshah <[email protected]>

commit fd10c39829a580bd83ba28c8518e7a7a5ebd3dfb
Author: Kai-Hsun Chen <[email protected]>
Date:   Tue Nov 11 09:43:05 2025 -0800

    [core] Scheduling a detached actor with a placement group is not recommended (#57726)

    <!-- Thank you for contributing to Ray! 🚀 -->
    <!-- Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->
    <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
    review when it's complete -->

    If users schedule a detached actor into a placement group, Raylet will
    kill the actor when the placement group is removed. The actor will be
    stuck in the `RESTARTING` state forever if it's restartable until users
    explicitly kill it.

    In that case, if users try to `get_actor` with the actor's name, it can
    still return the restarting actor, but no process exists. It will no
    longer be restarted because the PG is gone, and no PG with the same ID
    will be created during the cluster's lifetime.

    The better behavior would be for Ray to transition a task/actor's state
    to dead when it is impossible to restart. However, this would add too
    much complexity to the core, so I think it's not worth it. Therefore,
    this PR adds a warning log, and users should use detached actors or PGs
    correctly.

    Example: Run the following script and run `ray list actors`.

    ```python
    import ray
    from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
    from ray.util.placement_group import placement_group, remove_placement_group

    @ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1)
    class Actor:
      pass

    ray.init()

    pg = placement_group([{"CPU": 1}])
    ray.get(pg.ready())

    actor = Actor.options(
        scheduling_strategy=PlacementGroupSchedulingStrategy(
            placement_group=pg,
        )
    ).remote()

    ray.get(actor.__ray_ready__.remote())
    ```

    <!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to

    - [ ] Bug fix 🐛
    - [ ] New feature ✨
    - [x] Enhancement 🚀
    - [ ] Code refactoring 🔧
    - [ ] Documentation update 📖
    - [ ] Chore 🧹
    - [ ] Style 🎨

    **Does this PR introduce breaking changes?**
    - [ ] Yes ⚠️
    - [x] No
    <!-- If yes, describe what breaks and how users should migrate -->

    **Testing:**
    - [ ] Added/updated tests for my changes
    - [x] Tested the changes manually
    - [ ] This PR is not tested ❌ _(please explain why)_

    **Code Quality:**
    - [x] Signed off every commit (`git commit -s`)
    - [x] Ran pre-commit hooks ([setup
    guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

    **Documentation:**
    - [ ] Updated documentation (if applicable) ([contribution
    guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
    - [ ] Added new APIs to `doc/source/` (if applicable)

    <!-- Optional: Add screenshots, examples, performance impact, breaking
    change details -->

    ---------

    Signed-off-by: Kai-Hsun Chen <[email protected]>
    Signed-off-by: Robert Nishihara <[email protected]>
    Signed-off-by: Kai-Hsun Chen <[email protected]>
    Co-authored-by: Robert Nishihara <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 0752886e7d55694b6cf8d780b7470d58266c6a10
Author: Cuong Nguyen <[email protected]>
Date:   Tue Nov 11 07:19:19 2025 -0800

    [core] enable open telemetry by default (#56432)

    This PR enables open telemetry as the default backend for ray metric
    stack. The bulk of this PR is actually to fix tests that were written
    with some assumptions that no longer hold true. For ease of reviewing, I
    inline the reasons for the change together with the change for each
    tests in the comments.

    This PR also depends on a release of vllm (so that we can update the
    minimal supported version of vllm in ray).

    Test:
    - CI

    <!-- CURSOR_SUMMARY -->
    ---

    > [!NOTE]
    > Enable OpenTelemetry metrics backend by default and refactor
    metrics/Serve tests to use timeseries APIs and updated `ray_serve_*`
    metric names.
    >
    > - **Core/Config**:
    > - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to
    `true` in `ray_constants.py` and `ray_config_def.h`.
    > - Metrics `Counter`: use `CythonCount` by default; keep legacy
    `CythonSum` only when OTEL is explicitly disabled.
    > - **Serve/Metrics Tests**:
    > - Replace text scraping with `PrometheusTimeseries` and
    `fetch_prometheus_metric_timeseries` throughout.
    > - Update metric names/tags to `ray_serve_*` and counter suffixes
    `*_total`; adjust latency metric names and processing/queued gauges.
    > - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and
    pass through helpers.
    > - **General Test Fixes**:
    > - Remove OTEL parametrization/fixtures; simplify expectations where
    counters-as-gauges no longer apply; drop related tests.
    > - Cardinality tests: include `"low"` level and remove OTEL gating;
    stop injecting `enable_open_telemetry` in system config.
    > - Actor/state/thread tests: migrate to cluster fixtures, wait for
    dashboard agent, and adjust expected worker thread counts.
    > - **Build**:
    > - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env
    from C++ stats test.
    >
    > <sup>Written by [Cursor
    Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
    1d0190f3dd58d5f0c982fcbdab95fcf5f733553f. This will update automatically
    on new commits. Configure
    [here](https://cursor.com/dashboard?tab=bugbot).</sup>
    <!-- /CURSOR_SUMMARY -->

    ---------

    Signed-off-by: Cuong Nguyen <[email protected]>

commit bf595e32d049503f5c1931c5b477647a06d191c2
Author: Sampan S Nayak <[email protected]>
Date:   Tue Nov 11 19:15:41 2025 +0530

    [Core] move authentication_test_utils into ray._private to fix macos tests (#58528)

    the auth token test setup in `conftest.py` is breaking macos test. there
    are two test scripts (`test_microbenchmarks.py` and `test_basic.py`)
    that run after the wheel is installed but without editable mode. for
    these test to pass,` conftest.py` cannot import anything under
    `ray.tests`.

    this pr moves `authentication_test_utils` into `ray._private` to fix
    this issue

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 3d29c4ccc9182c44d3cfab08fb561cb7db74eea8
Author: Sampan S Nayak <[email protected]>
Date:   Tue Nov 11 19:10:56 2025 +0530

    [Core] Add Service Interceptor to support token authentication in dashboard agent (#58405)

    Add a grpc service interceptor to intercept all dashboard agent rpc
    calls and validate the presence of auth token (when auth mode is token)

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 1a48e7318442d038f2c43d22da3b580fa643b8d1
Author: curiosity-hyf <[email protected]>
Date:   Tue Nov 11 21:35:42 2025 +0800

    [Docs] fix pattern_async_actor demo typo (#58486)

    fix pattern_async_actor demo typo. Add `self.`.

    ---------

    Signed-off-by: curiosity-hyf <[email protected]>

commit f2a7a94a75b007a801ee5a2cf6a6e24b93e9cb9a
Author: Thomas Desrosiers <[email protected]>
Date:   Mon Nov 10 18:28:46 2025 -0800

    Update pydoclint to version 0.8.1 (#58490)
    * Does the work to bump pydoclint up to the latest version
    * And allowlist any new violations it finds
    n/a
    n/a

    ---------

    Signed-off-by: Thomas Desrosiers <[email protected]>

commit 10983e8c9f50ddfa355efe7977d056b29b38d4c1
Author: Goutam <[email protected]>
Date:   Mon Nov 10 17:34:13 2025 -0800

    [Data] - Iceberg support predicate & projection pushdown (#58286)
    Predicate pushdown (https://github.com/ray-project/ray/pull/58150) in
    conjunction with this PR should speed up reads from Iceberg.

    Once the above change lands, we can add the pushdown interface support
    for IcebergDatasource

    ---------

    Signed-off-by: Goutam <[email protected]>

commit 09f01135f4ab71d52be7a44d06e40ff3767f6cee
Author: Seiji Eicher <[email protected]>
Date:   Mon Nov 10 17:28:23 2025 -0800

    [serve][llm] Fix import path in muli-node release test (#58498)

    Signed-off-by: Seiji Eicher <[email protected]>

commit 405c4648c2fe71afb7daf4ea574605190f129fd7
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 16:04:48 2025 -0800

    [ci] upgrade rayci version (#58514)

    to 0.21.0; supports wanda priority now.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 6de012fd0df23993054653ca5517a66944c58dd2
Author: Zac Policzer <[email protected]>
Date:   Mon Nov 10 14:05:15 2025 -0800

    [core] Add owned object spill metrics (#57870)

    This PR adds 2 new metrics to core_worker by way of the reference
    counter. The two new metrics keep track of the count and size of objects
    owned by the worker as well as keeping track of their states. States are
    defined as:

    - **PendingCreation**: An object that is pending creation and hasn't
    finished it's initialization (and is sizeless)
    - **InPlasma**: An object which has an assigned node address and isn't
    spilled
    - **Spilled**: An object which has an assigned node address and is
    spilled
    - **InMemory**: An object which has no assigned address but isn't
    pending creation (and therefore, must be local)

    The approach used by these new metrics is to examine the state 'before
    and after' any mutations on the reference in the reference_counter. This
    is required in order to do the appropriate bookkeeping (decrementing
    values and incrementing others). Admittedly, there is potential for
    counting on the in between decrements/increments depending on when the
    RecordMetrics loop is run. This unfortunate side effect however seems
    preferable to doing mutual exclusion with metric collection as this is
    potentially a high throughput code path.

    In addition, performing live counts seemed preferable then doing full
    accounting of the object store and across all references at time of
    metric collection. Reason being, that potentially the reference counter
    is tracking millions of objects, and each metric scan could potentially
    be very expensive. So running the accounting (despite being potentially
    innaccurate for short periods) seemed the right call.

    This PR also allows for object size to potentially change due to
    potential non deterministic instantation (say an object is initially
    created, but it's primary copy dies, and then the recreation fails).
    This is an edge case, but seems important for completeness sake.

    ---------

    Signed-off-by: zac <[email protected]>

commit f2dd0e2b6dc7bc074f72197ff08f7d4e58635052
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:02:11 2025 -0800

    [java] remove local genrule `//java:ray_java_pkg` (#58503)

    using `bazelisk run //java:gen_ray_java_pkg` everywhere

    Signed-off-by: Lonnie Liu <[email protected]>

commit b23adc777c5b103291cf3a35b51b123a808d36f6
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:01:27 2025 -0800

    [ci] apply isort to release test directory, part 1 (#58505)

    excluding `*_tests` directories for now to reduce the impact

    Signed-off-by: Lonnie Liu <[email protected]>

commit ce1fd472b2677069a5bfcd2b5ed7a2695f5f2966
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:01:06 2025 -0800

    [doc] change link check to run on python 3.12 (#58506)

    migrating all doc related things to run on python 3.12

    Signed-off-by: Lonnie Liu <[email protected]>

commit b09b076e15fefe842a0b7e33accff71ec3c31435
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:00:01 2025 -0800

    [doc] ci: move doc annotation check to python 3.12 (#58507)

    be consistent with doc build environment

    Signed-off-by: Lonnie Liu <[email protected]>

commit 8971f83ecb40d54729c2c26d394594c29199e19d
Author: iamjustinhsu <[email protected]>
Date:   Mon Nov 10 12:52:43 2025 -0800

    [data] Clear queue for manually mark_execution_finished operators (#58441)
    Currently, we clear _external_ queues when an operator is manually
    marked as finished. But we don't clear their _internal_ queues. This PR
    fixes that
    Fixes this test
    https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit ffb51f866802ad3858d82a9356855a38503efec9
Author: Matthew Owen <[email protected]>
Date:   Mon Nov 10 10:54:34 2025 -0800

    [data] Update depsets for multimodal inference release tests (#57233)

    Update remaining mulitmodal release tests to use new depsets.

commit 62231dd4ba8e784da8800b248ad7616b8db92de7
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 10:30:00 2025 -0800

    [ci] seperate doc related jobs into its own group (#58454)

    so that they are not called lints any more

    Signed-off-by: Lonnie Liu <[email protected]>

commit 3f7a7b42fda0bb75a9af6e5ad197ba3743b011c2
Author: harshit-anyscale <[email protected]>
Date:   Mon Nov 10 23:45:38 2025 +0530

    increase timeout for test_initial_replica tests (#58423)

    - `test_target_capacity` windows test is failing, possibly because we
    have put up a short timeout of 10 seconds, increasing it to verify
    whether timeout is an issue or not.

    Signed-off-by: harshit <[email protected]>

commit 217031a48f4f83d04950ad39b94846ba362edd37
Author: Jugal Shah <[email protected]>
Date:   Mon Nov 10 09:39:43 2025 -0800

    Define an env for controlling UVloop (#58442)

    > Briefly describe what this PR accomplishes and why it's needed.

    Our serve ingress keeps running into below error related to `uvloop`
    under heavy load
    ```
    File descriptor 97 is used by transport
    ```
    The uvloop team have a
    [PR](https://github.com/MagicStack/uvloop/pull/646) to fix it, but seems
    like no one is working on it

    One of workaround mentioned in the
    ([PR](https://github.com/MagicStack/uvloop/pull/646#issuecomment-3138886982))
    is to just turn off uvloop .
    We tried it in our env and didn't see any major performance difference
    Hence as part of this PR, we are defining a new env for controlling
    UVloop

    Signed-off-by: jugalshah291 <[email protected]>

commit 2486ddd9fec83cc940937e3d91368942588ef177
Author: fscnick <[email protected]>
Date:   Mon Nov 10 23:29:03 2025 +0800

    [Doc][KubeRay] eliminate vale errors (#58429)

    Fix some vale's error and suggestions on the kai-scheduler document.

    See https://github.com/ray-project/ray/pull/58161#discussion_r2463701719

    Signed-off-by: fscnick <[email protected]>

commit cb6a60d0afcfca87734a399291343e297031f1d5
Author: Daniel Sperber <[email protected]>
Date:   Mon Nov 10 16:24:34 2025 +0100

    [air] Add stacklevel option to deprecation_warning (#58357)

    Currently are deprecation warnings sometimes not informative enough. The
    the warning is triggered it does not tell us *where* the deprecated
    feature is used. For example, ray internally raises a deprecation
    warning when an `RLModuleConfig` is initialized.

    ```python
    >>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig
    >>> RLModuleConfig()
    2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
    ```

    This is confusing, where did *I* use a config, what am I doing wrong?
    This raises issues like:
    https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064

    Tracing where the error actually happens is tedious - is it my code or
    internal? The output just shows `deprecation.:50`. Not helpful.

    This PR adds a stacklevel option with stacklevel=2 as the default to all
    `deprecation_warning`s. So devs and users can better see where is the
    deprecated option actually used.

    ---

    EDIT:

    **Before**

    ```python
    WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])`
    ```

    **After** module.py:line where the deprecated artifact is used is shown
    in the log output:

    When building an Algorithm:
    ```python
    WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
    ```

    ```python
    .../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
    ```

    Signed-off-by: Daraan <[email protected]>

commit 5bff52ab5d9a9d67de88c4f0b86c918487ed7216
Author: Sampan S Nayak <[email protected]>
Date:   Mon Nov 10 20:50:21 2025 +0530

    [core] Configure an interceptor to pass auth token in python direct g… (#58395)

    there are places in the python code where we use the raw grpc library to
    make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term
    we want to fully deprecate grpc library usage in our python code base
    but as that can take more effort and testing, in this pr I am
    introducing an interceptor to add auth headers (this will take effect
    for all grpc calls made from python).
    ```
    export RAY_auth_mode="token"
    export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
    ray start --head
    ray job submit -- echo "hi"
    ```

    output
    ```
    ray job submit -- echo "hi"
    2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads.
    Job submission server address: http://127.0.0.1:8265

    -------------------------------------------------------
    Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully
    -------------------------------------------------------

    Next steps
      Query the logs of the job:
        ray job logs raysubmit_1EV8q86uKM24nHmH
      Query the status of the job:
        ray job status raysubmit_1EV8q86uKM24nHmH
      Request the job to be stopped:
        ray job stop raysubmit_1EV8q86uKM24nHmH

    Tailing logs until the job exits (disable with --no-wait):
    2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up.
    hi
    Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi

    ------------------------------------------
    Job 'raysubmit_1EV8q86uKM24nHmH' succeeded
    ------------------------------------------
    ```
    dashboard
    test.py
    ```python
    import time
    import ray
    from ray._raylet import Config

    ray.init()

    @ray.remote
    def print_hi():
        print("Hi")
        time.sleep(2)

    @ray.remote
    class SimpleActor:
        def __init__(self):
            self.value = 0

        def increment(self):
            self.value += 1
            return self.value

    actor = SimpleActor.remote()
    result = ray.get(actor.increment.remote())

    for i in range(100):
        ray.get(print_hi.remote())
        time.sleep(20)

    ray.shutdown()
    ```

    ```
    export RAY_auth_mode="token"
    export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
    python test.py
    ```
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292"
    />
    overview page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762"
    />

    job page: tasks are listed
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a"
    />

    task page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136"
    />

    actors page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459"
    />

    specific actor page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0"
    />

    ---------

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 71c7bd056cc132c57a4c3cf13d0f5207cbcfd73f
Author: Xinyu Zhang <[email protected]>
Date:   Sun Nov 9 08:34:46 2025 -0800

    [Data] Add exception handling for invalid URIs in download operation (#58464)

commit d74c1570543045a0f99df4d5690ac44f1fda4a55
Author: iamjustinhsu <[email protected]>
Date:   Sat Nov 8 15:35:11 2025 -0800

    [dashboards][core] Make `do_reply` accept status_code, instead of success: bool (#58384)
    Pass in `status_code` directly into `do_reply`. This is a follow up to
    https://github.com/ray-project/ray/pull/58255

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit e793631896f65a88513510b4e7bf6f100607cb03
Author: Rueian <[email protected]>
Date:   Sat Nov 8 15:32:10 2025 -0800

    [core][autoscaler] Fix RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460)

    This ensures node type names are correctly reported even when the
    autoscaler is disabled (read-only mode).

    Autoscaler v2 fails to report prometheus metrics when operating in
    read-only mode on KubeRay with the following KeyError error:

    ```
    2025-11-08 12:06:57,402	ERROR autoscaler.py:215 -- 'small-group'
    Traceback (most recent call last):
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
        return Reconciler.reconcile(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile
        Reconciler._step_next(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next
        Reconciler._scale_cluster(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster
        reply = scheduler.schedule(sched_request)
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule
        ResourceDemandScheduler._enforce_max_workers_per_type(ctx)
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type
        node_config = ctx.get_node_type_configs()[node_type]
    KeyError: 'small-group'
    ```

    This happens because the `ReadOnlyProviderConfigReader` populates
    `ctx.get_node_type_configs()` using node IDs as node types, which is
    correct for local Ray (where local ray does not have
    `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where
    `ray_node_type_name` is present and expected wi…
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 13, 2025
commit b3a8434d35f7af0322e3b766b1a1809bd29c2837
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 14:31:31 2025 -0800

    [doc] remove python 3.12 in doc building (#58572)

    unifying to python 3.10

    Signed-off-by: Lonnie Liu <[email protected]>

commit 31f904f630809152ceba67c8bf1684c8c9b685ea
Author: Andrew Sy Kim <[email protected]>
Date:   Thu Nov 13 17:27:23 2025 -0500

    Add support for RAY_AUTH_MODE=k8s  (#58497)

    This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray
    will delegate authentication and authorization of Ray access to
    Kubernetes TokenReview and SubjectAccessReview APIs.

    ---------

    Signed-off-by: Andrew Sy Kim <[email protected]>

commit ade535a9519c19c25aa50c562d2c27128b3ca356
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 14:08:29 2025 -0800

    [serve] fix serve dashboard metric name (#58573)

    Prometheus auto-append the `_total` suffix to all Counter metrics. Ray
    historically has been supported counter metric with and without `_total`
    suffix for backward compatibility, but it is now time to drop the
    support (2 years since the warning was added).

    There is one place in ray serve dashboard that still doesn't use the
    `_total` suffix so fix it in this PR.

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit 62a33c29d23a5c1fb91a969b9aea3ffe1f8281cc
Author: Rui Qiao <[email protected]>
Date:   Thu Nov 13 13:33:33 2025 -0800

    [Serve.LLM] Add avg prompt length metric (#58599)
    Add avg prompt length metric

    When using uniform prompt length (especially in testing), the P50 and
    P90 computations are skewed due to the 1_2_5 buckets used in vLLM.
    Average prompt length provides another useful dimension to look at and
    validate.

    For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows
    9400, and avg accurately shows 5000.

    <img width="1186" height="466" alt="image"
    src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a"
    />

    ---------

    Signed-off-by: Rui Qiao <[email protected]>
    Signed-off-by: Rui Qiao <[email protected]>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit 0c4dcb032ce03a771c3b6276fb661cfc6b839c01
Author: Elliot Barnwell <[email protected]>
Date:   Thu Nov 13 12:42:49 2025 -0800

    [release] allowing for py3.13 images (cpu & cu123) in release tests (#58581)

    allowing for py3.13 images (cpu & cu123) in release tests

    Signed-off-by: elliot-barn <[email protected]>

commit c3ba35e6cb1ce4030d8d361a921a697af516fbca
Author: Goutam <[email protected]>
Date:   Thu Nov 13 12:26:10 2025 -0800

    [Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype (#58225)
    As title suggests
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    Signed-off-by: Goutam <[email protected]>

commit af20446c362a8f4d17b9226d944a3242b0acafaf
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 12:18:38 2025 -0800

    [core] fix get_metric_check_condition tests (#58598)

    Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`,
    which is a non-flaky version of `fetch_prometheus`. Update all of test
    usage accordingly.

    Test:
    - CI

    ---------

    Signed-off-by: Cuong Nguyen <[email protected]>
    Signed-off-by: Cuong Nguyen <[email protected]>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit f1c613dc386268beec06b6c57c12191218ae7e74
Author: Cuong Nguyen <[email protected]>
Date:   Thu Nov 13 12:14:04 2025 -0800

    [core] add an option to disable otel sdk error logs (#58257)

    Currently, Ray metrics and events are exported through a centralized
    process called the Dashboard Agent. This process functions as a gRPC
    server, receiving data from all other components (GCS, Raylet, workers,
    etc.). However, during a node shutdown, the Dashboard Agent may
    terminate before the other components, resulting in gRPC errors and
    potential loss of metrics and events.

    As this issue occurs, the otel sdk logs become very noisy. Add a default
    options to disable otel sdk logs to avoid confusion.

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit 638933ef4aabe24b5def68d72f21e772e354e853
Author: Abrar Sheikh <[email protected]>
Date:   Thu Nov 13 11:41:29 2025 -0800

    [1/n] [Serve] Refactor replica rank to prepare for node local ranks (#58471)

    2. **Extracted generic `RankManager` class** - Created reusable rank
    management logic separated from deployment-specific concerns

    3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
    replacing raw integers

    4. **Simplified error handling** - not supporting self healing

    5. **Updated tests** - Refactored unit tests to use new API and removed
    flag-dependent test cases

    **Impact:**
    - Cleaner separation of concerns in rank management
    - Foundation for future multi-level rank support

    Next PR https://github.com/ray-project/ray/pull/58473

    ---------

    Signed-off-by: abrar <[email protected]>

commit 5d5113134bce5929ff7504f733bbee44a7de2987
Author: Kunchen (David) Dai <[email protected]>
Date:   Thu Nov 13 11:21:50 2025 -0800

    [Core] Refactor reference_counter out of memory store and plasma store (#57590)

    As discovered in the [PR to better define the interface for reference
    counter](https://github.com/ray-project/ray/pull/57177#pullrequestreview-3312168933),
    plasma store provider and memory store both share thin dependencies on
    reference counter that can be refactored out. This will reduce
    entanglement in our code base and improve maintainability.

    The main logic changes are located in
    * src/ray/core_worker/store_provider/plasma_store_provider.cc, where
    reference counter related logic is refactor into core worker
    * src/ray/core_worker/core_worker.cc, where factored out reference
    counter logic is resolved
    * src/ray/core_worker/store_provider/memory_store/memory_store.cc, where
    logic related to reference counter has either been removed due to the
    fact that it is tech debt or refactored into caller functions.

    <!-- Please give a short summary of the change and the problem this
    solves. -->

    <!-- For example: "Closes #1234" -->
    Microbenchmark:
    ```
    single client get calls (Plasma Store) per second 10592.56 +- 535.86
    single client put calls (Plasma Store) per second 4908.72 +- 41.55
    multi client put calls (Plasma Store) per second 14260.79 +- 265.48
    single client put gigabytes per second 11.92 +- 10.21
    single client tasks and get batch per second 8.33 +- 0.19
    multi client put gigabytes per second 32.09 +- 1.63
    single client get object containing 10k refs per second 13.38 +- 0.13
    single client wait 1k refs per second 5.04 +- 0.05
    single client tasks sync per second 960.45 +- 15.76
    single client tasks async per second 7955.16 +- 195.97
    multi client tasks async per second 17724.1 +- 856.8
    1:1 actor calls sync per second 2251.22 +- 63.93
    1:1 actor calls async per second 9342.91 +- 614.74
    1:1 actor calls concurrent per second 6427.29 +- 50.3
    1:n actor calls async per second 8221.63 +- 167.83
    n:n actor calls async per second 22876.04 +- 436.98
    n:n actor calls with arg async per second 3531.21 +- 39.38
    1:1 async-actor calls sync per second 1581.31 +- 34.01
    1:1 async-actor calls async per second 5651.2 +- 222.21
    1:1 async-actor calls with args async per second 3618.34 +- 76.02
    1:n async-actor calls async per second 7379.2 +- 144.83
    n:n async-actor calls async per second 19768.79 +- 211.95
    ```
    This PR mainly makes logic changes to the `ray.get` call chain. As we
    can see from the benchmark above, the single clientget calls performance
    matches pre-regression levels.

    ---------

    Signed-off-by: davik <[email protected]>
    Co-authored-by: davik <[email protected]>
    Co-authored-by: Ibrahim Rabbani <[email protected]>

commit 2352e6b8e1e4488822eb787e6112c18c1964fbe0
Author: Sampan S Nayak <[email protected]>
Date:   Fri Nov 14 00:49:39 2025 +0530

    [Core] Support get-auth-token cli command  (#58566)

    add support for `ray get-auth-token` cli command + test

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit ea5bc3491a74e2b71f4cb6fdb14787fdcb3314fc
Author: Sampan S Nayak <[email protected]>
Date:   Fri Nov 14 00:37:23 2025 +0530

    [Core] Migrate to HttpOnly cookie-based authentication for enhanced security (#58591)

    Migrates Ray dashboard authentication from JavaScript-managed cookies to
    server-side HttpOnly cookies to enhance security against XSS attacks.
    This addresses code review feedback to improve the authentication
    implementation (https://github.com/ray-project/ray/pull/58368)

    main changes:
    - authentication middleware first looks for `Authorization` header, if
    not found it then looks at cookies to look for the auth token
    - new `api/authenticate` endpoint for verifying token and setting the
    auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and
    `secure=true` (when using https))
    - removed javascript based cookie manipulation utils and axios
    interceptors (were previously responsible for setting cookies)
    - cookies are deleted when connecting to a cluster with
    `AUTH_MODE=disabled`. connecting to a different ray cluster (with
    different auth token) using the same endpoint (eg due to port-forwarding
    or local testing) will reshow the popup and ask users to input the right
    token.

    ---------

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 0905c77db5acd286a6ba84a907c60ad2b15416dd
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:41:57 2025 -0800

    [ci] doc check: remove dependency on `ray_ci` (#58516)

    this makes it possible to run on a different python version than the CI
    wrapper code.

    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: Lonnie Liu <[email protected]>

commit 0bbd8fd22e0447ec66c12e67afc973e95523451b
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:35:38 2025 -0800

    [ci] mark github.Repository as typechecking (#58582)

    so that importing test.py does not always import github

    github repo imports jwt, which then imports cryptography and can lead to
    issues on windows.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 208970b5b399133a41557db8b16ad6832180e6b7
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:35:23 2025 -0800

    [wheel] stop building python 3.9 wheels on the pipelines (#58587)

    also stops building python 3.9 aarch64 images

    Signed-off-by: Lonnie Liu <[email protected]>

commit 33e855e42baaa1ebf4f3f0a1f96f00e87fdc1d11
Author: Lonnie Liu <[email protected]>
Date:   Thu Nov 13 10:32:21 2025 -0800

    [serve] run tests in python 3.10 (#58586)

    all tests are passing

    Signed-off-by: Lonnie Liu <[email protected]>

commit 5e8433d3cf8b6bea3366094bb4ecfc6f410dec01
Author: Zac Policzer <[email protected]>
Date:   Thu Nov 13 07:37:52 2025 -0800

    [core] Add monitoring in raylet for resouce view (#58382)

    We today have very little observability into pubsub. On a raylet one of
    the most important states that need to be propagated through the cluster
    via pubsub is cluster membership. All raylets should in an eventual BUT
    timely fashion agree on the list of available nodes. This metric just
    emits a simple counter to keep track of the node count.

    More pubsub observability to come.
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    ---------

    Signed-off-by: zac <[email protected]>
    Signed-off-by: Zac Policzer <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit dde70e76e5aa993e9224a2d173a053a35a132ebd
Author: Xinyu Zhang <[email protected]>
Date:   Wed Nov 12 23:04:37 2025 -0800

    [Data] Fix HTTP streaming file download by using `open_input_stream` (#58542)

    Fixes HTTP streaming file downloads in Ray Data's download operation.
    Some URIs (especially HTTP streams) require `open_input_stream` instead
    of `open_input_file`.

    - Modified `download_bytes_threaded` in `plan_download_op.py` to try
    both `open_input_file` and `open_input_stream` for each URI
    - Improved error handling to distinguish between different error types
       - Failed downloads now return `None` gracefully instead of crashing
    ```
    import pyarrow as pa
    from ray.data.context import DataContext
    from ray.data._internal.planner.plan_download_op import download_bytes_threaded
    urls = [
        "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
    ]
    table = pa.table({"url": urls})
    ctx = DataContext.get_current()
    results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
    result_table = results[0]
    for i in range(result_table.num_rows):
        url = result_table['url'][i].as_py()
        bytes_data = result_table['bytes'][i].as_py()

        if bytes_data is None:
            print(f"Row {i}: FAILED (None) - try-catch worked ✓")
        else:
            print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
        print(f"  URL: {url[:60]}...")

    print("\n✅ Test passed: Failed downloads return None instead of crashing.")
    ```

    Before the fix:
    ```
    TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
        test_download_expression_with_streaming_fallback()
      File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
        with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
        if not self.__exit__(*sys.exc_info()):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
        setattr(self.target, self.attribute, self.temp_original)
    TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
    (base) ray@ip-10-0-39-21:~/default$ python test.py
    2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
    Traceback (most recent call last):
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
        for result in fn(input_queue_iter):
                      ^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
        yield f.read()
              ^^^^^^^^
      File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
      File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
      File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
        raise ValueError("Cannot seek streaming HTTP file")
    ValueError: Cannot seek streaming HTTP file
    Traceback (most recent call last):
      File "/home/ray/default/test.py", line 16, in <module>
        results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
        uri_bytes = list(
                    ^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
        raise item
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
        for result in fn(input_queue_iter):
                      ^^^^^^^^^^^^^^^^^^^^
      File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
        yield f.read()
              ^^^^^^^^
      File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
      File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
      File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
        raise ValueError("Cannot seek streaming HTTP file")
    ValueError: Cannot seek streaming HTTP file
    ```
    After the fix:
    ```
    Row 0: SUCCESS (189370 bytes)
      URL: https://static-assets.tesla.com/configurator/compositor?cont...
    ```

    Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
    previously failed:
       - ✅ Successfully downloads HTTP stream files
       - ✅ Gracefully handles failed downloads (returns None)
       - ✅ Maintains backward compatibility with existing file downloads

    ---------

    Signed-off-by: xyuzh <[email protected]>
    Signed-off-by: Robert Nishihara <[email protected]>
    Co-authored-by: Robert Nishihara <[email protected]>

commit 438d6dcf225b7b03ba75ce9593050971458b94ac
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 22:19:50 2025 -0800

    [ci] pin docker client version (#58579)

    otherwise, the newer docker client will refuse to communicate with the
    docker daemon that is on an older version.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 633bb7b1d57ca58a05e905ee4551ee5f96d71750
Author: Elliot Barnwell <[email protected]>
Date:   Wed Nov 12 22:08:45 2025 -0800

    [deps] adding include_setuptools flag for depset config (#58580)

    Adding optional `include_setuptools` flag for depset configuration

    If the flag is set on a depset config --unsafe-package setuptools will
    not be included for depset compilation

    If the flag does not exist (default false) on a depset config
    --unsafe-package setuptools will be appended to the default arguments

    ---------

    Signed-off-by: elliot-barn <[email protected]>
    Co-authored-by: Lonnie Liu <[email protected]>

commit 292b977661b1ee9804bc0c6a3d3fbecd2b89ec25
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 20:36:43 2025 -0800

    [serve] remove minbuild-serve-py3.9 (#58585)

    nothing is using it anymore

    Signed-off-by: Lonnie Liu <[email protected]>

commit 0cdbe3f24132c69c4d6ce9322f85de767b660135
Author: Ibrahim Rabbani <[email protected]>
Date:   Wed Nov 12 18:48:27 2025 -0800

    [core] (cgroups) Use /proc/mounts if mount file is missing. (#58577)

    Signed-off-by: irabbani <[email protected]>

commit 22fbee343bc5326b2912ee24eb8faa8517ea29ec
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 18:26:25 2025 -0800

    [deps] update `requirements_buildkite.txt` (#58574)

    as the pydantic version is pinned in `requirements-doc.txt` now.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 7a6e29e96b1fa33ad5ff45e37d6f4da7eadd822a
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 16:38:54 2025 -0800

    Revert "[bazel] upgrade bazel python rules to 0.25.0" (#58578)

    Reverts ray-project/ray#58535

    failing on windows.. :(

commit 2f55d078bb69f39198eccf6293683e17a2e72dc5
Author: Goutam <[email protected]>
Date:   Wed Nov 12 16:37:24 2025 -0800

    [Data] - Iceberg support upsert tables + schema update + overwrite tables (#58270)
    - Support upserting iceberg tables for IcebergDatasink
    - Update schema on APPEND and UPSERT
    - Enable overwriting the entire table

    Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
    functionality. Also for append, the library now handles the transaction
    logic implicitly so that burden can be lifted from Ray Data.
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    ---------

    Signed-off-by: Goutam <[email protected]>

commit d6793ecdbc4e6043cc0b0f19862b4b0c8256bb7f
Author: Joshua Lee <[email protected]>
Date:   Wed Nov 12 16:31:26 2025 -0800

    [core] Use GetNodeAddressAndLiveness in raylet client pool (#58576)

    Using GetNodeAddressAndLiveness in raylet client pool instead of the
    bulkier Get, same for AsyncGetAll. Seems like it was already done in
    core worker client pool, so just making the same change for raylet
    client pool.

    Signed-off-by: joshlee <[email protected]>

commit e713b3de319afd437f2de7435f5a2870167fa99a
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 15:01:35 2025 -0800

    [doc] set default python env to 3.10 (#58570)

    we stop supporting building with python 3.9 now

    Signed-off-by: Lonnie Liu <[email protected]>

commit 8e4b32e0366a9b32f7dfbd55d5dd5a30fc5c734b
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 15:01:20 2025 -0800

    [bazel] rename contraint from hermatic to python_version (#58499)

    which is more accurate

    also moves python constraint definitions into `bazel/` directory and
    registering python 3.10 platform with hermetic toolchain

    this allows performing migration from python 3.19 to python 3.10
    incrementally

    Signed-off-by: Lonnie Liu <[email protected]>

commit 0d56f3ef9ae32c5ce8543bb76d9ccde120140623
Author: Elliot Barnwell <[email protected]>
Date:   Wed Nov 12 14:23:17 2025 -0800

    [images][deps] raydepsets base extra depset (#58461)

    generating depsets for base extra python requirements
    Installing requirements in base extra image

    ---------

    Signed-off-by: elliot-barn <[email protected]>

commit df65225e4f98bce2b45405b1cf89fb70556e2871
Author: Daniel Shin <[email protected]>
Date:   Thu Nov 13 07:08:15 2025 +0900

    [Data] Use Approximate Quantile for RobustScaler Preprocessor (#58371)
    Currently Ray Data has a preprocessor called `RobustScaler`. This scales
    the data based on given quantiles. Calculating the quantiles involves
    sorting the entire dataset by column for each column (C sorts for C
    number of columns), which, for a large dataset, will require a lot of
    calculations.

    ** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch`
    as I couldn't actually find well-maintained tdigest libraries for
    python. ddsketch is better maintained.

    ** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile`
    aggregator
    N/A
    N/A

    ---------

    Signed-off-by: kyuds <[email protected]>
    Signed-off-by: Daniel Shin <[email protected]>
    Co-authored-by: You-Cheng Lin <[email protected]>

commit 5e71d58badbfdcfc002826398c3e02469065cc71
Author: Sampan S Nayak <[email protected]>
Date:   Thu Nov 13 03:33:18 2025 +0530

    [Core] support token auth in ray client server  (#58557)
    support token auth in ray client server by using the existing grpc
    interceptors. This pr refactors the code to:
    - add/rename sync and async client and server interceptors
    - create grpc utils to house grpc channel and server creation logic,
    python codebase is updated to use these methods
    - separate tests for sync and async interceptors
    - make existing authentication integration tests to run with RAY_CLIENT
    mode

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit a6cc5499e7fa07c0d6cdc7b7cd0b08dfc08073dd
Author: Kunchen (David) Dai <[email protected]>
Date:   Wed Nov 12 13:45:02 2025 -0800

    [Core] Move request id creation to worker to address plasma get perf regression (#58390)
    This PR address the performance regression introduced in the [PR to make
    ray.get thread safe](https://github.com/ray-project/ray/pull/57911).
    Specifically, the previous PR requires the worker to block and wait for
    AsyncGet to return with a reply of the request id needed for correctly
    cleaning up get requests. This additional synchronous step causes the
    plasma store Get to regress in performance.

    This PR moves the request id generation step to the plasma store,
    removing the blocking step to fix the perf regression.
    - [PR which introduced perf
    regression](https://github.com/ray-project/ray/pull/57911)
    - [PR which observed the
    regression](https://github.com/ray-project/ray/pull/58175)
    New performance of the change measured by `ray microbenchmark`.
    <img width="485" height="17" alt="image"
    src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0"
    />

    Original performance prior to the change. Here we focus on the
    regressing `single client get calls (Plasma Store)` metric, where our
    new performance returns us back to the original 10k per second range
    compared to the existing sub 5k per second.
    <img width="811" height="355" alt="image"
    src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c"
    />

    ---------

    Signed-off-by: davik <[email protected]>
    Co-authored-by: davik <[email protected]>

commit 9e450e6805824ac825488e1455ac97f93df0bbc3
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 12:36:21 2025 -0800

    [doc] symlink the doc dependency lock file (#58520)

    and ask people to use that lock file for building docs.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 16c2f5fffbd1d772606de28ac39c0bb7182efdd4
Author: Lehui Liu <[email protected]>
Date:   Wed Nov 12 12:08:28 2025 -0800

    [train] Set JAX_PLATFORMS env var based on ScalingConfig (#57783)

    1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set
    to initialize jax.distributed:
    https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38
    2. Before this change, user will have to configure both `use_tpu=True`
    in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able
    to start jax.distributed. `JAX_PLATFORMS` can be comma separated string.
    3. If user uses other jax.distributed libraries like Orbax, sometimes,
    it will leads to misleading error about distributed initialization.
    4. After this change, if user sets `use_tpu=True`, we automatically add
    this to env var.
    5. tpu unit test is not available this time, will explore for how to
    cover it later.

    ---------

    Signed-off-by: Lehui Liu <[email protected]>

commit 1ab16e26a0251d3964637c6fe0f2f9a0ae8c6312
Author: iamjustinhsu <[email protected]>
Date:   Wed Nov 12 12:04:16 2025 -0800

    [Data] Add `Ranker` Interface (#58513)
    Creates a ranker interface that will rank the best operator to run next
    in `select_operator_to_run`. This code only refractors the existing
    code. The ranking value must be something that is comparable.
    None
    None

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit 9d5a2416e2980501ffc5c094ce5c59709f93ccf2
Author: Lonnie Liu <[email protected]>
Date:   Wed Nov 12 11:50:42 2025 -0800

    [bazel] upgrade bazel python rules to 0.25.0 (#58535)

    previously it was actually using 0.4.0, which is set up by the grpc
    repo. the declaration in the workspace file was being shadowed..

    Signed-off-by: Lonnie Liu <[email protected]>

commit 02afe68937429bfd6501e4d0f46780bca4dea329
Author: Balaji Veeramani <[email protected]>
Date:   Wed Nov 12 11:34:59 2025 -0800

    [Data] Refactor concurrency validation tests in `test_map.py` (#58549)

    The original `test_concurrency` function combined multiple test
    scenarios into a single test with complex control flow and expensive Ray
    cluster initialization. This refactoring extracts the parameter
    validation tests into focused, independent tests that are faster,
    clearer, and easier to maintain.

    Additionally, the original test included "validation" cases that tested
    valid concurrency parameters but didn't actually verify that concurrency
    was being limited correctly—they only checked that the output was
    correct, which isn't useful for validating the concurrency feature
    itself.

    **Key improvements:**
    - Split validation tests into `test_invalid_func_concurrency_raises` and
    `test_invalid_class_concurrency_raises`
    - Use parametrized tests for different invalid concurrency values
    - Switch from `shutdown_only` with explicit `ray.init()` to
    `ray_start_regular_shared` to eliminate cluster initialization overhead
    - Minimize test data from 10 blocks to 1 element since we're only
    validating parameter errors
    - Remove non-validation tests that didn't verify concurrency behavior

    N/A

    The validation tests now execute significantly faster and provide
    clearer failure messages. Each test has a single, well-defined purpose
    making maintenance and debugging easier.

    ---------

    Signed-off-by: Balaji Veeramani <[email protected]>

commit 676b86f4a8d6a4c4eab70f5f381642d9a17fdca2
Author: Balaji Veeramani <[email protected]>
Date:   Wed Nov 12 11:32:48 2025 -0800

    [Data] Convert rST-style to Google-style docstrings in `ray.data` (#58523)

    This PR improves documentation consistency in the `python/ray/data`
    module by converting all remaining rST-style docstrings (`:param:`,
    `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.).

    **Files modified:**
    - `python/ray/data/preprocessors/utils.py` - Converted
    `StatComputationPlan.add_callable_stat()`
    - `python/ray/data/preprocessors/encoder.py` - Converted
    `unique_post_fn()`
    - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()`
    and `BlockColumnAccessor.is_composed_of_lists()`
    - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` -
    Converted `DeltaSharingDatasource.setup_delta_sharing_connections()`

    Signed-off-by: Balaji Veeramani <[email protected]>

commit 7e872837e450411e9da45acea0c52f4b67221500
Author: Nikhil G <[email protected]>
Date:   Wed Nov 12 09:07:32 2025 -0800

    [serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (#58504)

    Signed-off-by: Nikhil Ghosh <[email protected]>

commit cd09d104f6d595a805fd8f9979d9f81a828823b5
Author: Alexey Kudinkin <[email protected]>
Date:   Wed Nov 12 11:50:05 2025 -0500

    [Data] Lowering `DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR` to 2 (#58262)

    > Thank you for contributing to Ray! 🚀
    > Please review the [Ray Contribution
    Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
    before opening a pull request.

    > ⚠️ Remove these instructions before submitting your PR.

    > 💡 Tip: Mark as draft if you want early feedback, or ready for review
    when it's complete.

    This was setting the value to be aligned with the previous default of 4.

    However, after some consideration i've realized that 4 is too high of a
    number so actually lowering this to 2
    > Link related issues: "Fixes #1234", "Closes #1234", or "Related to
    > Optional: Add implementation details, API changes, usage examples,
    screenshots, etc.

    Signed-off-by: Alexey Kudinkin <[email protected]>

commit 126a40bc711cf06ed44686ee5026624d6b78766e
Author: Cuong Nguyen <[email protected]>
Date:   Wed Nov 12 07:44:53 2025 -0800

    [core] fix idle node termination on object pulling (#57928)

    Currently, a node is considered idle while pulling objects from the
    remote object store. This can lead to situations where a node is
    terminated as idle, causing the cluster to enter an infinite loop when
    pulling large objects that exceed the node idle termination timeout.

    This PR fixes the issue by treating object pulling as a busy activity.
    Note that nodes can still accept additional tasks while pulling objects
    (since pulling consumes no resources), but the auto-scaler will no
    longer terminate the node prematurely.

    Closes #54372

    Test:
    - CI

    Signed-off-by: Cuong Nguyen <[email protected]>

commit ad8f30291137efce9e463fb23e6821f4c7c74a9c
Author: Sagar Sumit <[email protected]>
Date:   Wed Nov 12 05:40:47 2025 -0800

    [core] Use graceful shutdown path when actor OUT_OF_SCOPE (`del actor`) (#57090)

    When actors terminate gracefully, Ray calls the actor's
    `__ray_shutdown__()` method if defined, allowing for cleanup of
    resources. But, this is not invoked in case actor goes out of scope due
    to `del actor`.

    Traced through the entire code path, and here's what happens:

    Flow when `del actor` is called:

    1. **Python side**: `ActorHandle.__del__()` ->
    `worker.core_worker.remove_actor_handle_reference(actor_id)`

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040

    2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` ->
    `reference_counter_->RemoveLocalReference()`
    - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed`
    callback

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506

    3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` ->
    `AsyncReportActorOutOfScope()` to GCS

    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183
    https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51

    4. **GCS receives notification**: `HandleReportActorOutOfScope()`
    - **THE PROBLEM IS HERE** ([line 279 in
    `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)):
       ```cpp
       DestroyActor(actor_id,
                    GenActorOutOfScopeCause(actor),
                    /*force_kill=*/true,  // <-- HARDCODED TO TRUE!
                    [reply, send_reply_callback]() {
       ```

    5. **Actor worker receives kill signal**: `HandleKillActor()` in
    [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970)
       ```cpp
       if (request.force_kill()) {  // This is TRUE for OUT_OF_SCOPE
           ForceExit(...)  // Skips __ray_shutdown__
       } else {
           Exit(...)  // Would call __ray_shutdown__
       }
       ```

    6. **ForceExit path**: Bypasses graceful shutdown -> No
    `__ray_shutdown__` callback invoked.

    This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE
    actors. Also, updated the docs.

    ---------

    Signed-off-by: Sagar Sumit <[email protected]>
    Co-authored-by: Ibrahim Rabbani <[email protected]>

commit 15393edbe72f5079279d3a0e46b72adc7496cdfc
Author: Sampan S Nayak <[email protected]>
Date:   Wed Nov 12 19:00:10 2025 +0530

    [Core] use client interceptor for adding auth token in c++ client calls (#58424)
    - Use client interceptor for adding auth tokens in grpc calls when
    `AUTH_MODE=token`
    - BuildChannel() will automatically include the interceptor
    - Removed `auth_token` parameter from `ClientCallImpl`
    - removed manual auth from `python_gcs_subscriber`.cc
    - tests to verify auth works for autoscaller apis

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit d496ea87808706333703be6ff25ecc9472330fd5
Author: Sampan S Nayak <[email protected]>
Date:   Wed Nov 12 11:25:11 2025 +0530

     [core] Token auth usability improvements (#58408)
    - rename RAY_auth_mode → RAY_AUTH_MODE environment variable across
    codebase
    - Excluded healthcheck endpoints from authentication for Kubernetes
    compatibility
    - Fixed dashboard cookie handling to respect auth mode and clear stale
    tokens when switching clusters

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Signed-off-by: Sampan S Nayak <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 584f5acdf804b1ba097ff7fa5d78a0bfd63c682b
Author: kourosh hakhamaneshi <[email protected]>
Date:   Tue Nov 11 19:50:52 2025 -0800

    [doc][serve][llm] Attached the correct figure to the pd docs (#58543)

    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

commit a15f5be797ced0df321bfd8d42bab7d57defa2de
Author: Lonnie Liu <[email protected]>
Date:   Tue Nov 11 18:00:43 2025 -0800

    [doc] downgrade readthedocs to use python 3.10 (#58536)

    be consistent with the default build environment

    Signed-off-by: Lonnie Liu <[email protected]>

commit 9dcb67dc9ff20d9b9ae29875bb610273ba4149ed
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 17:26:15 2025 -0800

    [core] Fix auth test import (#58554)

    The python test step is failing on master now because of this. Probably
    a logical merge conflict.
    ```
    FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary)
    ...

    [2025-11-11T22:11:54Z]     from ray.tests.authentication_test_utils import (
    --
      | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils'
    ```

    Signed-off-by: dayshah <[email protected]>

commit 20bf68263beed3609e24aede3d9fc96bc07f0da0
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:44:05 2025 -0800

    [core][rdt] Abort NIXL and allow actor reuse on failed transfers  (#56783)

    Signed-off-by: dayshah <[email protected]>

commit 89a329cd1e0219629132abc203085117a11949f3
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:26:17 2025 -0800

    [core] Improve kill actor logs (#58544)

    Signed-off-by: dayshah <[email protected]>

commit 6c9607ea57b9edde07c856f094835c84f47b79a6
Author: Nikhil G <[email protected]>
Date:   Tue Nov 11 12:16:41 2025 -0800

    [docs][serve][llm] examples and doc for cross-node TP/PP in Serve (#57715)

    Signed-off-by: Nikhil Ghosh <[email protected]>
    Signed-off-by: Nikhil G <[email protected]>

commit 711d9453828fecebb91b9642e799b4b0b4a493f7
Author: Dhyey Shah <[email protected]>
Date:   Tue Nov 11 12:13:13 2025 -0800

    [core] Make GlobalState lazy initialization thread-safe (#58182)

    Signed-off-by: dayshah <[email protected]>

commit fd10c39829a580bd83ba28c8518e7a7a5ebd3dfb
Author: Kai-Hsun Chen <[email protected]>
Date:   Tue Nov 11 09:43:05 2025 -0800

    [core] Scheduling a detached actor with a placement group is not recommended (#57726)

    <!-- Thank you for contributing to Ray! 🚀 -->
    <!-- Please review
    https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
    opening a pull request. -->
    <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
    review when it's complete -->

    If users schedule a detached actor into a placement group, Raylet will
    kill the actor when the placement group is removed. The actor will be
    stuck in the `RESTARTING` state forever if it's restartable until users
    explicitly kill it.

    In that case, if users try to `get_actor` with the actor's name, it can
    still return the restarting actor, but no process exists. It will no
    longer be restarted because the PG is gone, and no PG with the same ID
    will be created during the cluster's lifetime.

    The better behavior would be for Ray to transition a task/actor's state
    to dead when it is impossible to restart. However, this would add too
    much complexity to the core, so I think it's not worth it. Therefore,
    this PR adds a warning log, and users should use detached actors or PGs
    correctly.

    Example: Run the following script and run `ray list actors`.

    ```python
    import ray
    from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
    from ray.util.placement_group import placement_group, remove_placement_group

    @ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1)
    class Actor:
      pass

    ray.init()

    pg = placement_group([{"CPU": 1}])
    ray.get(pg.ready())

    actor = Actor.options(
        scheduling_strategy=PlacementGroupSchedulingStrategy(
            placement_group=pg,
        )
    ).remote()

    ray.get(actor.__ray_ready__.remote())
    ```

    <!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to

    - [ ] Bug fix 🐛
    - [ ] New feature ✨
    - [x] Enhancement 🚀
    - [ ] Code refactoring 🔧
    - [ ] Documentation update 📖
    - [ ] Chore 🧹
    - [ ] Style 🎨

    **Does this PR introduce breaking changes?**
    - [ ] Yes ⚠️
    - [x] No
    <!-- If yes, describe what breaks and how users should migrate -->

    **Testing:**
    - [ ] Added/updated tests for my changes
    - [x] Tested the changes manually
    - [ ] This PR is not tested ❌ _(please explain why)_

    **Code Quality:**
    - [x] Signed off every commit (`git commit -s`)
    - [x] Ran pre-commit hooks ([setup
    guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

    **Documentation:**
    - [ ] Updated documentation (if applicable) ([contribution
    guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
    - [ ] Added new APIs to `doc/source/` (if applicable)

    <!-- Optional: Add screenshots, examples, performance impact, breaking
    change details -->

    ---------

    Signed-off-by: Kai-Hsun Chen <[email protected]>
    Signed-off-by: Robert Nishihara <[email protected]>
    Signed-off-by: Kai-Hsun Chen <[email protected]>
    Co-authored-by: Robert Nishihara <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 0752886e7d55694b6cf8d780b7470d58266c6a10
Author: Cuong Nguyen <[email protected]>
Date:   Tue Nov 11 07:19:19 2025 -0800

    [core] enable open telemetry by default (#56432)

    This PR enables open telemetry as the default backend for ray metric
    stack. The bulk of this PR is actually to fix tests that were written
    with some assumptions that no longer hold true. For ease of reviewing, I
    inline the reasons for the change together with the change for each
    tests in the comments.

    This PR also depends on a release of vllm (so that we can update the
    minimal supported version of vllm in ray).

    Test:
    - CI

    <!-- CURSOR_SUMMARY -->
    ---

    > [!NOTE]
    > Enable OpenTelemetry metrics backend by default and refactor
    metrics/Serve tests to use timeseries APIs and updated `ray_serve_*`
    metric names.
    >
    > - **Core/Config**:
    > - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to
    `true` in `ray_constants.py` and `ray_config_def.h`.
    > - Metrics `Counter`: use `CythonCount` by default; keep legacy
    `CythonSum` only when OTEL is explicitly disabled.
    > - **Serve/Metrics Tests**:
    > - Replace text scraping with `PrometheusTimeseries` and
    `fetch_prometheus_metric_timeseries` throughout.
    > - Update metric names/tags to `ray_serve_*` and counter suffixes
    `*_total`; adjust latency metric names and processing/queued gauges.
    > - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and
    pass through helpers.
    > - **General Test Fixes**:
    > - Remove OTEL parametrization/fixtures; simplify expectations where
    counters-as-gauges no longer apply; drop related tests.
    > - Cardinality tests: include `"low"` level and remove OTEL gating;
    stop injecting `enable_open_telemetry` in system config.
    > - Actor/state/thread tests: migrate to cluster fixtures, wait for
    dashboard agent, and adjust expected worker thread counts.
    > - **Build**:
    > - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env
    from C++ stats test.
    >
    > <sup>Written by [Cursor
    Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
    1d0190f3dd58d5f0c982fcbdab95fcf5f733553f. This will update automatically
    on new commits. Configure
    [here](https://cursor.com/dashboard?tab=bugbot).</sup>
    <!-- /CURSOR_SUMMARY -->

    ---------

    Signed-off-by: Cuong Nguyen <[email protected]>

commit bf595e32d049503f5c1931c5b477647a06d191c2
Author: Sampan S Nayak <[email protected]>
Date:   Tue Nov 11 19:15:41 2025 +0530

    [Core] move authentication_test_utils into ray._private to fix macos tests (#58528)

    the auth token test setup in `conftest.py` is breaking macos test. there
    are two test scripts (`test_microbenchmarks.py` and `test_basic.py`)
    that run after the wheel is installed but without editable mode. for
    these test to pass,` conftest.py` cannot import anything under
    `ray.tests`.

    this pr moves `authentication_test_utils` into `ray._private` to fix
    this issue

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 3d29c4ccc9182c44d3cfab08fb561cb7db74eea8
Author: Sampan S Nayak <[email protected]>
Date:   Tue Nov 11 19:10:56 2025 +0530

    [Core] Add Service Interceptor to support token authentication in dashboard agent (#58405)

    Add a grpc service interceptor to intercept all dashboard agent rpc
    calls and validate the presence of auth token (when auth mode is token)

    ---------

    Signed-off-by: sampan <[email protected]>
    Signed-off-by: Edward Oakes <[email protected]>
    Co-authored-by: sampan <[email protected]>
    Co-authored-by: Edward Oakes <[email protected]>

commit 1a48e7318442d038f2c43d22da3b580fa643b8d1
Author: curiosity-hyf <[email protected]>
Date:   Tue Nov 11 21:35:42 2025 +0800

    [Docs] fix pattern_async_actor demo typo (#58486)

    fix pattern_async_actor demo typo. Add `self.`.

    ---------

    Signed-off-by: curiosity-hyf <[email protected]>

commit f2a7a94a75b007a801ee5a2cf6a6e24b93e9cb9a
Author: Thomas Desrosiers <[email protected]>
Date:   Mon Nov 10 18:28:46 2025 -0800

    Update pydoclint to version 0.8.1 (#58490)
    * Does the work to bump pydoclint up to the latest version
    * And allowlist any new violations it finds
    n/a
    n/a

    ---------

    Signed-off-by: Thomas Desrosiers <[email protected]>

commit 10983e8c9f50ddfa355efe7977d056b29b38d4c1
Author: Goutam <[email protected]>
Date:   Mon Nov 10 17:34:13 2025 -0800

    [Data] - Iceberg support predicate & projection pushdown (#58286)
    Predicate pushdown (https://github.com/ray-project/ray/pull/58150) in
    conjunction with this PR should speed up reads from Iceberg.

    Once the above change lands, we can add the pushdown interface support
    for IcebergDatasource

    ---------

    Signed-off-by: Goutam <[email protected]>

commit 09f01135f4ab71d52be7a44d06e40ff3767f6cee
Author: Seiji Eicher <[email protected]>
Date:   Mon Nov 10 17:28:23 2025 -0800

    [serve][llm] Fix import path in muli-node release test (#58498)

    Signed-off-by: Seiji Eicher <[email protected]>

commit 405c4648c2fe71afb7daf4ea574605190f129fd7
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 16:04:48 2025 -0800

    [ci] upgrade rayci version (#58514)

    to 0.21.0; supports wanda priority now.

    Signed-off-by: Lonnie Liu <[email protected]>

commit 6de012fd0df23993054653ca5517a66944c58dd2
Author: Zac Policzer <[email protected]>
Date:   Mon Nov 10 14:05:15 2025 -0800

    [core] Add owned object spill metrics (#57870)

    This PR adds 2 new metrics to core_worker by way of the reference
    counter. The two new metrics keep track of the count and size of objects
    owned by the worker as well as keeping track of their states. States are
    defined as:

    - **PendingCreation**: An object that is pending creation and hasn't
    finished it's initialization (and is sizeless)
    - **InPlasma**: An object which has an assigned node address and isn't
    spilled
    - **Spilled**: An object which has an assigned node address and is
    spilled
    - **InMemory**: An object which has no assigned address but isn't
    pending creation (and therefore, must be local)

    The approach used by these new metrics is to examine the state 'before
    and after' any mutations on the reference in the reference_counter. This
    is required in order to do the appropriate bookkeeping (decrementing
    values and incrementing others). Admittedly, there is potential for
    counting on the in between decrements/increments depending on when the
    RecordMetrics loop is run. This unfortunate side effect however seems
    preferable to doing mutual exclusion with metric collection as this is
    potentially a high throughput code path.

    In addition, performing live counts seemed preferable then doing full
    accounting of the object store and across all references at time of
    metric collection. Reason being, that potentially the reference counter
    is tracking millions of objects, and each metric scan could potentially
    be very expensive. So running the accounting (despite being potentially
    innaccurate for short periods) seemed the right call.

    This PR also allows for object size to potentially change due to
    potential non deterministic instantation (say an object is initially
    created, but it's primary copy dies, and then the recreation fails).
    This is an edge case, but seems important for completeness sake.

    ---------

    Signed-off-by: zac <[email protected]>

commit f2dd0e2b6dc7bc074f72197ff08f7d4e58635052
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:02:11 2025 -0800

    [java] remove local genrule `//java:ray_java_pkg` (#58503)

    using `bazelisk run //java:gen_ray_java_pkg` everywhere

    Signed-off-by: Lonnie Liu <[email protected]>

commit b23adc777c5b103291cf3a35b51b123a808d36f6
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:01:27 2025 -0800

    [ci] apply isort to release test directory, part 1 (#58505)

    excluding `*_tests` directories for now to reduce the impact

    Signed-off-by: Lonnie Liu <[email protected]>

commit ce1fd472b2677069a5bfcd2b5ed7a2695f5f2966
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:01:06 2025 -0800

    [doc] change link check to run on python 3.12 (#58506)

    migrating all doc related things to run on python 3.12

    Signed-off-by: Lonnie Liu <[email protected]>

commit b09b076e15fefe842a0b7e33accff71ec3c31435
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 14:00:01 2025 -0800

    [doc] ci: move doc annotation check to python 3.12 (#58507)

    be consistent with doc build environment

    Signed-off-by: Lonnie Liu <[email protected]>

commit 8971f83ecb40d54729c2c26d394594c29199e19d
Author: iamjustinhsu <[email protected]>
Date:   Mon Nov 10 12:52:43 2025 -0800

    [data] Clear queue for manually mark_execution_finished operators (#58441)
    Currently, we clear _external_ queues when an operator is manually
    marked as finished. But we don't clear their _internal_ queues. This PR
    fixes that
    Fixes this test
    https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit ffb51f866802ad3858d82a9356855a38503efec9
Author: Matthew Owen <[email protected]>
Date:   Mon Nov 10 10:54:34 2025 -0800

    [data] Update depsets for multimodal inference release tests (#57233)

    Update remaining mulitmodal release tests to use new depsets.

commit 62231dd4ba8e784da8800b248ad7616b8db92de7
Author: Lonnie Liu <[email protected]>
Date:   Mon Nov 10 10:30:00 2025 -0800

    [ci] seperate doc related jobs into its own group (#58454)

    so that they are not called lints any more

    Signed-off-by: Lonnie Liu <[email protected]>

commit 3f7a7b42fda0bb75a9af6e5ad197ba3743b011c2
Author: harshit-anyscale <[email protected]>
Date:   Mon Nov 10 23:45:38 2025 +0530

    increase timeout for test_initial_replica tests (#58423)

    - `test_target_capacity` windows test is failing, possibly because we
    have put up a short timeout of 10 seconds, increasing it to verify
    whether timeout is an issue or not.

    Signed-off-by: harshit <[email protected]>

commit 217031a48f4f83d04950ad39b94846ba362edd37
Author: Jugal Shah <[email protected]>
Date:   Mon Nov 10 09:39:43 2025 -0800

    Define an env for controlling UVloop (#58442)

    > Briefly describe what this PR accomplishes and why it's needed.

    Our serve ingress keeps running into below error related to `uvloop`
    under heavy load
    ```
    File descriptor 97 is used by transport
    ```
    The uvloop team have a
    [PR](https://github.com/MagicStack/uvloop/pull/646) to fix it, but seems
    like no one is working on it

    One of workaround mentioned in the
    ([PR](https://github.com/MagicStack/uvloop/pull/646#issuecomment-3138886982))
    is to just turn off uvloop .
    We tried it in our env and didn't see any major performance difference
    Hence as part of this PR, we are defining a new env for controlling
    UVloop

    Signed-off-by: jugalshah291 <[email protected]>

commit 2486ddd9fec83cc940937e3d91368942588ef177
Author: fscnick <[email protected]>
Date:   Mon Nov 10 23:29:03 2025 +0800

    [Doc][KubeRay] eliminate vale errors (#58429)

    Fix some vale's error and suggestions on the kai-scheduler document.

    See https://github.com/ray-project/ray/pull/58161#discussion_r2463701719

    Signed-off-by: fscnick <[email protected]>

commit cb6a60d0afcfca87734a399291343e297031f1d5
Author: Daniel Sperber <[email protected]>
Date:   Mon Nov 10 16:24:34 2025 +0100

    [air] Add stacklevel option to deprecation_warning (#58357)

    Currently are deprecation warnings sometimes not informative enough. The
    the warning is triggered it does not tell us *where* the deprecated
    feature is used. For example, ray internally raises a deprecation
    warning when an `RLModuleConfig` is initialized.

    ```python
    >>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig
    >>> RLModuleConfig()
    2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
    ```

    This is confusing, where did *I* use a config, what am I doing wrong?
    This raises issues like:
    https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064

    Tracing where the error actually happens is tedious - is it my code or
    internal? The output just shows `deprecation.:50`. Not helpful.

    This PR adds a stacklevel option with stacklevel=2 as the default to all
    `deprecation_warning`s. So devs and users can better see where is the
    deprecated option actually used.

    ---

    EDIT:

    **Before**

    ```python
    WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])`
    ```

    **After** module.py:line where the deprecated artifact is used is shown
    in the log output:

    When building an Algorithm:
    ```python
    WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
    ```

    ```python
    .../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
    ```

    Signed-off-by: Daraan <[email protected]>

commit 5bff52ab5d9a9d67de88c4f0b86c918487ed7216
Author: Sampan S Nayak <[email protected]>
Date:   Mon Nov 10 20:50:21 2025 +0530

    [core] Configure an interceptor to pass auth token in python direct g… (#58395)

    there are places in the python code where we use the raw grpc library to
    make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term
    we want to fully deprecate grpc library usage in our python code base
    but as that can take more effort and testing, in this pr I am
    introducing an interceptor to add auth headers (this will take effect
    for all grpc calls made from python).
    ```
    export RAY_auth_mode="token"
    export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
    ray start --head
    ray job submit -- echo "hi"
    ```

    output
    ```
    ray job submit -- echo "hi"
    2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads.
    Job submission server address: http://127.0.0.1:8265

    -------------------------------------------------------
    Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully
    -------------------------------------------------------

    Next steps
      Query the logs of the job:
        ray job logs raysubmit_1EV8q86uKM24nHmH
      Query the status of the job:
        ray job status raysubmit_1EV8q86uKM24nHmH
      Request the job to be stopped:
        ray job stop raysubmit_1EV8q86uKM24nHmH

    Tailing logs until the job exits (disable with --no-wait):
    2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up.
    hi
    Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi

    ------------------------------------------
    Job 'raysubmit_1EV8q86uKM24nHmH' succeeded
    ------------------------------------------
    ```
    dashboard
    test.py
    ```python
    import time
    import ray
    from ray._raylet import Config

    ray.init()

    @ray.remote
    def print_hi():
        print("Hi")
        time.sleep(2)

    @ray.remote
    class SimpleActor:
        def __init__(self):
            self.value = 0

        def increment(self):
            self.value += 1
            return self.value

    actor = SimpleActor.remote()
    result = ray.get(actor.increment.remote())

    for i in range(100):
        ray.get(print_hi.remote())
        time.sleep(20)

    ray.shutdown()
    ```

    ```
    export RAY_auth_mode="token"
    export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
    python test.py
    ```
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292"
    />
    overview page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762"
    />

    job page: tasks are listed
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a"
    />

    task page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136"
    />

    actors page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459"
    />

    specific actor page
    <img width="1720" height="1073" alt="image"
    src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0"
    />

    ---------

    Signed-off-by: sampan <[email protected]>
    Co-authored-by: sampan <[email protected]>

commit 71c7bd056cc132c57a4c3cf13d0f5207cbcfd73f
Author: Xinyu Zhang <[email protected]>
Date:   Sun Nov 9 08:34:46 2025 -0800

    [Data] Add exception handling for invalid URIs in download operation (#58464)

commit d74c1570543045a0f99df4d5690ac44f1fda4a55
Author: iamjustinhsu <[email protected]>
Date:   Sat Nov 8 15:35:11 2025 -0800

    [dashboards][core] Make `do_reply` accept status_code, instead of success: bool (#58384)
    Pass in `status_code` directly into `do_reply`. This is a follow up to
    https://github.com/ray-project/ray/pull/58255

    ---------

    Signed-off-by: iamjustinhsu <[email protected]>

commit e793631896f65a88513510b4e7bf6f100607cb03
Author: Rueian <[email protected]>
Date:   Sat Nov 8 15:32:10 2025 -0800

    [core][autoscaler] Fix RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460)

    This ensures node type names are correctly reported even when the
    autoscaler is disabled (read-only mode).

    Autoscaler v2 fails to report prometheus metrics when operating in
    read-only mode on KubeRay with the following KeyError error:

    ```
    2025-11-08 12:06:57,402	ERROR autoscaler.py:215 -- 'small-group'
    Traceback (most recent call last):
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
        return Reconciler.reconcile(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile
        Reconciler._step_next(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next
        Reconciler._scale_cluster(
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster
        reply = scheduler.schedule(sched_request)
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule
        ResourceDemandScheduler._enforce_max_workers_per_type(ctx)
      File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type
        node_config = ctx.get_node_type_configs()[node_type]
    KeyError: 'small-group'
    ```

    This happens because the `ReadOnlyProviderConfigReader` populates
    `ctx.get_node_type_configs()` using node IDs as node types, which is
    correct for local Ray (where local ray does not have
    `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where
    `ray_node_type_name` is present and expected wi…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants