increase timeout for test_initial_replica tests #58423

harshit-anyscale · 2025-11-06T05:11:05Z

test_target_capacity windows test is failing, possibly because we have put up a short timeout of 10 seconds, increasing it to verify whether timeout is an issue or not.

Signed-off-by: harshit <[email protected]>

gemini-code-assist

Code Review

This pull request increases the timeout for a test in test_initial_replica_tests to address a potential flakiness issue on Windows. The change is straightforward and reasonable. I've added a suggestion to use a named constant for the timeout value to improve code maintainability.

gemini-code-assist · 2025-11-06T05:11:55Z

python/ray/serve/tests/test_target_capacity.py

                deployment_name: int(initial_replicas * config_target_capacity / 100)
            },
            app_name="app1",
+            timeout=30,


To improve readability and maintainability, it's better to define this timeout value as a named constant at the top of the file or test class, for example INITIAL_REPLICA_TEST_TIMEOUT_S = 30. This makes it easier to understand the purpose of the timeout and to adjust it in the future if needed, especially since other timeouts are used in this file.

Suggested change

timeout=30,

timeout=30, # Consider defining this as a constant, e.g., INITIAL_REPLICA_TEST_TIMEOUT_S

cursor

Bug: Inconsistent Timeout Updates in Replicas Test Suite

Inconsistent timeout handling in test_initial_replicas_new_configs. The PR aims to increase timeout to 30 seconds to address Windows test failures, but only the first wait_for_condition call (line 1078) was updated. Two subsequent wait_for_condition calls with check_expected_num_replicas (lines 1103-1109 for "app1" and lines 1110-1116 for "app2") still use the default 10-second timeout. These calls are checking similar replica scaling conditions and are likely to experience the same timeout issues on Windows, making the fix incomplete.

python/ray/serve/tests/test_target_capacity.py#L1102-L1116

ray/python/ray/serve/tests/test_target_capacity.py

Lines 1102 to 1116 in 7ecde7f

    
           client.deploy_apps(new_config) 
        
           wait_for_condition( 
        
               lambda: serve.status().target_capacity == new_config_target_capacity 
        
           ) 
        
           wait_for_condition( 
        
               check_expected_num_replicas, 
        
               deployment_to_num_replicas={ 
        
                   deployment_name: int( 
        
                       initial_replicas * new_config_target_capacity / 100 
        
                   ) 
        
               }, 
        
               app_name="app1", 
        
           ) 
        
           wait_for_condition( 
        
               check_expected_num_replicas,

zcin · 2025-11-06T17:14:05Z

@harshit-anyscale did you run the windows test to verify the fix

harshit-anyscale · 2025-11-06T17:41:42Z

@harshit-anyscale did you run the windows test to verify the fix

not right now, this seems to be a brute-force solution to me because the status we were getting is Deploying, but what we want is running, so thought of increasing the timeout first to make it less flakey and less problematic for others. If this works, will do the RCA for this, and take steps

this is sort-of trying a short term solution. let me know if that's okay or else will perform the windows test locally first.

cursor · 2025-11-07T05:24:50Z

python/ray/serve/tests/test_target_capacity.py

                deployment_name: int(initial_replicas * config_target_capacity / 100)
            },
            app_name="app1",
+            timeout=30,


Bug: Incomplete timeout propagation in test retries

The timeout increase to 30 seconds is only applied to the first wait_for_condition call in test_initial_replicas_new_configs, but two similar calls later in the same test (around lines 1103 and 1111) still use the default 10-second timeout. This incomplete fix means the test can still fail on Windows due to timeouts in those later assertions, defeating the purpose of this PR.

commit b3a8434d35f7af0322e3b766b1a1809bd29c2837 Author: Lonnie Liu <[email protected]> Date: Thu Nov 13 14:31:31 2025 -0800 [doc] remove python 3.12 in doc building (#58572) unifying to python 3.10 Signed-off-by: Lonnie Liu <[email protected]> commit 31f904f630809152ceba67c8bf1684c8c9b685ea Author: Andrew Sy Kim <[email protected]> Date: Thu Nov 13 17:27:23 2025 -0500 Add support for RAY_AUTH_MODE=k8s (#58497) This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray will delegate authentication and authorization of Ray access to Kubernetes TokenReview and SubjectAccessReview APIs. --------- Signed-off-by: Andrew Sy Kim <[email protected]> commit ade535a9519c19c25aa50c562d2c27128b3ca356 Author: Cuong Nguyen <[email protected]> Date: Thu Nov 13 14:08:29 2025 -0800 [serve] fix serve dashboard metric name (#58573) Prometheus auto-append the `_total` suffix to all Counter metrics. Ray historically has been supported counter metric with and without `_total` suffix for backward compatibility, but it is now time to drop the support (2 years since the warning was added). There is one place in ray serve dashboard that still doesn't use the `_total` suffix so fix it in this PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]> commit 62a33c29d23a5c1fb91a969b9aea3ffe1f8281cc Author: Rui Qiao <[email protected]> Date: Thu Nov 13 13:33:33 2025 -0800 [Serve.LLM] Add avg prompt length metric (#58599) Add avg prompt length metric When using uniform prompt length (especially in testing), the P50 and P90 computations are skewed due to the 1_2_5 buckets used in vLLM. Average prompt length provides another useful dimension to look at and validate. For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows 9400, and avg accurately shows 5000. <img width="1186" height="466" alt="image" src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a" /> --------- Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 0c4dcb032ce03a771c3b6276fb661cfc6b839c01 Author: Elliot Barnwell <[email protected]> Date: Thu Nov 13 12:42:49 2025 -0800 [release] allowing for py3.13 images (cpu & cu123) in release tests (#58581) allowing for py3.13 images (cpu & cu123) in release tests Signed-off-by: elliot-barn <[email protected]> commit c3ba35e6cb1ce4030d8d361a921a697af516fbca Author: Goutam <[email protected]> Date: Thu Nov 13 12:26:10 2025 -0800 [Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype (#58225) As title suggests > Link related issues: "Fixes #1234", "Closes #1234", or "Related to > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <[email protected]> commit af20446c362a8f4d17b9226d944a3242b0acafaf Author: Cuong Nguyen <[email protected]> Date: Thu Nov 13 12:18:38 2025 -0800 [core] fix get_metric_check_condition tests (#58598) Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`, which is a non-flaky version of `fetch_prometheus`. Update all of test usage accordingly. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit f1c613dc386268beec06b6c57c12191218ae7e74 Author: Cuong Nguyen <[email protected]> Date: Thu Nov 13 12:14:04 2025 -0800 [core] add an option to disable otel sdk error logs (#58257) Currently, Ray metrics and events are exported through a centralized process called the Dashboard Agent. This process functions as a gRPC server, receiving data from all other components (GCS, Raylet, workers, etc.). However, during a node shutdown, the Dashboard Agent may terminate before the other components, resulting in gRPC errors and potential loss of metrics and events. As this issue occurs, the otel sdk logs become very noisy. Add a default options to disable otel sdk logs to avoid confusion. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]> commit 638933ef4aabe24b5def68d72f21e772e354e853 Author: Abrar Sheikh <[email protected]> Date: Thu Nov 13 11:41:29 2025 -0800 [1/n] [Serve] Refactor replica rank to prepare for node local ranks (#58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR https://github.com/ray-project/ray/pull/58473 --------- Signed-off-by: abrar <[email protected]> commit 5d5113134bce5929ff7504f733bbee44a7de2987 Author: Kunchen (David) Dai <[email protected]> Date: Thu Nov 13 11:21:50 2025 -0800 [Core] Refactor reference_counter out of memory store and plasma store (#57590) As discovered in the [PR to better define the interface for reference counter](https://github.com/ray-project/ray/pull/57177#pullrequestreview-3312168933), plasma store provider and memory store both share thin dependencies on reference counter that can be refactored out. This will reduce entanglement in our code base and improve maintainability. The main logic changes are located in * src/ray/core_worker/store_provider/plasma_store_provider.cc, where reference counter related logic is refactor into core worker * src/ray/core_worker/core_worker.cc, where factored out reference counter logic is resolved * src/ray/core_worker/store_provider/memory_store/memory_store.cc, where logic related to reference counter has either been removed due to the fact that it is tech debt or refactored into caller functions.   Microbenchmark: ``` single client get calls (Plasma Store) per second 10592.56 +- 535.86 single client put calls (Plasma Store) per second 4908.72 +- 41.55 multi client put calls (Plasma Store) per second 14260.79 +- 265.48 single client put gigabytes per second 11.92 +- 10.21 single client tasks and get batch per second 8.33 +- 0.19 multi client put gigabytes per second 32.09 +- 1.63 single client get object containing 10k refs per second 13.38 +- 0.13 single client wait 1k refs per second 5.04 +- 0.05 single client tasks sync per second 960.45 +- 15.76 single client tasks async per second 7955.16 +- 195.97 multi client tasks async per second 17724.1 +- 856.8 1:1 actor calls sync per second 2251.22 +- 63.93 1:1 actor calls async per second 9342.91 +- 614.74 1:1 actor calls concurrent per second 6427.29 +- 50.3 1:n actor calls async per second 8221.63 +- 167.83 n:n actor calls async per second 22876.04 +- 436.98 n:n actor calls with arg async per second 3531.21 +- 39.38 1:1 async-actor calls sync per second 1581.31 +- 34.01 1:1 async-actor calls async per second 5651.2 +- 222.21 1:1 async-actor calls with args async per second 3618.34 +- 76.02 1:n async-actor calls async per second 7379.2 +- 144.83 n:n async-actor calls async per second 19768.79 +- 211.95 ``` This PR mainly makes logic changes to the `ray.get` call chain. As we can see from the benchmark above, the single clientget calls performance matches pre-regression levels. --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]> commit 2352e6b8e1e4488822eb787e6112c18c1964fbe0 Author: Sampan S Nayak <[email protected]> Date: Fri Nov 14 00:49:39 2025 +0530 [Core] Support get-auth-token cli command (#58566) add support for `ray get-auth-token` cli command + test --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]> commit ea5bc3491a74e2b71f4cb6fdb14787fdcb3314fc Author: Sampan S Nayak <[email protected]> Date: Fri Nov 14 00:37:23 2025 +0530 [Core] Migrate to HttpOnly cookie-based authentication for enhanced security (#58591) Migrates Ray dashboard authentication from JavaScript-managed cookies to server-side HttpOnly cookies to enhance security against XSS attacks. This addresses code review feedback to improve the authentication implementation (https://github.com/ray-project/ray/pull/58368) main changes: - authentication middleware first looks for `Authorization` header, if not found it then looks at cookies to look for the auth token - new `api/authenticate` endpoint for verifying token and setting the auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and `secure=true` (when using https)) - removed javascript based cookie manipulation utils and axios interceptors (were previously responsible for setting cookies) - cookies are deleted when connecting to a cluster with `AUTH_MODE=disabled`. connecting to a different ray cluster (with different auth token) using the same endpoint (eg due to port-forwarding or local testing) will reshow the popup and ask users to input the right token. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]> commit 0905c77db5acd286a6ba84a907c60ad2b15416dd Author: Lonnie Liu <[email protected]> Date: Thu Nov 13 10:41:57 2025 -0800 [ci] doc check: remove dependency on `ray_ci` (#58516) this makes it possible to run on a different python version than the CI wrapper code. Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]> commit 0bbd8fd22e0447ec66c12e67afc973e95523451b Author: Lonnie Liu <[email protected]> Date: Thu Nov 13 10:35:38 2025 -0800 [ci] mark github.Repository as typechecking (#58582) so that importing test.py does not always import github github repo imports jwt, which then imports cryptography and can lead to issues on windows. Signed-off-by: Lonnie Liu <[email protected]> commit 208970b5b399133a41557db8b16ad6832180e6b7 Author: Lonnie Liu <[email protected]> Date: Thu Nov 13 10:35:23 2025 -0800 [wheel] stop building python 3.9 wheels on the pipelines (#58587) also stops building python 3.9 aarch64 images Signed-off-by: Lonnie Liu <[email protected]> commit 33e855e42baaa1ebf4f3f0a1f96f00e87fdc1d11 Author: Lonnie Liu <[email protected]> Date: Thu Nov 13 10:32:21 2025 -0800 [serve] run tests in python 3.10 (#58586) all tests are passing Signed-off-by: Lonnie Liu <[email protected]> commit 5e8433d3cf8b6bea3366094bb4ecfc6f410dec01 Author: Zac Policzer <[email protected]> Date: Thu Nov 13 07:37:52 2025 -0800 [core] Add monitoring in raylet for resouce view (#58382) We today have very little observability into pubsub. On a raylet one of the most important states that need to be propagated through the cluster via pubsub is cluster membership. All raylets should in an eventual BUT timely fashion agree on the list of available nodes. This metric just emits a simple counter to keep track of the node count. More pubsub observability to come. > Link related issues: "Fixes #1234", "Closes #1234", or "Related to > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zac <[email protected]> Signed-off-by: Zac Policzer <[email protected]> Co-authored-by: Edward Oakes <[email protected]> commit dde70e76e5aa993e9224a2d173a053a35a132ebd Author: Xinyu Zhang <[email protected]> Date: Wed Nov 12 23:04:37 2025 -0800 [Data] Fix HTTP streaming file download by using `open_input_stream` (#58542) Fixes HTTP streaming file downloads in Ray Data's download operation. Some URIs (especially HTTP streams) require `open_input_stream` instead of `open_input_file`. - Modified `download_bytes_threaded` in `plan_download_op.py` to try both `open_input_file` and `open_input_stream` for each URI - Improved error handling to distinguish between different error types - Failed downloads now return `None` gracefully instead of crashing ``` import pyarrow as pa from ray.data.context import DataContext from ray.data._internal.planner.plan_download_op import download_bytes_threaded urls = [ "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&", ] table = pa.table({"url": urls}) ctx = DataContext.get_current() results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) result_table = results[0] for i in range(result_table.num_rows): url = result_table['url'][i].as_py() bytes_data = result_table['bytes'][i].as_py() if bytes_data is None: print(f"Row {i}: FAILED (None) - try-catch worked ✓") else: print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)") print(f" URL: {url[:60]}...") print("\n✅ Test passed: Failed downloads return None instead of crashing.") ``` Before the fix: ``` TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/ray/default/test_streaming_fallback.py", line 110, in <module> test_download_expression_with_streaming_fallback() File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__ if not self.__exit__(*sys.exc_info()): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__ setattr(self.target, self.attribute, self.temp_original) TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' (base) ray@ip-10-0-39-21:~/default$ python test.py 2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker! Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file Traceback (most recent call last): File "/home/ray/default/test.py", line 16, in <module> results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded uri_bytes = list( ^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen raise item File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file ``` After the fix: ``` Row 0: SUCCESS (189370 bytes) URL: https://static-assets.tesla.com/configurator/compositor?cont... ``` Tested with HTTP streaming URLs (e.g., Tesla configurator images) that previously failed: - ✅ Successfully downloads HTTP stream files - ✅ Gracefully handles failed downloads (returns None) - ✅ Maintains backward compatibility with existing file downloads --------- Signed-off-by: xyuzh <[email protected]> Signed-off-by: Robert Nishihara <[email protected]> Co-authored-by: Robert Nishihara <[email protected]> commit 438d6dcf225b7b03ba75ce9593050971458b94ac Author: Lonnie Liu <[email protected]> Date: Wed Nov 12 22:19:50 2025 -0800 [ci] pin docker client version (#58579) otherwise, the newer docker client will refuse to communicate with the docker daemon that is on an older version. Signed-off-by: Lonnie Liu <[email protected]> commit 633bb7b1d57ca58a05e905ee4551ee5f96d71750 Author: Elliot Barnwell <[email protected]> Date: Wed Nov 12 22:08:45 2025 -0800 [deps] adding include_setuptools flag for depset config (#58580) Adding optional `include_setuptools` flag for depset configuration If the flag is set on a depset config --unsafe-package setuptools will not be included for depset compilation If the flag does not exist (default false) on a depset config --unsafe-package setuptools will be appended to the default arguments --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]> commit 292b977661b1ee9804bc0c6a3d3fbecd2b89ec25 Author: Lonnie Liu <[email protected]> Date: Wed Nov 12 20:36:43 2025 -0800 [serve] remove minbuild-serve-py3.9 (#58585) nothing is using it anymore Signed-off-by: Lonnie Liu <[email protected]> commit 0cdbe3f24132c69c4d6ce9322f85de767b660135 Author: Ibrahim Rabbani <[email protected]> Date: Wed Nov 12 18:48:27 2025 -0800 [core] (cgroups) Use /proc/mounts if mount file is missing. (#58577) Signed-off-by: irabbani <[email protected]> commit 22fbee343bc5326b2912ee24eb8faa8517ea29ec Author: Lonnie Liu <[email protected]> Date: Wed Nov 12 18:26:25 2025 -0800 [deps] update `requirements_buildkite.txt` (#58574) as the pydantic version is pinned in `requirements-doc.txt` now. Signed-off-by: Lonnie Liu <[email protected]> commit 7a6e29e96b1fa33ad5ff45e37d6f4da7eadd822a Author: Lonnie Liu <[email protected]> Date: Wed Nov 12 16:38:54 2025 -0800 Revert "[bazel] upgrade bazel python rules to 0.25.0" (#58578) Reverts ray-project/ray#58535 failing on windows.. :( commit 2f55d078bb69f39198eccf6293683e17a2e72dc5 Author: Goutam <[email protected]> Date: Wed Nov 12 16:37:24 2025 -0800 [Data] - Iceberg support upsert tables + schema update + overwrite tables (#58270) - Support upserting iceberg tables for IcebergDatasink - Update schema on APPEND and UPSERT - Enable overwriting the entire table Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite functionality. Also for append, the library now handles the transaction logic implicitly so that burden can be lifted from Ray Data. > Link related issues: "Fixes #1234", "Closes #1234", or "Related to > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]> commit d6793ecdbc4e6043cc0b0f19862b4b0c8256bb7f Author: Joshua Lee <[email protected]> Date: Wed Nov 12 16:31:26 2025 -0800 [core] Use GetNodeAddressAndLiveness in raylet client pool (#58576) Using GetNodeAddressAndLiveness in raylet client pool instead of the bulkier Get, same for AsyncGetAll. Seems like it was already done in core worker client pool, so just making the same change for raylet client pool. Signed-off-by: joshlee <[email protected]> commit e713b3de319afd437f2de7435f5a2870167fa99a Author: Lonnie Liu <[email protected]> Date: Wed Nov 12 15:01:35 2025 -0800 [doc] set default python env to 3.10 (#58570) we stop supporting building with python 3.9 now Signed-off-by: Lonnie Liu <[email protected]> commit 8e4b32e0366a9b32f7dfbd55d5dd5a30fc5c734b Author: Lonnie Liu <[email protected]> Date: Wed Nov 12 15:01:20 2025 -0800 [bazel] rename contraint from hermatic to python_version (#58499) which is more accurate also moves python constraint definitions into `bazel/` directory and registering python 3.10 platform with hermetic toolchain this allows performing migration from python 3.19 to python 3.10 incrementally Signed-off-by: Lonnie Liu <[email protected]> commit 0d56f3ef9ae32c5ce8543bb76d9ccde120140623 Author: Elliot Barnwell <[email protected]> Date: Wed Nov 12 14:23:17 2025 -0800 [images][deps] raydepsets base extra depset (#58461) generating depsets for base extra python requirements Installing requirements in base extra image --------- Signed-off-by: elliot-barn <[email protected]> commit df65225e4f98bce2b45405b1cf89fb70556e2871 Author: Daniel Shin <[email protected]> Date: Thu Nov 13 07:08:15 2025 +0900 [Data] Use Approximate Quantile for RobustScaler Preprocessor (#58371) Currently Ray Data has a preprocessor called `RobustScaler`. This scales the data based on given quantiles. Calculating the quantiles involves sorting the entire dataset by column for each column (C sorts for C number of columns), which, for a large dataset, will require a lot of calculations. ** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch` as I couldn't actually find well-maintained tdigest libraries for python. ddsketch is better maintained. ** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile` aggregator N/A N/A --------- Signed-off-by: kyuds <[email protected]> Signed-off-by: Daniel Shin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]> commit 5e71d58badbfdcfc002826398c3e02469065cc71 Author: Sampan S Nayak <[email protected]> Date: Thu Nov 13 03:33:18 2025 +0530 [Core] support token auth in ray client server (#58557) support token auth in ray client server by using the existing grpc interceptors. This pr refactors the code to: - add/rename sync and async client and server interceptors - create grpc utils to house grpc channel and server creation logic, python codebase is updated to use these methods - separate tests for sync and async interceptors - make existing authentication integration tests to run with RAY_CLIENT mode --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]> commit a6cc5499e7fa07c0d6cdc7b7cd0b08dfc08073dd Author: Kunchen (David) Dai <[email protected]> Date: Wed Nov 12 13:45:02 2025 -0800 [Core] Move request id creation to worker to address plasma get perf regression (#58390) This PR address the performance regression introduced in the [PR to make ray.get thread safe](https://github.com/ray-project/ray/pull/57911). Specifically, the previous PR requires the worker to block and wait for AsyncGet to return with a reply of the request id needed for correctly cleaning up get requests. This additional synchronous step causes the plasma store Get to regress in performance. This PR moves the request id generation step to the plasma store, removing the blocking step to fix the perf regression. - [PR which introduced perf regression](https://github.com/ray-project/ray/pull/57911) - [PR which observed the regression](https://github.com/ray-project/ray/pull/58175) New performance of the change measured by `ray microbenchmark`. <img width="485" height="17" alt="image" src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0" /> Original performance prior to the change. Here we focus on the regressing `single client get calls (Plasma Store)` metric, where our new performance returns us back to the original 10k per second range compared to the existing sub 5k per second. <img width="811" height="355" alt="image" src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c" /> --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]> commit 9e450e6805824ac825488e1455ac97f93df0bbc3 Author: Lonnie Liu <[email protected]> Date: Wed Nov 12 12:36:21 2025 -0800 [doc] symlink the doc dependency lock file (#58520) and ask people to use that lock file for building docs. Signed-off-by: Lonnie Liu <[email protected]> commit 16c2f5fffbd1d772606de28ac39c0bb7182efdd4 Author: Lehui Liu <[email protected]> Date: Wed Nov 12 12:08:28 2025 -0800 [train] Set JAX_PLATFORMS env var based on ScalingConfig (#57783) 1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set to initialize jax.distributed: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38 2. Before this change, user will have to configure both `use_tpu=True` in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able to start jax.distributed. `JAX_PLATFORMS` can be comma separated string. 3. If user uses other jax.distributed libraries like Orbax, sometimes, it will leads to misleading error about distributed initialization. 4. After this change, if user sets `use_tpu=True`, we automatically add this to env var. 5. tpu unit test is not available this time, will explore for how to cover it later. --------- Signed-off-by: Lehui Liu <[email protected]> commit 1ab16e26a0251d3964637c6fe0f2f9a0ae8c6312 Author: iamjustinhsu <[email protected]> Date: Wed Nov 12 12:04:16 2025 -0800 [Data] Add `Ranker` Interface (#58513) Creates a ranker interface that will rank the best operator to run next in `select_operator_to_run`. This code only refractors the existing code. The ranking value must be something that is comparable. None None --------- Signed-off-by: iamjustinhsu <[email protected]> commit 9d5a2416e2980501ffc5c094ce5c59709f93ccf2 Author: Lonnie Liu <[email protected]> Date: Wed Nov 12 11:50:42 2025 -0800 [bazel] upgrade bazel python rules to 0.25.0 (#58535) previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <[email protected]> commit 02afe68937429bfd6501e4d0f46780bca4dea329 Author: Balaji Veeramani <[email protected]> Date: Wed Nov 12 11:34:59 2025 -0800 [Data] Refactor concurrency validation tests in `test_map.py` (#58549) The original `test_concurrency` function combined multiple test scenarios into a single test with complex control flow and expensive Ray cluster initialization. This refactoring extracts the parameter validation tests into focused, independent tests that are faster, clearer, and easier to maintain. Additionally, the original test included "validation" cases that tested valid concurrency parameters but didn't actually verify that concurrency was being limited correctly—they only checked that the output was correct, which isn't useful for validating the concurrency feature itself. **Key improvements:** - Split validation tests into `test_invalid_func_concurrency_raises` and `test_invalid_class_concurrency_raises` - Use parametrized tests for different invalid concurrency values - Switch from `shutdown_only` with explicit `ray.init()` to `ray_start_regular_shared` to eliminate cluster initialization overhead - Minimize test data from 10 blocks to 1 element since we're only validating parameter errors - Remove non-validation tests that didn't verify concurrency behavior N/A The validation tests now execute significantly faster and provide clearer failure messages. Each test has a single, well-defined purpose making maintenance and debugging easier. --------- Signed-off-by: Balaji Veeramani <[email protected]> commit 676b86f4a8d6a4c4eab70f5f381642d9a17fdca2 Author: Balaji Veeramani <[email protected]> Date: Wed Nov 12 11:32:48 2025 -0800 [Data] Convert rST-style to Google-style docstrings in `ray.data` (#58523) This PR improves documentation consistency in the `python/ray/data` module by converting all remaining rST-style docstrings (`:param:`, `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.). **Files modified:** - `python/ray/data/preprocessors/utils.py` - Converted `StatComputationPlan.add_callable_stat()` - `python/ray/data/preprocessors/encoder.py` - Converted `unique_post_fn()` - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()` and `BlockColumnAccessor.is_composed_of_lists()` - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` - Converted `DeltaSharingDatasource.setup_delta_sharing_connections()` Signed-off-by: Balaji Veeramani <[email protected]> commit 7e872837e450411e9da45acea0c52f4b67221500 Author: Nikhil G <[email protected]> Date: Wed Nov 12 09:07:32 2025 -0800 [serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (#58504) Signed-off-by: Nikhil Ghosh <[email protected]> commit cd09d104f6d595a805fd8f9979d9f81a828823b5 Author: Alexey Kudinkin <[email protected]> Date: Wed Nov 12 11:50:05 2025 -0500 [Data] Lowering `DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR` to 2 (#58262) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. This was setting the value to be aligned with the previous default of 4. However, after some consideration i've realized that 4 is too high of a number so actually lowering this to 2 > Link related issues: "Fixes #1234", "Closes #1234", or "Related to > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <[email protected]> commit 126a40bc711cf06ed44686ee5026624d6b78766e Author: Cuong Nguyen <[email protected]> Date: Wed Nov 12 07:44:53 2025 -0800 [core] fix idle node termination on object pulling (#57928) Currently, a node is considered idle while pulling objects from the remote object store. This can lead to situations where a node is terminated as idle, causing the cluster to enter an infinite loop when pulling large objects that exceed the node idle termination timeout. This PR fixes the issue by treating object pulling as a busy activity. Note that nodes can still accept additional tasks while pulling objects (since pulling consumes no resources), but the auto-scaler will no longer terminate the node prematurely. Closes #54372 Test: - CI Signed-off-by: Cuong Nguyen <[email protected]> commit ad8f30291137efce9e463fb23e6821f4c7c74a9c Author: Sagar Sumit <[email protected]> Date: Wed Nov 12 05:40:47 2025 -0800 [core] Use graceful shutdown path when actor OUT_OF_SCOPE (`del actor`) (#57090) When actors terminate gracefully, Ray calls the actor's `__ray_shutdown__()` method if defined, allowing for cleanup of resources. But, this is not invoked in case actor goes out of scope due to `del actor`. Traced through the entire code path, and here's what happens: Flow when `del actor` is called: 1. **Python side**: `ActorHandle.__del__()` -> `worker.core_worker.remove_actor_handle_reference(actor_id)` https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040 2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` -> `reference_counter_->RemoveLocalReference()` - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed` callback https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506 3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` -> `AsyncReportActorOutOfScope()` to GCS https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183 https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51 4. **GCS receives notification**: `HandleReportActorOutOfScope()` - **THE PROBLEM IS HERE** ([line 279 in `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)): ```cpp DestroyActor(actor_id, GenActorOutOfScopeCause(actor), /*force_kill=*/true, // <-- HARDCODED TO TRUE! [reply, send_reply_callback]() { ``` 5. **Actor worker receives kill signal**: `HandleKillActor()` in [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970) ```cpp if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE ForceExit(...) // Skips __ray_shutdown__ } else { Exit(...) // Would call __ray_shutdown__ } ``` 6. **ForceExit path**: Bypasses graceful shutdown -> No `__ray_shutdown__` callback invoked. This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE actors. Also, updated the docs. --------- Signed-off-by: Sagar Sumit <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]> commit 15393edbe72f5079279d3a0e46b72adc7496cdfc Author: Sampan S Nayak <[email protected]> Date: Wed Nov 12 19:00:10 2025 +0530 [Core] use client interceptor for adding auth token in c++ client calls (#58424) - Use client interceptor for adding auth tokens in grpc calls when `AUTH_MODE=token` - BuildChannel() will automatically include the interceptor - Removed `auth_token` parameter from `ClientCallImpl` - removed manual auth from `python_gcs_subscriber`.cc - tests to verify auth works for autoscaller apis --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]> commit d496ea87808706333703be6ff25ecc9472330fd5 Author: Sampan S Nayak <[email protected]> Date: Wed Nov 12 11:25:11 2025 +0530 [core] Token auth usability improvements (#58408) - rename RAY_auth_mode → RAY_AUTH_MODE environment variable across codebase - Excluded healthcheck endpoints from authentication for Kubernetes compatibility - Fixed dashboard cookie handling to respect auth mode and clear stale tokens when switching clusters --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]> commit 584f5acdf804b1ba097ff7fa5d78a0bfd63c682b Author: kourosh hakhamaneshi <[email protected]> Date: Tue Nov 11 19:50:52 2025 -0800 [doc][serve][llm] Attached the correct figure to the pd docs (#58543) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> commit a15f5be797ced0df321bfd8d42bab7d57defa2de Author: Lonnie Liu <[email protected]> Date: Tue Nov 11 18:00:43 2025 -0800 [doc] downgrade readthedocs to use python 3.10 (#58536) be consistent with the default build environment Signed-off-by: Lonnie Liu <[email protected]> commit 9dcb67dc9ff20d9b9ae29875bb610273ba4149ed Author: Dhyey Shah <[email protected]> Date: Tue Nov 11 17:26:15 2025 -0800 [core] Fix auth test import (#58554) The python test step is failing on master now because of this. Probably a logical merge conflict. ``` FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary) ... [2025-11-11T22:11:54Z] from ray.tests.authentication_test_utils import ( -- | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils' ``` Signed-off-by: dayshah <[email protected]> commit 20bf68263beed3609e24aede3d9fc96bc07f0da0 Author: Dhyey Shah <[email protected]> Date: Tue Nov 11 12:44:05 2025 -0800 [core][rdt] Abort NIXL and allow actor reuse on failed transfers (#56783) Signed-off-by: dayshah <[email protected]> commit 89a329cd1e0219629132abc203085117a11949f3 Author: Dhyey Shah <[email protected]> Date: Tue Nov 11 12:26:17 2025 -0800 [core] Improve kill actor logs (#58544) Signed-off-by: dayshah <[email protected]> commit 6c9607ea57b9edde07c856f094835c84f47b79a6 Author: Nikhil G <[email protected]> Date: Tue Nov 11 12:16:41 2025 -0800 [docs][serve][llm] examples and doc for cross-node TP/PP in Serve (#57715) Signed-off-by: Nikhil Ghosh <[email protected]> Signed-off-by: Nikhil G <[email protected]> commit 711d9453828fecebb91b9642e799b4b0b4a493f7 Author: Dhyey Shah <[email protected]> Date: Tue Nov 11 12:13:13 2025 -0800 [core] Make GlobalState lazy initialization thread-safe (#58182) Signed-off-by: dayshah <[email protected]> commit fd10c39829a580bd83ba28c8518e7a7a5ebd3dfb Author: Kai-Hsun Chen <[email protected]> Date: Tue Nov 11 09:43:05 2025 -0800 [core] Scheduling a detached actor with a placement group is not recommended (#57726)    If users schedule a detached actor into a placement group, Raylet will kill the actor when the placement group is removed. The actor will be stuck in the `RESTARTING` state forever if it's restartable until users explicitly kill it. In that case, if users try to `get_actor` with the actor's name, it can still return the restarting actor, but no process exists. It will no longer be restarted because the PG is gone, and no PG with the same ID will be created during the cluster's lifetime. The better behavior would be for Ray to transition a task/actor's state to dead when it is impossible to restart. However, this would add too much complexity to the core, so I think it's not worth it. Therefore, this PR adds a warning log, and users should use detached actors or PGs correctly. Example: Run the following script and run `ray list actors`. ```python import ray from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy from ray.util.placement_group import placement_group, remove_placement_group @ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1) class Actor: pass ray.init() pg = placement_group([{"CPU": 1}]) ray.get(pg.ready()) actor = Actor.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, ) ).remote() ray.get(actor.__ray_ready__.remote()) ```  **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [x] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable)  --------- Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Robert Nishihara <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Robert Nishihara <[email protected]> Co-authored-by: Edward Oakes <[email protected]> commit 0752886e7d55694b6cf8d780b7470d58266c6a10 Author: Cuong Nguyen <[email protected]> Date: Tue Nov 11 07:19:19 2025 -0800 [core] enable open telemetry by default (#56432) This PR enables open telemetry as the default backend for ray metric stack. The bulk of this PR is actually to fix tests that were written with some assumptions that no longer hold true. For ease of reviewing, I inline the reasons for the change together with the change for each tests in the comments. This PR also depends on a release of vllm (so that we can update the minimal supported version of vllm in ray). Test: - CI  --- > [!NOTE] > Enable OpenTelemetry metrics backend by default and refactor metrics/Serve tests to use timeseries APIs and updated `ray_serve_*` metric names. > > - **Core/Config**: > - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to `true` in `ray_constants.py` and `ray_config_def.h`. > - Metrics `Counter`: use `CythonCount` by default; keep legacy `CythonSum` only when OTEL is explicitly disabled. > - **Serve/Metrics Tests**: > - Replace text scraping with `PrometheusTimeseries` and `fetch_prometheus_metric_timeseries` throughout. > - Update metric names/tags to `ray_serve_*` and counter suffixes `*_total`; adjust latency metric names and processing/queued gauges. > - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and pass through helpers. > - **General Test Fixes**: > - Remove OTEL parametrization/fixtures; simplify expectations where counters-as-gauges no longer apply; drop related tests. > - Cardinality tests: include `"low"` level and remove OTEL gating; stop injecting `enable_open_telemetry` in system config. > - Actor/state/thread tests: migrate to cluster fixtures, wait for dashboard agent, and adjust expected worker thread counts. > - **Build**: > - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env from C++ stats test. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 1d0190f3dd58d5f0c982fcbdab95fcf5f733553f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Cuong Nguyen <[email protected]> commit bf595e32d049503f5c1931c5b477647a06d191c2 Author: Sampan S Nayak <[email protected]> Date: Tue Nov 11 19:15:41 2025 +0530 [Core] move authentication_test_utils into ray._private to fix macos tests (#58528) the auth token test setup in `conftest.py` is breaking macos test. there are two test scripts (`test_microbenchmarks.py` and `test_basic.py`) that run after the wheel is installed but without editable mode. for these test to pass,` conftest.py` cannot import anything under `ray.tests`. this pr moves `authentication_test_utils` into `ray._private` to fix this issue Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]> commit 3d29c4ccc9182c44d3cfab08fb561cb7db74eea8 Author: Sampan S Nayak <[email protected]> Date: Tue Nov 11 19:10:56 2025 +0530 [Core] Add Service Interceptor to support token authentication in dashboard agent (#58405) Add a grpc service interceptor to intercept all dashboard agent rpc calls and validate the presence of auth token (when auth mode is token) --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]> commit 1a48e7318442d038f2c43d22da3b580fa643b8d1 Author: curiosity-hyf <[email protected]> Date: Tue Nov 11 21:35:42 2025 +0800 [Docs] fix pattern_async_actor demo typo (#58486) fix pattern_async_actor demo typo. Add `self.`. --------- Signed-off-by: curiosity-hyf <[email protected]> commit f2a7a94a75b007a801ee5a2cf6a6e24b93e9cb9a Author: Thomas Desrosiers <[email protected]> Date: Mon Nov 10 18:28:46 2025 -0800 Update pydoclint to version 0.8.1 (#58490) * Does the work to bump pydoclint up to the latest version * And allowlist any new violations it finds n/a n/a --------- Signed-off-by: Thomas Desrosiers <[email protected]> commit 10983e8c9f50ddfa355efe7977d056b29b38d4c1 Author: Goutam <[email protected]> Date: Mon Nov 10 17:34:13 2025 -0800 [Data] - Iceberg support predicate & projection pushdown (#58286) Predicate pushdown (https://github.com/ray-project/ray/pull/58150) in conjunction with this PR should speed up reads from Iceberg. Once the above change lands, we can add the pushdown interface support for IcebergDatasource --------- Signed-off-by: Goutam <[email protected]> commit 09f01135f4ab71d52be7a44d06e40ff3767f6cee Author: Seiji Eicher <[email protected]> Date: Mon Nov 10 17:28:23 2025 -0800 [serve][llm] Fix import path in muli-node release test (#58498) Signed-off-by: Seiji Eicher <[email protected]> commit 405c4648c2fe71afb7daf4ea574605190f129fd7 Author: Lonnie Liu <[email protected]> Date: Mon Nov 10 16:04:48 2025 -0800 [ci] upgrade rayci version (#58514) to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <[email protected]> commit 6de012fd0df23993054653ca5517a66944c58dd2 Author: Zac Policzer <[email protected]> Date: Mon Nov 10 14:05:15 2025 -0800 [core] Add owned object spill metrics (#57870) This PR adds 2 new metrics to core_worker by way of the reference counter. The two new metrics keep track of the count and size of objects owned by the worker as well as keeping track of their states. States are defined as: - **PendingCreation**: An object that is pending creation and hasn't finished it's initialization (and is sizeless) - **InPlasma**: An object which has an assigned node address and isn't spilled - **Spilled**: An object which has an assigned node address and is spilled - **InMemory**: An object which has no assigned address but isn't pending creation (and therefore, must be local) The approach used by these new metrics is to examine the state 'before and after' any mutations on the reference in the reference_counter. This is required in order to do the appropriate bookkeeping (decrementing values and incrementing others). Admittedly, there is potential for counting on the in between decrements/increments depending on when the RecordMetrics loop is run. This unfortunate side effect however seems preferable to doing mutual exclusion with metric collection as this is potentially a high throughput code path. In addition, performing live counts seemed preferable then doing full accounting of the object store and across all references at time of metric collection. Reason being, that potentially the reference counter is tracking millions of objects, and each metric scan could potentially be very expensive. So running the accounting (despite being potentially innaccurate for short periods) seemed the right call. This PR also allows for object size to potentially change due to potential non deterministic instantation (say an object is initially created, but it's primary copy dies, and then the recreation fails). This is an edge case, but seems important for completeness sake. --------- Signed-off-by: zac <[email protected]> commit f2dd0e2b6dc7bc074f72197ff08f7d4e58635052 Author: Lonnie Liu <[email protected]> Date: Mon Nov 10 14:02:11 2025 -0800 [java] remove local genrule `//java:ray_java_pkg` (#58503) using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <[email protected]> commit b23adc777c5b103291cf3a35b51b123a808d36f6 Author: Lonnie Liu <[email protected]> Date: Mon Nov 10 14:01:27 2025 -0800 [ci] apply isort to release test directory, part 1 (#58505) excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <[email protected]> commit ce1fd472b2677069a5bfcd2b5ed7a2695f5f2966 Author: Lonnie Liu <[email protected]> Date: Mon Nov 10 14:01:06 2025 -0800 [doc] change link check to run on python 3.12 (#58506) migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <[email protected]> commit b09b076e15fefe842a0b7e33accff71ec3c31435 Author: Lonnie Liu <[email protected]> Date: Mon Nov 10 14:00:01 2025 -0800 [doc] ci: move doc annotation check to python 3.12 (#58507) be consistent with doc build environment Signed-off-by: Lonnie Liu <[email protected]> commit 8971f83ecb40d54729c2c26d394594c29199e19d Author: iamjustinhsu <[email protected]> Date: Mon Nov 10 12:52:43 2025 -0800 [data] Clear queue for manually mark_execution_finished operators (#58441) Currently, we clear _external_ queues when an operator is manually marked as finished. But we don't clear their _internal_ queues. This PR fixes that Fixes this test https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736 --------- Signed-off-by: iamjustinhsu <[email protected]> commit ffb51f866802ad3858d82a9356855a38503efec9 Author: Matthew Owen <[email protected]> Date: Mon Nov 10 10:54:34 2025 -0800 [data] Update depsets for multimodal inference release tests (#57233) Update remaining mulitmodal release tests to use new depsets. commit 62231dd4ba8e784da8800b248ad7616b8db92de7 Author: Lonnie Liu <[email protected]> Date: Mon Nov 10 10:30:00 2025 -0800 [ci] seperate doc related jobs into its own group (#58454) so that they are not called lints any more Signed-off-by: Lonnie Liu <[email protected]> commit 3f7a7b42fda0bb75a9af6e5ad197ba3743b011c2 Author: harshit-anyscale <[email protected]> Date: Mon Nov 10 23:45:38 2025 +0530 increase timeout for test_initial_replica tests (#58423) - `test_target_capacity` windows test is failing, possibly because we have put up a short timeout of 10 seconds, increasing it to verify whether timeout is an issue or not. Signed-off-by: harshit <[email protected]> commit 217031a48f4f83d04950ad39b94846ba362edd37 Author: Jugal Shah <[email protected]> Date: Mon Nov 10 09:39:43 2025 -0800 Define an env for controlling UVloop (#58442) > Briefly describe what this PR accomplishes and why it's needed. Our serve ingress keeps running into below error related to `uvloop` under heavy load ``` File descriptor 97 is used by transport ``` The uvloop team have a [PR](https://github.com/MagicStack/uvloop/pull/646) to fix it, but seems like no one is working on it One of workaround mentioned in the ([PR](https://github.com/MagicStack/uvloop/pull/646#issuecomment-3138886982)) is to just turn off uvloop . We tried it in our env and didn't see any major performance difference Hence as part of this PR, we are defining a new env for controlling UVloop Signed-off-by: jugalshah291 <[email protected]> commit 2486ddd9fec83cc940937e3d91368942588ef177 Author: fscnick <[email protected]> Date: Mon Nov 10 23:29:03 2025 +0800 [Doc][KubeRay] eliminate vale errors (#58429) Fix some vale's error and suggestions on the kai-scheduler document. See https://github.com/ray-project/ray/pull/58161#discussion_r2463701719 Signed-off-by: fscnick <[email protected]> commit cb6a60d0afcfca87734a399291343e297031f1d5 Author: Daniel Sperber <[email protected]> Date: Mon Nov 10 16:24:34 2025 +0100 [air] Add stacklevel option to deprecation_warning (#58357) Currently are deprecation warnings sometimes not informative enough. The the warning is triggered it does not tell us *where* the deprecated feature is used. For example, ray internally raises a deprecation warning when an `RLModuleConfig` is initialized. ```python >>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig >>> RLModuleConfig() 2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future! ``` This is confusing, where did *I* use a config, what am I doing wrong? This raises issues like: https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064 Tracing where the error actually happens is tedious - is it my code or internal? The output just shows `deprecation.:50`. Not helpful. This PR adds a stacklevel option with stacklevel=2 as the default to all `deprecation_warning`s. So devs and users can better see where is the deprecated option actually used. --- EDIT: **Before** ```python WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` ``` **After** module.py:line where the deprecated artifact is used is shown in the log output: When building an Algorithm: ```python WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future! ``` ```python .../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" ``` Signed-off-by: Daraan <[email protected]> commit 5bff52ab5d9a9d67de88c4f0b86c918487ed7216 Author: Sampan S Nayak <[email protected]> Date: Mon Nov 10 20:50:21 2025 +0530 [core] Configure an interceptor to pass auth token in python direct g… (#58395) there are places in the python code where we use the raw grpc library to make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term we want to fully deprecate grpc library usage in our python code base but as that can take more effort and testing, in this pr I am introducing an interceptor to add auth headers (this will take effect for all grpc calls made from python). ``` export RAY_auth_mode="token" export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789" ray start --head ray job submit -- echo "hi" ``` output ``` ray job submit -- echo "hi" 2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads. Job submission server address: http://127.0.0.1:8265 ------------------------------------------------------- Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully ------------------------------------------------------- Next steps Query the logs of the job: ray job logs raysubmit_1EV8q86uKM24nHmH Query the status of the job: ray job status raysubmit_1EV8q86uKM24nHmH Request the job to be stopped: ray job stop raysubmit_1EV8q86uKM24nHmH Tailing logs until the job exits (disable with --no-wait): 2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up. hi Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi ------------------------------------------ Job 'raysubmit_1EV8q86uKM24nHmH' succeeded ------------------------------------------ ``` dashboard test.py ```python import time import ray from ray._raylet import Config ray.init() @ray.remote def print_hi(): print("Hi") time.sleep(2) @ray.remote class SimpleActor: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value actor = SimpleActor.remote() result = ray.get(actor.increment.remote()) for i in range(100): ray.get(print_hi.remote()) time.sleep(20) ray.shutdown() ``` ``` export RAY_auth_mode="token" export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789" python test.py ``` <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292" /> overview page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762" /> job page: tasks are listed <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a" /> task page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136" /> actors page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459" /> specific actor page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0" /> --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]> commit 71c7bd056cc132c57a4c3cf13d0f5207cbcfd73f Author: Xinyu Zhang <[email protected]> Date: Sun Nov 9 08:34:46 2025 -0800 [Data] Add exception handling for invalid URIs in download operation (#58464) commit d74c1570543045a0f99df4d5690ac44f1fda4a55 Author: iamjustinhsu <[email protected]> Date: Sat Nov 8 15:35:11 2025 -0800 [dashboards][core] Make `do_reply` accept status_code, instead of success: bool (#58384) Pass in `status_code` directly into `do_reply`. This is a follow up to https://github.com/ray-project/ray/pull/58255 --------- Signed-off-by: iamjustinhsu <[email protected]> commit e793631896f65a88513510b4e7bf6f100607cb03 Author: Rueian <[email protected]> Date: Sat Nov 8 15:32:10 2025 -0800 [core][autoscaler] Fix RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460) This ensures node type names are correctly reported even when the autoscaler is disabled (read-only mode). Autoscaler v2 fails to report prometheus metrics when operating in read-only mode on KubeRay with the following KeyError error: ``` 2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group' Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state return Reconciler.reconcile( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile Reconciler._step_next( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next Reconciler._scale_cluster( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster reply = scheduler.schedule(sched_request) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule ResourceDemandScheduler._enforce_max_workers_per_type(ctx) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type node_config = ctx.get_node_type_configs()[node_type] KeyError: 'small-group' ``` This happens because the `ReadOnlyProviderConfigReader` populates `ctx.get_node_type_configs()` using node IDs as node types, which is correct for local Ray (where local ray does not have `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where `ray_node_type_name` is present and expected wi…

increase timeout

7ecde7f

Signed-off-by: harshit <[email protected]>

harshit-anyscale requested a review from a team as a code owner November 6, 2025 05:11

harshit-anyscale self-assigned this Nov 6, 2025

harshit-anyscale added the go add ONLY when ready to merge, run all tests label Nov 6, 2025

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

cursor bot reviewed Nov 6, 2025

View reviewed changes

ray-gardener bot added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core labels Nov 6, 2025

ok-scale approved these changes Nov 6, 2025

View reviewed changes

Merge branch 'master' into increase-timeout-for-wait-condition-v2

89e5cd8

cursor bot reviewed Nov 7, 2025

View reviewed changes

zcin approved these changes Nov 10, 2025

View reviewed changes

zcin merged commit 3f7a7b4 into master Nov 10, 2025
6 checks passed

zcin deleted the increase-timeout-for-wait-condition-v2 branch November 10, 2025 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

increase timeout for test_initial_replica tests #58423

increase timeout for test_initial_replica tests #58423

Uh oh!

harshit-anyscale commented Nov 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Uh oh!

cursor bot left a comment

Uh oh!

zcin commented Nov 6, 2025

Uh oh!

harshit-anyscale commented Nov 6, 2025

Uh oh!

cursor bot Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	timeout=30,
	timeout=30, # Consider defining this as a constant, e.g., INITIAL_REPLICA_TEST_TIMEOUT_S

	client.deploy_apps(new_config)
	wait_for_condition(
	lambda: serve.status().target_capacity == new_config_target_capacity
	)
	wait_for_condition(
	check_expected_num_replicas,
	deployment_to_num_replicas={
	deployment_name: int(
	initial_replicas * new_config_target_capacity / 100
	)
	},
	app_name="app1",
	)
	wait_for_condition(
	check_expected_num_replicas,

increase timeout for test_initial_replica tests #58423

increase timeout for test_initial_replica tests #58423

Uh oh!

Conversation

harshit-anyscale commented Nov 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Inconsistent Timeout Updates in Replicas Test Suite

Uh oh!

zcin commented Nov 6, 2025

Uh oh!

harshit-anyscale commented Nov 6, 2025

Uh oh!

cursor bot Nov 7, 2025

Choose a reason for hiding this comment

Bug: Incomplete timeout propagation in test retries

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants