From fd0fb4a250b7a20d09af6659d34b70611fcbe6ea Mon Sep 17 00:00:00 2001 From: Bobby Wang Date: Mon, 10 Nov 2025 14:38:55 +0800 Subject: [PATCH 1/3] Add doc for how to run nds power over Spark Connect Signed-off-by: Bobby Wang --- nds/README.md | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/nds/README.md b/nds/README.md index b9ffb8e..b6378af 100644 --- a/nds/README.md +++ b/nds/README.md @@ -378,6 +378,62 @@ time.csv \ --output_format parquet ``` +### Power Run over Spark Connect + +Power Run currently supports execution over Spark Connect, starting with Spark 4.0.0. However, +you cannot run `nds_power.py` via Spark Connect using the associated `spark-submit-template`. +Instead, execute it directly. + +Before proceeding, ensure `pyspark-client` is installed locally. For example: + +- Install `pyspark-client` + +``` bash +pip install pyspark-client==4.0.0 +``` + +- Run `nds_power.py` + +```shell +export SPARK_REMOTE=sc://localhost +python nds_power.py \ + parquet_sf3k \ + ./nds_query_streams/query_0.sql \ + time.csv \ + --output_prefix /data/query_output \ + --output_format parquet +``` + +Alternatively, you can import the APIs in a notebook and execute them as follows: + +``` shell + +from nds_power import gen_sql_from_stream, run_query_stream + +import os +os.environ["SPARK_REMOTE"] = "sc://localhost" + +query_stream_file = "nds_query_streams/query_0.sql" +nds_data_path = "parquet_sf3k" +time_log_file = "time.csv" + +query_dict = gen_sql_from_stream(query_stream_file) + +run_query_stream(input_prefix=nds_data_path, + property_file=None, + query_dict=query_dict, + time_log_output_path=time_log_file, + extra_time_log_output_path=None, + sub_queries=None, + warmup_iterations=0, + iterations=1, + plan_types="logical", + ) +``` + +`Note:` the python listener is disabled when running nds_power.py over Spark Connect, as py4j +is not available in the Spark Connect environment. + ### Throughput Run Throughput Run simulates the scenario that multiple query sessions are running simultaneously in From 0142e6a3f595c36c5f28a9e79aa9d715b819c9d3 Mon Sep 17 00:00:00 2001 From: Bobby Wang Date: Tue, 11 Nov 2025 10:01:41 +0800 Subject: [PATCH 2/3] Update nds/README.md Co-authored-by: Gera Shegalov --- nds/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nds/README.md b/nds/README.md index b6378af..21af319 100644 --- a/nds/README.md +++ b/nds/README.md @@ -406,7 +406,7 @@ python nds_power.py \ Alternatively, you can import the APIs in a notebook and execute them as follows: -``` shell +```Python from nds_power import gen_sql_from_stream, run_query_stream From 645f8667df1d58f6f35bbfa15fdeb1e1dc45fdd6 Mon Sep 17 00:00:00 2001 From: Bobby Wang Date: Tue, 11 Nov 2025 10:21:28 +0800 Subject: [PATCH 3/3] comments --- nds/README.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/nds/README.md b/nds/README.md index 21af319..2cff3bf 100644 --- a/nds/README.md +++ b/nds/README.md @@ -345,7 +345,10 @@ optional arguments: query39_part2 ``` -Example command to submit nds_power.py by spark-submit-template utility: +#### Power Run with spark-submit + +Users can use the `spark-submit-template` script to run the power run with spark-submit. +An example command to submit nds_power.py by spark-submit-template utility is: ```bash ./spark-submit-template power_run_gpu.template \ @@ -378,7 +381,7 @@ time.csv \ --output_format parquet ``` -### Power Run over Spark Connect +#### Power Run over Spark Connect Power Run currently supports execution over Spark Connect, starting with Spark 4.0.0. However, you cannot run `nds_power.py` via Spark Connect using the associated `spark-submit-template`.