-
Notifications
You must be signed in to change notification settings - Fork 35
Add doc for how to run nds power over Spark Connect #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Bobby Wang <[email protected]>
|
Hi @tgravescs @jihoonson @gerashegalov @eordentlich, please help review this PR, thx very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added documentation section for running NDS Power Run over Spark Connect (Spark 4.0.0+), including installation instructions for pyspark-client and two execution methods: direct command-line execution and notebook API usage.
Key additions:
- Installation steps for
pyspark-client==4.0.0 - Command-line execution example using
SPARK_REMOTEenvironment variable - Notebook API usage example with
gen_sql_from_streamandrun_query_streamfunctions - Important note that python listener is disabled in Spark Connect environment (py4j unavailability)
Confidence Score: 5/5
- This PR is safe to merge with no risk - documentation-only change
- Documentation-only PR that adds clear instructions for running NDS Power Run over Spark Connect. All changes are isolated to README.md, properly formatted, technically accurate, and aligned with the existing codebase implementation (PysparkBenchReport.py). The note about disabled python listener correctly reflects the code behavior.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nds/README.md | 5/5 | Added comprehensive Spark Connect documentation with setup instructions and python listener note |
Sequence Diagram
sequenceDiagram
participant User
participant LocalEnv as Local Environment
participant PySparkClient as PySpark Client
participant SparkConnect as Spark Connect Server
participant SparkCluster as Spark Cluster
User->>LocalEnv: Install pyspark-client==4.0.0
User->>LocalEnv: Set SPARK_REMOTE=sc://localhost
alt Command-line execution
User->>LocalEnv: python nds_power.py [args]
LocalEnv->>PySparkClient: Initialize PySpark session
PySparkClient->>SparkConnect: Connect via SPARK_REMOTE
SparkConnect->>SparkCluster: Execute queries
SparkCluster-->>SparkConnect: Query results
SparkConnect-->>PySparkClient: Return results
PySparkClient-->>LocalEnv: Write time.csv
Note over PySparkClient,SparkConnect: Python listener disabled (no py4j)
else Notebook API execution
User->>LocalEnv: Import nds_power APIs
LocalEnv->>PySparkClient: gen_sql_from_stream()
LocalEnv->>PySparkClient: run_query_stream()
PySparkClient->>SparkConnect: Connect and execute
SparkConnect->>SparkCluster: Run queries
SparkCluster-->>SparkConnect: Results
SparkConnect-->>PySparkClient: Return data
PySparkClient-->>LocalEnv: Save results
end
1 file reviewed, no comments
jihoonson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice, left a minor suggestion.
nds/README.md
Outdated
| --output_format parquet | ||
| ``` | ||
| ### Power Run over Spark Connect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a subsection at Line 347 titled "Power Run with spark-submit"? I'd also add some introduction like below:
Users can use the `spark-submit-template` script to run the power run with spark-submit. An example command to submit nds_power.py by spark-submit-template utility is: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Done.
Co-authored-by: Gera Shegalov <[email protected]>
Greptile OverviewGreptile SummaryAdded comprehensive documentation for running NDS Power benchmarks over Spark Connect. The documentation includes installation prerequisites, CLI usage examples, notebook API usage examples, and an important note about Python listener being disabled in Spark Connect environments.
Confidence Score: 5/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant README as README.md Documentation
participant pyspark_client as pyspark-client (local)
participant nds_power as nds_power.py
participant SparkConnect as Spark Connect Server
participant SparkSession
Note over User,SparkConnect: Setup Phase
User->>pyspark_client: pip install pyspark-client==4.0.0
User->>User: export SPARK_REMOTE=sc://localhost
Note over User,SparkSession: Execution Phase (CLI)
User->>nds_power: python nds_power.py parquet_sf3k query_0.sql time.csv
nds_power->>SparkConnect: Connect via SPARK_REMOTE
SparkConnect->>SparkSession: Create SparkSession
nds_power->>nds_power: Check is_remote_only() == True
nds_power->>nds_power: Skip python listener registration
nds_power->>SparkSession: Execute queries via Spark Connect
SparkSession-->>nds_power: Query results
nds_power-->>User: time.csv with execution times
Note over User,SparkSession: Execution Phase (Notebook API)
User->>nds_power: from nds_power import gen_sql_from_stream, run_query_stream
User->>nds_power: gen_sql_from_stream(query_stream_file)
nds_power-->>User: query_dict
User->>nds_power: run_query_stream(input_prefix, query_dict, ...)
nds_power->>SparkConnect: Connect via SPARK_REMOTE
SparkConnect->>SparkSession: Create SparkSession
nds_power->>nds_power: Check is_remote_only() == True
nds_power->>nds_power: Skip python listener registration
nds_power->>SparkSession: Execute queries via Spark Connect
SparkSession-->>nds_power: Query results
nds_power-->>User: Execution complete with time logs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
| ``` | ||
| `Note:` the python listener is disabled when running nds_power.py over Spark Connect, as py4j | ||
| is not available in the Spark Connect environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to say what this affects. if it affects the reporting of task failures I think that would be good to add here.
This PR add a section in README on how to run nds power over Spark Connect and add a note that the python listener is disabled for now.
To fix the comments