Skip to content

Conversation

@wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Nov 10, 2025

This PR add a section in README on how to run nds power over Spark Connect and add a note that the python listener is disabled for now.

To fix the comments

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Nov 10, 2025

Hi @tgravescs @jihoonson @gerashegalov @eordentlich, please help review this PR, thx very much.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Added documentation section for running NDS Power Run over Spark Connect (Spark 4.0.0+), including installation instructions for pyspark-client and two execution methods: direct command-line execution and notebook API usage.

Key additions:

  • Installation steps for pyspark-client==4.0.0
  • Command-line execution example using SPARK_REMOTE environment variable
  • Notebook API usage example with gen_sql_from_stream and run_query_stream functions
  • Important note that python listener is disabled in Spark Connect environment (py4j unavailability)

Confidence Score: 5/5

  • This PR is safe to merge with no risk - documentation-only change
  • Documentation-only PR that adds clear instructions for running NDS Power Run over Spark Connect. All changes are isolated to README.md, properly formatted, technically accurate, and aligned with the existing codebase implementation (PysparkBenchReport.py). The note about disabled python listener correctly reflects the code behavior.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
nds/README.md 5/5 Added comprehensive Spark Connect documentation with setup instructions and python listener note

Sequence Diagram

sequenceDiagram
    participant User
    participant LocalEnv as Local Environment
    participant PySparkClient as PySpark Client
    participant SparkConnect as Spark Connect Server
    participant SparkCluster as Spark Cluster

    User->>LocalEnv: Install pyspark-client==4.0.0
    User->>LocalEnv: Set SPARK_REMOTE=sc://localhost
    
    alt Command-line execution
        User->>LocalEnv: python nds_power.py [args]
        LocalEnv->>PySparkClient: Initialize PySpark session
        PySparkClient->>SparkConnect: Connect via SPARK_REMOTE
        SparkConnect->>SparkCluster: Execute queries
        SparkCluster-->>SparkConnect: Query results
        SparkConnect-->>PySparkClient: Return results
        PySparkClient-->>LocalEnv: Write time.csv
        Note over PySparkClient,SparkConnect: Python listener disabled (no py4j)
    else Notebook API execution
        User->>LocalEnv: Import nds_power APIs
        LocalEnv->>PySparkClient: gen_sql_from_stream()
        LocalEnv->>PySparkClient: run_query_stream()
        PySparkClient->>SparkConnect: Connect and execute
        SparkConnect->>SparkCluster: Run queries
        SparkCluster-->>SparkConnect: Results
        SparkConnect-->>PySparkClient: Return data
        PySparkClient-->>LocalEnv: Save results
    end
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, left a minor suggestion.

nds/README.md Outdated
--output_format parquet
```
### Power Run over Spark Connect
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a subsection at Line 347 titled "Power Run with spark-submit"? I'd also add some introduction like below:

Users can use the `spark-submit-template` script to run the power run with spark-submit. An example command to submit nds_power.py by spark-submit-template utility is: ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done.

Co-authored-by: Gera Shegalov <[email protected]>
@greptile-apps
Copy link

greptile-apps bot commented Nov 11, 2025

Greptile Overview

Greptile Summary

Added comprehensive documentation for running NDS Power benchmarks over Spark Connect. The documentation includes installation prerequisites, CLI usage examples, notebook API usage examples, and an important note about Python listener being disabled in Spark Connect environments.

  • Restructured Power Run section with clearer subsection headers for spark-submit and Spark Connect execution paths
  • Added pyspark-client installation instructions for local Spark Connect client setup
  • Provided CLI example with SPARK_REMOTE environment variable configuration
  • Included notebook API usage example showing how to import and use gen_sql_from_stream and run_query_stream functions directly
  • Documented limitation that Python listener is disabled when running over Spark Connect due to py4j unavailability

Confidence Score: 5/5

  • This PR is safe to merge with no risk - it only adds documentation with no code changes
  • Score reflects that this is a documentation-only PR that accurately describes the Spark Connect functionality implemented in PR Support running nds_power over spark connect #226. The documentation is clear, well-structured, includes proper examples for both CLI and notebook usage, and correctly notes the Python listener limitation. No code changes means zero risk of introducing bugs or breaking existing functionality.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
nds/README.md 5/5 Added comprehensive documentation for running NDS Power over Spark Connect with installation instructions, usage examples, and important notes about listener limitations

Sequence Diagram

sequenceDiagram
    participant User
    participant README as README.md Documentation
    participant pyspark_client as pyspark-client (local)
    participant nds_power as nds_power.py
    participant SparkConnect as Spark Connect Server
    participant SparkSession
    
    Note over User,SparkConnect: Setup Phase
    User->>pyspark_client: pip install pyspark-client==4.0.0
    User->>User: export SPARK_REMOTE=sc://localhost
    
    Note over User,SparkSession: Execution Phase (CLI)
    User->>nds_power: python nds_power.py parquet_sf3k query_0.sql time.csv
    nds_power->>SparkConnect: Connect via SPARK_REMOTE
    SparkConnect->>SparkSession: Create SparkSession
    nds_power->>nds_power: Check is_remote_only() == True
    nds_power->>nds_power: Skip python listener registration
    nds_power->>SparkSession: Execute queries via Spark Connect
    SparkSession-->>nds_power: Query results
    nds_power-->>User: time.csv with execution times
    
    Note over User,SparkSession: Execution Phase (Notebook API)
    User->>nds_power: from nds_power import gen_sql_from_stream, run_query_stream
    User->>nds_power: gen_sql_from_stream(query_stream_file)
    nds_power-->>User: query_dict
    User->>nds_power: run_query_stream(input_prefix, query_dict, ...)
    nds_power->>SparkConnect: Connect via SPARK_REMOTE
    SparkConnect->>SparkSession: Create SparkSession
    nds_power->>nds_power: Check is_remote_only() == True
    nds_power->>nds_power: Skip python listener registration
    nds_power->>SparkSession: Execute queries via Spark Connect
    SparkSession-->>nds_power: Query results
    nds_power-->>User: Execution complete with time logs
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@wbo4958 wbo4958 merged commit 632c551 into NVIDIA:dev Nov 13, 2025
2 checks passed
@wbo4958 wbo4958 deleted the connect-doc branch November 13, 2025 23:13
```
`Note:` the python listener is disabled when running nds_power.py over Spark Connect, as py4j
is not available in the Spark Connect environment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to say what this affects. if it affects the reporting of task failures I think that would be good to add here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants