Skip to content

[BUG] nds can't run over spark connect using pyspark #234

@wbo4958

Description

@wbo4958

Describe the bug

When users install pyspark instead of pyspark-client, they're going to meet below error

  Traceback (most recent call last):
  File "/home/mahrens/git/spark-rapids-benchmarks/nds/nds_power.py", line 672, in <module>
    run_query_stream(args.input_prefix,
  File "/home/mahrens/git/spark-rapids-benchmarks/nds/nds_power.py", line 475, in run_query_stream
    summary = q_report.report_on(run_one_query,warmup_iterations,
  File "/home/mahrens/git/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 101, in report_on
    spark_conf = dict(self._get_spark_conf())
  File "/home/mahrens/git/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 88, in _get_spark_conf
    return self.spark_session.sparkContext._conf.getAll()
  File "/home/mahrens/.pyenv/versions/3.10.12/lib/python3.10/site-packages/pyspark/sql/connect/session.py", line 941, in __getattr__
    raise PySparkAttributeError(
pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `sparkContext` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail.

It turned out we should use is_remote() instead of is_remote_only() for pyspark and pyspark-client package.

Steps/Code to reproduce bug

pip install pyspark==4.0.0

run the nds

Expected behavior
Nds should work for both pyspark and pyspark-client

Metadata

Metadata

Assignees

Labels

? - Needs TriageNeed team to review and classifybugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions