This repo was started as a wrapper around Spark REPLs for easier use with the Spark RAPIDS plugin. Lately I have been putting more effort in maintaining standalone Jupyter notebooks that can be easily started without the wrapper script, and particularly easy to simply open them in VSCode with the Jupyter extension.
A utility to start RAPIDS-enabled Spark Shell with access to unit tests resources from https://github.com/NVIDIA/spark-rapids
Before running the examples make sure to at least execute mvn package in your local spark-rapids repo if you are not using binaries.
See rapids.sh --help for up to date information
Usage: rapids.sh [OPTION]
Options:
  --debug
    enable bash tracing
  -h, --help
    prints this message
  -l4j=LOG4J_CONF_FILE, --log4j-file=LOG4J_CONF_FILE
    LOG4J_CONF_FILE location of a custom log4j config for local mode
  -nsys, --nsys-profile
    run with Nsights profile
  -m=MASTER, --master=MASTER
    specify MASTER for spark command, default is local[-cluster], see --num-local-execs
  -n, --dry-run
    generates and prints the spark submit command without executing
  -nle=N, --num-local-execs=N
    specify the number of local executors to use, default is 2. If > 1 use pseudo-distributed
    local-cluster, otherwise local[*]
  -uecp, --use-extra-classpath
    use extraClassPath instead of --jars to add RAPIDS jars to spark-submit (default)
  -uj, --use-jars
    use --jars instead of extraClassPath to add RAPIDS jars to spark-submit
  --ucx-shim=spark<3xy>
    Spark buildver to populate shim-dependent package name of RapidsShuffleManager.
    Will be replaced by a Boolean option
  -cmd=CMD, --spark-command=CMD
    specify one of spark-submit (default), spark-shell, pyspark, jupyter, jupyter-lab
  -dopts=EOPTS, --driver-opts=EOPTS
    pass EOPTS as --driver-java-options
  -eopts=EOPTS, --executor-opts=EOPTS
    pass EOPTS as spark.executor.extraJavaOptions
  --gpu-fraction=GPU_FRACTION
    GPU share per executor JVM unless local or local-cluster mode, see spark.rapids.memory.gpu.allocFraction
- 
SPARK_RAPIDS_HOME- the path either to the local repo or to the location used for downloading the binaries - 
SPARK_HOME- the path either to the local Spark repo or to the root fo binary distro - 
SPARK_CMD- one ofspark-shell,spark-submit(default),pyspark,jupyter,jupyter-lab 
Use Spark RAPIDS in Jupyter notebook
SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 SPARK_CMD=jupyter[-lab] rapids.shRun in pseudo-distirbuted local-cluster mode
NUM_LOCAL_EXECS=2 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.shAllow attaching a java debugger to the driver JVM
JDBSTR=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.shSingle test suite
scala> run(new com.nvidia.spark.rapids.InsertPartition311Suite)
InsertPartition311Suite:
...Single test case
scala> run(new com.nvidia.spark.rapids.HashAggregatesSuite, "sum(floats) group by more_floats 2 partitions")
HashAggregatesSuite:
...In pyspark based drivers one can use data generators from spark-rapids/integration-tests or run whole pytests.
Add rapids.py as an ipython startup file, e.g. on *NIX
cp src/python/rapids.py ~/.ipython/profile_default/startup/key_data_gen = StructGen([
        ('a', IntegerGen(min_val=0, max_val=4)),
        ('b', IntegerGen(min_val=5, max_val=9)),
    ], nullable=False)
val_data_gen = IntegerGen()
df = two_col_df(spark, key_data_gen, val_data_gen)
...runpytest('test_struct_count_distinct')