Spark Submit

You can use the Spark Submit job entry in PDI to launch Spark jobs on any vendor version that PDI supports.

Using Spark Submit, you can submit Spark applications written in Java, Scala, or Python to run in yarn-cluster or yarn-client mode.

For more information, see the Install Pentaho Data Integration and Analytics documentation.

Before you begin

Before you install and configure Spark Submit, review:

Install and configure Spark client for PDI use

You must install and configure the Spark client on every machine where you want to run Spark jobs by using PDI.

Pentaho supports Cloudera Distribution of Spark (CDS) versions 2.3.x and 2.4.x.

circle-exclamation

Spark version 2.x.x

To install and configure the Spark client:

  1. Download the Spark distribution (same or higher version than the cluster).

  2. Set the HADOOP_CONF_DIR environment variable to a folder that contains cluster configuration files.

    Example path for an already-configured driver:

    • <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>

  3. Go to <SPARK_HOME>/conf and create spark-defaults.conf by following the instructions in https://spark.apache.org/docs/latest/configuration.htmlarrow-up-right.

  4. Create a ZIP archive containing all JAR files in SPARK_HOME/jars.

  5. Copy the ZIP file from the local file system to a world-readable location on the cluster.

  6. In spark-defaults.conf, set spark.yarn.archive to the world-readable location of your ZIP file on the cluster.

    Examples:

    • spark.yarn.archive hdfs://<NameNode hostname>:8020/user/spark/lib/<your ZIP file>

  7. Add the following line to spark-defaults.conf:

    • spark.hadoop.yarn.timeline-service.enabled false

  8. If you are connecting to an HDP cluster, add these lines to spark-defaults.conf:

    • spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557

    • spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2557

    The -Dhdp.version value must match the Hadoop version used on the cluster.

  9. If you are connecting to an HDP cluster, create a text file named java-opts in <SPARK_HOME>/conf and add your HDP version.

    Example:

    • -Dhdp.version=2.3.0.0-2557

    To determine your HDP version, run hdp-select status Hadoop client.

  10. If you are connecting to a supported version of an HDP or CDH cluster, open core-site.xml and comment out the net.topology.script.file property.

The Spark client is now ready for use with Spark Submit in PDI.

General

Field
Description

Entry Name

Specify the name of the entry. You can customize it or leave it as the default.

Spark Submit Utility

The script that launches the Spark job (the batch/shell file name of the underlying spark-submit tool). For example, Spark2-submit.

Master URL

Select a master URL:

  • yarn-cluster: Runs the driver program as a thread of the YARN application master (similar to MapReduce).

  • yarn-client: Runs the driver program on the YARN client while tasks execute in the YARN cluster node managers.

Type

Select the language of the Spark job (Java, Scala, or Python). The fields on the Files tab depend on this selection.

Python support on Windows requires Spark version 2.3.x or higher.

Enable Blocking

If selected (default), Spark Submit waits until the Spark job finishes. If cleared, Spark Submit continues after the job is submitted.

We support the yarn-cluster and yarn-client modes. For details, see the Spark documentation about master URLs: https://spark.apache.org/docs/latest/submitting-applications.html#master-urlsarrow-up-right.

circle-info

If your Hadoop cluster and Spark use Kerberos, a valid Kerberos ticket must already exist in the ticket cache on the client machine before you submit the job.

Options

The Spark Submit entry includes the following tabs: Files, Arguments, and Options.

Files tab

The fields on this tab depend on whether you set Type to Java or Scala or Python.

Java or Scala

Files tab, Java or Scala, Spark Submit
Option
Description

Class

Optional. The entry point for your application.

Application Jar

The main file of the Spark job you are submitting. The path must be accessible from within the cluster (for example, an hdfs:// path or a file:// path available on all nodes).

Dependencies

The environment and path for other packages, bundles, or libraries used by your Spark job. Environment indicates whether dependencies are Local (on your machine) or Static (on the cluster or web).

Python

Files tab, Python, Spark Submit
Option
Description

Py File

The main Python file of the Spark job you are submitting.

Dependencies

The environment and path for other packages, bundles, or libraries used by your Spark job. Environment indicates whether dependencies are Local (on your machine) or Static (on the cluster or web).

Arguments tab

Arguments tab, Spark Submit
Option
Description

Arguments

The arguments passed to your main Java/Scala class or Python file.

Options tab

Options tab, Spark Submit
Option
Description

Executor Memory

The amount of memory to use per executor process. Use JVM format (for example, 512m or 2g).

Driver Memory

The amount of memory to use for the driver. Use JVM format (for example, 512m or 2g).

Utility Parameters

Optional Spark configuration parameters associated with spark-defaults.conf (name/value pairs).

Troubleshooting

If Spark Submit fails in PDI:

  • Validate the application by running the spark-submit command-line tool on the same machine that runs PDI.

  • Use the YARN ResourceManager web UI to review submitted jobs, resource usage, duration, and logs.

Running a Spark job from a Windows machine

If you see errors such as:

  • ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver! (JobTracker log)

  • ExitCodeException exitCode=10 (Spoon log)

Create a Windows firewall inbound rule to enable inbound connections from the cluster.

Last updated

Was this helpful?