Spark Submit
You can use the Spark Submit job entry in PDI to launch Spark jobs on any vendor version that PDI supports.
Using Spark Submit, you can submit Spark applications written in Java, Scala, or Python to run in yarn-cluster or yarn-client mode.
For more information, see the Install Pentaho Data Integration and Analytics documentation.
Before you begin
Before you install and configure Spark Submit, review:
http://spark.apache.org/ (installation and configuration)
Spark job submission guidance: https://spark.apache.org/docs/latest/submitting-applications.html
Install and configure Spark client for PDI use
You must install and configure the Spark client on every machine where you want to run Spark jobs by using PDI.
Pentaho supports Cloudera Distribution of Spark (CDS) versions 2.3.x and 2.4.x.
Pentaho does not support Spark version 2.4.2 because it does not support Scala version 2.11.
Spark version 2.x.x
To install and configure the Spark client:
Download the Spark distribution (same or higher version than the cluster).
Set the
HADOOP_CONF_DIRenvironment variable to a folder that contains cluster configuration files.Example path for an already-configured driver:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>
Go to
<SPARK_HOME>/confand createspark-defaults.confby following the instructions in https://spark.apache.org/docs/latest/configuration.html.Create a ZIP archive containing all JAR files in
SPARK_HOME/jars.Copy the ZIP file from the local file system to a world-readable location on the cluster.
In
spark-defaults.conf, setspark.yarn.archiveto the world-readable location of your ZIP file on the cluster.Examples:
spark.yarn.archive hdfs://<NameNode hostname>:8020/user/spark/lib/<your ZIP file>
Add the following line to
spark-defaults.conf:spark.hadoop.yarn.timeline-service.enabled false
If you are connecting to an HDP cluster, add these lines to
spark-defaults.conf:spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2557
The
-Dhdp.versionvalue must match the Hadoop version used on the cluster.If you are connecting to an HDP cluster, create a text file named
java-optsin<SPARK_HOME>/confand add your HDP version.Example:
-Dhdp.version=2.3.0.0-2557
To determine your HDP version, run
hdp-select status Hadoop client.If you are connecting to a supported version of an HDP or CDH cluster, open
core-site.xmland comment out thenet.topology.script.fileproperty.
The Spark client is now ready for use with Spark Submit in PDI.
General
Entry Name
Specify the name of the entry. You can customize it or leave it as the default.
Spark Submit Utility
The script that launches the Spark job (the batch/shell file name of the underlying spark-submit tool). For example, Spark2-submit.
Master URL
Select a master URL:
yarn-cluster: Runs the driver program as a thread of the YARN application master (similar to MapReduce).
yarn-client: Runs the driver program on the YARN client while tasks execute in the YARN cluster node managers.
Type
Select the language of the Spark job (Java, Scala, or Python). The fields on the Files tab depend on this selection.
Python support on Windows requires Spark version 2.3.x or higher.
Enable Blocking
If selected (default), Spark Submit waits until the Spark job finishes. If cleared, Spark Submit continues after the job is submitted.
We support the yarn-cluster and yarn-client modes. For details, see the Spark documentation about master URLs: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls.
If your Hadoop cluster and Spark use Kerberos, a valid Kerberos ticket must already exist in the ticket cache on the client machine before you submit the job.
Options
The Spark Submit entry includes the following tabs: Files, Arguments, and Options.
Files tab
The fields on this tab depend on whether you set Type to Java or Scala or Python.
Java or Scala

Class
Optional. The entry point for your application.
Application Jar
The main file of the Spark job you are submitting. The path must be accessible from within the cluster (for example, an hdfs:// path or a file:// path available on all nodes).
Dependencies
The environment and path for other packages, bundles, or libraries used by your Spark job. Environment indicates whether dependencies are Local (on your machine) or Static (on the cluster or web).
Python

Py File
The main Python file of the Spark job you are submitting.
Dependencies
The environment and path for other packages, bundles, or libraries used by your Spark job. Environment indicates whether dependencies are Local (on your machine) or Static (on the cluster or web).
Arguments tab

Arguments
The arguments passed to your main Java/Scala class or Python file.
Options tab

Executor Memory
The amount of memory to use per executor process. Use JVM format (for example, 512m or 2g).
Driver Memory
The amount of memory to use for the driver. Use JVM format (for example, 512m or 2g).
Utility Parameters
Optional Spark configuration parameters associated with spark-defaults.conf (name/value pairs).
Troubleshooting
If Spark Submit fails in PDI:
Validate the application by running the
spark-submitcommand-line tool on the same machine that runs PDI.Use the YARN ResourceManager web UI to review submitted jobs, resource usage, duration, and logs.
Running a Spark job from a Windows machine
If you see errors such as:
ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!(JobTracker log)ExitCodeException exitCode=10(Spoon log)
Create a Windows firewall inbound rule to enable inbound connections from the cluster.
Last updated
Was this helpful?

