Pentaho Spark application
The Pentaho Spark application is built upon PDI's Pentaho execution engine, which allows you to develop Spark applications with familiar Pentaho tools. Some third-party plugins, such as those plugins available in the Pentaho Marketplace, may not be included by default within the Pentaho Spark application. To address this issue, we include functionality in the Spark Application builder tool so you can customize the Pentaho Spark application by adding or removing components to fit your needs.
After running the Spark application builder tool, copy and unzip the resulting pdi-spark-driver.zip
file to an edge node in your Hadoop cluster. The unpacked contents consist of the data-integration
folder and the pdi-spark-executor.zip
file, which includes only the required libraries needed by the Spark nodes themselves to execute a transformation when the AEL daemon is configured to run in YARN mode. Since the pdi-spark-executor.zip
file needs to be accessible by all nodes in the cluster, it must be copied into HDFS. Spark distributes this ZIP file to other nodes and then automatically extracts it.
Perform the following steps to run the Spark application build tool and manage the resulting files:
Ensure that you have configured your PDI client with all the plugins that you will use.
Navigate to the
design-tools/data-integration
folder and locate thespark-app-builder.bat
(Windows) or thespark-app-builder.sh
(Linux).Execute the Spark application builder tool script.
A console window will display and the
pdi-spark-driver.zip
file will be created in thedata-integration
folder (unless otherwise specified by the-outputLocation
parameter described below).The following parameters can be used when running the script to build the
pdi-spark-driver.zip
.ParameterAction–h
or--help
Displays the help.
–e
or--exclude-plugins
Specifies plugins from the
data-integration/plugins
folder not to exclude from the assembly.–o
or--outputLocation
Specifies the output location.
The
pdi-spark-driver.zip
file contains adata-integration
folder andpdi-spark-executor.zip
file.Copy the
data-integration
folder to the edge node where you want to run the AEL daemon.Copy the
pdi-spark-executor.zip
file to the HDFS node where you will run Spark.This folder will be referred to as HDFS_SPARK_EXECUTOR_LOCATION.
Note: For the cluster nodes to use the functionality provided by PDI plugins when executing a transformation, they must be installed into the PDI client prior to generating the Pentaho Spark application. If you install other plugins later, you must regenerate the Pentaho Spark application.
Last updated
Was this helpful?