Pentaho Spark application
The Pentaho Spark application is built upon PDI's Pentaho execution engine, which allows you to develop Spark applications with familiar Pentaho tools. Some third-party plugins, such as those plugins available in the Pentaho Marketplace, may not be included by default within the Pentaho Spark application. To address this issue, we include functionality in the Spark Application builder tool so you can customize the Pentaho Spark application by adding or removing components to fit your needs.
After running the Spark application builder tool, copy and unzip the resulting pdi-spark-driver.zip file to an edge node in your Hadoop cluster. The unpacked contents consist of the data-integration folder and the pdi-spark-executor.zip file, which includes only the required libraries needed by the Spark nodes themselves to execute a transformation when the AEL daemon is configured to run in YARN mode. Since the pdi-spark-executor.zip file needs to be accessible by all nodes in the cluster, it must be copied into HDFS. Spark distributes this ZIP file to other nodes and then automatically extracts it.
Perform the following steps to run the Spark application build tool and manage the resulting files:
Ensure that you have configured your PDI client with all the plugins that you will use.
Navigate to the
design-tools/data-integrationfolder and locate thespark-app-builder.bat(Windows) or thespark-app-builder.sh(Linux).Execute the Spark application builder tool script.
A console window will display and the
pdi-spark-driver.zipfile will be created in thedata-integrationfolder (unless otherwise specified by the-outputLocationparameter described below).The following parameters can be used when running the script to build the
pdi-spark-driver.zip.ParameterAction–hor--helpDisplays the help.
–eor--exclude-pluginsSpecifies plugins from the
data-integration/pluginsfolder not to exclude from the assembly.–oor--outputLocationSpecifies the output location.
The
pdi-spark-driver.zipfile contains adata-integrationfolder andpdi-spark-executor.zipfile.Copy the
data-integrationfolder to the edge node where you want to run the AEL daemon.Copy the
pdi-spark-executor.zipfile to the HDFS node where you will run Spark.This folder will be referred to as HDFS_SPARK_EXECUTOR_LOCATION.
Note: For the cluster nodes to use the functionality provided by PDI plugins when executing a transformation, they must be installed into the PDI client prior to generating the Pentaho Spark application. If you install other plugins later, you must regenerate the Pentaho Spark application.
Last updated
Was this helpful?

