# Spark Submit

You can use the **Spark Submit** job entry in PDI to launch Spark jobs on any vendor version that PDI supports.

Using Spark Submit, you can submit Spark applications written in Java, Scala, or Python to run in `yarn-cluster` or `yarn-client` mode.

For more information, see the **Install Pentaho Data Integration and Analytics** documentation.

### Before you begin

Before you install and configure Spark Submit, review:

* <https://spark.apache.org/releases/>
* <https://spark.apache.org/docs/latest/configuration.html>
* <https://spark.apache.org/docs/latest/running-on-yarn.html>
* <http://spark.apache.org/> (installation and configuration)
* Spark job submission guidance: <https://spark.apache.org/docs/latest/submitting-applications.html>

### Install and configure Spark client for PDI use

You must install and configure the Spark client on every machine where you want to run Spark jobs by using PDI.

Pentaho supports Cloudera Distribution of Spark (CDS) versions 2.3.x and 2.4.x.

{% hint style="warning" %}
Pentaho does not support Spark version **2.4.2** because it does not support Scala version 2.11.
{% endhint %}

#### Spark version 2.x.x

To install and configure the Spark client:

1. Download the Spark distribution (same or higher version than the cluster).
2. Set the `HADOOP_CONF_DIR` environment variable to a folder that contains cluster configuration files.

   Example path for an already-configured driver:

   * `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>`
3. Go to `<SPARK_HOME>/conf` and create `spark-defaults.conf` by following the instructions in <https://spark.apache.org/docs/latest/configuration.html>.
4. Create a ZIP archive containing all JAR files in `SPARK_HOME/jars`.
5. Copy the ZIP file from the local file system to a world-readable location on the cluster.
6. In `spark-defaults.conf`, set `spark.yarn.archive` to the world-readable location of your ZIP file on the cluster.

   Examples:

   * `spark.yarn.archive hdfs://<NameNode hostname>:8020/user/spark/lib/<your ZIP file>`
7. Add the following line to `spark-defaults.conf`:
   * `spark.hadoop.yarn.timeline-service.enabled false`
8. If you are connecting to an HDP cluster, add these lines to `spark-defaults.conf`:

   * `spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557`
   * `spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2557`

   The `-Dhdp.version` value must match the Hadoop version used on the cluster.
9. If you are connecting to an HDP cluster, create a text file named `java-opts` in `<SPARK_HOME>/conf` and add your HDP version.

   Example:

   * `-Dhdp.version=2.3.0.0-2557`

   To determine your HDP version, run `hdp-select status Hadoop client`.
10. If you are connecting to a supported version of an HDP or CDH cluster, open `core-site.xml` and comment out the `net.topology.script.file` property.

    ```xml
    <!--
    <property>
      <name>net.topology.script.file.name</name>
      <value>/etc/hadoop/conf/topology_script.py</value>
    </property>
    -->
    ```

The Spark client is now ready for use with Spark Submit in PDI.

### General

| Field                    | Description                                                                                                                                                                                                                                                                                                    |
| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Entry Name**           | Specify the name of the entry. You can customize it or leave it as the default.                                                                                                                                                                                                                                |
| **Spark Submit Utility** | The script that launches the Spark job (the batch/shell file name of the underlying `spark-submit` tool). For example, `Spark2-submit`.                                                                                                                                                                        |
| **Master URL**           | <p>Select a master URL:</p><ul><li><strong>yarn-cluster</strong>: Runs the driver program as a thread of the YARN application master (similar to MapReduce).</li><li><strong>yarn-client</strong>: Runs the driver program on the YARN client while tasks execute in the YARN cluster node managers.</li></ul> |
| **Type**                 | <p>Select the language of the Spark job (Java, Scala, or Python). The fields on the <strong>Files</strong> tab depend on this selection.</p><p>Python support on Windows requires Spark version 2.3.x or higher.</p>                                                                                           |
| **Enable Blocking**      | If selected (default), Spark Submit waits until the Spark job finishes. If cleared, Spark Submit continues after the job is submitted.                                                                                                                                                                         |

We support the `yarn-cluster` and `yarn-client` modes. For details, see the Spark documentation about master URLs: <https://spark.apache.org/docs/latest/submitting-applications.html#master-urls>.

{% hint style="info" %}
If your Hadoop cluster and Spark use Kerberos, a valid Kerberos ticket must already exist in the ticket cache on the client machine before you submit the job.
{% endhint %}

### Options

The Spark Submit entry includes the following tabs: **Files**, **Arguments**, and **Options**.

#### Files tab

The fields on this tab depend on whether you set **Type** to **Java or Scala** or **Python**.

**Java or Scala**

![Files tab, Java or Scala, Spark Submit](https://773338310-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYwnJ6Fexn4LZwKRHghPK%2Fuploads%2Fgit-blob-6b361db91004300e35b327c36815b244c56e9606%2FssPDISpark_Submit-FileTab-Java_and_Scala.png?alt=media)

| Option              | Description                                                                                                                                                                                                      |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Class**           | Optional. The entry point for your application.                                                                                                                                                                  |
| **Application Jar** | The main file of the Spark job you are submitting. The path must be accessible from within the cluster (for example, an `hdfs://` path or a `file://` path available on all nodes).                              |
| **Dependencies**    | The environment and path for other packages, bundles, or libraries used by your Spark job. **Environment** indicates whether dependencies are **Local** (on your machine) or **Static** (on the cluster or web). |

**Python**

![Files tab, Python, Spark Submit](https://773338310-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYwnJ6Fexn4LZwKRHghPK%2Fuploads%2Fgit-blob-b6d2baa6900a348ae11df284fb5c64e0b6cd0876%2FssPDISpark_Submit-FileTab-Python.png?alt=media)

| Option           | Description                                                                                                                                                                                                      |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Py File**      | The main Python file of the Spark job you are submitting.                                                                                                                                                        |
| **Dependencies** | The environment and path for other packages, bundles, or libraries used by your Spark job. **Environment** indicates whether dependencies are **Local** (on your machine) or **Static** (on the cluster or web). |

#### Arguments tab

![Arguments tab, Spark Submit](https://773338310-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYwnJ6Fexn4LZwKRHghPK%2Fuploads%2Fgit-blob-c65c1063065c1d7a7d01212222baa24b1939f08c%2FssPDISpark_Submit-ArgumentsTab.png?alt=media)

| Option        | Description                                                        |
| ------------- | ------------------------------------------------------------------ |
| **Arguments** | The arguments passed to your main Java/Scala class or Python file. |

#### Options tab

![Options tab, Spark Submit](https://773338310-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYwnJ6Fexn4LZwKRHghPK%2Fuploads%2Fgit-blob-9937371e110f05e4eb3e4a43a97c8efde71e21fe%2FssPDISpark_Submit-OptionsTab.png?alt=media)

| Option                 | Description                                                                                       |
| ---------------------- | ------------------------------------------------------------------------------------------------- |
| **Executor Memory**    | The amount of memory to use per executor process. Use JVM format (for example, `512m` or `2g`).   |
| **Driver Memory**      | The amount of memory to use for the driver. Use JVM format (for example, `512m` or `2g`).         |
| **Utility Parameters** | Optional Spark configuration parameters associated with `spark-defaults.conf` (name/value pairs). |

### Troubleshooting

If Spark Submit fails in PDI:

* Validate the application by running the `spark-submit` command-line tool on the same machine that runs PDI.
* Use the YARN ResourceManager web UI to review submitted jobs, resource usage, duration, and logs.

#### Running a Spark job from a Windows machine

If you see errors such as:

* `ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!` (JobTracker log)
* `ExitCodeException exitCode=10` (Spoon log)

Create a Windows firewall inbound rule to enable inbound connections from the cluster.
