> For the complete documentation index, see [llms.txt](https://docs.pentaho.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.pentaho.com/install/using-spark-submit-cp.md).

# Using Spark Submit

Use the Spark Submit job entry with an external Spark script to run Spark jobs on YARN clusters.

This example shows how to submit a Spark job from PDI.

If you use Spark Submit with Kerberos-secured Cloudera CDP, see *Use Kerberos with Spark Submit* in the *Administer Pentaho Data Integration and Analytics* documentation.

### Before you begin

{% hint style="info" %}
Install and configure the Spark client. Follow the Spark Submit job entry instructions in the *Pentaho Data Integration* documentation.
{% endhint %}

### Example: run the sample Spark Submit job

{% stepper %}
{% step %}

### Prepare an input file in HDFS

Copy a text file to HDFS. Use any of these tools:

* Hadoop Copy Files job entry
* Hadoop command-line tools
  {% endstep %}

{% step %}

### Open and save the sample job

1. Start the PDI client.
2. Open `Spark Submit.kjb`.

   Location: `design-tools/data-integration/samples/jobs/Spark Submit`
3. Select **File** > **Save As**.
4. Save the job as `Spark Submit Sample.kjb`.

The file is saved to the `jobs` folder.

![Spark Submit Sample Job](/files/YPx6RXk8xgoEn6rsBGGN)
{% endstep %}

{% step %}

### Configure the Spark Submit job entry

1. Open the **Spark PI** job entry.

   Spark PI is the Spark Submit job entry in the sample.
2. In **Spark Submit Utility**, enter the path to `spark-submit`.

   Use the Spark client install location.
3. In **Application Jar**, enter the path to your Spark examples JAR.

   Use either the local JAR or the cluster JAR in HDFS.
4. In **Class Name**, enter `org.apache.spark.examples.JavaWordCount`.
5. Set **Master URL** to `yarn-client`.

   For other execution modes, see [Submitting Applications](https://spark.apache.org/docs/2.3.2/submitting-applications.html) in the Spark docs.
6. In **Arguments**, enter the path to the input file in HDFS.
7. Select **OK**.
8. Save the job.
   {% endstep %}

{% step %}

### Run the job

Run the job.

As the job runs, watch the word-count output in the Execution pane.
{% endstep %}
{% endstepper %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/install/using-spark-submit-cp.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
