Using Spark Submit

Use the Spark Submit job entry with an external Spark script to run Spark jobs on YARN clusters.

This example shows how to submit a Spark job from PDI.

If you use Spark Submit with Kerberos-secured Cloudera CDP, see Use Kerberos with Spark Submit in the Administer Pentaho Data Integration and Analytics documentation.

Before you begin

circle-info

Install and configure the Spark client. Follow the Spark Submit job entry instructions in the Pentaho Data Integration documentation.

Example: run the sample Spark Submit job

1

Prepare an input file in HDFS

Copy a text file to HDFS. Use any of these tools:

  • Hadoop Copy Files job entry

  • Hadoop command-line tools

2

Open and save the sample job

  1. Start the PDI client.

  2. Open Spark Submit.kjb.

    Location: design-tools/data-integration/samples/jobs/Spark Submit

  3. Select File > Save As.

  4. Save the job as Spark Submit Sample.kjb.

The file is saved to the jobs folder.

Spark Submit Sample Job
3

Configure the Spark Submit job entry

  1. Open the Spark PI job entry.

    Spark PI is the Spark Submit job entry in the sample.

  2. In Spark Submit Utility, enter the path to spark-submit.

    Use the Spark client install location.

  3. In Application Jar, enter the path to your Spark examples JAR.

    Use either the local JAR or the cluster JAR in HDFS.

  4. In Class Name, enter org.apache.spark.examples.JavaWordCount.

  5. Set Master URL to yarn-client.

    For other execution modes, see Submitting Applicationsarrow-up-right in the Spark docs.

  6. In Arguments, enter the path to the input file in HDFS.

  7. Select OK.

  8. Save the job.

4

Run the job

Run the job.

As the job runs, watch the word-count output in the Execution pane.

Last updated

Was this helpful?