Spark Tuning

You can use these PDI step tuning options to customize the transformations that run on the Spark engine.Spark tuning is the customization of PDI transformation and step parameters to improve the performance of running your PDI transformations on Spark. These parameters affect memory, cores, and instances used by the Spark engine. These Spark parameters include:

  • Application tuning parameters which are transformation parameters for working with PDI transformations on Spark.

  • Spark tuning options which are parameters for a specific PDI step.

Use the Spark tuning options to customize Spark parameters within a PDI step to further refine how your transformation runs. For example, if your KTR contains many complex computations, you can adjust the Spark tuning options for a PDI step to increase performance and decrease run times when executing your transformation.

Note: Spark tuning options for a step override the application tuning parameters for the transformation. You can set application tuning parameters in AEL in the data-integration/adaptive-execution/config/application.properties file or in PDI in the Transformation Properties window. For more information, see Configuring application tuning parameters for Spark.

This article provides a reference for the step-level Spark tuning options available in PDI. The Spark tuning categories, options, and applicable steps are listed below. For more information about the Spark tuning workflow, see About Spark tuning in PDI.

Spark tuning options for PDI steps include the following categories:

  • Dataset

    Set these options for data persistence, repartitioning, and coalesce. Tune these options to help you save downstream computation, such as reducing the amount of possible recalculations after wide Spark transformations. Options include partitioning data, persisting data, and caching data.

  • Join

    Set this broadcast join option to push datasets out to the executors which can reduce shuffling during join operations.

  • JDBC

    Set these options to specify the number of JDBC connections, including partitioning attributes.

  • Dataframe Writer

    Set these options for partitioning management, including bucketing file writes.

Last updated

Was this helpful?