Dataset tuning options

The Dataset tuning options include partitioning data and persisting or caching data, which can save downstream computation.

Effective memory management is critical for optimizing performance when running PDI transformations on the Spark engine. The Spark tuning options provide the ability to adjust Spark defaults settings and customize them to your environment. Depending on the application and environment, specific tuning parameters may be adjusted to meet your performance goals.

One of the most important capabilities in Spark is persisting (or caching) a Dataset in memory across operations. You can define partitioning and you can apply cache or persist.storageLevel or none.

The following table describes common data formats used by PDI transformation steps and job entries:

Option

Description

Value type

Example value

cache

Persist the Dataset with the default storage level (MEMORY_AND_DISK). See the Spark API documentation for more information.

Boolean

true/false

coalesce

Returns a new Dataset that has exactly numPartitions partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. See the Spark API documentation for more information.

Note: In PDI, this option must be used with thereparatition.numPartitions option.

Boolean

true/false

repartition.numPartitions

Returns a new Dataset partitioned by the given partitioning value into numPartitions. The resulting Dataset is hash partitioned. See the Spark API documentation for more information.

Note: In PDI, this option can be used with repartition.columns.

Note: When coalesce is set to true then coalesce( numPartitions ) is called. When coalesce is set to blank or false, then .repartition( numPartitions ) is called.

Integer

5

repartition.columns

Returns a new Dataset partitioned by the given Dataset columns, and only works with the repartition.numPartitionoption. The resulting Dataset is hash partitioned by the given columns. See the Spark API documentation for more information.

Note: In PDI, the step logs an error if an invalid column is entered.

Comma separated strings

column1, column2

persist.storageLevel

Each persisted Dataset can be stored using a different storage level. These levels are set by passing a StorageLevel object to persist().

See the Spark API documentation to view more information about RDD persistence, the full set of storage levels, and how to choose a storage level.

Spark Storage Level

MEMORY_ONLY

The following examples show you how the Dataset tuning options work with a given set of values in a PDI step:

Example of step tuning options and values

Resulting Spark API call

  • cache = true

  • repartition.numPartitions = 5

  • repartition.columns = id,name

dataset.repartition( 5, Column[]{ “id”, “name”} ).cache()

  • persist.storageLevel = IN_MEMORY

  • repartition.numPartitions = 15

  • coalesce = true

dataset.coalesce( 15 ).persist( StorageLevel.IN_MEMORY )

Last updated

Was this helpful?