Dataset tuning options
The Dataset tuning options include partitioning data and persisting or caching data, which can save downstream computation.
Effective memory management is critical for optimizing performance when running PDI transformations on the Spark engine. The Spark tuning options provide the ability to adjust Spark defaults settings and customize them to your environment. Depending on the application and environment, specific tuning parameters may be adjusted to meet your performance goals.
One of the most important capabilities in Spark is persisting (or caching) a Dataset in memory across operations. You can define partitioning and you can apply cache or persist.storageLevel or none.
The following table describes common data formats used by PDI transformation steps and job entries:
Option
Description
Value type
Example value
cache
Persist the Dataset with the default storage level (MEMORY_AND_DISK
). See the Spark API documentation for more information.
Boolean
true/false
coalesce
Returns a new Dataset that has exactly numPartitions
partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. See the Spark API documentation for more information.
Note: In PDI, this option must be used with thereparatition.numPartitions option.
Boolean
true/false
repartition.numPartitions
Returns a new Dataset partitioned by the given partitioning value into numPartitions
. The resulting Dataset is hash partitioned. See the Spark API documentation for more information.
Note: In PDI, this option can be used with repartition.columns.
Note: When coalesce is set to true
then coalesce( numPartitions )
is called. When coalesce is set to blank or false
, then .repartition( numPartitions )
is called.
Integer
5
repartition.columns
Returns a new Dataset partitioned by the given Dataset columns, and only works with the repartition.numPartitionoption. The resulting Dataset is hash partitioned by the given columns. See the Spark API documentation for more information.
Note: In PDI, the step logs an error if an invalid column is entered.
Comma separated strings
column1, column2
persist.storageLevel
Each persisted Dataset can be stored using a different storage level. These levels are set by passing a StorageLevel
object to persist()
.
See the Spark API documentation to view more information about RDD persistence, the full set of storage levels, and how to choose a storage level.
Spark Storage Level
MEMORY_ONLY
The following examples show you how the Dataset tuning options work with a given set of values in a PDI step:
Example of step tuning options and values
Resulting Spark API call
cache =
true
repartition.numPartitions =
5
repartition.columns =
id,name
dataset.repartition( 5, Column[]{ “id”, “name”} ).cache()
persist.storageLevel =
IN_MEMORY
repartition.numPartitions =
15
coalesce =
true
dataset.coalesce( 15 ).persist( StorageLevel.IN_MEMORY )
Last updated
Was this helpful?