# Dataset tuning options

The Dataset tuning options include partitioning data and persisting or caching data, which can save downstream computation.

Effective memory management is critical for optimizing performance when running PDI transformations on the Spark engine. The Spark tuning options provide the ability to adjust Spark defaults settings and customize them to your environment. Depending on the application and environment, specific tuning parameters may be adjusted to meet your performance goals.

One of the most important capabilities in Spark is persisting (or caching) a Dataset in memory across operations. You can define partitioning and you can apply cache or **persist.storageLevel** or none.

The following table describes common data formats used by PDI transformation steps and job entries:

| Option                        | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Value type              | Example value    |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------- | ---------------- |
| **cache**                     | Persist the Dataset with the default storage level (`MEMORY_AND_DISK`). See the [Spark API documentation](https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Dataset.html#cache--) for more information.                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Boolean                 | true/false       |
| **coalesce**                  | <p>Returns a new Dataset that has exactly <code>numPartitions</code> partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. See the <a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Dataset.html#coalesce-int-">Spark API documentation</a> for more information.</p><p>Note: In PDI, this option must be used with the<strong>reparatition.numPartitions</strong> option.</p>                                                                                                                                                                       | Boolean                 | true/false       |
| **repartition.numPartitions** | <p>Returns a new Dataset partitioned by the given partitioning value into <code>numPartitions</code>. The resulting Dataset is hash partitioned. See the <a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Dataset.html#repartition-int-">Spark API documentation</a> for more information.</p><p>Note: In PDI, this option can be used with <strong>repartition.columns</strong>.</p><p>Note: When <strong>coalesce</strong> is set to <code>true</code> then <code>coalesce( numPartitions )</code> is called. When <strong>coalesce</strong> is set to blank or <code>false</code>, then <code>.repartition( numPartitions )</code> is called.</p> | Integer                 | 5                |
| **repartition.columns**       | <p>Returns a new Dataset partitioned by the given Dataset columns, and only works with the <strong>repartition.numPartition</strong>option. The resulting Dataset is hash partitioned by the given columns. See the <a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Dataset.html#repartition-int-org.apache.spark.sql.Column...-">Spark API documentation</a> for more information.</p><p>Note: In PDI, the step logs an error if an invalid column is entered.</p>                                                                                                                                                                                 | Comma separated strings | column1, column2 |
| **persist.storageLevel**      | <p>Each persisted Dataset can be stored using a different storage level. These levels are set by passing a <code>StorageLevel</code> object to <code>persist()</code>.</p><p>See the <a href="https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#rdd-persistence">Spark API documentation</a> to view more information about RDD persistence, the full set of storage levels, and how to choose a storage level.</p>                                                                                                                                                                                                                                               | Spark Storage Level     | MEMORY\_ONLY     |

The following examples show you how the Dataset tuning options work with a given set of values in a PDI step:

| Example of step tuning options and values                                                                                                                                                                    | Resulting Spark API call                                    |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------- |
| <ul><li><strong>cache</strong> = <code>true</code></li><li><strong>repartition.numPartitions</strong> = <code>5</code></li><li><strong>repartition.columns</strong> = <code>id,name</code></li></ul>         | `dataset.repartition( 5, Column[]{ “id”, “name”} ).cache()` |
| <ul><li><strong>persist.storageLevel</strong> = <code>IN\_MEMORY</code></li><li><strong>repartition.numPartitions</strong> = <code>15</code></li><li><strong>coalesce</strong> = <code>true</code></li></ul> | `dataset.coalesce( 15 ).persist( StorageLevel.IN_MEMORY )`  |
