# Dataset tuning options

The Dataset tuning options include partitioning data and persisting or caching data, which can save downstream computation.

Effective memory management is critical for optimizing performance when running PDI transformations on the Spark engine. The Spark tuning options provide the ability to adjust Spark defaults settings and customize them to your environment. Depending on the application and environment, specific tuning parameters may be adjusted to meet your performance goals.

One of the most important capabilities in Spark is persisting (or caching) a Dataset in memory across operations. You can define partitioning and you can apply cache or **persist.storageLevel** or none.

The following table describes common data formats used by PDI transformation steps and job entries:

| Option                        | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Value type              | Example value    |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------- | ---------------- |
| **cache**                     | Persist the Dataset with the default storage level (`MEMORY_AND_DISK`). See the [Spark API documentation](https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Dataset.html#cache--) for more information.                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Boolean                 | true/false       |
| **coalesce**                  | <p>Returns a new Dataset that has exactly <code>numPartitions</code> partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. See the <a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Dataset.html#coalesce-int-">Spark API documentation</a> for more information.</p><p>Note: In PDI, this option must be used with the<strong>reparatition.numPartitions</strong> option.</p>                                                                                                                                                                       | Boolean                 | true/false       |
| **repartition.numPartitions** | <p>Returns a new Dataset partitioned by the given partitioning value into <code>numPartitions</code>. The resulting Dataset is hash partitioned. See the <a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Dataset.html#repartition-int-">Spark API documentation</a> for more information.</p><p>Note: In PDI, this option can be used with <strong>repartition.columns</strong>.</p><p>Note: When <strong>coalesce</strong> is set to <code>true</code> then <code>coalesce( numPartitions )</code> is called. When <strong>coalesce</strong> is set to blank or <code>false</code>, then <code>.repartition( numPartitions )</code> is called.</p> | Integer                 | 5                |
| **repartition.columns**       | <p>Returns a new Dataset partitioned by the given Dataset columns, and only works with the <strong>repartition.numPartition</strong>option. The resulting Dataset is hash partitioned by the given columns. See the <a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Dataset.html#repartition-int-org.apache.spark.sql.Column...-">Spark API documentation</a> for more information.</p><p>Note: In PDI, the step logs an error if an invalid column is entered.</p>                                                                                                                                                                                 | Comma separated strings | column1, column2 |
| **persist.storageLevel**      | <p>Each persisted Dataset can be stored using a different storage level. These levels are set by passing a <code>StorageLevel</code> object to <code>persist()</code>.</p><p>See the <a href="https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#rdd-persistence">Spark API documentation</a> to view more information about RDD persistence, the full set of storage levels, and how to choose a storage level.</p>                                                                                                                                                                                                                                               | Spark Storage Level     | MEMORY\_ONLY     |

The following examples show you how the Dataset tuning options work with a given set of values in a PDI step:

| Example of step tuning options and values                                                                                                                                                                    | Resulting Spark API call                                    |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------- |
| <ul><li><strong>cache</strong> = <code>true</code></li><li><strong>repartition.numPartitions</strong> = <code>5</code></li><li><strong>repartition.columns</strong> = <code>id,name</code></li></ul>         | `dataset.repartition( 5, Column[]{ “id”, “name”} ).cache()` |
| <ul><li><strong>persist.storageLevel</strong> = <code>IN\_MEMORY</code></li><li><strong>repartition.numPartitions</strong> = <code>15</code></li><li><strong>coalesce</strong> = <code>true</code></li></ul> | `dataset.coalesce( 15 ).persist( StorageLevel.IN_MEMORY )`  |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/install/9.3-install/pentaho-configuration/tasks-to-be-performed-by-an-it-administrator/set-up-the-adaptive-execution-layer-ael/advanced-topics/spark-tuning-landing-page-cp/dataset-tuning-options-spark.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
