# Parquet Output

The Parquet Output step allows you to map PDI fields to fields within data files and choose where you want to process those files, such as on HDFS. For big data users, [Parquet Input](https://docs.pentaho.com/pdia-data-integration/pdi-transformation-steps-reference-overview/parquet-input) and Parquet Output steps enable you to gather data from various sources and move that data into the Hadoop ecosystem in the Parquet format.

### Before you begin

Before using the Parquet Output step, you must configure a named connection for your distribution, even if your **Location** is set to `Local`. For more information, see [Connecting to a Hadoop cluster with the PDI client](https://docs.pentaho.com/pdia-data-integration/extracting-data-into-pdi/connecting-to-a-hadoop-cluster-with-the-pdi-client-article).

### General tab

Enter the following information in the transformation step fields:

| Option                             | Description                                                                                                                                                                                                                                                                                                                                                                                                        |
| ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Step name**                      | Specify the unique name of the Parquet Output step on the canvas. You can customize the name or leave it as the default.                                                                                                                                                                                                                                                                                           |
| **Folder/File name**               | Specify the location and name of the file or folder. Click **Browse** to display the **Open** dialog box and navigate to the destination file or folder. For the supported file system types, see [Connecting to Virtual File Systems](https://docs.pentaho.com/pdia-data-integration/extracting-data-into-pdi/virtual-file-system-browser). When running on the Pentaho engine, a single Parquet file is created. |
| **Overwrite existing output file** | Select to overwrite an existing file that has the same file name and extension.                                                                                                                                                                                                                                                                                                                                    |

### Fields tab

![Parquet Output step](https://773338310-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYwnJ6Fexn4LZwKRHghPK%2Fuploads%2Fgit-blob-f802704d3e07a5b45fbc03288d7f71313b1b5e51%2FPDI_ParquetOutput_Fields.png?alt=media)

In the **Fields** tab, you can define properties for the fields being exported.

| Property          | Description                                                                                                                  |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| **Parquet path**  | Specify the name of the column in the Parquet file.                                                                          |
| **Name**          | Specify the name of the PDI field.                                                                                           |
| **Parquet type**  | Specify the data type used to store the data in the Parquet file.                                                            |
| **Precision**     | Specify the total number of significant digits in the number. Applies only to the Decimal Parquet type. The default is `20`. |
| **Scale**         | Specify the number of digits after the decimal point. Applies only to the Decimal Parquet type. The default is `10`.         |
| **Default value** | Specify the default value of the field if it is null or empty.                                                               |
| **Null**          | Specify whether the field can contain null values.                                                                           |

{% hint style="warning" %}
To help prevent a transformation failure, enter a value in **Default value** for every field where **Null** is set to `No`.
{% endhint %}

You can define the fields manually, or you can click **Get Fields** to populate the fields.

When the fields are retrieved, a PDI type is converted into an appropriate Parquet type. You can change the Parquet type by using the **Type** drop-down list or by entering the type manually.

| PDI type    | Parquet type    |
| ----------- | --------------- |
| InetAddress | UTF8            |
| String      | UTF8            |
| TimeStamp   | TimestampMillis |
| Binary      | Binary          |
| BigNumber   | Decimal         |
| Boolean     | Boolean         |
| Date        | Date            |
| Integer     | Int64           |
| Number      | Double          |

### Options tab

![Parquet Output step Options tab](https://773338310-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYwnJ6Fexn4LZwKRHghPK%2Fuploads%2Fgit-blob-9c7b4f48b59cd72e2e35e722f236aae45aebe52a%2FPDI_TransStep_Parquet_Output_Options_Tab.png?alt=media)

In the **Options** tab, you can define properties for the file output.

#### Compression

Specify the codec to use to compress the Parquet output file:

* **None**: No compression is used. (Default)
* **Snappy**: Uses Google’s [Snappy](http://google.github.io/snappy/) compression library.
* **GZIP**: Uses a compression format based on the [Deflate](https://en.wikipedia.org/wiki/DEFLATE) algorithm.

#### Version

Specify the version of Parquet to use:

* **Parquet 1.0**
* **Parquet 2.0**

#### Row group size (MB)

Specify the group size for the rows. The default value is `0`.

#### Data page size (KB)

Specify the page size for the data. The default value is `0`.

#### Dictionary encoding

Specify dictionary encoding, which builds a dictionary of values encountered in a column. The dictionary page is written first, before the data pages of the column.

If the dictionary grows larger than **Page size**, whether in size or number of distinct values, the encoding method reverts to the plain encoding type.

#### Page size (KB)

Specify the page size when using dictionary encoding. The default value is `1024`.

#### Extension

Select the extension for your output file. The default value is `parquet`.

#### Include date in file name

Add the system date to the filename in `yyyyMMdd` format (for example, `20181231`).

#### Include time in file name

Add the system time to the filename in `HHmmss` format (for example, `235959`).

#### Specify date time format

Select to specify the date and time format by using the drop-down list.

### Metadata injection support

All fields of this step support metadata injection. You can use this step with [ETL metadata injection](https://docs.pentaho.com/pdia-data-integration/pdi-transformation-steps-reference-overview/etl-metadata-injection) to pass metadata to your transformation at runtime.
