# Bulk load into Amazon Redshift

The **Bulk load into Amazon Redshift** job entry leverages Amazon Redshift’s [`COPY`](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) command to automate loading your data into an Amazon Redshift cluster. Using `COPY` enables parallel loading from cloud storage for high performance, and reduces the need for repetitive SQL scripting.

### Before you begin

Before using this job entry in PDI, do the following to download the Redshift driver and copy it into the Pentaho distribution:

1. Download the latest JDBC 4.0 driver from [Download the Amazon Redshift JDBC Driver](https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver).
2. Copy the downloaded JAR file into `server/pentaho-server/tomcat/lib` to update the Pentaho Server.
3. Copy the downloaded JAR file into `design-tools/data-integration/lib` to update PDI.
4. Verify that your cluster is configured to support the `COPY` command.

   See [Credentials and Access Permissions](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-access-permissions.html) for details.

### General

The following field is available on the **General** tab:

* **Entry Name**: Specify the unique name of the job entry on the canvas. You can customize the name or leave it as the default.

### Options

The **Bulk load into Amazon Redshift** job entry includes several tabs to define the data input source, the output database and location, and other `COPY` parameters.

#### Input tab

![Input tab](/files/gaBJ8dB2u351r6JeFQUD)

Use the options in this tab to define your input source for the Redshift `COPY` command:

| Option                            | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Source**                        | Specify the file, folder, prefix, or variable of the S3 bucket to use as the input for the Redshift `COPY` command. See [Syntax of the COPY from Amazon S3](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-source-s3.html#copy-parameters-data-source-s3-syntax) for details.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| **What file type is your source** | <p>Select the file type of the input source. You can select one of the following types:- <strong>Delimited text</strong></p><p>The input source is character-delimited UTF-8 text.</p><ul><li><strong>Avro</strong></li></ul><p>The input source is an Avro data serialization protocol.</p><ul><li><strong>JSON</strong></li></ul><p>The input source is a JavaScript Object Notation (JSON) data file containing a set of either objects or arrays.</p><ul><li><strong>ORC</strong></li></ul><p>The input source is an Optimized Row Columnar (ORC) file containing Hive data.</p><ul><li><strong>Parquet</strong></li></ul><p>The input source is a Parquet file of nested data structures in a flat columnar format.</p><ul><li><strong>Fixed width</strong></li></ul><p>The input source is a file where each column width is a fixed length, rather than columns separated by a delimiter.</p> |
| **Compression**                   | <p>Select the type of compression applied to your input source:- <strong>None</strong></p><p>No compression is applied.</p><ul><li><strong>BZIP2</strong></li></ul><p>The Burrows–Wheeler algorithm is used to compress the input source.</p><ul><li><strong>GZIP</strong></li></ul><p>The DEFLATE algorithm is used as a basis for compressing the input source.</p><ul><li><strong>LZOP</strong></li></ul><p>The Lempel–Ziv–Oberhumer lossless algorithm is used as a basis for compressing the input source.</p>                                                                                                                                                                                                                                                                                                                                                                                  |

Depending on the file type selected in **What file type is your source**, additional file settings appear at the bottom of the tab:

<table data-header-hidden><thead><tr><th></th><th></th></tr></thead><tbody><tr><td>File Type</td><td>File Settings</td></tr><tr><td><strong>Delimited text</strong></td><td><p>Specify the following settings for a delimited text file:- <strong>Leading rows to skip</strong></p><p>Specify the number of rows to use as an offset from the beginning of the file.</p><ul><li><strong>Delimiter</strong></li></ul><p>Specify the character used to separate a data field.</p><ul><li><strong>Quote character</strong></li></ul><p>Specify the character used to enclose a data field.</p><ul><li><strong>Remove quotes</strong></li></ul><p>Select one of the following values to indicate whether quotation characters should be removed from a data field during the bulk load:</p><pre><code>-   **Yes**: removes the quotation characters.
-   **No**: retains the quotation characters.
</code></pre><ul><li><strong>Ignore blank lines</strong></li></ul><p>Select one of the following values to indicate whether blank lines should be removed from the data during the bulk load:</p><pre><code>-   **Yes**: removes the blank lines.
-   **No**: retains the blank lines.
</code></pre><ul><li><strong>Empty as null</strong></li></ul><p>Select one of the following values to indicate whether empty data values should be set to null during the bulk load:</p><pre><code>-   **Yes**: sets empty data values to null.
-   **No**: leaves data values as empty.
</code></pre></td></tr><tr><td><strong>Avro</strong></td><td><p>Select one of the following values for the <strong>Empty as null</strong> option to indicate whether empty data values should be set to null during the bulk load:- <strong>Yes</strong>: sets empty data values to null.</p><ul><li><strong>No</strong>: leaves data values as empty.</li></ul></td></tr><tr><td><strong>JSON</strong></td><td><p>Select one of the following values for the <strong>Empty as null</strong> option to indicate whether empty data values should be set to null during the bulk load:- <strong>Yes</strong>: sets empty data values to null.</p><ul><li><strong>No</strong>: leaves data values as empty.</li></ul></td></tr><tr><td><strong>ORC</strong></td><td>Additional file settings handled by the ORC file.</td></tr><tr><td><strong>Parquet</strong></td><td>Additional file settings handled by Parquet file.</td></tr><tr><td><strong>Fixed width</strong></td><td><p>Specify the following settings for a fixed width file:- <strong>Columns:Width</strong></p><p>Specify the columns and their fixed widths.</p><ul><li><strong>Leading rows to skip</strong></li></ul><p>Specify the number of rows to use as an offset from the beginning of the file.</p><ul><li><strong>Ignore blank lines</strong></li></ul><p>Select one of the following values to indicate whether blank lines should be removed from the data during the bulk load:</p><pre><code>-   **Yes**: removes the blank lines.
-   **No**: retains the blank lines.
</code></pre><ul><li><strong>Empty as null</strong></li></ul><p>Select one of the following values to indicate whether empty data values should be set to null during the bulk load:</p><pre><code>-   **Yes**: sets empty data values to null.
-   **No**: leaves data values as empty.
</code></pre></td></tr></tbody></table>

#### Output tab

![Output tab](/files/9yrV0HWywQcwGfejSVmq)

Use the options in this tab to define how the data is loaded using the Redshift `COPY` command:

| Option                  | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Database connection** | <p>Select your database connection from a list of existing Redshift connections.If you do not have an existing connection, click <strong>New</strong>. If you need to modify an existing connection, click <strong>Edit</strong>.</p><p>An exclamation mark appears in the title of the <strong>Output</strong> tab if a connection cannot be established to the selected database. In this case, click <strong>Edit</strong> to modify your connection settings.</p><p>You must specify your IAM-role based access or credentials in your database connection for Redshift. Within the Database Connection dialog box, select either <strong>Role</strong> or <strong>Credentials</strong>. For your S3 role, specify the IAM-role in <strong>Role Name</strong>. For your S3 credentials, specify the <strong>Access Key ID</strong> and <strong>Secret Access Key</strong>.</p> |
| **Schema**              | Select the schema to use for the bulk load.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| **Table name**          | Select the name of the table to bulk load. You can only load one table at a time. You will need to use multiple Bulk load into Amazon Redshift entries or loop over a single entry to load multiple tables.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| **Columns**             | Preview columns within your selected **Table name**.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

#### Options tab

![Options tab](/files/AmAP0nu1Y5cuO2ElBK15)

Use the option in this tab to define how tabular data is loaded using the Redshift `COPY` command:

| Option             | Description                                                                                                                                                                               |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Truncate table** | Select to remove all the data in the table before bulk loading the current data. When **Truncate table** is cleared, the data is appended during a new data load. The default is cleared. |

#### Parameters tab

![Parameters tab](/files/uL3kx9O5SenTEzqRDJ4Y)

Use this tab to configure additional parameters for the Redshift `COPY` command. See Redshift’s [COPY Parameter Reference](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY-parameters.html) for details.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdia-data-integration/pdi-job-entries-reference-overview/bulk-load-into-amazon-redshift.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
