Bulk load into Snowflake

The Bulk load into Snowflake job entry loads large amounts of data into a Snowflake virtual warehouse in a single session.

This entry automates Snowflake’s COPY INTO command, which helps you avoid repetitive SQL scripting. To use it, you typically:

Size your virtual warehouse.
Define the source data and file type.
Specify the target database objects.
Provide any needed COPY INTO parameters.

For more information about working with Snowflake in PDI, see PDI and Snowflake.

For more information about Snowflake, including best practices for bulk loading data, see the Snowflake documentation.

Before you begin

To use this job entry, you need:

A Snowflake account.
A connection to the database where you want to load data.
An S3 connection or a VFS connection to a Snowflake staging area for your data source.
A table and schema in the target database. The first time you use this entry, you might need to create the table.

General

Entry name: Specify the unique name of the job entry on the canvas. You can customize the name or leave it as the default.

Options

This job entry includes four tabs: Input, Output, Options, and Advanced Options.

Input tab

Use the options in this tab to define your input source for the Snowflake COPY INTO command:

Option

Description

Source

Choose one of the following input source types:

S3: The input source is an S3 bucket.
Snowflake Staging Area: The input source is files in a Snowflake staging area.

Click Select to specify the file, folder, prefix, or variable of the S3 bucket or staging location to use as the input for COPY INTO. See “Syntax” in the Snowflake documentation for details.

What file type is your source?

Select the file type of the input source:

Delimited text: Character-delimited UTF-8 text.
Avro: Avro data serialization protocol.
JSON: JavaScript Object Notation (JSON) data file containing a set of either objects or arrays.
ORC: Optimized Row Columnar (ORC) file containing Hive data.
Parquet: Parquet file of nested data structures in a flat columnar format.
XML: XML file.

Compression

Select the type of compression applied to your input source:

None
Auto
BZIP2
GZIP
Deflate
Raw deflate
Brotli
Zstd

For Parquet files, the Compression options are:

None
Auto
Snappy

Depending on the file type selected in What file type is your source?, additional file settings appear at the bottom of this tab:

File Type

File Settings

Delimited text

Specify the following settings for a delimited text file:

Leading rows to skip: The number of rows to skip from the beginning of the file. This option is useful for skipping header lines.
Delimiter: The character used to separate a data field. Default is semicolon (`;`).
Quote character: The character used to enclose a data field. Default is the double-quotation mark.
Remove quotes: Whether to remove quotation characters from a data field during the bulk load.
Empty as null: Whether to set empty data values to null during the bulk load.
Trim whitespace: Whether to remove leading and trailing whitespace during the bulk load.

Note: For delimited text files, you must have a table in your database with all required columns defined.

Avro

No additional settings.

JSON

Ignore UTF8 errors: Whether to ignore UTF-8 errors during the bulk load.
Allow duplicate elements: Whether to allow duplicate elements during the bulk load. Snowflake keeps the last duplicate value and discards the others.
Strip null values: Whether to delete null values during the bulk load.
Parse octal numbers: Whether to parse octal numbers during the bulk load.

ORC

Additional file settings for ORC files.

Parquet

Additional file settings for Parquet files.

XML

Ignore UTF8 errors: Replace invalid UTF-8 sequences with the Unicode character U+FFFD. If cleared (default), invalid UTF-8 sequences produce an encoding error.
Preserve space: Whether to preserve leading and trailing spaces in element content.
Strip outer element: Whether to remove the outer XML element and expose the second-level elements as separate documents.
Enable Snowflake data: Whether to enable recognition of Snowflake semi-structured data tags.
Auto convert: Whether to convert numeric and Boolean values from text to native representation.

If you load semi-structured data, your target table must include a VARIANT column for the following file types:

JSON
ORC
Parquet
XML

Output tab

Option

Description

Database connection

Select a Snowflake connection. If you do not have an existing connection, click New. To modify a connection, click Edit.

Note: If timeout errors occur, see Snowflake timeout errors.

Schema

Select the schema to use for the bulk load. The job entry reads schemas from the database to populate this list.

Table name

Select the table to bulk load. The job entry reads tables from the selected schema to populate this list. Note: You can only load one table at a time. For multiple tables, use multiple entries or use variables to parameterize the schema and table names.

Columns

Preview the column names and associated data types for the selected table.

You can use variables in the Schema and Table name fields.

Options tab

Use the options in this tab to define how data is loaded using the Snowflake COPY INTO command:

Option

Description

Truncate table

Select to remove all data in the table before bulk loading the current data. Clear to append data during a new data load (default).

On error

Select what to do if an error occurs during the bulk load:

Abort: Stop loading data.
Skip file: Skip loading the file that caused the error and continue processing.
Skip file number: Skip the file after a specified number of errors.
Skip file percent: Skip the file after a specified percentage of errors (based on record count).
Continue: Ignore errors and continue processing.

Skip file after (x) errors

The number of errors allowed before the file is skipped. This field appears when Skip file number is selected.

Skip file after (%) errors

The percentage of errors allowed before the file is skipped. This field appears when Skip file percent is selected.

Advanced Options tab

Use this tab to configure additional COPY INTO parameters.

Any Name/Value pair that you add is passed to Snowflake as a parameter. You are responsible for validating parameter names and values.

The Force parameter is provided as an example.

Parameter

Description

Force

Specify whether to force loading files into a table:

True: Load data even if the file has already been loaded. This option can duplicate data.
False: Ignore files that have already been loaded (default).

PreviousBulk load into Databricks NextCreate Snowflake warehouse

Last updated 2 months ago

Was this helpful?

hashtagBefore you begin

hashtagGeneral

hashtagOptions

hashtagInput tab

hashtagOutput tab

hashtagOptions tab

hashtagAdvanced Options tab

Before you begin

General

Options

Input tab

Output tab

Options tab

Advanced Options tab