Bulk load into Databricks

Use the Bulk load into Databricks job entry to load large amounts of data from files in your cloud accounts into Databricks tables.

This entry uses the Databricks COPY INTOarrow-up-right command.

General

  • Entry name: Specifies the unique name of the Bulk load into Databricks job entry on the canvas. You can customize the name or leave it as the default.

Options

The Bulk load into Databricks entry requires you to specify options and parameters on the Input and Output tabs.

Input tab

circle-info

The input file must exist in either a Databricks external location or a managed volume.

PDI Bulk load Databricks Input tab
Field
Description

Source

Specify the path to the input file. This must be the path to a file in a Databricks external location or managed volume.

What file type is your source?

Specify the format of the source file. Supported formats are:

  • AVRO

  • BINARYFILE

  • CSV

  • JSON

  • ORC

  • PARQUET

  • TEXT

Force

Set to false to skip files that have already been copied into the target table (default). Set to true to copy files again, even if they have already been copied into the table.

Merge schema

Set to false to fail if the schema of the target table does not match the schema of the incoming files (default). Set to true to add new columns to the target table for each column in the source file that does not exist in the target table.

The target column types must still match the source column types, even when Merge schema is selected.

Format Options

Each file format has a number of options that are specific to that format. Use this table to specify the appropriate options for your file format. See Databricks format optionsarrow-up-right.

Note: This entry does not validate that the options entered are appropriate for the selected file format.

Output tab

Use this tab to configure the target table in Databricks.

After you select a connection:

  • The Catalog list populates.

  • After you select a catalog, the Schema list populates.

  • After you select a schema, the Table name list populates.

PDI Bulk load Databricks Output tab
Field
Description

Database connection

Specify the Databricks database connection to the Databricks account. You can authenticate with either an access token or a username and password. The username must be the email address you use to sign in to Databricks.

Click Edit to revise an existing connection. Click New to add a new connection.

Examples:

jdbc:databricks://<server hostname>:443;HttpPath=<HTTP path>;PWD=<Personal Access Token>

jdbc:databricks://<serverhostname>:443;HttpPath=<HTTP path>

The Custom driver class name is com.databricks.client.jdbc.Driver.

Catalog

Specify a catalog from the list of available catalogs for your Databricks connection.

Schema

Specify the schema of the target table.

Table name

Specify the name of the target table.

Last updated

Was this helpful?