Manage connections for transformations and jobs

This page is archived. Its content moved to Managing transformations and jobs.

Archived. Content is merged into Managing transformations and jobs.

While creating or editing a transformation or job in Pipeline Designer, you can define connections to multiple databases provided by multiple database vendors such as MySQL and Oracle. Pipeline Designer ships with the most suitable JDBC drivers for PostgreSQL, our default database.

Pentaho recommends avoiding ODBC connections. The ODBC to JDBC bridge driver does not always provide an exact match and adds another level of complexity, which affects performance. The only time you may have to use ODBC is if no JDBC driver is available. For details, see the Pentaho Community article on why you should avoid ODBC.

When you define a database connection in Pentaho Designer, the connection information (such as the user name, password, and port number) is stored in the Pentaho Repository and is available to other users when they connect to the repository. If you are not using the Pentaho Repository, the database connection information is stored in the XML file associated with your transformation or job. See the Pentaho Data Integration document for details on the Pentaho Repository.

You must have information about your database (such as your database type, port number, user name and password) before you define a JDBC connection. In PDI, you can also set connection properties as variables. Through such variables, your transformations and jobs can access data from multiple database types.

Make sure to use clean ANSI SQL that works on all the database types used.

You must have a transformation or job open to manage connections from within the Pipeline Designer. To see steps for opening a transformation or job, see Create a transformation, Create a job, or Edit a transformation or job.

Tasks

If you need to run standard SQL commands against a connection, see Use the SQL Editor.

Define a new database connection

While working on a transformation or job, you can define a new database connection to use.

Before you can create a connection, the appropriate driver must be installed for your particular data connection. Your IT administrator should be able to install the appropriate driver for you. For details, see Specify data connections for the Pentaho Server in the Install Pentaho Data Integration and Analytics guide.

To define a new database connection, complete the following steps:

With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections list.
Find Database Connections, click the More Actions icon, and then select New. The Database Connection window opens.
Enter database connection information for your new Database Connection. The type of database connection information entered depends on your access protocol. Refer to the examples in the following sections of this topic for Native (JDBC) and OCI protocols:

Native (JDBC) protocol information

Create a Native (JDBC) connection in the Database Connection dialog box by completing the following steps:

In the Connection Name field, enter a name that uniquely describes this connection.
The name can have spaces, but it cannot have special characters (such as #, $, and %).
In the Connection Type list, select the database you want to use (for example, MySQL or Oracle).
In the Access Type list, select Native (JDBC). The access protocol which appears depends on the database type you select.
In the Settings section, enter the following information:
Field
Description
Host Name
The name of the server that hosts the database to which you are connecting. Alternatively, you can specify the host by IP address.
Database Name
The name of the database to which you are connecting. If you are using an ODBC connection, enter the Data Source Name (DSN) in this field.
Port Number
The TCP/IP port number (if it is different from the default)
Username
Optional user name used to connect to the database
Password
Optional password used to connect to the database
Click Test Connection. A success message appears if the connection is established.
Click OK to close the connection test dialog box.
To save the connection, click Save. The database connection is saved and appears in the Database Connections list.

OCI protocol information

Perform the following steps to create an OCI connection in the PDI Database Connection dialog box:

In the Connection Name field, enter a name that uniquely describes this connection.
The name can have spaces, but it cannot have special characters (such as #, $, and %).
In the Connection Type list, select Oracle.
In the Access list, select OCI. The access protocol which appears depends on the database type you select.
In the Settings section, enter the following information as directed by the Oracle OCI documentation.
Field
Description
SID
The Oracle system ID that uniquely identifies the database on the system
Tablespace for Data
The name of the tablespace where the data is stored
Tablespace for Indices
The name of the tablespace where the indices is stored
User Name
The user name used to connect to the database
Password
The password used to connect to the database
Click Test Connection.
A success message appears if the connection is established.
Click OK to close the connection test dialog box.
To save the connection, click OK to close the Database Connection dialog box.

If you want to use Advanced, Options, or Pooling for your OCI connection, refer to the Oracle OCI documentation to understand how to specify these settings.

Connect to Snowflake using strong authentication

If you are defining a data connection to Pentaho Data Integration and Analytics from a Snowflake data warehouse in the cloud, you can improve connection security by applying strong authentication.

You can apply strong authentication to your defined Pentaho data connection from Snowflake through a key pair.

Configure key pair strong authentication for your Snowflake data connection by completing the following steps:

After entering the information for your Snowflake data connection in the General tab of the Database Connection dialog box, select the Options tab.
Set the key pair parameters as indicated in the following table:
Parameter
Value
authenticator
snowflake_jwt
private_key_file
Specify the name of the private key file you use in your environment. For example, /rsa_key.p8
private_key_file_pwd
Specify the password for accessing the private key file you use in your environment. For example, PentahoSnowFlake123
See https://docs.snowflake.com/en/developer-guide/jdbc/jdbc-configure#private-key-file-name-and-password-as-connection-properties for details on the private key file and its password.
Click Test Connection to verify your connection. A success message appears if the connection is established.
Click OK to close the connection test dialog box.
To save the connection, click OK to close the Database Connection dialog box.

You have applied key pair authentication to your defined data connection between Pentaho and Snowflake.

Connect to an Azure SQL database

You can use an Azure SQL database as a data source with the Pipeline Designer. This connection is required if you want to bulk load into Azure SQL DB job entry to load data into your Azure SQL database from Azure Data Lake Storage. Pentaho supports the Always Encrypted option, dynamic masking, and multiple authentication methods for connecting to an Azure SQL database.

Because one physical server may host databases for multiple customers, keep in mind that SQL for Azure is different from MSSQL. For more information regarding the differences between Azure SQL and MSSQL, see https://docs.microsoft.com/en-us/azure/azure-sql/database/features-comparison

Before you begin

You must have an Azure account with an active subscription and an instance of an Azure SQL database. You also need to install the Azure SQL database drivers. For help installing your drivers, see your Microsoft documentation for details.

Additionally, you need to obtain the following information from your system administrator:

Host name
Database name
Port number
Authentication method
Username
Password

If you use the Always Encryption Enabled option, you also need to obtain the Client id and Client Secret Key.

Authentication method

Pentaho supports four authentication methods for connecting to the Azure SQL DB instance:

SQL Authentication
Connect using the Azure SQL Server username and password.
Azure Active Directory
Connect using Multi Factor Authentication (MFA). The MFA password must be entered on the displayed webpage.
Azure Active Directory with password
Connect using an Azure AD username and password.
Azure Active Directory with integrated authentication
Connect using the federated on-premises Active Directory Federation Services (ADFS) with Azure Active Directory in the cloud.

Connect to an Azure database

In the Connection Name field, enter a name that uniquely describes this connection. The name can have spaces, but it cannot have special characters (such as #, $, and %).
In the Connection Type list, select Azure SQL DB.
In the Access list, select Native (JDBC).
Enter your database connection information.
Field
Description
Host Name
The name of the Azure SQL server instance.
Database Name
The name of the Azure SQL database to which you are connecting.
Port Number
The TCP/IP port number. The Azure SQL Database service is only available through TCP port 1433. You must set your firewall to allow outgoing TCP communication on port 1433.
Authentication method
The authentication method used to connect to the Azure SQL DB instance. The default is SQL Authentication.
Username
The username used to connect to the database.
Password
The password used to connect to the database.
Always Encryption Enabled
Select to use encryption. See Use the Always Encryption Enabled option for instructions on using this option.
Client id
The unique client identifier, used to identify and set up a durable connection path to the server.
Client Secret Key
The unique name of the key value in the Azure Key Vault.
Click Test Connection to verify your connection.

Use the Always Encryption Enabled option

Before you can use the Always Encryption Enabled option, you must perform the following steps. Consult the Microsoft Azure SQL documentation for assistance with your Azure SQL tools.

Generate a column master key in the Azure Key Vault.
Encrypt the column using the column master key.
Register the app under Azure Active Directory and obtain both the Client id and Client Secret Key.
Grant permissions to the Client id for accessing the Azure Key Vault.
Select Always Encryption Enabled and provide the Client id and Client Secret Key.

The Azure Always Encrypted feature is now active.

Clear cached database metadata

When working with complex transformations or jobs, Pipeline Designer might accumulate outdated or incorrect metadata due to changes in the underlying database. You can use the Clear Complete DB Cache option to clear out the outdated or incorrect metadata the next time you access the transformation or job.

Cached metadata might include information about:

Table structures
Column types
Indexes
Primary and foreign keys
Other schema-related metadata

Note: Clearing cached database metadata does not delete any data from your database, affect transformation or job files, or clear runtime data caches that are used during execution.

To clear cached database metadata, complete the following steps:

With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections list.
Find Database Connections, click the More Actions icon, and then select Clear Complete DB Cache. The cache is cleared, and a Success message is displayed. Fresh metadata is retrieved from the database the next time you access it.

Edit a database connection

You can edit an existing database connection to refine and change aspects of the connection.

To edit a database connection, complete the following steps:

With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections.
Expand Database Connections, find the database connection you want to edit, and click the More Actions icon.
Select Edit. The Database Connection window opens.
Configure the options in each tab of the Database Connections window:
(Optional) To view features of the database connection, click Feature List.
(Optional) To explore configured database connections, click Explore. For details, see Explore configured database connections.
Click Test Connection. If the connection is established, a success message is displayed.
Click OK to close the success message.
Click Save. The connection is saved and the Database Connections window closes.

General

In the General tab, the options you have to edit depend on the type of database connection you are editing. Connection information depends on your access protocol. For details about general connection settings, refer to examples in Define a new database connection.

Advanced

The Advanced tab contains options for configuring properties associated with how SQL is generated. With these properties, you can set a standard across all your SQL tools, ETL tools, and design tools.

Option

Description

Supports the Boolean data type

Instructs Pipeline Designer to use native Boolean data types supported by the database.

Supports the timestamp data type

Instructs Pipeline Designer to use the timestamp data type supported by the database.

Quote all in database

Enables case-sensitive table names. For example, MySQL is case-sensitive on Linux, but not case-sensitive on Microsoft Windows. If you quote the identifiers, the database uses a case-sensitive table name.

Force all to lower-case

Enables the system to change the case of all identifiers to lower-case.

Force all to upper-case

Enables the system to change the case of all identifiers to upper-case.

Preserve case of reserved words

Instructs Pipeline Designer to use a list of reserved words supported by the database.

The Preferred Schema name where no schema is used

For Pipeline Designer, enter the preferred schema name (for example, MYSCHEMA).

SQL Code Editor

Enter the SQL statements to execute right after connecting.

Options

Use the Options tab to add or delete parameters. Parameters enable you to control database-specific behavior.

To add more parameters to the list, click Add Row.
To delete rows, click the Delete icon next to the row.

Pooling

Configure options in the Pooling tab to set up a connection pool and define options like the initial pool size, maximum pool size, and connection pool parameters. By default, a connection remains open for each individual report or set of reports in PUC and for each individual step in a transformation in PDI. For example, you might start by specifying a pool of ten or fifteen connections, and as you run reports in PUC or transformations in PDI, the unused connections drop off. Pooling helps control database access, especially if you have dashboards that contain many reports and require a large number of connections. Pooling can also be implemented when your database licensing restricts the number of active concurrent connections.

You can take the following action in the parameters section:

To add a new parameter, click Add Row and then enter the Parameter name and Value.
To delete a parameter, click the Delete icon.
To change how many parameters are shown at one time, select a new Items per page value.
If there are multiple pages of parameters, scroll through the pages using the left and right arrow that appear under the list of parameters.

The following table shows an example of Pooling options that might be available in a typical JDBC driver. Check your driver documentation for driver-specific pooling details.

Option

Description

Enable Connection Pooling

Enables connection pooling.

Pool Size

Initial

Set the initial size of the connection pool.

Maximum

Set the maximum number of connections in the connection pool.

Parameters

You can define additional custom pool parameters. Click on any parameter to view a short description of that parameter. Click Restore Defaults when to restore the default values for selected parameters.The most commonly-used parameter is validationQuery. The parameter differs slightly depending on your RDBMS connection. The basic set of Pentaho databases use the following values for validationQuery:

For Oracle and PostgreSQL, use Select 1 from dual.
For MS SQL Server and MySQL, use Select 1.

Description

Enter a description for your parameters.

Clustering

Use the Clustering options to cluster the database connection and create connections to data partitions in Pipeline Designer. To create a new connection to a data partition, enter a Partition ID, the Host Name, the Port, the Database Name, User Name, and Password for the connection.

If you have the Pentaho Server configured in a cluster of servers, and use the Data Source Wizard (DSW) in PUC to add a new data source, the new data source will only be seen on the cluster node where the user has a session. For the new data source to be seen by all the cluster nodes, you must disable DSW data source caching. This may cause the loading of the data source list to be slower since the list is not cached.

To disable the cache, navigate to the server/pentaho-server/pentaho-solutions/system folder and set the enableDomainIdCache value in the system.properties file to false.

Delete a database connection

Delete a database connection you no longer need.

Deleting a connection affects all reports, charts, dashboards, and other content that use the connection.

To delete a database connection, complete the following steps:

With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections.
Expand Database Connections, find the database connection you want to delete, and click the More Actions icon.
Select Delete. The Confirm deletion dialog box opens.
Click Yes to confirm deletion. The database connection is deleted.

Explore configured database connections

The Database Explorer allows you to explore configured database connections. The Database Explorer also supports tables, views, and synonyms along with the catalog, schema, or both to which the table belongs.

With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections list.
Expand Database Connections, find the database connection you want to explore, and click the More Actions icon.
Select Explore. The Database Explorer window opens.
(Optional) Click the refresh icon to refresh the list.
Expand the folders and find the item you want to review.
Click Actions, and then select one of the following features:
Feature
Description
Preview first 100
Returns the first 100 rows from the selected table.
Preview x Rows
Prompts you for the number of rows to return from the selected table.
Row Count
Specifies the total number of rows in the selected table.
Show Layout
Displays a list of column names, data types, and so on from the selected table.
DDL
Generates the DDL to create the selected table based on the current connection type, the drop-down.
View SQL
Launches the Simple SQL Editor for the selected table.
Truncate Table
Generates a TRUNCATE table statement for the current table.Note: The statement is commented out by default to prevent users from accidentally deleting the table data
Data Profile
Provides basic information about the data.
When you finish exploring the database connection, click OK. The Database Explorer window closes.

Show dependencies

Expand the connection to display a list of dependencies across the platform, including transformations and jobs.

To show the dependencies for a database connection, complete the following steps:

With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections list.
Expand Database Connections, find the database connection you want to explore, and click the More Actions icon.
Select Show dependencies. The database connection is expanded to show the transformations and jobs that depend on that connection.

Last updated 1 month ago

Was this helpful?

hashtagTasks

hashtagDefine a new database connection

hashtagNative (JDBC) protocol information

hashtagOCI protocol information

hashtagConnect to Snowflake using strong authentication

hashtagConnect to an Azure SQL database

hashtagClear cached database metadata

hashtagEdit a database connection

hashtagGeneral

hashtagAdvanced

hashtagOptions

hashtagPooling

hashtagClustering

hashtagDelete a database connection

hashtagExplore configured database connections

hashtagShow dependencies

Tasks

Define a new database connection

Native (JDBC) protocol information

OCI protocol information

Connect to Snowflake using strong authentication

Connect to an Azure SQL database

Clear cached database metadata

Edit a database connection

General

Advanced

Options

Pooling

Clustering

Delete a database connection

Explore configured database connections

Show dependencies