Pentaho configuration

You can use basic configuration tasks to set up the Pentaho Server, data connections, the Pentaho design tools, and Hadoop cluster connections so you can get started creating ETL solutions and data analytics. These configuration tasks can be performed through the PUC (Pentaho User Console), the PDI (Pentaho Data Integration) client, or edits to shell scripts and property files. A Pentaho administrator user name and password are required to perform configuration tasks through the user console.

These tasks are for IT and Pentaho administrators, as described in the following definitions:

  • IT administrator

    Installs, configures, and upgrades the Pentaho Server. An IT administrator knows where the data is stored, how to connect to it, details about the computing environment, and how to use the command line on Microsoft Windows or Linux.

  • Pentaho administrator

    Responsible for creation and management of users and roles, along with managing workstations so the ETL specialists and business analysts can create, publish, and share content.

Note: This guide assumes you have installed the Pentaho software.

Tasks to be performed by an IT administrator

As an IT administrator, you configure the Pentaho Server and define security. If your team uses big data, you also configure access to Hadoop clusters.

Configure the Pentaho Server

Basic server tasks include starting and stopping the Pentaho Server, tuning memory, and setting up connections.

The following IT administrator tasks prepare the system for more specific Pentaho administrator configuration tasks, like defining connections and managing users and roles:

Stop and start the Pentaho Server and repository

To perform maintenance work on the Pentaho Server or components that use the Pentaho Server, such as the Pentaho User Console and Pentaho Data Integration, stop the repository and server, perform the work, and then restart the repository and server.

circle-exclamation

You must be an IT administrator and know where the data that you are managing is stored, how to access that stored data, details about the computing environment, and how to use the command line for Windows or Linux.

  1. Run the following control scripts to stop and start the Pentaho Server:

    Start on Windows

    ...\\pentaho\\server\\pentaho-server\\start-pentaho.bat

    Stop on Windows

    ...\\pentaho\\server\\pentaho-server\\stop-pentaho.bat

    Start on Linux

    .../pentaho/server/pentaho-server/start-pentaho.sh

    Stop on Linux

    .../pentaho/server/pentaho-server/stop-pentaho.sh

circle-info

If you installed the Pentaho Repository as PostgreSQL, MySQL, MS SQL Server, or Oracle, consult that database documentation for start and stop instructions.

chevron-rightScript arguments and services (custom start/stop scripts)hashtag

Use these standard arguments:

  • start

  • stop

  • restart

  • status

  • help

Common service names:

  • pentahoserver

  • postgresql

Create scripts for automatic stop and start of the Pentaho server and repository on Linux

If you used the manual installation to install Pentaho on Linux, create scripts for automatic start on boot and stop at shutdown for the Pentaho Server and Pentaho Repository.

You must have root permissions.

  1. Navigate to /etc/init.d/ and create a file named pentaho.

  2. Open the pentaho file and enter the following content:

  3. Optional: Update the script for your environment.

    • If you are not using Red Hat Enterprise Linux, modify the details of the script to work with the OS, shells, and init systems that you are using. The script was tested only on Red Hat Enterprise Linux.

    • If you use an account other than the pentaho local user account to start services, replace pentaho with your account name.

    • If you use a MySQL or Oracle repository instead of PostgreSQL, replace postgresql with mysql or oracle.

    • If the solution repository runs on the same machine as the server, change postgresql to the name of the init script for your database.

    • If the solution repository runs on a remote computer, remove postgresql entirely and adjust the paths to the Pentaho Server scripts.

  4. Save and close the pentaho file.

  5. Navigate to /home/pentaho/pentaho/server/pentaho-server and open start-pentaho.sh.

  6. In start-pentaho.sh, change the last if statement to match the following example:

  7. Save and close start-pentaho.sh.

  8. Make the init script executable:

  9. Add the init script to the standard run levels.

    If your distribution supports update-rc.d, run:

circle-info

Many modern Linux distributions use systemd instead of SysV init scripts. If update-rc.d is not available, consult your distribution documentation for the equivalent.

Increase Pentaho Server memory limit for custom installations on Windows or Linux

If you used a custom method to install PDI on Windows or Linux, consider increasing the PDI memory limit so that the Pentaho Server and the PDI client (also known as Spoon) can perform memory-intensive tasks, like sorting, grouping large datasets, or running complex transformations and jobs.

Increase the memory limit for both the Pentaho Server and the PDI client. If you do not increase the memory limit, PDI uses the default memory settings in the PDI startup scripts.

For instructions on increasing memory limits by editing the Tomcat startup script, see Configure and start the Pentaho Server after manual installation.

For instructions on increasing the memory limit in the PDI client, see Modify the PDI client startup script.

Specify data connections for BA design tools

The BA design tools need a way to connect to where you store data, and to the Pentaho Server where you publish and share what you create in the design tools.

We support connecting to data stored in these ways:

  • Pentaho data sources, such as relational Metadata data models or multidimensional Mondrian data models.

  • JDBC databases.

JDBC database connections

Your database access protocol affects how you create, test, and publish reports and models. Choose one of these protocols:

  • Native (JDBC): Standard JDBC connection details stored in each report or model.

  • JNDI: A named connection. Properties are stored outside the report or model.

  • ODBC: Not supported. Available only for compatibility.

  • OCI: Oracle-only. Use when you must use Oracle Call Interface.

Choose an access protocol

Native (JDBC)

Use Native (JDBC) when you need a quick setup or you are evaluating.

Keep this limitation in mind:

  • If connection details change, you must update every report and model that embeds them.

Behavior by tool:

  • Report Designer and Metadata Editor

    • Connection details are embedded in each report or Metadata model.

    • Report Designer lists the saved connection name after you define it.

    • Metadata Editor requires you to re-enter the connection name when you use it.

    • Before publishing, ensure the connection name matches a Native (JDBC) connection already defined on the Pentaho Server.

  • Schema Workbench and Aggregation Designer

    • Connection details are not stored. Define the connection each time.

    • Before publishing, you need the Pentaho Server URL and the name of a Native (JDBC) connection already defined on that server.

Recommendation:

  • Use for trials and evaluation.

  • For production, switch to JNDI before publishing.

JNDI

Use JNDI when you need centralized maintenance and a distributed team.

Key characteristics:

  • Reports and models store only the JNDI name.

  • You update connection details in one place.

Behavior by tool:

  • Report Designer and Metadata Editor

    • Connection properties live in a file on each workstation.

    • Before publishing, ensure the JNDI name matches a JNDI connection already defined on the Pentaho Server.

  • Schema Workbench and Aggregation Designer

    • JNDI is not available for connecting from the tool.

    • Before publishing, you still need the Pentaho Server URL and a JNDI connection name already defined on the Pentaho Server.

Recommendation:

  • Use for production deployments.

ODBC (not supported)

ODBC is not supported. It is available only if you have a hard dependency.

If you must use ODBC, contact Pentaho Supportarrow-up-right.

OCI (Oracle only)

If you connect to an Oracle database, you can use OCI. Use the steps in Define JDBC or OCI connections for BA design tools.

Define connections for BA design tools

Use these sections to set up connections on workstations running BA design tools.

Define JDBC or OCI connections for BA design tools

  1. Choose an access protocol in Choose an access protocol.

  2. Install the matching driver.

  3. Set the access protocol later in each design tool when you define the data source.

Add drivers

The driver lets the design tools connect to the Pentaho Server. This connection helps you validate the model.

Your database administrator or IT team can provide the driver. If needed, download it from your database vendor.

Copy the driver to these directories on every machine that runs the design tools. Stop the design tools before you copy the file.

  • Aggregation Designer: /pentaho/design-tools/agg-designer/drivers/

  • Metadata Editor: /pentaho/design-tools/metadata-editor/libext/JDBC/

  • Report Designer: /pentaho/design-tools/report-designer/lib/jdbc/

  • Schema Workbench: /pentaho/design-tools/schema-workbench/drivers/

Keep one driver version per database in each directory. Remove older versions to avoid conflicts.

If you are unsure, contact Pentaho Supportarrow-up-right.

Microsoft SQL Server driver notes

If you use Microsoft SQL Server, you might need the jTDS driver. jTDS is not vendor supported. Confirm the correct driver with Pentaho Supportarrow-up-right.

On Windows, many JDBC drivers support Type 2 integrated authentication using the integratedSecurity connection string property. To use it, copy sqljdbc_auth.dll to every machine and directory where you copied the JDBC driver.

You can find sqljdbc_auth.dll at:

<installation directory>\\sqljdbc_<version>\\<language>\\auth\\

  • Use the x64 DLL with a 64-bit JVM on an x64 processor.

  • Use the IA64 DLL with a 64-bit JVM on an Itanium processor.

Define JNDI connections for Report Designer and Metadata Editor

  1. Choose an access protocol in Choose an access protocol.

  2. Complete these tasks on each workstation.

Add drivers

The driver lets design tools connect to the Pentaho Server and validate models.

Your database administrator or IT team can provide the driver. You can also download it from your database vendor.

Check Components Reference to confirm your driver is supported.

Copy the driver to these directories on each machine where you installed design tools. Stop the tools first.

  • Aggregation Designer: /pentaho/design-tools/agg-designer/drivers/

  • Metadata Editor: /pentaho/design-tools/metadata-editor/libext/JDBC/

  • Report Designer: /pentaho/design-tools/report-designer/lib/jdbc/

  • Schema Workbench: /pentaho/design-tools/schema-workbench/drivers/

Keep only one version of the driver in each directory. Remove older versions to avoid conflicts.

If you are unsure, contact Pentaho Supportarrow-up-right.

Configure Report Designer

  1. Stop Report Designer. Stop the Pentaho Server too, if it runs locally.

    For server steps, see Stop and start the Pentaho Server and repository.

  2. Go to the .pentaho directory in the user profile.

    Examples:

    • Windows: C:\\Users\\username\\.pentaho\\

    • Linux or Solaris: /home/username/.pentaho/

  3. Go to ~/.pentaho/simple-jndi/. Create it if needed.

  4. Edit default.properties. Create it if needed.

    Example:

    In this example, SampleData is the JNDI connection name.

  5. Save the file.

  6. Restart Report Designer. Restart the Pentaho Server too, if it runs locally.

Repeat this process on each workstation where you run Report Designer.

Configure Metadata Editor

  1. Stop Metadata Editor. Stop the Pentaho Server too, if it runs locally.

    For server steps, see Stop and start the Pentaho Server and repository.

  2. Go to the metadata-editor/simple_jndi directory under your Metadata Editor install.

  3. Edit JDBC.properties. Create it if needed.

    Example:

    In this example, SampleData is the JNDI connection name.

  4. Save the file.

  5. Restart Metadata Editor. Restart the Pentaho Server too, if it runs locally.

Repeat this process on each workstation where you run Metadata Editor.

Specify data connections for the Pentaho Server

For business analytics, you can use the Pentaho User Console to define the connection to where you store data.

We support accessing data stored in the following ways:

  • Comma separated values (CSV) files, or any file that can be converted to CSV, such as spreadsheets, XML, or other semi-structured or tabular data files. Use CSV files if you are evaluating Pentaho or you want to get started quickly.

  • JDBC drivers to help with database connections.

In Pentaho Data Integration (PDI), you can make connections in each job and transformation through an input step. Although users can create connections themselves, it is best to set up shared connections for your users so that they can select the connection they need from a list.

JDBC database connections

Use JDBC database connections when the Pentaho Server connects to an RDBMS.

You choose one of these access types:

  • Native (JDBC): Define the connection in Pentaho User Console (PUC).

  • JNDI: Define a data source in your application server (Tomcat).

  • OCI: Oracle-only option for using the Oracle Call Interface.

  • ODBC: Available but not supported. Contact Pentaho Supportarrow-up-right if you must use it.

Choose between native (JDBC) and JNDI

Use native (JDBC) when you want the simplest setup.

You manage the full connection definition in PUC.

If connection details change, update each affected connection in PUC.

Use JNDI when you want central management in the application server.

This is common in production.

It can also help keep credentials out of Pentaho admin workflows.

If connection details change, update the application server configuration.

Add JDBC drivers

The Pentaho Server needs the correct JDBC driver for your database.

Get the driver from your database vendor or your IT team.

The Components Reference lists supported drivers.

Copy the driver to these directories:

  • Pentaho Server: /pentaho/server/pentaho-server/tomcat/lib/

  • PDI client: data-integration/lib

Restart the Pentaho Server and any PDI clients after you add drivers.

chevron-rightDriver conflict guidancehashtag

Keep only one driver per database type in each directory.

Remove older versions to avoid conflicts.

This matters when the driver matches your repository database type.

If you are unsure, contact Pentaho Supportarrow-up-right.

Microsoft SQL Server driver

For Microsoft SQL Server (MSSQL), you might need the non-vendor driver called jTDS.

Contact Pentaho Supportarrow-up-right to confirm the correct driver.

Most Microsoft JDBC drivers support Type 2 integrated authentication on Windows.

Use the integratedSecurity property in the connection string.

To use integrated authentication, copy sqljdbc_auth.dll to every location where you copied the JDBC driver.

You can find sqljdbc_auth.dll here:

<installation directory>\\sqljdbc_<version>\\<language>\\auth\\

Choose the correct DLL:

  • 64-bit JVM on an x64 processor: use the x64 folder.

  • 64-bit JVM on an Itanium processor: use the IA64 folder.

Set up JNDI connections for the Pentaho Server

If you installed the Pentaho Server using the manual installation method, you set up JNDI data sources in Tomcatarrow-up-right.

Use this section to add JNDI data sources and update existing ones.

Before you begin

  • You need IT administrator access.

  • You must know how to connect to your data sources.

  • You must know your OS and database setup.

Defining JNDI connections for PDI clients

If you publish to the Pentaho Server from a PDI client, you can configure the PDI client with the same JNDI settings as the Pentaho Server.

This lets you develop and test transformations without keeping your application server running.

Edit the jdbc.properties file to mirror your application server data sources:

  • Location: /pentaho/design-tools/data-integration/simple-jndi/jdbc.properties

Tomcat JNDI connections

Use JNDI connections when you manage data sources in Tomcat.

Install the correct JDBC driver before you create or use a JNDI data source.

Next step: define the connection

After the driver is installed, create the connection definition in Pentaho.

See Define data connections.

See also

The native database connections in the Pentaho Suite are based on Java Database Connectivity (JDBC).

For Pentaho Server-specific connection setup, see:

Define security for the Pentaho Server

You also need to establish a security plan for your Pentaho system. Pentaho supports two different security options: Pentaho Security, or advanced security providers, such as LDAP, Single Sign-On, or Microsoft Active Directory.

This table can help you choose the option that is best for you.

Explore Considerations
Choose Options

Pentaho security in PUC and Pentaho security in the PDI client

Advanced security providers

Summary

Pentaho Security is the easiest way to configure security quickly. Both the Pentaho User Console and the PDI client enable you to define and manage users and roles. The Pentaho Server controls which users and roles can access web resources through the User Console or resources in the Pentaho Repository.

Pentaho Security works well if you do not have a security provider or if you have a user community with less than 100 users.

If you are already using a security provider, such as LDAP, Single Sign-On, or Microsoft Active Directory, you can use the users and roles you have already defined with Pentaho. Your security provider controls which users and roles can access Pentaho web resources through the User Console or resources in the Pentaho Repository.

Advanced security scales well for production and enterprise user communities.

Expertise

Knowledge of your user community and which users should have which roles in the Pentaho system. Knowledge about security in general is NOT required.

Knowledge of your user community and which users should have which roles in the Pentaho system. Knowledge about your particular security provider and its options is required.

Recommendation

Recommended for the Pentaho trial download or evaluation only. Caution: Do not use for production.

Recommended for production.

Depending on the security option you choose, refer to the following details in the Administer Pentaho Data Integration and Analytics document:

  • If you want to use Pentaho Security, see Manage Users and Roles in PUC and Manage Users and Roles in the PDI client.

  • If you want to use an advanced security provider, see Implement Advanced Security.

  • For information on using Kerberos for security, see How to Enable Kerberos Authentication.

Use password encryption with Pentaho

Strengthen security by using encrypted passwords for Pentaho applications.

For IT administrators, who have permissions to modify files on the server and the permission to stop and start the server, perform these tasks when you want to enhance your company's security by encrypting the passwords that are currently stored as plain text in configuration files, for example, if you want to meet specific server security levels for regulatory compliance.

As a best practice, stop the server before modifying configuration files, then start the server when finished. After you have configured a Pentaho product to use encrypted passwords, all logins with the Pentaho product will use the encrypted passwords. Connect to any databases that were edited to ensure all changes are correct.

Use this topic to:

Encrypted passwords are supported for:

You can also use encrypted passwords with JDBC security. See the Administer Pentaho Data Integration and Analytics document for instructions on switching to JDBC security.

Encrypting a password

Perform the following steps on the machine with the Pentaho Server to create an encrypted password.

  1. Stop the server.

    For instructions, see Stop and start the Pentaho Server and repository.

  2. At the command line, navigate to the server/pentaho-server directory.

  3. Run encr.bat (Windows) or encr.sh (Linux).

    Example:

    The console prints the encrypted password.

    Note: You must have a JRE or JDK installed.

  4. Restart the server.

Use encrypted passwords with Pentaho products

How you apply an encrypted password varies by product.

Pentaho Data Integration (PDI)

Perform the following steps to use an encrypted password with Pentaho Data Integration (PDI).

  1. Stop the server.

    For instructions, see Stop and start the Pentaho Server and repository.

  2. Navigate to the design-tools/data-integration/simple-jndi directory.

  3. Open the jdbc.properties file in a text editor.

  4. Replace all instances of the password value with the encrypted password.

  5. Save and close the file.

  6. Restart the server and verify the change.

Pentaho User Console (PUC)

Perform the following steps to use an encrypted password with the Pentaho User Console (PUC).

  1. Stop the server.

    For instructions, see Stop and start the Pentaho Server and repository.

  2. Navigate to the server/pentaho-server/tomcat/webapps/pentaho/META-INF directory.

  3. Open the context.xml file in a text editor.

  4. Replace the password value in every Resource element with the encrypted password.

  5. Save and close the file.

  6. Restart the server and verify the change.

PUC email

After you configure PUC to use an encrypted password, you can use that password with PUC email.

  1. Log in to PUC as an administrator.

  2. Open the Administration Perspective.

  3. Select the Mail server section.

  4. Enter your encrypted password value in the password field.

    Note: If you use Gmail, enable Allow less secure apps to access your account.

  5. Select Test Email Configuration.

  6. Verify that PUC sends an email to the address you specified.

Pentaho Aggregation Designer

To use encrypted passwords with Pentaho Aggregation Designer, you must first centralize your passwords in a jndi.properties file.

  1. Stop the server.

    For instructions, see Stop and start the Pentaho Server and repository.

  2. Create a jndi.properties file with the default properties:

  3. Save jndi.properties in the design-tools/aggregation-designer/lib directory.

  4. In the user’s home directory, navigate to the .pentaho/simple-jndi directory.

    • Open default.properties in a text editor.

    • If default.properties does not exist, create it.

    • If you created default.properties under design-tools/aggregation-designer/simple-jndi, update org.osjava.sj.root in design-tools/aggregation-designer/lib/jndi.properties to point to it. Example:

  5. Replace the password value in every property in default.properties with the encrypted password.

    Note: If you use a remote repository, replace localhost with the repository IP address.

  6. Save and close the file.

  7. Restart the server and verify the change.

Pentaho Metadata Editor (PME)

The Pentaho Metadata Editor (PME) stores passwords in the JNDI connection default.properties file. For setup details, see Define JNDI connections for Report Designer and Metadata Editor.

  1. Stop the server.

    For instructions, see Stop and start the Pentaho Server and repository.

  2. In the user’s home directory, navigate to the .pentaho/simple-jndi directory.

  3. Open default.properties in a text editor.

    Note: If default.properties does not exist, create it.

  4. Replace the password value in every property with the encrypted password.

    Note: If you use a remote repository, replace localhost with the repository IP address.

  5. Save and close the file.

  6. Restart the server and verify the change.

Pentaho Report Designer (PRD)

The Pentaho Report Designer (PRD) stores passwords in the JNDI connection default.properties file. For setup details, see Define JNDI connections for Report Designer and Metadata Editor.

  1. Stop the server.

    For instructions, see Stop and start the Pentaho Server and repository.

  2. Navigate to the design-tools/report-designer/configuration-template/simple-jndi directory.

  3. Open default.properties in a text editor.

  4. Replace the password value in every property with the encrypted password.

    Note: If you use a remote repository, replace localhost with the repository IP address.

  5. Save the file.

  6. Copy default.properties to the .pentaho/simple-jndi directory in the user’s home directory. Replace the existing file.

    Note: If the .pentaho/simple-jndi directory does not exist, create it.

  7. Restart the server and verify the change.

After you update a product

After you configure a product to use encrypted passwords, all logins with that product use encrypted passwords.

Connect to any databases you updated to verify the changes.

Set up Pentaho to connect to a Hadoop cluster

Use this topic to configure Pentaho to connect to Hadoop clusters.

Supported distributions include Amazon EMR, Azure HDInsight (HDI), Cloudera Data Platform (CDP), and Google Dataproc.

Pentaho also supports related services such as HDFS, HBase, Hive, Oozie, Sqoop, YARN/MapReduce, ZooKeeper, and Spark.

You can connect to clusters and services from these Pentaho components:

  • PDI client (Spoon), along with Kitchen and Pan command line tools

  • Pentaho Server

  • Analyzer (PAZ)

  • Pentaho Interactive Reports (PIR)

  • Pentaho Report Designer (PRD)

  • Pentaho Metadata Editor (PME)

Pentaho connects to Hadoop clusters through a compatibility layer called a driver (Big Data shim).

To confirm which drivers are supported for your version, see the Components Reference.

Drivers are shipped as vendor-specific builds of the optional pentaho-big-data-ee-plugin.

Download drivers from the Hitachi Vantara Lumada and Pentaho Support Portalarrow-up-right.

Note: Pentaho ships with a generic Apache Hadoop driver. For specific vendor drivers, visit the Hitachi Vantara Lumada and Pentaho Support Portalarrow-up-right to download the drivers.

Install a new driver

You need a driver for each cluster vendor and version you connect to from:

  • PDI client (Spoon), plus Kitchen and Pan

  • Pentaho Server

  • Analyzer

  • Interactive Reports

  • Pentaho Report Designer (PRD)

  • Pentaho Metadata Editor (PME)

circle-info

Pentaho ships with a generic Apache Hadoop driver. Download vendor-specific drivers from the Support Portal.

1

Download the driver plugin

  1. Go to Downloads.

  2. In the 11.0 list, open the full downloads list.

  3. Open Pentaho 11.0 GA Release.

  4. Download the driver plugin from Big Data Shims.

Common driver plugin files:

  • Apache Vanilla: pentaho-big-data-ee-plugin-apachevanilla-11.0.0.0-<build-number>.zip

  • Cloudera Data Platform: pentaho-big-data-ee-plugin-cdpdc71-11.0.0.0-<build-number>.zip

  • Google Dataproc: pentaho-big-data-ee-plugin-dataproc1421-11.0.0.0-<build-number>.zip

  • Amazon EMR: pentaho-big-data-ee-plugin-emr770-11.0.0.0-<build-number>.zip

  • Azure HDInsight: pentaho-big-data-ee-plugin-hdi40-11.0.0.0-<build-number>.zip

2

Install the driver on the PDI client

  1. Stop PDI.

  2. Extract the downloaded .zip into:

    • <pdi-install-dir>/data-integration/plugins

  3. If you are replacing an existing driver plugin, remove the old pentaho-big-data-ee-plugin folder first.

3

Install the driver on the Pentaho Server

  1. Stop the Pentaho Server.

  2. Extract the downloaded .zip into:

    • <pentaho-server>/pentaho-solutions/system/kettle/plugins

  3. If you are replacing an existing driver plugin, remove the old pentaho-big-data-ee-plugin folder first.

4

Restart and verify

  1. Restart the PDI client and the Pentaho Server.

  2. Create or update your cluster connection and verify it connects.

Update drivers

When drivers for new Hadoop versions are released, download the new driver plugin and repeat the install steps.

Additional configurations for specific distributions

Use these settings when you configure Pentaho to connect to specific Hadoop distributions:

Amazon EMR

The following settings are available while you configure Pentaho to connect to a working Amazon EMR cluster.

circle-info

EMR clusters (version 7.x and later) built with JDK 17 exclude commons-lang-2.6.jar from standard Hadoop library directories (such as $HADOOP_HOME/lib).

To use the EMR driver with EMR 7.x:

  1. Download commons-lang-2.6.jar from a trusted source (for example, Maven Repository: commons-lang » commons-lang » 2.6arrow-up-right).

  2. Copy the JAR to $HADOOP_HOME/lib or $HADOOP_MAPRED_HOME/lib on every EMR node.

Before you begin

Before you set up Pentaho to connect to an Amazon EMR cluster, do these tasks:

  1. Check the Components Reference to confirm your Pentaho version supports your EMR version.

  2. Prepare your Amazon EMR cluster:

    1. Configure an Amazon EC2 cluster.

    2. Install required services and service client tools.

    3. Test the cluster.

  3. Install PDI on an Amazon EC2 instance in the same Amazon VPC as the EMR cluster.

  4. Get connection details from your Hadoop administrator.

  5. Add the YARN user on the cluster to the group defined by dfs.permissions.superusergroup in hdfs-site.xml.

circle-info

As a best practice, install PDI on the Amazon EC2 instance.

Otherwise, you may not be able to read or write cluster files.

For a workaround, see Unable to read or write files to HDFS on the Amazon EMR cluster.

You also need to share connection details with users after setup.

For the full list, see Hadoop connection and access information list.

Edit configuration files for users

Your cluster administrator must download cluster configuration files.

Update the files with Pentaho-specific and user-specific values.

Use these files to create or update a named connection.

Where named connection files live

Named connection files are stored here:

  • Named connection XML: <username>/.pentaho/metastore/pentaho/NamedCluster

  • Named connection config folder: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>

  • Extra settings file: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

Save edited files in a safe location.

Files to provide to users

Provide these files to each user:

  • core-site.xml

  • mapred-site.xml

  • hdfs-site.xml

  • yarn-site.xml

Verify or edit core-site.xml file

circle-info

If you plan to run MapReduce jobs on Amazon EMR, confirm you have read, write, and execute access to the S3 buffer directories specified in core-site.xml.

Edit core-site.xml to add AWS access keys and (optional) LZO compression settings.

1

Open the file

Open core-site.xml from the folder where you saved the other *-site.xml files.

2

Add AWS credentials

Add your AWS Access Key ID and secret access key:

3

Optional: Add S3N credentials

If you use S3N, add these properties:

4

Add filesystem implementation settings

Add these properties:

5

Configure LZO compression

If you are not using LZO compression, remove any references to com.hadoop.compression.lzo.LzoCodec from core-site.xml.

If you are using LZO compression:

  1. Download the LZO JAR.

  2. Add it to pentaho-big-data-plugin/hadoop-configurations/emr3x/lib.

Download: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/arrow-up-right

6

Save and apply the change

Save the file.

Update the named connection.

Upload the updated core-site.xml.

Edit mapred-site.xml file

If you use MapReduce, edit mapred-site.xml.

You also enable cross-platform MapReduce job submission.

1

Open the file

Open mapred-site.xml from the folder where you saved the other *-site.xml files.

2

Add the property

Add this property:

This property is only required for MapReduce jobs on Windows.

3

Save and apply the change

Save the file.

Update the named connection.

Upload the updated mapred-site.xml.

Connect to a Hadoop cluster with the PDI client

After you set up the Pentaho Server to connect to a cluster, configure and test the connection.

See the Pentaho Data Integration documentation for instructions.

Connect other Pentaho components to the Amazon EMR cluster

Use this procedure to create and test a connection to your Amazon EMR cluster from these Pentaho components:

  • Pentaho Server (DI and BA)

  • Pentaho Metadata Editor (PME)

  • Pentaho Report Designer (PRD)

Install a driver for the Pentaho Server

Install a driver for the Pentaho Server.

For instructions, see Install a new driver.

Create and test connections

Create and test a connection for each component:

  • Pentaho Server for DI: Create a transformation in the PDI client and run it remotely.

  • Pentaho Server for BA: Create a connection to the cluster in the Data Source Wizard.

  • PME: Create a connection to the cluster in PME.

  • PRD: Create a connection to the cluster in PRD.

Share connection details with users

After you connect to the cluster and services, share the connection details with users.

Users can access the cluster only from machines configured to connect to it.

To connect, users need:

  • Hadoop distribution and version

  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers

  • Oozie URL (if used)

Users also need permissions for required HDFS directories.

For a detailed list of required information, see Hadoop connection and access information list.

Azure HDInsight

Use these settings when you configure Pentaho to connect to Azure HDInsight (HDI).

Before you begin

Before you set up Pentaho to connect to HDI, do the following:

  1. Check Components Reference. Confirm your Pentaho version supports your HDI version.

  2. Prepare your HDI instance:

    1. Configure your Azure HDInsight instance.

    2. Install required services and client tools.

    3. Test the platform.

    4. If HDI uses Kerberos, complete the Kerberos steps in this page.

  3. Get connection details from your platform admin. You will share some of this information with users later. See Hadoop connection and access information list.

  4. Add the YARN user to the group defined by dfs.permissions.superusergroup in hdfs-site.xml.

  5. Set up the Hadoop driver for your HDI version. See Install a new driver.

Kerberos-secured HDInsight instances

If you connect to HDI secured with Kerberos, complete these steps first:

  1. Configure Kerberos security on the platform. Configure the Kerberos realm, KDC, and admin server.

  2. Configure these nodes to accept remote connection requests:

    • NameNode

    • DataNode

    • Secondary NameNode

    • JobTracker

    • TaskTracker

  3. If you deployed HDI using an enterprise program, set up Kerberos for those nodes.

  4. Add user credentials to the Kerberos database for each Pentaho user.

  5. Verify an OS user exists on each HDI node for each Kerberos user. Create users as needed.

circle-info

User account UIDs should be greater than min.user.id. The default is usually 1000.

  1. Set up Kerberos on your Pentaho machines. See the Administer Pentaho Data Integration and Analytics guide.

Edit configuration files for users

Your Azure admin downloads the site configuration files for the services you use. They update the files with Pentaho-specific and user-specific settings. Users upload the updated files when they create a named connection.

Named connection files are stored in these locations:

  • <username>/.pentaho/metastore/pentaho/NamedCluster

  • <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

Save the updated files in a known location for reuse.

Files to provide

  • core-site.xml (secured HDInsight only)

  • hbase-site.xml

  • hive-site.xml

  • mapred-site.xml

  • yarn-site.xml

circle-info

If you update these files after creating a named connection, edit the named connection and re-upload the updated files.

Edit Core site XML file

If you use a secured instance of Azure HDInsight, update core-site.xml.

  1. Open core-site.xml.

  2. Add or update properties for your storage type.

    WASB storage

    Add these properties:

    • fs.AbstractFileSystem.wasb.impl: org.apache.hadoop.fs.azure.Wasb

    • pentaho.runtime.fs.default.name: wasb://<container-name>@<storage-account-name>.blob.core.windows.net

    Example:

    ADLS (ABFS) storage

    Add this property:

    • pentaho.runtime.fs.default.name: abfs://<container-name>@<storage-account-name>.dfs.core.windows.net

    Example:

  3. Save the file.

Edit HBase site XML file

If you use HBase, update hbase-site.xml to set the temporary directory.

  1. Open hbase-site.xml.

  2. Add or update this property:

    • hbase.tmp.dir: /tmp/hadoop/hbase

  3. Save the file.

Edit Hive site XML file

If you use Hive, update hive-site.xml to set the Hive metastore location.

  1. Open hive-site.xml.

  2. Add or update these properties:

    • hive.metastore.uris: Hive metastore URI, if different from your HDInsight instance.

    • fs.azure.account.keyprovider.<storage-account>.blob.core.windows.net: Azure storage key provider principal, if required.

    Example:

  3. Save the file.

Edit Mapred site XML file

If you use MapReduce, update mapred-site.xml for job history logging and cross-platform execution.

  1. Open mapred-site.xml.

  2. Ensure these properties exist:

    • mapreduce.jobhistory.address: where MapReduce job history logs are stored

    • mapreduce.job.hdfs-servers: HDFS servers used by YARN to run MapReduce jobs

    Example:

  3. Optional: If YARN containers run on JDK 11 nodes, add this property:

    • mapreduce.jvm.add-opens-as-default: false

circle-exclamation

Example:

  1. Save the file.

Edit YARN site XML file

If you use YARN, verify your yarn-site.xml settings.

  1. Open yarn-site.xml.

  2. Add or update these properties:

    • yarn.resourcemanager.hostname: ResourceManager host name

    • yarn.resourcemanager.address: ResourceManager address and port

    • yarn.resourcemanager.admin.address: ResourceManager admin address and port

    Example:

  3. Save the file.

circle-info

After you change these files, edit the named connection and upload the updated files.

Oozie configuration

If you use Oozie, configure both the cluster and the Pentaho server.

By default, the Oozie user runs Oozie jobs. If you start an Oozie job from PDI, set up a PDI proxy user.

Set up Oozie on a cluster

Add your PDI user to oozie-site.xml.

  1. Open oozie-site.xml on the cluster.

  2. Add these properties. Replace <pdi-username> with the PDI user name.

  3. Save the file.

Set up Oozie on the server

Set the proxy user for the named cluster on the Pentaho server.

  1. Open config.properties:

    /<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

    This path is created when you create a named connection.

  2. Set pentaho.oozie.proxy.user to the proxy user name.

  3. Save the file.

Windows configuration for a secured cluster

If you run Pentaho Server on Windows and your cluster uses Kerberos, point Tomcat to your krb5.conf or krb5.ini.

  1. Go to server/pentaho-server.

  2. Open start-pentaho.bat.

  3. Set CATALINA_OPTS to include the Kerberos config path:

  4. Save the file.

Connect to HDI with the PDI client

After you set up the Pentaho Server to connect to HDI, configure and test the connection from PDI.

See the Pentaho Data Integration documentation for how to connect the PDI client to a cluster.

Connect other Pentaho components to HDI

Create and test an Azure HDInsight (HDI) connection in:

  • Pentaho Server

  • Pentaho Metadata Editor (PME)

  • Pentaho Report Designer (PRD)

Prerequisites

Install a driver for the Pentaho Server. See Install a new driver.

Create and test connections

Create and test the connection in each product:

  • Pentaho Server (DI): Create a transformation in the PDI client. Run it remotely.

  • Pentaho Server (BA): Create a connection to HDI in the Data Source Wizard.

  • PME: Create a connection to HDI.

  • PRD: Create a connection to HDI.

After you connect, share connection details with users.

Users typically need:

  • HDI distribution and version

  • HDFS, ResourceManager (JobTracker), ZooKeeper, and HiveServer2 hostnames, IP addresses, and ports

  • Oozie URL (if used)

  • Permissions for required HDFS directories, including user home directories

See Hadoop connection and access information list.

Cloudera Data Platform (CDP)

Use these advanced settings when you configure Pentaho to connect to Cloudera Data Platform (CDP).

Before you begin

Before you set up Pentaho to connect to CDP, do these tasks:

  1. Check Components Reference. Verify your Pentaho version supports your CDP version.

  2. Prepare CDP:

    1. Configure Cloudera Data Platform.

      See CDP documentationarrow-up-right.

    2. Install required services and client tools.

    3. Test the platform.

  3. Get connection details from your platform administrator.

    You will share some of this information with users later.

    See Hadoop connection and access information list.

  4. Add the YARN user to the group defined by dfs.permissions.superusergroup.

    Find this property in hdfs-site.xml or in Cloudera Manager.

  5. Set up the Hadoop driver for your CDP version. See Install a new driver.

Set up a secured instance of CDP

If you connect to Kerberos-secured CDP, also do these tasks:

  1. Configure Kerberos on the platform.

    Include the realm, KDC, and administrative server.

  2. Configure these nodes to accept remote connection requests:

    • Name

    • Data

    • Secondary

    • Job tracker

    • Task tracker

  3. If you deployed CDP using an enterprise program, set up Kerberos for:

    • Name

    • Data

    • Secondary name

    • Job tracker

    • Task tracker nodes

  4. Add credentials to the Kerberos database for each Pentaho user.

  5. Verify each user has an operating system account on each CDP node.

    Add operating system users if needed.

circle-info

User account UIDs should be greater than min.user.id.

This value is usually 1000.

  1. Set up Kerberos on your Pentaho machines.

    See Administer Pentaho Data Integration and Analytics.

Edit configuration files for users

Cloudera administrators download site configuration files for the services you use.

They update the files with Pentaho-specific and user-specific settings.

Users then upload the files when they create a named connection.

Named connection files are stored here:

  • <username>/.pentaho/metastore/pentaho/NamedCluster

  • <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

Save the updated files in a known location for reuse.

Files to provide

  • config.properties

  • core-site.xml (secured CDP only)

  • hive-site.xml

  • mapred-site.xml

  • yarn-site.xml

circle-info

If you update configuration files after creating a named connection, edit the named connection and re-upload the updated files.

Edit Core site XML file

If you use a secured instance of CDP, update core-site.xml.

  1. Open core-site.xml.

  2. Add or update these properties:

    Property
    Value

    hadoop.proxyuser.oozie.hosts

    Oozie hosts on your CDP cluster.

    hadoop.proxyuser.oozie.groups

    Oozie groups on your CDP cluster.

    hadoop.proxyuser.<security_service>.hosts

    Proxy user hosts for other services on your CDP cluster.

    hadoop.proxyuser.<security_service>.groups

    Proxy user groups for other services on your CDP cluster.

    fs.s3a.access.key

    Your S3 access key, if you access S3 from CDP.

    fs.s3a.secret.key

    Your S3 secret key, if you access S3 from CDP.

  3. Optional (AWS): If you connect to CDP Public Cloud on AWS and use an S3 bucket outside the CDP environment, update or add these properties:

    Ensure the gateway node has valid AWS credentials (for example, under ~/.aws/).

  4. Optional (Azure): If you connect to CDP Public Cloud on Azure and use a storage account outside the CDP environment:

    • Remove these properties:

      • fs.azure.enable.delegation.token

      • fs.azure.delegation.token.provider.type

      • fs.azure.account.auth.type

      • fs.azure.account.oauth.provider.type

    • Add these properties:

      • fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net = SharedKey

      • fs.azure.account.key.<storage-account-name>.dfs.core.windows.net = <storage-account-key>

  5. Optional (GCP): If you connect to CDP Public Cloud on GCP and use a bucket outside the CDP environment, create a custom role with these permissions:

    Assign the custom role to the Data Lake and Log service accounts for the bucket.

  6. Save the file.

Edit Hive site XML file

If you use Hive, update hive-site.xml to set the Hive metastore location.

  1. Open hive-site.xml.

  2. Add or update these properties:

    Property
    Value

    hive.metastore.uris

    Set this to the Hive metastore URI if it differs from your CDP cluster.

    hive.server2.enable.impersonation

    Set to true if you use impersonation.

    hive.server2.enable.doAs

    Set to true if you use impersonation.

    tez.lib.uris

    Required when you use Hive 3 on Tez.

    Example:

  3. Save the file.

Edit Mapred site XML file

If you use MapReduce, update mapred-site.xml to set job history logging and allow cross-platform submissions.

  1. Open mapred-site.xml.

  2. Ensure these properties exist:

    Property
    Value

    mapreduce.jobhistory.address

    Where MapReduce job history logs are stored.

    mapreduce.app-submission.cross-platform

    Set to true to allow submissions from Windows clients to Linux servers.

    Example:

  3. Save the file.

Edit YARN site XML file

If you use YARN, verify your YARN settings in yarn-site.xml.

  1. Open yarn-site.xml.

  2. Add or update these properties:

    Property
    Value

    yarn.application.classpath

    Classpaths needed to run YARN applications. Use commas to separate multiple paths.

    yarn.resourcemanager.hostname

    Resource Manager host name for your environment.

    yarn.resourcemanager.address

    Resource Manager address and port for your environment.

    yarn.resourcemanager.admin.address

    Resource Manager admin address and port for your environment.

    yarn.resourcemanager.proxy-user-privileges.enabled

    Set to true if you use a proxy user.

    Example:

  3. Save the file.

circle-info

After you change these files, edit the named connection and upload the updated files.

Oozie configuration

If you use Oozie on your cluster, configure proxy access on the cluster and the server.

By default, the oozie user runs Oozie jobs.

If you start an Oozie job from PDI, configure a proxy user.

Set up Oozie on a cluster

Add your PDI user to oozie-site.xml.

  1. Open oozie-site.xml on the cluster.

  2. Add these properties.

    Replace <your_pdi_user_name> with your PDI user name.

  3. Save the file.

Set up Oozie on the server

Add the proxy user name to the PDI named connection configuration.

  1. Open this file:

    <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection_name>/config.properties

circle-info

This path is created when you create a named connection.

  1. Set pentaho.oozie.proxy.user to the proxy user name.

  2. Save the file.

Windows configuration for a secured cluster

If you run Pentaho Server on Windows and use Kerberos, set the path to your krb5.conf or krb5.ini file.

  1. Open server/pentaho-server/start-pentaho.bat.

  2. Add -Djava.security.krb5.conf to CATALINA_OPTS.

    Example:

  3. Save the file.

Connect to CDP with the PDI client

After you set up the Pentaho Server to connect to CDP, configure and test the connection from the PDI client.

See Pentaho Data Integration for the client connection steps.

Connect other Pentaho components to CDP

Create and test a connection to CDP from Pentaho Server, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME).

Create and test connections

Create and test a connection in each component.

  • Pentaho Server for Data Integration (DI): Create a transformation in the PDI client, then run it remotely.

  • Pentaho Server for Business Analytics (BA): Create a connection to CDP in the Data Source Wizard.

  • Pentaho Metadata Editor (PME): Create a connection to CDP in PME.

  • Pentaho Report Designer (PRD): Create a connection to CDP in PRD.

Share connection details with users

After you connect to CDP and its services, give connection details to users who need access.

Users typically need:

  • CDP distribution and version

  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames, and port numbers

  • Oozie URL (if used)

  • Permission to access required HDFS directories, including home directories

Users might need more information, depending on the steps, entries, and services they use.

See Hadoop connection and access information list.

Google Dataproc

The following settings are available while you configure Pentaho to connect to Google Dataproc.

Before you begin

Before you set up Pentaho to connect to a Google Dataproc cluster, do these tasks:

  1. Prepare your Google Cloud access:

    • Get credentials for a Google account and access to the Google Cloud Console.

    • Get required credentials for Google Cloud Platform, Compute Engine, and Dataproc.

  2. Contact your Hadoop administrator for cluster connection details.

You also need to provide some of this information to users after setup.

Create a Dataproc cluster

You can create a Dataproc cluster using several methods.

For cluster setup options, see the Google Cloud Documentationarrow-up-right.

Install the Google Cloud SDK on your local machine

Use Google’s instructions to install the Google Cloud SDK for your platform:

Set command variables

Set these environment variables before you run command-line examples on your local machine or in Cloud Shell.

  1. Set the variables:

  2. Set PROJECT to your Google Cloud project ID.

  3. Set HOSTNAME to the name of the master node in your Dataproc cluster.

    Note: The master node name ends with -m.

  4. Set ZONE to the zone of the instances in your Dataproc cluster.

Set up a Google Compute Engine instance for PDI

Run the PDI client inside Google Compute Engine (GCE).

Users must connect remotely through VNC to use the desktop UI.

VM instances in GCE do not publicly expose the required remote desktop ports.

Create an SSH tunnel between the VNC client and the VM instance.

1

Create a VM instance and set network tags

  1. In the Google Cloud Console, open the Compute Engine console.

  2. Go to Compute Engine > VM instances.

  3. Select Create instance.

  4. Open Advanced options and then the Networking tab.

  5. In Network tags, enter vnc-server.

2

Install and configure VNC

  1. Install and update a VNC service for the remote UI.

  2. Install Gnome and VNC.

3

Connect using SSH and create an SSH tunnel

  1. Log in to the instance using SSH.

  2. Use an SSH client and the VM external IP.

    Note: The Google Cloud Console shows the external IP.

  3. Create an SSH tunnel from your VNC client machine.

  4. Connect to the VNC session.

4

Optional: Configure Kerberos

If you use Kerberos, configure Kerberos on the GCE VM.

Authenticate the client machine with the Kerberos controller.

This is required for Kerberos-enabled Dataproc clusters.

When you finish, you can run PDI in GCE.

You can design and launch jobs and transformations on Dataproc.

Edit configuration files for users

Your cluster administrator must download cluster configuration files.

Update the files with Pentaho-specific and user-specific values.

Use these files to create a named connection.

Where named connection files live

Named connection files are stored here:

  • Named connection XML: <username>/.pentaho/metastore/pentaho/NamedCluster

  • Named connection config folder: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>

  • Extra settings file: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

Save edited files in a safe location.

Files to provide to users

Provide these files to each user:

  • core-site.xml

  • hdfs-site.xml

  • mapred-site.xml

  • yarn-site.xml

  • hive-site.xml

circle-info

You can copy these files from a Dataproc cluster using SCP.

Edit mapred-site.xml (MapReduce)

If you use MapReduce, update mapred-site.xml.

You also enable cross-platform MapReduce job submission.

1

Open the file

Open mapred-site.xml from the folder where you saved the other *-site.xml files.

2

Add the property

Add this property:

This property is only required for MapReduce jobs on Windows.

3

Save and apply the change

Save the file.

Edit the named connection.

Upload the updated mapred-site.xml.

Connect to a Hadoop cluster with the PDI client

After you set up the Pentaho Server to connect to a cluster, configure and test the connection.

See the Pentaho Data Integration documentation for instructions.

Connect other Pentaho components to Dataproc

Use this procedure to create and test a connection to your Dataproc cluster from these Pentaho components:

  • Pentaho Server (DI and BA)

  • Pentaho Metadata Editor (PME)

  • Pentaho Report Designer (PRD)

Install a driver for the Pentaho Server

Install a driver for the Pentaho Server.

For instructions, see Install a new driver.

Create and test connections

Create and test a connection for each component:

  • Pentaho Server for DI: Create a transformation in the PDI client and run it remotely.

  • Pentaho Server for BA: Create a connection to the cluster in the Data Source Wizard.

  • PME: Create a connection to the cluster in PME.

  • PRD: Create a connection to the cluster in PRD.

Share connection details with users

After you connect to the cluster and services, share the connection details with users.

Users can access the cluster only from machines configured to connect to it.

To connect, users need:

  • Hadoop distribution and version

  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers

  • Oozie URL (if used)

Users also need permissions for required HDFS directories.

For a detailed list of required information, see Hadoop connection and access information list.

Tasks to be performed by a Pentaho administrator

As a Pentaho administrator, you configure data connections, manage security, and set up client tools.

Define data connections

PUC (Pentaho User Console) and PDI (Pentaho Data Integration) can access data from various sources.

If your data source is a database, create a database connection before you can use it.

Before you create a connection, install the correct driver for your database. Your IT administrator can help. See Specify data connections for the Pentaho Server.

Open the connection dialog box

You can open the Database Connection dialog box from PUC or PDI.

Open from Pentaho User Console (PUC)

  1. Log on to PUC.

  2. Select File > Manage Data Sources.

    The Manage Data Sources dialog box appears.

  3. In More actions and options, select New Connection.

    The Database Connection dialog box appears.

    Database Connection dialog box
  4. Enter your connection information.

  5. Click Test.

  6. Click OK.

Open from Pentaho Data Integration (PDI)

  1. Start the PDI client (Spoon).

  2. Create a new transformation or job.

  3. In the View tab, open the Database connections folder.

    The Database Connection dialog box appears.

    Database Connection dialog box
  4. Enter your connection information.

  5. Click Test.

  6. Click OK.

Notes for PDI connections

  • PDI can connect to multiple database vendors.

  • PDI includes JDBC drivers for PostgreSQL by default.

circle-exclamation

When you define a database connection in PDI, Spoon stores the connection information:

  • If you use the Pentaho Repository, the connection is stored in the repository. Other users can use it after they connect.

  • If you do not use the Pentaho Repository, Spoon stores it in XML with the transformation or job.

Before you define a JDBC connection, gather your database information. For example, gather the database type, port, user name, and password.

In PDI, you can set connection properties as variables. This helps jobs access multiple database types.

Use clean ANSI SQL where possible.

Enter database connection information

The connection information you enter depends on the database and access protocol.

Native (JDBC) protocol information

  1. In Connection Name, enter a descriptive name.

    The name can include spaces. Do not use special characters such as #, $, or %.

  2. In Database Type, select the database you want to use.

  3. In Access, select Native (JDBC).

  4. In Settings, enter the following information:

    Field
    Description

    Host Name

    Server name or IP address.

    Database Name

    Database name. If you use ODBC, enter the DSN.

    Port Number

    TCP/IP port number (if different from the default).

    User Name

    Optional user name for the database.

    Password

    Optional password for the database.

  5. Click Test.

  6. Click OK to close the test dialog.

  7. Click OK to save the connection.

    • In PUC, the connection appears in Manage Data Sources.

    • In PDI, it appears under Database connections in the View tab.

OCI protocol information (PDI only)

  1. In Connection Name, enter a descriptive name.

    The name can include spaces. Do not use special characters such as #, $, or %.

  2. In Database Type, select Oracle.

  3. In Access, select OCI.

  4. In Settings, enter the fields as directed by the Oracle OCI documentationarrow-up-right.

    Field
    Description

    SID

    Oracle system ID.

    Tablespace for Data

    Tablespace where data is stored.

    Tablespace for Indices

    Tablespace where indexes are stored.

    User Name

    User name for the database.

    Password

    Password for the database.

  5. Click Test.

  6. Click OK to close the test dialog.

  7. Click OK to save the connection.

If you want to use Advanced, Options, or Pooling for OCI, refer to the Oracle documentation.

Connect to Snowflake using strong authentication

You can improve connection security by using key pair authentication.

  1. After you enter the Snowflake connection details in General, select Options.

  2. Set the following parameters:

    Parameter
    Value

    authenticator

    snowflake_jwt

    private_key_file

    Path to the private key file (for example, /rsa_key.p8).

    private_key_file_pwd

    Password for the private key file.

    For details, see the Snowflake JDBC docs: Private key file name and password as connection propertiesarrow-up-right.

  3. Click Test.

  4. Click OK to close the test dialog.

  5. Click OK to save the connection.

Connect to an Azure SQL database

You can use an Azure SQL database as a data source in the PDI client. This connection is required for the Bulk load into Azure SQL DB job entry.

Pentaho supports:

Because one physical server can host databases for multiple customers, Azure SQL differs from SQL Server. For details, see Feature comparisonarrow-up-right.

Before you begin

You need:

  • An Azure account with an active subscription

  • An Azure SQL Database instance

  • Azure SQL database drivers installed

Also gather the following information:

  • Host name

  • Database name

  • Port number

  • Authentication method

  • User name

  • Password

If you use Always Encryption Enabled, also gather:

  • Client id

  • Client Secret Key

Authentication method

Pentaho supports four authentication methods:

  • SQL Authentication: Azure SQL Server user name and password.

  • Azure Active Directory: Multi-factor authentication (MFA).

  • Azure Active Directory with password: Azure AD user name and password.

  • Azure Active Directory with integrated authentication: Federated on-premises ADFS with Azure AD.

Connect to an Azure database

  1. Start the PDI client and create a new transformation or job.

  2. In the View tab of the Explorer pane, double-click Database connections.

  3. Enter your connection information:

    Field
    Description

    Host Name

    Azure SQL server instance name.

    Database Name

    Azure SQL database name.

    Port Number

    TCP port. Azure SQL uses TCP port 1433. Allow outbound TCP 1433.

    Authentication method

    Authentication method. Default is SQL Authentication.

    Username

    Database user name.

    Password

    Database password.

    Always Encryption Enabled

    Client id

    Client identifier for a durable connection path.

    Client Secret Key

    Key value name in Azure Key Vault.

  4. Click Test to verify the connection.

Use the Always Encryption Enabled option

Before you can use Always Encryption Enabled, complete these steps using your Azure SQL tools:

  1. Generate a column master key in Azure Key Vault.

  2. Encrypt the column using the column master key.

  3. Register the app under Azure Active Directory.

  4. Obtain the Client id and Client Secret Key.

  5. Grant permissions to the Client id for accessing Azure Key Vault.

  6. Select Always Encryption Enabled.

  7. Enter the Client id and Client Secret Key.

Edit existing connections

Once a connection has been established, you can open the Database Connection dialog box to change it.

Edit database connections in PUC

  1. Select File > Manage Data Sources.

  2. Select the data source you want to edit.

  3. In More actions and options, select Edit.

  4. Select items in the left pane to refine the connection.

Edit database connections in PDI

  1. Open a transformation or job in the PDI client.

  2. In Explorer, open the View tab.

  3. Expand Database connections.

  4. Right-click a connection name, then select Edit.

  5. Select items in the left pane to refine the connection.

Specify advanced configuration (Advanced)

Use Advanced to set SQL-generation properties. You can use these settings to standardize behavior across tools.

  1. Open the Database Connection dialog box. See Open the connection dialog box.

  2. Select Advanced in the left pane.

  3. Configure options as needed:

    Identifier
    Description

    Supports the Boolean data type*

    Use native Boolean data types supported by the database.

    Supports the timestamp data type*

    Use the timestamp data type supported by the database.

    Quote all in database

    Enable case-sensitive table names by quoting identifiers.

    Force all to lower-case

    Force identifiers to lower-case.

    Force all to upper-case

    Force identifiers to upper-case.

    Preserve case of reserved words*

    Use the database reserved-words list.

    Preferred schema name*

    Preferred schema name (for example, MYSCHEMA).

    SQL statements

    SQL statement used to initialize the connection.

    * PDI-only options.

  4. Click Test.

  5. Click OK to close the test dialog.

  6. Click OK to save the connection.

Quoting

Pentaho uses database-specific quoting rules. This helps you use names that follow database naming conventions.

PUC and PDI include a reserved-words list for most supported databases.

Pentaho separates schema (table owner) from the table name. This supports table or field names that include periods (common in some ERP systems).

To avoid quoting errors, Pentaho does not apply quoting if the schema or table name already includes a start or end quotation mark. This lets you control quoting explicitly.

Set database-specific options (Options)

Use Options to set JDBC parameters for a specific database.

  1. Open the Database Connection dialog box. See Open the connection dialog box.

  2. Select Options in the left pane.

  3. In the next available row, enter a parameter name and value.

    For JDBC configuration help, select Help. A browser window opens with database-specific guidance for the selected Database Type.

  4. Click Test.

  5. Click OK to close the test dialog.

  6. Click OK to save the connection.

Define connection pooling (Pooling)

Connection pooling helps control database access. It can also help when your database license limits concurrent connections.

  1. Open the Database Connection dialog box. See Open the connection dialog box.

  2. Select Pooling in the left pane.

  3. Configure pooling options for your JDBC driver.

    Typical options include:

    • Enable Connection Pooling

    • Pool Size: initial and maximum

    • Parameters such as validationQuery

    Common validationQuery values:

    • Oracle and PostgreSQL: Select 1 from dual

    • SQL Server and MySQL: Select 1

  4. Click Test.

  5. Click OK to close the test dialog.

  6. Click OK to save the connection.

Connect to clusters (PDI only)

Use Clustering to create database connections to data partitions in PDI.

To create a new partition connection, enter:

  • Partition ID

  • Host Name

  • Port

  • Database Name

  • User Name

  • Password

If you run Pentaho Server as a cluster and use the Data Source Wizard (DSW) in PUC, a new data source is visible only on the cluster node where you created it.

To make a new DSW data source visible across all cluster nodes, disable DSW data source caching:

  1. Open server/pentaho-server/pentaho-solutions/system/system.properties.

  2. Set enableDomainIdCache=false.

circle-exclamation

Modify connections

You can perform additional tasks from PUC or the PDI client.

Modify connections from PUC

In PUC, open File > Manage Data Sources, then use More actions and options.

More actions and options menu in the PUC Manage Data Sources dialog box
Task
Description

Export

Exports connection information. Metadata downloads an XMI file. Analysis downloads an XML file. Data Source Wizard downloads an XMI (reporting only) or ZIP (reporting and analysis).

Import Analysis

Imports data values from a Mondrian file into a specified data source. You can select an available data source or enter parameters manually.

Import Metadata

Imports a model and creates a data source. Data Source Wizard (includes Analysis model) uses a ZIP created using Export. Metadata model uses an XMI file.

Modify connections from PDI

In PDI, right-click the connection name under Database connections in the View tab.

Task
Description

Duplicate

Duplicates a connection after you enter a new Connection Name.

Copy to clipboard

Copies the XML defining the connection.

SQL Editor

Executes SQL commands against an existing connection.

Clear DB Cache

Clears PDI’s database cache (use after schema changes).

Share

Shares the connection across transformations and jobs.

Explore

Opens Database Explorer to browse schemas and tables.

Show dependences

Lists transformations and jobs that use the connection.

Delete connections

If you no longer need a connection, delete it in PUC or PDI.

circle-exclamation

Delete connections in PUC

  1. Select File > Manage Data Sources.

  2. Select the data source you want to delete.

  3. In More actions and options, select Delete.

Delete connections in PDI

  1. In Explorer, select the View tab.

  2. Expand Database connections.

  3. Right-click the connection name, then select Delete.

Assign permissions to use or manage database connections

You may have several connections to your data that you do not want to share with all of your users. When connected to the Pentaho Server, the PDI client (also known as Spoon) gives you the ability to make your data visible to only those users and roles that you specify. You can assign permission to allow users and roles to read, write, or delete the connection. Connection definitions are stored in the Pentaho Repository. The PDI client Repository Explorer enables you to browse the available connections and select the one for which you want to assign permissions.

  1. From within the PDI client, click on Tools > Repository > Explore.

    The Repository Explorer appears.

  2. Select the Connections tab.

  3. Select the connection for which you want to assign permissions.

  4. From the User/Role area, select the user or role for which you want to assign permissions.

  5. Check the permissions you want to assign to the selected user or role.

    Selection
    Selection Result

    Read

    For this user or role, the connection appears in the connection list and can be selected for use. If users or roles have permission to read a transformation or job but not to a referenced database connection, they cannot open the transformation or job and an error message appears.

    Write

    This user or role can edit the connection definition.

    Delete

    This user or role can permanently remove the connection definition from the list.

    Manage Access Control

    This user or role can assign read, write, and delete permissions to other users or roles.

  6. Click Apply.

  7. Click Close to exit the dialog box.

You can also delegate the ability to assign these permissions to another user or role. For more information on managing users and roles, see With the PDI client.

Manage users and roles

If you are using basic Pentaho security, the Pentaho administrator may be tasked with creating and managing users and roles, including assigning permissions to allow users to access the content they need.

With PUC

Switch between user and role settings to add, delete, and edit users and roles in PUC.

See the Pentaho Business Analytics documentation for details.

With the PDI client

Manage users, roles, and permissions in the Pentaho Repository with the PDI client (Spoon).

You must sign in as an administrator. Or sign in with a role that has Administer Security permission.

Quick tasks

Users

Roles and permissions

Reference

Before you begin

Back up relevant files before you change security settings.

  • If you installed PDI using the suite installer or custom methods, back up your Data Integration directories.

  • If you installed PDI using the manual method, back up pentaho.war and your solution files.

Sample users, default roles, and permissions

Use the built-in sample users and default roles as a starting point. They can guide your own role design.

  1. Open the PDI client and sign in to the repository.

  2. Select Tools > Repository > Explore.

  3. Select the Security tab.

    Available users are listed.

    Repository explorer Security tab
  4. Choose what you want to manage:

    • Users: Select a user to see assigned roles and an optional description.

    • Roles: Select a role to see its permissions.

    • System Roles: Select a system role to see its permissions.

Default roles and sample users (typical setup):

  • Administrator (sample user: admin)

    • Administer Security

    • Schedule Content

    • Read Content

    • Publish Content

    • Create Content

    • Execute

    • Manage Data Sources

  • Power User (sample user: suzy)

    • Schedule Content

    • Read Content

    • Publish Content

    • Create Content

    • Execute

  • Report Author (sample user: tiffany)

    • Schedule Content

    • Publish Content

  • Business Analyst (sample user: pat)

    • Publish Content

Permission definitions:

  • Administer Security

    • Lets users access and manage content across perspectives.

    • Lets users view and manage schedules in the Schedules perspective.

    • The Administrator role effectively retains all permissions, even if unchecked.

  • Schedule Content

    • Lets users schedule content.

    • Lets users manage their own schedules.

  • Read Content

    • Lets users view content in perspectives.

  • Publish Content

    • Lets users store reports and data models in the repository.

    • With Create Content, lets users publish to the repository.

  • Create Content

    • Lets users create, import, delete, and save jobs and transformations.

    • Lets users see repository data sources used by jobs and transformations.

    • With Execute, lets users export, copy/paste, and save to VFS.

  • Execute

    • Lets users run, preview, debug, replay, verify, and schedule.

    • With Create Content, lets users export, copy/paste, and save to VFS.

  • Manage Data Sources

    • Lets users create, edit, or delete data sources.

    • Lets users view repository data sources.

Manage users

Use the Security tab to manage users. Select Users to see the user list.

Add users

  1. Select Users, then select the plus sign next to Available.

  2. Enter User Name and Password.

  3. Optional: Enter a Description.

  4. Optional: Under Member, select a role and select OK.

  5. Select OK to save.

Change user passwords

  1. Select Users, then select the user.

  2. Select the Edit icon.

  3. In Password, enter the new password.

  4. Select OK.

When you sign in to the PDI client for the first time, change the default administrator password.

Edit user information

  1. Select Users, then select the user in the Available list.

  2. Select the Edit icon.

  3. Update the user details.

  4. Select OK.

Delete users

circle-exclamation
  1. Select Users, then select the user in the Available list.

  2. Next to Available, select the X icon.

  3. Select Yes to confirm.

If you delete a user or role, repository content ownership and ACL entries remain unchanged. This can create a security risk if you later create a new user or role with the same name.

Safer alternatives:

  • If you are disabling a role, unassign all members from the role.

  • If you are disabling a user, reset the password to a cryptic, unknown value.

Hide a user Home folder in PDI

If your organization uses multi-tenancy, you may want to hide individual Home folders. This can help centralize user-created content in a secure folder.

See the Administer Pentaho Data Integration and Analytics documentation for details.

Manage roles and permissions

Select Roles to manage roles and their permissions.

Add roles

  1. Select Roles.

  2. Select the plus sign next to Available.

  3. Enter the Role Name.

  4. Optional: Enter a Description.

  5. Optional: Select users (Shift/Ctrl), then select the right arrow to assign them.

  6. Select OK.

Edit roles

  1. Select Roles.

  2. Select the role, then select the Edit icon.

  3. Make changes.

  4. Select OK.

Delete roles

  1. Select Roles.

  2. Select the role in the Available list.

  3. Select the X icon next to Available.

  4. Select Yes to confirm.

Assign users to roles

  1. Select Roles.

  2. Select the role.

  3. Next to Members, select the plus sign.

  4. Select users, then select the right arrow.

  5. Select OK.

Assign user permissions in the repository using the PDI client

Use role permissions to control what users can do. For example, create an admin role that can administer security and create content.

  1. Select Roles.

  2. In the Available list, select the role.

  3. In Permission, select or clear permissions.

  4. Select Apply.

Changes apply the next time affected users sign in.

Enable system role permissions

Pentaho requires the Authenticated system role for repository sign-in. Users are assigned Authenticated automatically at sign-in.

By default, Authenticated has Read Content permission. Adjust permissions as needed.

circle-info

The Anonymous system role is not used.

  1. Select System Roles.

  2. Select Authenticated.

  3. Under Permissions, select or clear permissions.

  4. Select Apply.

Make changes to the administrator role

You cannot edit the administrator role’s action-based permissions in the PDI client. The administrator role controls access to the Security tab.

circle-exclamation

Use a configuration change for these scenarios:

  • Delete the administrator role.

  • Remove administrator permission from the administrator role.

  • Configure LDAP.

  1. Shut down the Pentaho Server.

  2. Open repository.spring.xml in pentaho-server/pentaho-solutions/system.

  3. Find the immutableRoleBindingMap element.

  4. Replace the entire node with this XML.

    Replace yourAdminRole with the role that should have administrator permission.

  5. Restart the Pentaho Server.

Configure the design tools and utilities

Before using design tools and utilities, perform configuration tasks for each workstation running these tools.

circle-info

Depending on how these tools and utilities were installed, they might be located on different machines than the Pentaho Server.

BA design tools

The following table describes the BA design tools and their uses:

Design Tool
What it does

Pentaho Aggregation Designer

Optimizes the multidimensional Mondrian data model.

Pentaho Metadata Editor

Creates relational data models and refines the models created with the Data Access Wizard.

Pentaho Report Designer

Interactive Reports is a web-based design interface that is used to create both simple and on-demand operational reports without depending on IT or report developers.

Pentaho Schema Workbench

Creates multidimensional Mondrian data models and refines the models created with the Data Access Wizard.

Your IT administrator must specify the data connections for these tools.

Start and stop BA design tools

Each BA design tool has its own startup command. Start and stop each tool separately.

Start BA design tools

The startup method depends on your operating system.

circle-info

Replace pentaho with your Pentaho installation directory.

Run the .bat file for the tool you want.

  • Aggregation Designer: pentaho\design-tools\aggregation-designer\startaggregationdesigner.bat

  • Metadata Editor: pentaho\design-tools\metadata-editor\metadata-editor.bat

  • Report Designer: pentaho\design-tools\report-designer\report-designer.bat

  • Schema Workbench: pentaho\design-tools\schema-workbench\workbench.bat

You can also create shortcuts to these files.

Stop BA design tools

Use the standard exit method for your tool.

  • Aggregation Designer: Close the application window.

  • Metadata Editor, Report Designer, and Schema Workbench: Select File > Exit.

Save your work before you exit.

PDI design tools and utilities

Pentaho Data Integration allows users to design and manage complex ETL workloads with a graphical user interface.

Kitchen, Pan, and Carte are command-line tools for executing jobs and transformations modeled in the PDI client.

  • Use Pan and Kitchen command-line tools to work with transformations and jobs.

  • Use Carte clusters to:

    • Schedule jobs to run on a remote Carte server.

    • Stop Carte from the command line interface or URL.

    • Run transformations and jobs from the repository on the Carte server.

See the Pentaho Data Integration document for details on Kitchen, Pan, and Carte.

Start and stop PDI design tools and utilities

Each PDI tool has a specific function, so you start and stop each one individually.

Starting PDI tools and utilities

Run the batch, script, or app files in the data-integration folder.

You can find data-integration in the folder where you installed the tool or utility.

Use the file for your operating system.

  • PDI client: spoon.bat

  • Carte: carte.bat

  • Kitchen: kitchen.bat

  • Pan: pan.bat

Tip: Consider making a shortcut for each .bat file. Add it to the desktop or the Start menu.

Stopping PDI design tools

Use these methods to stop the PDI design tools:

  • PDI client: Close the window or select File > Exit.

  • Carte: Stop it from the command line interface or URL.

  • Pan and Kitchen: Stop them from the command line interface.

For details on stopping Kitchen, Pan, or Carte, see the Pentaho Data Integration documentation.

Increase the PDI client memory limit

As a best practice, increase PDI's memory limit so the Pentaho Server and the PDI client (Spoon) can perform memory-intensive tasks, process or sort large datasets, and run complex transformations and jobs.

Increase the memory limit for both the Pentaho Server and the PDI client.

If you do not increase the memory limit, PDI uses the default settings in the startup scripts.

For Pentaho Server memory changes, see Increase Pentaho Server memory limit for custom installations on Windows or Linux.

Increase memory for the PDI client (Spoon)

  1. Exit from the PDI client if it is currently running.

  2. Open the PDI client startup script with a text editor.

    The startup script depends on your operating system:

    • Windows: pentaho/design-tools/data-integration/spoon.bat

    • Linux: pentaho/design-tools/data-integration/spoon.sh

  3. Modify the -Xmx value to a larger upper memory limit.

    This example allocates 2 GB of heap space:

  4. Save and close the startup file.

  5. Start the PDI client.

    If you still see memory errors, increase the value again.

Alternative: Set an environment variable

circle-info

Instead of modifying the startup script, you can set the PENTAHO_DI_JAVA_OPTIONS environment variable on your client to -Xmx2g -XX:MaxPermSize=256m.

Last updated

Was this helpful?