# Pentaho configuration

You can use basic configuration tasks to set up the Pentaho Server, data connections, the Pentaho design tools, and Hadoop cluster connections so you can get started creating ETL solutions and data analytics. These configuration tasks can be performed through the PUC (Pentaho User Console), the PDI (Pentaho Data Integration) client, or edits to shell scripts and property files. A Pentaho administrator user name and password are required to perform configuration tasks through the user console.

These tasks are for IT and Pentaho administrators, as described in the following definitions:

* **IT administrator**

  Installs, configures, and upgrades the Pentaho Server. An IT administrator knows where the data is stored, how to connect to it, details about the computing environment, and how to use the command line on Microsoft Windows or Linux.
* **Pentaho administrator**

  Responsible for creation and management of users and roles, along with managing workstations so the ETL specialists and business analysts can create, publish, and share content.

**Note:** This guide assumes you have [installed](https://docs.pentaho.com/install/pentaho-installation-overview-cp) the Pentaho software.

## Tasks to be performed by an IT administrator

As an IT administrator, you configure the Pentaho Server and define security. If your team uses big data, you also configure access to Hadoop clusters.

* [Configure the Pentaho Server](#configure-the-pentaho-server)
* [Define security for the Pentaho Server](#define-security-for-the-pentaho-server)
* [Use password encryption with Pentaho](#use-password-encryption-with-pentaho)
* [Set up Pentaho to connect to a Hadoop cluster](#set-up-pentaho-to-connect-to-a-hadoop-cluster)

### Configure the Pentaho Server

Basic server tasks include starting and stopping the Pentaho Server, tuning memory, and setting up connections.

The following IT administrator tasks prepare the system for more specific [Pentaho administrator configuration tasks](https://docs.pentaho.com/install/pentaho-configuration), like defining connections and managing users and roles:

* [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository)
* [Create scripts for automatic stop and start of the Pentaho server and repository on Linux](#create-scripts-for-automatic-stop-and-start-of-the-pentaho-server-and-repository-on-linux)
* [Increase Pentaho Server memory limit for custom installations on Windows or Linux](#increase-pentaho-server-memory-limit-for-custom-installations-on-windows-or-linux)
* [Specify data connections for BA design tools](#specify-data-connections-for-ba-design-tools)
* [Specify data connections for the Pentaho Server](#specify-data-connections-for-the-pentaho-server)

#### Stop and start the Pentaho Server and repository

To perform maintenance work on the Pentaho Server or components that use the Pentaho Server, such as the Pentaho User Console and Pentaho Data Integration, stop the repository and server, perform the work, and then restart the repository and server.

{% hint style="warning" %}
Start the Pentaho Repository before you start the Pentaho Server.
{% endhint %}

You must be an IT administrator and know where the data that you are managing is stored, how to access that stored data, details about the computing environment, and how to use the command line for Windows or Linux.

1. Run the following control scripts to stop and start the Pentaho Server:

   | Start on Windows | `...\\pentaho\\server\\pentaho-server\\start-pentaho.bat` |
   | ---------------- | --------------------------------------------------------- |
   | Stop on Windows  | `...\\pentaho\\server\\pentaho-server\\stop-pentaho.bat`  |
   | Start on Linux   | `.../pentaho/server/pentaho-server/start-pentaho.sh`      |
   | Stop on Linux    | `.../pentaho/server/pentaho-server/stop-pentaho.sh`       |

{% hint style="info" %}
If you installed the Pentaho Repository as PostgreSQL, MySQL, MS SQL Server, or Oracle, consult that database documentation for start and stop instructions.
{% endhint %}

<details>

<summary>Script arguments and services (custom start/stop scripts)</summary>

Use these standard arguments:

* `start`
* `stop`
* `restart`
* `status`
* `help`

Common service names:

* `pentahoserver`
* `postgresql`

</details>

#### Create scripts for automatic stop and start of the Pentaho server and repository on Linux

If you used the manual installation to install Pentaho on Linux, create scripts for automatic start on boot and stop at shutdown for the Pentaho Server and Pentaho Repository.

You must have root permissions.

1. Navigate to `/etc/init.d/` and create a file named `pentaho`.
2. Open the `pentaho` file and enter the following content:

   ```sh
   #!/bin/sh
   ### BEGIN INIT INFO
   # Provides: start-pentaho stop-pentaho
   # Required-Start: networking postgresql
   # Required-Stop: postgresql
   # Default-Start: 2 3 4 5
   # Default-Stop: 0 1 6
   # Description: Pentaho Server
   ### END INIT INFO

   case "$1" in
     "start")
       su - pentaho -c "/home/pentaho/pentaho/server/pentaho-server/start-pentaho.sh"
       ;;
     "stop")
       su - pentaho -c "/home/pentaho/pentaho/server/pentaho-server/stop-pentaho.sh"
       ;;
     *)
       echo "Usage: $0 { start | stop }"
       ;;
   esac
   exit 0
   ```
3. Optional: Update the script for your environment.
   * If you are not using Red Hat Enterprise Linux, modify the details of the script to work with the OS, shells, and init systems that you are using. The script was tested only on Red Hat Enterprise Linux.
   * If you use an account other than the `pentaho` local user account to start services, replace `pentaho` with your account name.
   * If you use a MySQL or Oracle repository instead of PostgreSQL, replace `postgresql` with `mysql` or `oracle`.
   * If the solution repository runs on the same machine as the server, change `postgresql` to the name of the `init` script for your database.
   * If the solution repository runs on a remote computer, remove `postgresql` entirely and adjust the paths to the Pentaho Server scripts.
4. Save and close the `pentaho` file.
5. Navigate to `/home/pentaho/pentaho/server/pentaho-server` and open `start-pentaho.sh`.
6. In `start-pentaho.sh`, change the last `if` statement to match the following example:

   ```sh
   if [ "$?" = 0 ]; then
     cd "$DIR/tomcat/bin"
     export CATALINA_OPTS="-Xms4096m -Xmx6144m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000"
     env JAVA_HOME=$_PENTAHO_JAVA_HOME sh ./startup.sh
   fi
   ```
7. Save and close `start-pentaho.sh`.
8. Make the init script executable:

   ```sh
   chmod +x /etc/init.d/pentaho
   ```
9. Add the init script to the standard run levels.

   If your distribution supports `update-rc.d`, run:

   ```sh
   update-rc.d pentaho defaults
   ```

{% hint style="info" %}
Many modern Linux distributions use `systemd` instead of SysV init scripts. If `update-rc.d` is not available, consult your distribution documentation for the equivalent.
{% endhint %}

#### Increase Pentaho Server memory limit for custom installations on Windows or Linux

If you used a custom method to install PDI on Windows or Linux, consider increasing the PDI memory limit so that the Pentaho Server and the PDI client (also known as Spoon) can perform memory-intensive tasks, like sorting, grouping large datasets, or running complex transformations and jobs.

Increase the memory limit for both the Pentaho Server and the PDI client. If you do not increase the memory limit, PDI uses the default memory settings in the PDI startup scripts.

For instructions on increasing memory limits by editing the Tomcat startup script, see [Configure and start the Pentaho Server after manual installation](https://docs.pentaho.com/install/legacy-redirects/manual-installation-process-archived-content/configure-and-start-the-pentaho-server-after-manual-installation).

For instructions on increasing the memory limit in the PDI client, see [Modify the PDI client startup script](https://docs.pentaho.com/install/legacy-redirects/modify-the-pdi-client-startup-script).

#### Specify data connections for BA design tools

The BA design tools need a way to connect to where you store data, and to the Pentaho Server where you publish and share what you create in the design tools.

We support connecting to data stored in these ways:

* Pentaho data sources, such as relational Metadata data models or multidimensional Mondrian data models.
* JDBC databases.

**JDBC database connections**

Your database access protocol affects how you create, test, and publish reports and models. Choose one of these protocols:

* **Native (JDBC)**: Standard JDBC connection details stored in each report or model.
* **JNDI**: A named connection. Properties are stored outside the report or model.
* **ODBC**: Not supported. Available only for compatibility.
* **OCI**: Oracle-only. Use when you must use Oracle Call Interface.

**Choose an access protocol**

**Native (JDBC)**

Use Native (JDBC) when you need a quick setup or you are evaluating.

Keep this limitation in mind:

* If connection details change, you must update every report and model that embeds them.

Behavior by tool:

* **Report Designer and Metadata Editor**
  * Connection details are embedded in each report or Metadata model.
  * Report Designer lists the saved connection name after you define it.
  * Metadata Editor requires you to re-enter the connection name when you use it.
  * Before publishing, ensure the connection name matches a Native (JDBC) connection already defined on the Pentaho Server.
* **Schema Workbench and Aggregation Designer**
  * Connection details are not stored. Define the connection each time.
  * Before publishing, you need the Pentaho Server URL and the name of a Native (JDBC) connection already defined on that server.

Recommendation:

* Use for trials and evaluation.
* For production, switch to JNDI before publishing.

**JNDI**

Use JNDI when you need centralized maintenance and a distributed team.

Key characteristics:

* Reports and models store only the JNDI name.
* You update connection details in one place.

Behavior by tool:

* **Report Designer and Metadata Editor**
  * Connection properties live in a file on each workstation.
  * Before publishing, ensure the JNDI name matches a JNDI connection already defined on the Pentaho Server.
* **Schema Workbench and Aggregation Designer**
  * JNDI is not available for connecting from the tool.
  * Before publishing, you still need the Pentaho Server URL and a JNDI connection name already defined on the Pentaho Server.

Recommendation:

* Use for production deployments.

**ODBC (not supported)**

ODBC is not supported. It is available only if you have a hard dependency.

If you must use ODBC, contact [Pentaho Support](https://support.pentaho.com).

**OCI (Oracle only)**

If you connect to an Oracle database, you can use OCI. Use the steps in [Define JDBC or OCI connections for BA design tools](#define-jdbc-or-oci-connections-for-ba-design-tools).

**Define connections for BA design tools**

Use these sections to set up connections on workstations running BA design tools.

**Define JDBC or OCI connections for BA design tools**

1. Choose an access protocol in [Choose an access protocol](#choose-an-access-protocol).
2. Install the matching driver.
3. Set the access protocol later in each design tool when you define the data source.

**Add drivers**

The driver lets the design tools connect to the Pentaho Server. This connection helps you validate the model.

Your database administrator or IT team can provide the driver. If needed, download it from your database vendor.

Copy the driver to these directories on every machine that runs the design tools. Stop the design tools before you copy the file.

* Aggregation Designer: `/pentaho/design-tools/agg-designer/drivers/`
* Metadata Editor: `/pentaho/design-tools/metadata-editor/libext/JDBC/`
* Report Designer: `/pentaho/design-tools/report-designer/lib/jdbc/`
* Schema Workbench: `/pentaho/design-tools/schema-workbench/drivers/`

Keep one driver version per database in each directory. Remove older versions to avoid conflicts.

If you are unsure, contact [Pentaho Support](https://support.pentaho.com).

**Microsoft SQL Server driver notes**

If you use Microsoft SQL Server, you might need the `jTDS` driver. `jTDS` is not vendor supported. Confirm the correct driver with [Pentaho Support](https://support.pentaho.com).

On Windows, many JDBC drivers support Type 2 integrated authentication using the `integratedSecurity` connection string property. To use it, copy `sqljdbc_auth.dll` to every machine and directory where you copied the JDBC driver.

You can find `sqljdbc_auth.dll` at:

`<installation directory>\\sqljdbc_<version>\\<language>\\auth\\`

* Use the `x64` DLL with a 64-bit JVM on an x64 processor.
* Use the `IA64` DLL with a 64-bit JVM on an Itanium processor.

**Define JNDI connections for Report Designer and Metadata Editor**

1. Choose an access protocol in [Choose an access protocol](#choose-an-access-protocol).
2. Complete these tasks on each workstation.

**Add drivers**

The driver lets design tools connect to the Pentaho Server and validate models.

Your database administrator or IT team can provide the driver. You can also download it from your database vendor.

Check [Components Reference](https://docs.pentaho.com/install/components-reference) to confirm your driver is supported.

Copy the driver to these directories on each machine where you installed design tools. Stop the tools first.

* Aggregation Designer: `/pentaho/design-tools/agg-designer/drivers/`
* Metadata Editor: `/pentaho/design-tools/metadata-editor/libext/JDBC/`
* Report Designer: `/pentaho/design-tools/report-designer/lib/jdbc/`
* Schema Workbench: `/pentaho/design-tools/schema-workbench/drivers/`

Keep only one version of the driver in each directory. Remove older versions to avoid conflicts.

If you are unsure, contact [Pentaho Support](https://support.pentaho.com).

**Configure Report Designer**

1. Stop Report Designer. Stop the Pentaho Server too, if it runs locally.

   For server steps, see [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository).
2. Go to the `.pentaho` directory in the user profile.

   Examples:

   * Windows: `C:\\Users\\username\\.pentaho\\`
   * Linux or Solaris: `/home/username/.pentaho/`
3. Go to `~/.pentaho/simple-jndi/`. Create it if needed.
4. Edit `default.properties`. Create it if needed.

   Example:

   ```properties
   SampleData/type=javax.sql.DataSource
   SampleData/driver=org.hsqldb.jdbcDriver
   SampleData/user=pentaho_user
   SampleData/password=password
   SampleData/url=jdbc:hsqldb:mem:SampleData
   ```

   In this example, `SampleData` is the JNDI connection name.
5. Save the file.
6. Restart Report Designer. Restart the Pentaho Server too, if it runs locally.

Repeat this process on each workstation where you run Report Designer.

**Configure Metadata Editor**

1. Stop Metadata Editor. Stop the Pentaho Server too, if it runs locally.

   For server steps, see [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository).
2. Go to the `metadata-editor/simple_jndi` directory under your Metadata Editor install.
3. Edit `JDBC.properties`. Create it if needed.

   Example:

   ```properties
   SampleData/type=javax.sql.DataSource
   SampleData/driver=org.hsqldb.jdbcDriver
   SampleData/user=pentaho_user
   SampleData/password=password
   SampleData/url=jdbc:hsqldb:mem:SampleData
   ```

   In this example, `SampleData` is the JNDI connection name.
4. Save the file.
5. Restart Metadata Editor. Restart the Pentaho Server too, if it runs locally.

Repeat this process on each workstation where you run Metadata Editor.

#### Specify data connections for the Pentaho Server

For business analytics, you can use the Pentaho User Console to define the connection to where you store data.

We support accessing data stored in the following ways:

* Comma separated values (CSV) files, or any file that can be converted to CSV, such as spreadsheets, XML, or other semi-structured or tabular data files. Use CSV files if you are evaluating Pentaho or you want to get started quickly.
* JDBC drivers to help with database connections.

In Pentaho Data Integration (PDI), you can make connections in each job and transformation through an input step. Although users can create connections themselves, it is best to set up shared connections for your users so that they can select the connection they need from a list.

**JDBC database connections**

Use JDBC database connections when the Pentaho Server connects to an RDBMS.

You choose one of these access types:

* **Native (JDBC)**: Define the connection in Pentaho User Console (PUC).
* **JNDI**: Define a data source in your application server (Tomcat).
* **OCI**: Oracle-only option for using the Oracle Call Interface.
* **ODBC**: Available but not supported. Contact [Pentaho Support](https://support.pentaho.com) if you must use it.

**Choose between native (JDBC) and JNDI**

Use **native (JDBC)** when you want the simplest setup.

You manage the full connection definition in PUC.

If connection details change, update each affected connection in PUC.

Use **JNDI** when you want central management in the application server.

This is common in production.

It can also help keep credentials out of Pentaho admin workflows.

If connection details change, update the application server configuration.

**Add JDBC drivers**

The Pentaho Server needs the correct JDBC driver for your database.

Get the driver from your database vendor or your IT team.

The [Components Reference](https://docs.pentaho.com/install/components-reference) lists supported drivers.

Copy the driver to these directories:

* Pentaho Server: `/pentaho/server/pentaho-server/tomcat/lib/`
* PDI client: `data-integration/lib`

Restart the Pentaho Server and any PDI clients after you add drivers.

<details>

<summary>Driver conflict guidance</summary>

Keep only one driver per database type in each directory.

Remove older versions to avoid conflicts.

This matters when the driver matches your repository database type.

If you are unsure, contact [Pentaho Support](https://support.pentaho.com).

</details>

**Microsoft SQL Server driver**

For Microsoft SQL Server (MSSQL), you might need the non-vendor driver called `jTDS`.

Contact [Pentaho Support](https://support.pentaho.com) to confirm the correct driver.

Most Microsoft JDBC drivers support Type 2 integrated authentication on Windows.

Use the `integratedSecurity` property in the connection string.

To use integrated authentication, copy `sqljdbc_auth.dll` to every location where you copied the JDBC driver.

You can find `sqljdbc_auth.dll` here:

`<installation directory>\\sqljdbc_<version>\\<language>\\auth\\`

Choose the correct DLL:

* 64-bit JVM on an x64 processor: use the `x64` folder.
* 64-bit JVM on an Itanium processor: use the `IA64` folder.

**Set up JNDI connections for the Pentaho Server**

If you installed the Pentaho Server using the manual installation method, you set up JNDI data sources in [Tomcat](http://tomcat.apache.org/).

Use this section to add JNDI data sources and update existing ones.

**Before you begin**

* You need IT administrator access.
* You must know how to connect to your data sources.
* You must know your OS and database setup.

**Defining JNDI connections for PDI clients**

If you publish to the Pentaho Server from a PDI client, you can configure the PDI client with the same JNDI settings as the Pentaho Server.

This lets you develop and test transformations without keeping your application server running.

Edit the `jdbc.properties` file to mirror your application server data sources:

* Location: `/pentaho/design-tools/data-integration/simple-jndi/jdbc.properties`

**Tomcat JNDI connections**

Use JNDI connections when you manage data sources in Tomcat.

Install the correct JDBC driver before you create or use a JNDI data source.

**Next step: define the connection**

After the driver is installed, create the connection definition in Pentaho.

See [Define data connections](https://docs.pentaho.com/install/legacy-redirects/pentaho-configuration-redirects/define-data-connections).

#### See also

The native database connections in the Pentaho Suite are based on Java Database Connectivity (JDBC).

For Pentaho Server-specific connection setup, see:

* [Set up native (JDBC) or OCI data connections for the Pentaho Server](https://docs.pentaho.com/install/legacy-redirects/pentaho-server-jdbc-connection-redirects/set-up-native-jdbc-or-oci-data-connections-for-the-pentaho-server)
* [Set up JNDI connections for the Pentaho Server](https://docs.pentaho.com/install/legacy-redirects/pentaho-server-jdbc-connection-redirects/set-up-jndi-connections-for-the-pentaho-server)

### Define security for the Pentaho Server

You also need to establish a security plan for your Pentaho system. Pentaho supports two different security options: Pentaho Security, or advanced security providers, such as LDAP, Single Sign-On, or Microsoft Active Directory.

This table can help you choose the option that is best for you.

| Explore Considerations                                         | Choose Options                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Pentaho security in PUC and Pentaho security in the PDI client | Advanced security providers                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Summary                                                        | <p>Pentaho Security is the easiest way to configure security quickly. Both the Pentaho User Console and the PDI client enable you to define and manage users and roles. The Pentaho Server controls which users and roles can access web resources through the User Console or resources in the Pentaho Repository.</p><p>Pentaho Security works well if you do not have a security provider or if you have a user community with less than 100 users.</p> | <p>If you are already using a security provider, such as LDAP, Single Sign-On, or Microsoft Active Directory, you can use the users and roles you have already defined with Pentaho. Your security provider controls which users and roles can access Pentaho web resources through the User Console or resources in the Pentaho Repository.</p><p>Advanced security scales well for production and enterprise user communities.</p> |
| Expertise                                                      | Knowledge of your user community and which users should have which roles in the Pentaho system. Knowledge about security in general is NOT required.                                                                                                                                                                                                                                                                                                       | Knowledge of your user community and which users should have which roles in the Pentaho system. Knowledge about your particular security provider and its options is required.                                                                                                                                                                                                                                                       |
| Recommendation                                                 | Recommended for the Pentaho trial download or evaluation only. **Caution:** Do not use for production.                                                                                                                                                                                                                                                                                                                                                     | Recommended for production.                                                                                                                                                                                                                                                                                                                                                                                                          |

Depending on the security option you choose, refer to the following details in the **Administer Pentaho Data Integration and Analytics** document:

* If you want to use Pentaho Security, see **Manage Users and Roles in PUC** and **Manage Users and Roles in the PDI client**.
* If you want to use an advanced security provider, see **Implement Advanced Security**.
* For information on using Kerberos for security, see **How to Enable Kerberos Authentication**.

### Use password encryption with Pentaho

Strengthen security by using encrypted passwords for Pentaho applications.

For IT administrators, who have permissions to modify files on the server and the permission to stop and start the server, perform these tasks when you want to enhance your company's security by encrypting the passwords that are currently stored as plain text in configuration files, for example, if you want to meet specific server security levels for regulatory compliance.

As a best practice, stop the server before modifying configuration files, then start the server when finished. After you have configured a Pentaho product to use encrypted passwords, all logins with the Pentaho product will use the encrypted passwords. Connect to any databases that were edited to ensure all changes are correct.

Use this topic to:

* [Encrypting a password](#encrypting-a-password)
* [Use encrypted passwords with Pentaho products](#use-encrypted-passwords-with-pentaho-products)
* [After you update a product](#after-you-update-a-product)

Encrypted passwords are supported for:

* [Pentaho Data Integration (PDI)](#pentaho-data-integration-pdi)
* [Pentaho User Console (PUC)](#pentaho-user-console-puc)
  * [PUC email](#puc-email)
* [Pentaho Aggregation Designer](#pentaho-aggregation-designer)
* [Pentaho Metadata Editor (PME)](#pentaho-metadata-editor-pme)
* [Pentaho Report Designer (PRD)](#pentaho-report-designer-prd)

You can also use encrypted passwords with JDBC security. See the **Administer Pentaho Data Integration and Analytics** document for instructions on switching to JDBC security.

#### Encrypting a password

Perform the following steps on the machine with the Pentaho Server to create an encrypted password.

1. Stop the server.

   For instructions, see [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository).
2. At the command line, navigate to the `server/pentaho-server` directory.
3. Run `encr.bat` (Windows) or `encr.sh` (Linux).

   Example:

   ```
   encr -kettle <password>
   ```

   The console prints the encrypted password.

   **Note:** You must have a JRE or JDK installed.
4. Restart the server.

#### Use encrypted passwords with Pentaho products

How you apply an encrypted password varies by product.

* [Pentaho Data Integration (PDI)](#pentaho-data-integration-pdi)
* [Pentaho User Console (PUC)](#pentaho-user-console-puc)
  * [PUC email](#puc-email)
* [Pentaho Aggregation Designer](#pentaho-aggregation-designer)
* [Pentaho Metadata Editor (PME)](#pentaho-metadata-editor-pme)
* [Pentaho Report Designer (PRD)](#pentaho-report-designer-prd)

**Pentaho Data Integration (PDI)**

Perform the following steps to use an encrypted password with Pentaho Data Integration (PDI).

1. Stop the server.

   For instructions, see [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository).
2. Navigate to the `design-tools/data-integration/simple-jndi` directory.
3. Open the `jdbc.properties` file in a text editor.
4. Replace all instances of the `password` value with the encrypted password.
5. Save and close the file.
6. Restart the server and verify the change.

**Pentaho User Console (PUC)**

Perform the following steps to use an encrypted password with the Pentaho User Console (PUC).

1. Stop the server.

   For instructions, see [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository).
2. Navigate to the `server/pentaho-server/tomcat/webapps/pentaho/META-INF` directory.
3. Open the `context.xml` file in a text editor.
4. Replace the `password` value in every `Resource` element with the encrypted password.
5. Save and close the file.
6. Restart the server and verify the change.

**PUC email**

After you configure PUC to use an encrypted password, you can use that password with PUC email.

1. Log in to PUC as an administrator.
2. Open the Administration Perspective.
3. Select the Mail server section.
4. Enter your encrypted password value in the `password` field.

   **Note:** If you use Gmail, enable **Allow less secure apps to access your account**.
5. Select **Test Email Configuration**.
6. Verify that PUC sends an email to the address you specified.

**Pentaho Aggregation Designer**

To use encrypted passwords with Pentaho Aggregation Designer, you must first centralize your passwords in a `jndi.properties` file.

1. Stop the server.

   For instructions, see [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository).
2. Create a `jndi.properties` file with the default properties:

   ```
   java.naming.factory.initial=org.osjava.sj.SimpleContextFactory
   org.osjava.sj.root=file://C:/Users/<username>/.pentaho/simple-jndi
   org.osjava.sj.delimiter=/
   ```
3. Save `jndi.properties` in the `design-tools/aggregation-designer/lib` directory.
4. In the user’s home directory, navigate to the `.pentaho/simple-jndi` directory.
   * Open `default.properties` in a text editor.
   * If `default.properties` does not exist, create it.
   * If you created `default.properties` under `design-tools/aggregation-designer/simple-jndi`, update `org.osjava.sj.root` in `design-tools/aggregation-designer/lib/jndi.properties` to point to it. Example:

     ```
     org.osjava.sj.root=file://<install directory>/design-tools/aggregation-designer/simple-jndi
     ```
5. Replace the `password` value in every property in `default.properties` with the encrypted password.

   **Note:** If you use a remote repository, replace `localhost` with the repository IP address.
6. Save and close the file.
7. Restart the server and verify the change.

**Pentaho Metadata Editor (PME)**

The Pentaho Metadata Editor (PME) stores passwords in the JNDI connection `default.properties` file. For setup details, see [Define JNDI connections for Report Designer and Metadata Editor](https://docs.pentaho.com/install/legacy-redirects/define-data-connections-redirects/define-jndi-connections-for-report-designer-and-metadata-editor).

1. Stop the server.

   For instructions, see [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository).
2. In the user’s home directory, navigate to the `.pentaho/simple-jndi` directory.
3. Open `default.properties` in a text editor.

   **Note:** If `default.properties` does not exist, create it.
4. Replace the `password` value in every property with the encrypted password.

   **Note:** If you use a remote repository, replace `localhost` with the repository IP address.
5. Save and close the file.
6. Restart the server and verify the change.

**Pentaho Report Designer (PRD)**

The Pentaho Report Designer (PRD) stores passwords in the JNDI connection `default.properties` file. For setup details, see [Define JNDI connections for Report Designer and Metadata Editor](https://docs.pentaho.com/install/legacy-redirects/define-data-connections-redirects/define-jndi-connections-for-report-designer-and-metadata-editor).

1. Stop the server.

   For instructions, see [Stop and start the Pentaho Server and repository](#stop-and-start-the-pentaho-server-and-repository).
2. Navigate to the `design-tools/report-designer/configuration-template/simple-jndi` directory.
3. Open `default.properties` in a text editor.
4. Replace the `password` value in every property with the encrypted password.

   **Note:** If you use a remote repository, replace `localhost` with the repository IP address.
5. Save the file.
6. Copy `default.properties` to the `.pentaho/simple-jndi` directory in the user’s home directory. Replace the existing file.

   **Note:** If the `.pentaho/simple-jndi` directory does not exist, create it.
7. Restart the server and verify the change.

#### After you update a product

After you configure a product to use encrypted passwords, all logins with that product use encrypted passwords.

Connect to any databases you updated to verify the changes.

### Set up Pentaho to connect to a Hadoop cluster

Use this topic to configure Pentaho to connect to Hadoop clusters.

Supported distributions include Amazon EMR, Azure HDInsight (HDI), Cloudera Data Platform (CDP), and Google Dataproc.

Pentaho also supports related services such as HDFS, HBase, Hive, Oozie, Sqoop, YARN/MapReduce, ZooKeeper, and Spark.

You can connect to clusters and services from these Pentaho components:

* PDI client (Spoon), along with Kitchen and Pan command line tools
* Pentaho Server
* Analyzer (PAZ)
* Pentaho Interactive Reports (PIR)
* Pentaho Report Designer (PRD)
* Pentaho Metadata Editor (PME)

Pentaho connects to Hadoop clusters through a compatibility layer called a driver (Big Data shim).

To confirm which drivers are supported for your version, see the [Components Reference](https://docs.pentaho.com/install/components-reference).

Drivers are shipped as vendor-specific builds of the optional `pentaho-big-data-ee-plugin`.

Download drivers from the [Hitachi Vantara Lumada and Pentaho Support Portal](https://support.pentaho.com/hc/en-us).

**Note:** Pentaho ships with a generic Apache Hadoop driver. For specific vendor drivers, visit the [Hitachi Vantara Lumada and Pentaho Support Portal](https://support.pentaho.com/hc/en-us) to download the drivers.

#### Install a new driver

You need a driver for each cluster vendor and version you connect to from:

* PDI client (Spoon), plus Kitchen and Pan
* Pentaho Server
* Analyzer
* Interactive Reports
* Pentaho Report Designer (PRD)
* Pentaho Metadata Editor (PME)

{% hint style="info" %}
Pentaho ships with a generic Apache Hadoop driver. Download vendor-specific drivers from the Support Portal.
{% endhint %}

{% stepper %}
{% step %}
**Download the driver plugin**

1. Sign in to the [Support Portal](https://support.pentaho.com/hc/en-us).
2. Go to **Downloads**.
3. In the **11.0** list, open the full downloads list.
4. Open **Pentaho 11.0 GA Release**.
5. Download the driver plugin from `Big Data Shims`.

Common driver plugin files:

* Apache Vanilla: `pentaho-big-data-ee-plugin-apachevanilla-11.0.0.0-<build-number>.zip`
* Cloudera Data Platform: `pentaho-big-data-ee-plugin-cdpdc71-11.0.0.0-<build-number>.zip`
* Google Dataproc: `pentaho-big-data-ee-plugin-dataproc1421-11.0.0.0-<build-number>.zip`
* Amazon EMR: `pentaho-big-data-ee-plugin-emr770-11.0.0.0-<build-number>.zip`
* Azure HDInsight: `pentaho-big-data-ee-plugin-hdi40-11.0.0.0-<build-number>.zip`
  {% endstep %}

{% step %}
**Install the driver on the PDI client**

1. Stop PDI.
2. Extract the downloaded `.zip` into:
   * `<pdi-install-dir>/data-integration/plugins`
3. If you are replacing an existing driver plugin, remove the old `pentaho-big-data-ee-plugin` folder first.
   {% endstep %}

{% step %}
**Install the driver on the Pentaho Server**

1. Stop the Pentaho Server.
2. Extract the downloaded `.zip` into:
   * `<pentaho-server>/pentaho-solutions/system/kettle/plugins`
3. If you are replacing an existing driver plugin, remove the old `pentaho-big-data-ee-plugin` folder first.
   {% endstep %}

{% step %}
**Restart and verify**

1. Restart the PDI client and the Pentaho Server.
2. Create or update your cluster connection and verify it connects.
   {% endstep %}
   {% endstepper %}

#### Update drivers

When drivers for new Hadoop versions are released, download the new driver plugin and repeat the install steps.

#### Additional configurations for specific distributions

Use these settings when you configure Pentaho to connect to specific Hadoop distributions:

* [Amazon EMR](#amazon-emr)
* [Azure HDInsight](#azure-hdinsight)
* [Cloudera Data Platform (CDP)](#cloudera-data-platform-cdp)
* [Google Dataproc](#google-dataproc)

**Amazon EMR**

The following settings are available while you configure Pentaho to connect to a working Amazon EMR cluster.

{% hint style="info" %}
EMR clusters (version 7.x and later) built with JDK 17 exclude `commons-lang-2.6.jar` from standard Hadoop library directories (such as `$HADOOP_HOME/lib`).

To use the EMR driver with EMR 7.x:

1. Download `commons-lang-2.6.jar` from a trusted source (for example, [Maven Repository: commons-lang » commons-lang » 2.6](https://mvnrepository.com/artifact/commons-lang/commons-lang/2.6)).
2. Copy the JAR to `$HADOOP_HOME/lib` or `$HADOOP_MAPRED_HOME/lib` on every EMR node.
   {% endhint %}

**Before you begin**

Before you set up Pentaho to connect to an Amazon EMR cluster, do these tasks:

1. Check the [Components Reference](https://docs.pentaho.com/install/components-reference) to confirm your Pentaho version supports your EMR version.
2. Prepare your Amazon EMR cluster:
   1. Configure an Amazon EC2 cluster.
   2. Install required services and service client tools.
   3. Test the cluster.
3. Install PDI on an Amazon EC2 instance in the same Amazon VPC as the EMR cluster.
4. Get connection details from your Hadoop administrator.
5. Add the YARN user on the cluster to the group defined by `dfs.permissions.superusergroup` in `hdfs-site.xml`.

{% hint style="info" %}
As a best practice, install PDI on the Amazon EC2 instance.

Otherwise, you may not be able to read or write cluster files.

For a workaround, see [Unable to read or write files to HDFS on the Amazon EMR cluster](https://docs.pentaho.com/install/legacy-redirects/use-hadoop-with-pentaho-redirects/unable-to-read-or-write-files-to-hdfs-on-amazon-emr-cluster).
{% endhint %}

You also need to share connection details with users after setup.

For the full list, see [Hadoop connection and access information list](https://docs.pentaho.com/install/legacy-redirects/hadoop-connection-and-access-information-list).

**Edit configuration files for users**

Your cluster administrator must download cluster configuration files.

Update the files with Pentaho-specific and user-specific values.

Use these files to create or update a named connection.

**Where named connection files live**

Named connection files are stored here:

* Named connection XML: `<username>/.pentaho/metastore/pentaho/NamedCluster`
* Named connection config folder: `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>`
* Extra settings file: `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

Save edited files in a safe location.

**Files to provide to users**

Provide these files to each user:

* `core-site.xml`
* `mapred-site.xml`
* `hdfs-site.xml`
* `yarn-site.xml`

**Verify or edit core-site.xml file**

{% hint style="info" %}
If you plan to run MapReduce jobs on Amazon EMR, confirm you have read, write, and execute access to the S3 buffer directories specified in `core-site.xml`.
{% endhint %}

Edit `core-site.xml` to add AWS access keys and (optional) LZO compression settings.

{% stepper %}
{% step %}
**Open the file**

Open `core-site.xml` from the folder where you saved the other `*-site.xml` files.
{% endstep %}

{% step %}
**Add AWS credentials**

Add your AWS Access Key ID and secret access key:

```xml
<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>
```

{% endstep %}

{% step %}
**Optional: Add S3N credentials**

If you use S3N, add these properties:

```xml
<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>
```

{% endstep %}

{% step %}
**Add filesystem implementation settings**

Add these properties:

```xml
<property>
   <name>fs.s3n.impl</name>
   <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

<property>
   <name>fs.s3.impl</name>
   <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

<property>
   <name>fs.s3a.impl</name>
   <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
```

{% endstep %}

{% step %}
**Configure LZO compression**

If you are not using LZO compression, remove any references to `com.hadoop.compression.lzo.LzoCodec` from `core-site.xml`.

If you are using LZO compression:

1. Download the LZO JAR.
2. Add it to `pentaho-big-data-plugin/hadoop-configurations/emr3x/lib`.

Download: <http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/>
{% endstep %}

{% step %}
**Save and apply the change**

Save the file.

Update the named connection.

Upload the updated `core-site.xml`.
{% endstep %}
{% endstepper %}

**Edit mapred-site.xml file**

If you use MapReduce, edit `mapred-site.xml`.

You also enable cross-platform MapReduce job submission.

{% stepper %}
{% step %}
**Open the file**

Open `mapred-site.xml` from the folder where you saved the other `*-site.xml` files.
{% endstep %}

{% step %}
**Add the property**

Add this property:

```xml
<property>
  <name>mapreduce.app-submission.cross-platform</name>
  <value>true</value>
</property>
```

This property is only required for MapReduce jobs on Windows.
{% endstep %}

{% step %}
**Save and apply the change**

Save the file.

Update the named connection.

Upload the updated `mapred-site.xml`.
{% endstep %}
{% endstepper %}

**Connect to a Hadoop cluster with the PDI client**

After you set up the Pentaho Server to connect to a cluster, configure and test the connection.

See the **Pentaho Data Integration** documentation for instructions.

**Connect other Pentaho components to the Amazon EMR cluster**

Use this procedure to create and test a connection to your Amazon EMR cluster from these Pentaho components:

* Pentaho Server (DI and BA)
* Pentaho Metadata Editor (PME)
* Pentaho Report Designer (PRD)

**Install a driver for the Pentaho Server**

Install a driver for the Pentaho Server.

For instructions, see [Install a new driver](#install-a-new-driver).

**Create and test connections**

Create and test a connection for each component:

* **Pentaho Server for DI**: Create a transformation in the PDI client and run it remotely.
* **Pentaho Server for BA**: Create a connection to the cluster in the Data Source Wizard.
* **PME**: Create a connection to the cluster in PME.
* **PRD**: Create a connection to the cluster in PRD.

**Share connection details with users**

After you connect to the cluster and services, share the connection details with users.

Users can access the cluster only from machines configured to connect to it.

To connect, users need:

* Hadoop distribution and version
* HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
* Oozie URL (if used)

Users also need permissions for required HDFS directories.

For a detailed list of required information, see [Hadoop connection and access information list](https://docs.pentaho.com/install/legacy-redirects/hadoop-connection-and-access-information-list).

**Azure HDInsight**

Use these settings when you configure Pentaho to connect to Azure HDInsight (HDI).

**Before you begin**

Before you set up Pentaho to connect to HDI, do the following:

1. Check [Components Reference](https://docs.pentaho.com/install/components-reference). Confirm your Pentaho version supports your HDI version.
2. Prepare your HDI instance:
   1. Configure your Azure HDInsight instance.
   2. Install required services and client tools.
   3. Test the platform.
   4. If HDI uses Kerberos, complete the Kerberos steps in this page.
3. Get connection details from your platform admin. You will share some of this information with users later. See [Hadoop connection and access information list](https://docs.pentaho.com/install/legacy-redirects/hadoop-connection-and-access-information-list).
4. Add the YARN user to the group defined by `dfs.permissions.superusergroup` in `hdfs-site.xml`.
5. Set up the Hadoop driver for your HDI version. See [Install a new driver](#install-a-new-driver).

**Kerberos-secured HDInsight instances**

If you connect to HDI secured with Kerberos, complete these steps first:

1. Configure Kerberos security on the platform. Configure the Kerberos realm, KDC, and admin server.
2. Configure these nodes to accept remote connection requests:
   * NameNode
   * DataNode
   * Secondary NameNode
   * JobTracker
   * TaskTracker
3. If you deployed HDI using an enterprise program, set up Kerberos for those nodes.
4. Add user credentials to the Kerberos database for each Pentaho user.
5. Verify an OS user exists on each HDI node for each Kerberos user. Create users as needed.

{% hint style="info" %}
User account UIDs should be greater than `min.user.id`. The default is usually `1000`.
{% endhint %}

6. Set up Kerberos on your Pentaho machines. See the *Administer Pentaho Data Integration and Analytics* guide.

**Edit configuration files for users**

Your Azure admin downloads the site configuration files for the services you use. They update the files with Pentaho-specific and user-specific settings. Users upload the updated files when they create a named connection.

Named connection files are stored in these locations:

* `<username>/.pentaho/metastore/pentaho/NamedCluster`
* `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

Save the updated files in a known location for reuse.

**Files to provide**

* `core-site.xml` (secured HDInsight only)
* `hbase-site.xml`
* `hive-site.xml`
* `mapred-site.xml`
* `yarn-site.xml`

{% hint style="info" %}
If you update these files after creating a named connection, edit the named connection and re-upload the updated files.
{% endhint %}

**Edit Core site XML file**

If you use a secured instance of Azure HDInsight, update `core-site.xml`.

1. Open `core-site.xml`.
2. Add or update properties for your storage type.

   **WASB storage**

   Add these properties:

   * `fs.AbstractFileSystem.wasb.impl`: `org.apache.hadoop.fs.azure.Wasb`
   * `pentaho.runtime.fs.default.name`: `wasb://<container-name>@<storage-account-name>.blob.core.windows.net`

   Example:

   ```xml
   <property>
     <name>fs.AbstractFileSystem.wasb.impl</name>
     <value>org.apache.hadoop.fs.azure.Wasb</value>
   </property>

   <property>
     <name>pentaho.runtime.fs.default.name</name>
     <value>wasb://&lt;container-name&gt;@&lt;storage-account-name&gt;.blob.core.windows.net</value>
   </property>
   ```

   **ADLS (ABFS) storage**

   Add this property:

   * `pentaho.runtime.fs.default.name`: `abfs://<container-name>@<storage-account-name>.dfs.core.windows.net`

   Example:

   ```xml
   <property>
     <name>pentaho.runtime.fs.default.name</name>
     <value>abfs://&lt;container-name&gt;@&lt;storage-account-name&gt;.dfs.core.windows.net</value>
   </property>
   ```
3. Save the file.

**Edit HBase site XML file**

If you use HBase, update `hbase-site.xml` to set the temporary directory.

1. Open `hbase-site.xml`.
2. Add or update this property:
   * `hbase.tmp.dir`: `/tmp/hadoop/hbase`
3. Save the file.

**Edit Hive site XML file**

If you use Hive, update `hive-site.xml` to set the Hive metastore location.

1. Open `hive-site.xml`.
2. Add or update these properties:

   * `hive.metastore.uris`: Hive metastore URI, if different from your HDInsight instance.
   * `fs.azure.account.keyprovider.<storage-account>.blob.core.windows.net`: Azure storage key provider principal, if required.

   Example:

   ```xml
   <property>
     <name>hive.metastore.uris</name>
     <value>thrift://&lt;metastore-hostname&gt;:9083</value>
   </property>
   ```
3. Save the file.

**Edit Mapred site XML file**

If you use MapReduce, update `mapred-site.xml` for job history logging and cross-platform execution.

1. Open `mapred-site.xml`.
2. Ensure these properties exist:

   * `mapreduce.jobhistory.address`: where MapReduce job history logs are stored
   * `mapreduce.job.hdfs-servers`: HDFS servers used by YARN to run MapReduce jobs

   Example:

   ```xml
   <property>
     <name>mapreduce.jobhistory.address</name>
     <value>&lt;active-node-hostname&gt;:10020</value>
   </property>

   <property>
     <name>mapreduce.job.hdfs-servers</name>
     <value>hdfs://&lt;active-node-hostname&gt;:8020</value>
   </property>
   ```
3. Optional: If YARN containers run on JDK 11 nodes, add this property:
   * `mapreduce.jvm.add-opens-as-default`: `false`

{% hint style="warning" %}
Do not add `mapreduce.jvm.add-opens-as-default` for containers running on JDK 17 nodes.
{% endhint %}

Example:

```xml
<property>
  <name>mapreduce.jvm.add-opens-as-default</name>
  <value>false</value>
</property>
```

4. Save the file.

**Edit YARN site XML file**

If you use YARN, verify your `yarn-site.xml` settings.

1. Open `yarn-site.xml`.
2. Add or update these properties:

   * `yarn.resourcemanager.hostname`: ResourceManager host name
   * `yarn.resourcemanager.address`: ResourceManager address and port
   * `yarn.resourcemanager.admin.address`: ResourceManager admin address and port

   Example:

   ```xml
   <property>
     <name>yarn.resourcemanager.hostname</name>
     <value>&lt;resource-manager-hostname&gt;</value>
   </property>

   <property>
     <name>yarn.resourcemanager.address</name>
     <value>&lt;resource-manager-hostname&gt;:8050</value>
   </property>

   <property>
     <name>yarn.resourcemanager.admin.address</name>
     <value>&lt;resource-manager-hostname&gt;:8141</value>
   </property>
   ```
3. Save the file.

{% hint style="info" %}
After you change these files, edit the named connection and upload the updated files.
{% endhint %}

**Oozie configuration**

If you use Oozie, configure both the cluster and the Pentaho server.

By default, the Oozie user runs Oozie jobs. If you start an Oozie job from PDI, set up a PDI proxy user.

**Set up Oozie on a cluster**

Add your PDI user to `oozie-site.xml`.

1. Open `oozie-site.xml` on the cluster.
2. Add these properties. Replace `<pdi-username>` with the PDI user name.

   ```xml
   <property>
     <name>oozie.service.ProxyUserService.proxyuser.<pdi-username>.groups</name>
     <value>*</value>
   </property>
   <property>
     <name>oozie.service.ProxyUserService.proxyuser.<pdi-username>.hosts</name>
     <value>*</value>
   </property>
   ```
3. Save the file.

**Set up Oozie on the server**

Set the proxy user for the named cluster on the Pentaho server.

1. Open `config.properties`:

   `/<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

   This path is created when you create a named connection.
2. Set `pentaho.oozie.proxy.user` to the proxy user name.
3. Save the file.

**Windows configuration for a secured cluster**

If you run Pentaho Server on Windows and your cluster uses Kerberos, point Tomcat to your `krb5.conf` or `krb5.ini`.

1. Go to `server/pentaho-server`.
2. Open `start-pentaho.bat`.
3. Set `CATALINA_OPTS` to include the Kerberos config path:

   ```
   set \"CATALINA_OPTS=%CATALINA_OPTS% -Djava.security.krb5.conf=C:\\kerberos\\krb5.conf\"
   ```
4. Save the file.

**Connect to HDI with the PDI client**

After you set up the Pentaho Server to connect to HDI, configure and test the connection from PDI.

See the *Pentaho Data Integration* documentation for how to connect the PDI client to a cluster.

**Connect other Pentaho components to HDI**

Create and test an Azure HDInsight (HDI) connection in:

* Pentaho Server
* Pentaho Metadata Editor (PME)
* Pentaho Report Designer (PRD)

**Prerequisites**

Install a driver for the Pentaho Server. See [Install a new driver](#install-a-new-driver).

**Create and test connections**

Create and test the connection in each product:

* **Pentaho Server (DI)**: Create a transformation in the PDI client. Run it remotely.
* **Pentaho Server (BA)**: Create a connection to HDI in the Data Source Wizard.
* **PME**: Create a connection to HDI.
* **PRD**: Create a connection to HDI.

After you connect, share connection details with users.

Users typically need:

* HDI distribution and version
* HDFS, ResourceManager (JobTracker), ZooKeeper, and HiveServer2 hostnames, IP addresses, and ports
* Oozie URL (if used)
* Permissions for required HDFS directories, including user home directories

See [Hadoop connection and access information list](https://docs.pentaho.com/install/legacy-redirects/hadoop-connection-and-access-information-list).

**Cloudera Data Platform (CDP)**

Use these advanced settings when you configure Pentaho to connect to Cloudera Data Platform (CDP).

**Before you begin**

Before you set up Pentaho to connect to CDP, do these tasks:

1. Check [Components Reference](https://docs.pentaho.com/install/components-reference). Verify your Pentaho version supports your CDP version.
2. Prepare CDP:
   1. Configure Cloudera Data Platform.

      See [CDP documentation](https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/index.html).
   2. Install required services and client tools.
   3. Test the platform.
3. Get connection details from your platform administrator.

   You will share some of this information with users later.

   See [Hadoop connection and access information list](https://docs.pentaho.com/install/legacy-redirects/hadoop-connection-and-access-information-list).
4. Add the YARN user to the group defined by `dfs.permissions.superusergroup`.

   Find this property in `hdfs-site.xml` or in Cloudera Manager.
5. Set up the Hadoop driver for your CDP version. See [Install a new driver](#install-a-new-driver).

**Set up a secured instance of CDP**

If you connect to Kerberos-secured CDP, also do these tasks:

1. Configure Kerberos on the platform.

   Include the realm, KDC, and administrative server.
2. Configure these nodes to accept remote connection requests:
   * Name
   * Data
   * Secondary
   * Job tracker
   * Task tracker
3. If you deployed CDP using an enterprise program, set up Kerberos for:
   * Name
   * Data
   * Secondary name
   * Job tracker
   * Task tracker nodes
4. Add credentials to the Kerberos database for each Pentaho user.
5. Verify each user has an operating system account on each CDP node.

   Add operating system users if needed.

{% hint style="info" %}
User account UIDs should be greater than `min.user.id`.

This value is usually `1000`.
{% endhint %}

6. Set up Kerberos on your Pentaho machines.

   See **Administer Pentaho Data Integration and Analytics**.

**Edit configuration files for users**

Cloudera administrators download site configuration files for the services you use.

They update the files with Pentaho-specific and user-specific settings.

Users then upload the files when they create a named connection.

Named connection files are stored here:

* `<username>/.pentaho/metastore/pentaho/NamedCluster`
* `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

Save the updated files in a known location for reuse.

**Files to provide**

* `config.properties`
* `core-site.xml` (secured CDP only)
* `hive-site.xml`
* `mapred-site.xml`
* `yarn-site.xml`

{% hint style="info" %}
If you update configuration files after creating a named connection, edit the named connection and re-upload the updated files.
{% endhint %}

**Edit Core site XML file**

If you use a secured instance of CDP, update `core-site.xml`.

1. Open `core-site.xml`.
2. Add or update these properties:

   | Property                                     | Value                                                     |
   | -------------------------------------------- | --------------------------------------------------------- |
   | `hadoop.proxyuser.oozie.hosts`               | Oozie hosts on your CDP cluster.                          |
   | `hadoop.proxyuser.oozie.groups`              | Oozie groups on your CDP cluster.                         |
   | `hadoop.proxyuser.<security_service>.hosts`  | Proxy user hosts for other services on your CDP cluster.  |
   | `hadoop.proxyuser.<security_service>.groups` | Proxy user groups for other services on your CDP cluster. |
   | `fs.s3a.access.key`                          | Your S3 access key, if you access S3 from CDP.            |
   | `fs.s3a.secret.key`                          | Your S3 secret key, if you access S3 from CDP.            |
3. Optional (AWS): If you connect to CDP Public Cloud on AWS and use an S3 bucket outside the CDP environment, update or add these properties:

   ```xml
   <property>
     <name>fs.s3a.delegation.token.binding</name>
     <value>org.apache.hadoop.fs.s3a.auth.delegation.SessionTokenBinding</value>
   </property>
   ```

   ```xml
   <property>
     <name>fs.s3a.aws.credentials.provider</name>
     <value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
   </property>
   ```

   Ensure the gateway node has valid AWS credentials (for example, under `~/.aws/`).
4. Optional (Azure): If you connect to CDP Public Cloud on Azure and use a storage account outside the CDP environment:
   * Remove these properties:
     * `fs.azure.enable.delegation.token`
     * `fs.azure.delegation.token.provider.type`
     * `fs.azure.account.auth.type`
     * `fs.azure.account.oauth.provider.type`
   * Add these properties:
     * `fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net` = `SharedKey`
     * `fs.azure.account.key.<storage-account-name>.dfs.core.windows.net` = `<storage-account-key>`
5. Optional (GCP): If you connect to CDP Public Cloud on GCP and use a bucket outside the CDP environment, create a custom role with these permissions:

   ```
   storage.bucket.get
   storage.objects.create
   storage.objects.delete
   storage.objects.get
   storage.objects.getIamPolicy
   storage.objects.list
   storage.objects.setIamPolicy
   storage.objects.update
   ```

   Assign the custom role to the Data Lake and Log service accounts for the bucket.
6. Save the file.

**Edit Hive site XML file**

If you use Hive, update `hive-site.xml` to set the Hive metastore location.

1. Open `hive-site.xml`.
2. Add or update these properties:

   | Property                            | Value                                                                   |
   | ----------------------------------- | ----------------------------------------------------------------------- |
   | `hive.metastore.uris`               | Set this to the Hive metastore URI if it differs from your CDP cluster. |
   | `hive.server2.enable.impersonation` | Set to `true` if you use impersonation.                                 |
   | `hive.server2.enable.doAs`          | Set to `true` if you use impersonation.                                 |
   | `tez.lib.uris`                      | Required when you use Hive 3 on Tez.                                    |

   Example:

   ```xml
   <property>
     <name>hive.server2.enable.doAs</name>
     <value>true</value>
   </property>
   ```

   ```xml
   <property>
     <name>tez.lib.uris</name>
     <value>/user/tez/0.9.1.7.1.4.0-203/tez.tar.gz</value>
   </property>
   ```
3. Save the file.

**Edit Mapred site XML file**

If you use MapReduce, update `mapred-site.xml` to set job history logging and allow cross-platform submissions.

1. Open `mapred-site.xml`.
2. Ensure these properties exist:

   | Property                                  | Value                                                                     |
   | ----------------------------------------- | ------------------------------------------------------------------------- |
   | `mapreduce.jobhistory.address`            | Where MapReduce job history logs are stored.                              |
   | `mapreduce.app-submission.cross-platform` | Set to `true` to allow submissions from Windows clients to Linux servers. |

   Example:

   ```xml
   <property>
     <name>mapreduce.app-submission.cross-platform</name>
     <value>true</value>
   </property>
   ```
3. Save the file.

**Edit YARN site XML file**

If you use YARN, verify your YARN settings in `yarn-site.xml`.

1. Open `yarn-site.xml`.
2. Add or update these properties:

   | Property                                             | Value                                                                              |
   | ---------------------------------------------------- | ---------------------------------------------------------------------------------- |
   | `yarn.application.classpath`                         | Classpaths needed to run YARN applications. Use commas to separate multiple paths. |
   | `yarn.resourcemanager.hostname`                      | Resource Manager host name for your environment.                                   |
   | `yarn.resourcemanager.address`                       | Resource Manager address and port for your environment.                            |
   | `yarn.resourcemanager.admin.address`                 | Resource Manager admin address and port for your environment.                      |
   | `yarn.resourcemanager.proxy-user-privileges.enabled` | Set to `true` if you use a proxy user.                                             |

   Example:

   ```xml
   <property>
     <name>yarn.resourcemanager.proxy-user-privileges.enabled</name>
     <value>true</value>
   </property>
   ```
3. Save the file.

{% hint style="info" %}
After you change these files, edit the named connection and upload the updated files.
{% endhint %}

**Oozie configuration**

If you use Oozie on your cluster, configure proxy access on the cluster and the server.

By default, the `oozie` user runs Oozie jobs.

If you start an Oozie job from PDI, configure a proxy user.

**Set up Oozie on a cluster**

Add your PDI user to `oozie-site.xml`.

1. Open `oozie-site.xml` on the cluster.
2. Add these properties.

   Replace `<your_pdi_user_name>` with your PDI user name.

   ```xml
   <property>
     <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.groups</name>
     <value>*</value>
   </property>
   <property>
     <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.hosts</name>
     <value>*</value>
   </property>
   ```
3. Save the file.

**Set up Oozie on the server**

Add the proxy user name to the PDI named connection configuration.

1. Open this file:

   `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection_name>/config.properties`

{% hint style="info" %}
This path is created when you create a named connection.
{% endhint %}

2. Set `pentaho.oozie.proxy.user` to the proxy user name.
3. Save the file.

**Windows configuration for a secured cluster**

If you run Pentaho Server on Windows and use Kerberos, set the path to your `krb5.conf` or `krb5.ini` file.

1. Open `server/pentaho-server/start-pentaho.bat`.
2. Add `-Djava.security.krb5.conf` to `CATALINA_OPTS`.

   Example:

   ```bat
   set \"CATALINA_OPTS=%CATALINA_OPTS% -Djava.security.krb5.conf=C:\\kerberos\\krb5.conf\"
   ```
3. Save the file.

**Connect to CDP with the PDI client**

After you set up the Pentaho Server to connect to CDP, configure and test the connection from the PDI client.

See **Pentaho Data Integration** for the client connection steps.

**Connect other Pentaho components to CDP**

Create and test a connection to CDP from Pentaho Server, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME).

**Create and test connections**

Create and test a connection in each component.

* **Pentaho Server for Data Integration (DI)**: Create a transformation in the PDI client, then run it remotely.
* **Pentaho Server for Business Analytics (BA)**: Create a connection to CDP in the Data Source Wizard.
* **Pentaho Metadata Editor (PME)**: Create a connection to CDP in PME.
* **Pentaho Report Designer (PRD)**: Create a connection to CDP in PRD.

**Share connection details with users**

After you connect to CDP and its services, give connection details to users who need access.

Users typically need:

* CDP distribution and version
* HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames, and port numbers
* Oozie URL (if used)
* Permission to access required HDFS directories, including home directories

Users might need more information, depending on the steps, entries, and services they use.

See [Hadoop connection and access information list](https://docs.pentaho.com/install/legacy-redirects/hadoop-connection-and-access-information-list).

**Google Dataproc**

The following settings are available while you configure Pentaho to connect to Google Dataproc.

**Before you begin**

Before you set up Pentaho to connect to a Google Dataproc cluster, do these tasks:

1. Check the [Components Reference](https://docs.pentaho.com/install/components-reference).
2. Prepare your Google Cloud access:
   * Get credentials for a Google account and access to the Google Cloud Console.
   * Get required credentials for Google Cloud Platform, Compute Engine, and Dataproc.
3. Contact your Hadoop administrator for cluster connection details.

You also need to [provide some of this information to users](https://docs.pentaho.com/install/legacy-redirects/hadoop-connection-and-access-information-list) after setup.

**Create a Dataproc cluster**

You can create a Dataproc cluster using several methods.

For cluster setup options, see the [Google Cloud Documentation](https://cloud.google.com/dataproc/docs/guides/create-cluster).

**Install the Google Cloud SDK on your local machine**

Use Google’s instructions to install the Google Cloud SDK for your platform:

* Linux: [Install the Google Cloud SDK on Linux](https://cloud.google.com/sdk/docs/downloads-interactive#linux)
* Windows: [Install the Google Cloud SDK on Windows](https://cloud.google.com/sdk/docs/downloads-interactive#windows)

**Set command variables**

Set these environment variables before you run command-line examples on your local machine or in Cloud Shell.

1. Set the variables:

   ```bash
   export PROJECT=project
   export HOSTNAME=hostname
   export ZONE=zone
   ```
2. Set `PROJECT` to your Google Cloud project ID.
3. Set `HOSTNAME` to the name of the master node in your Dataproc cluster.

   **Note:** The master node name ends with `-m`.
4. Set `ZONE` to the zone of the instances in your Dataproc cluster.

**Set up a Google Compute Engine instance for PDI**

Run the PDI client inside Google Compute Engine (GCE).

Users must connect remotely through VNC to use the desktop UI.

VM instances in GCE do not publicly expose the required remote desktop ports.

Create an SSH tunnel between the VNC client and the VM instance.

{% stepper %}
{% step %}
**Create a VM instance and set network tags**

1. In the Google Cloud Console, open the Compute Engine console.
2. Go to **Compute Engine** > **VM instances**.
3. Select **Create instance**.
4. Open **Advanced options** and then the **Networking** tab.
5. In **Network tags**, enter `vnc-server`.
   {% endstep %}

{% step %}
**Install and configure VNC**

1. Install and update a VNC service for the remote UI.
2. Install Gnome and VNC.
   {% endstep %}

{% step %}
**Connect using SSH and create an SSH tunnel**

1. Log in to the instance using SSH.
2. Use an SSH client and the VM external IP.

   **Note:** The Google Cloud Console shows the external IP.
3. Create an SSH tunnel from your VNC client machine.
4. Connect to the VNC session.
   {% endstep %}

{% step %}
**Optional: Configure Kerberos**

If you use Kerberos, configure Kerberos on the GCE VM.

Authenticate the client machine with the Kerberos controller.

This is required for Kerberos-enabled Dataproc clusters.
{% endstep %}
{% endstepper %}

When you finish, you can run PDI in GCE.

You can design and launch jobs and transformations on Dataproc.

**Edit configuration files for users**

Your cluster administrator must download cluster configuration files.

Update the files with Pentaho-specific and user-specific values.

Use these files to create a named connection.

**Where named connection files live**

Named connection files are stored here:

* Named connection XML: `<username>/.pentaho/metastore/pentaho/NamedCluster`
* Named connection config folder: `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>`
* Extra settings file: `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

Save edited files in a safe location.

**Files to provide to users**

Provide these files to each user:

* `core-site.xml`
* `hdfs-site.xml`
* `mapred-site.xml`
* `yarn-site.xml`
* `hive-site.xml`

{% hint style="info" %}
You can copy these files from a Dataproc cluster using SCP.
{% endhint %}

**Edit `mapred-site.xml` (MapReduce)**

If you use MapReduce, update `mapred-site.xml`.

You also enable cross-platform MapReduce job submission.

{% stepper %}
{% step %}
**Open the file**

Open `mapred-site.xml` from the folder where you saved the other `*-site.xml` files.
{% endstep %}

{% step %}
**Add the property**

Add this property:

```xml
<property>
  <name>mapreduce.app-submission.cross-platform</name>
  <value>true</value>
</property>
```

This property is only required for MapReduce jobs on Windows.
{% endstep %}

{% step %}
**Save and apply the change**

Save the file.

Edit the named connection.

Upload the updated `mapred-site.xml`.
{% endstep %}
{% endstepper %}

**Connect to a Hadoop cluster with the PDI client**

After you set up the Pentaho Server to connect to a cluster, configure and test the connection.

See the **Pentaho Data Integration** documentation for instructions.

**Connect other Pentaho components to Dataproc**

Use this procedure to create and test a connection to your Dataproc cluster from these Pentaho components:

* Pentaho Server (DI and BA)
* Pentaho Metadata Editor (PME)
* Pentaho Report Designer (PRD)

**Install a driver for the Pentaho Server**

Install a driver for the Pentaho Server.

For instructions, see [Install a new driver](#install-a-new-driver).

**Create and test connections**

Create and test a connection for each component:

* **Pentaho Server for DI**: Create a transformation in the PDI client and run it remotely.
* **Pentaho Server for BA**: Create a connection to the cluster in the Data Source Wizard.
* **PME**: Create a connection to the cluster in PME.
* **PRD**: Create a connection to the cluster in PRD.

**Share connection details with users**

After you connect to the cluster and services, share the connection details with users.

Users can access the cluster only from machines configured to connect to it.

To connect, users need:

* Hadoop distribution and version
* HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
* Oozie URL (if used)

Users also need permissions for required HDFS directories.

For a detailed list of required information, see [Hadoop connection and access information list](https://docs.pentaho.com/install/legacy-redirects/hadoop-connection-and-access-information-list).

## Tasks to be performed by a Pentaho administrator

As a Pentaho administrator, you configure data connections, manage security, and set up client tools.

* [Define data connections](#define-data-connections)
* [Assign permissions to use or manage database connections](#assign-permissions-to-use-or-manage-database-connections)
* [Manage users and roles](#manage-users-and-roles)
* [Configure the design tools and utilities](#configure-the-design-tools-and-utilities)

### Define data connections

PUC (Pentaho User Console) and PDI (Pentaho Data Integration) can access data from various sources.

If your data source is a database, create a database connection before you can use it.

Before you create a connection, install the correct driver for your database. Your IT administrator can help. See [Specify data connections for the Pentaho Server](https://docs.pentaho.com/install/legacy-redirects/tasks-to-be-performed-by-an-it-administrator-legacy-redirects/configure-the-pentaho-server-legacy-pages/specify-data-connections-for-the-pentaho-server).

#### Open the connection dialog box

You can open the **Database Connection** dialog box from PUC or PDI.

**Open from Pentaho User Console (PUC)**

1. Log on to PUC.
2. Select **File** > **Manage Data Sources**.

   The **Manage Data Sources** dialog box appears.
3. In **More actions and options**, select **New Connection**.

   The **Database Connection** dialog box appears.

   ![Database Connection dialog box](https://2345962715-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqfaQ2p0JAZrP8b3cpM9a%2Fuploads%2Fgit-blob-936c323042f48a6ef656fe6cfb89a4f18b1c9153%2FssPUCDataConnectionFromManageDataSources.png?alt=media)
4. Enter your connection information.
5. Click **Test**.
6. Click **OK**.

**Open from Pentaho Data Integration (PDI)**

1. Start the PDI client (Spoon).
2. Create a new transformation or job.
3. In the **View** tab, open the **Database connections** folder.

   The **Database Connection** dialog box appears.

   ![Database Connection dialog box](https://2345962715-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqfaQ2p0JAZrP8b3cpM9a%2Fuploads%2Fgit-blob-0b36ee4d927e030f0ff79f281fe32f337792b2a7%2FssPDIDataConnectionFromViewTabInExplorerPane.png?alt=media)
4. Enter your connection information.
5. Click **Test**.
6. Click **OK**.

**Notes for PDI connections**

* PDI can connect to multiple database vendors.
* PDI includes JDBC drivers for PostgreSQL by default.

{% hint style="warning" %}
Avoid ODBC connections.

The ODBC-to-JDBC bridge driver adds complexity and can hurt performance. Use ODBC only if no JDBC driver is available.

For details, see the [Pentaho Community article on why you should avoid ODBC](http://wiki.pentaho.com/pages/viewpage.action?pageId=14850644).
{% endhint %}

When you define a database connection in PDI, Spoon stores the connection information:

* If you use the Pentaho Repository, the connection is stored in the repository. Other users can use it after they connect.
* If you do not use the Pentaho Repository, Spoon stores it in XML with the transformation or job.

Before you define a JDBC connection, gather your database information. For example, gather the database type, port, user name, and password.

In PDI, you can set connection properties as variables. This helps jobs access multiple database types.

Use clean ANSI SQL where possible.

#### Enter database connection information

The connection information you enter depends on the database and access protocol.

**Native (JDBC) protocol information**

1. In **Connection Name**, enter a descriptive name.

   The name can include spaces. Do not use special characters such as `#`, `$`, or `%`.
2. In **Database Type**, select the database you want to use.
3. In **Access**, select **Native (JDBC)**.
4. In **Settings**, enter the following information:

   | Field         | Description                                         |
   | ------------- | --------------------------------------------------- |
   | Host Name     | Server name or IP address.                          |
   | Database Name | Database name. If you use ODBC, enter the DSN.      |
   | Port Number   | TCP/IP port number (if different from the default). |
   | User Name     | Optional user name for the database.                |
   | Password      | Optional password for the database.                 |
5. Click **Test**.
6. Click **OK** to close the test dialog.
7. Click **OK** to save the connection.
   * In PUC, the connection appears in **Manage Data Sources**.
   * In PDI, it appears under **Database connections** in the **View** tab.

**OCI protocol information (PDI only)**

1. In **Connection Name**, enter a descriptive name.

   The name can include spaces. Do not use special characters such as `#`, `$`, or `%`.
2. In **Database Type**, select **Oracle**.
3. In **Access**, select **OCI**.
4. In **Settings**, enter the fields as directed by the [Oracle OCI documentation](http://docs.oracle.com/cd/B28359_01/java.111/b31224/instclnt.htm).

   | Field                  | Description                          |
   | ---------------------- | ------------------------------------ |
   | SID                    | Oracle system ID.                    |
   | Tablespace for Data    | Tablespace where data is stored.     |
   | Tablespace for Indices | Tablespace where indexes are stored. |
   | User Name              | User name for the database.          |
   | Password               | Password for the database.           |
5. Click **Test**.
6. Click **OK** to close the test dialog.
7. Click **OK** to save the connection.

If you want to use **Advanced**, **Options**, or **Pooling** for OCI, refer to the Oracle documentation.

**Connect to Snowflake using strong authentication**

You can improve connection security by using key pair authentication.

1. After you enter the Snowflake connection details in **General**, select **Options**.
2. Set the following parameters:

   | Parameter              | Value                                                      |
   | ---------------------- | ---------------------------------------------------------- |
   | `authenticator`        | `snowflake_jwt`                                            |
   | `private_key_file`     | Path to the private key file (for example, `/rsa_key.p8`). |
   | `private_key_file_pwd` | Password for the private key file.                         |

   For details, see the Snowflake JDBC docs: [Private key file name and password as connection properties](https://docs.snowflake.com/en/developer-guide/jdbc/jdbc-configure#private-key-file-name-and-password-as-connection-properties).
3. Click **Test**.
4. Click **OK** to close the test dialog.
5. Click **OK** to save the connection.

**Connect to an Azure SQL database**

You can use an Azure SQL database as a data source in the PDI client. This connection is required for the **Bulk load into Azure SQL DB** job entry.

Pentaho supports:

* [Always Encrypted](https://learn.microsoft.com/sql/relational-databases/security/encryption/always-encrypted-database-engine?view=sql-server-ver16)
* [Dynamic Data Masking](https://learn.microsoft.com/azure/azure-sql/database/dynamic-data-masking-overview)
* Multiple authentication methods

Because one physical server can host databases for multiple customers, Azure SQL differs from SQL Server. For details, see [Feature comparison](https://learn.microsoft.com/azure/azure-sql/database/features-comparison).

**Before you begin**

You need:

* An Azure account with an active subscription
* An Azure SQL Database instance
* Azure SQL database drivers installed

Also gather the following information:

* Host name
* Database name
* Port number
* Authentication method
* User name
* Password

If you use **Always Encryption Enabled**, also gather:

* **Client id**
* **Client Secret Key**

**Authentication method**

Pentaho supports four authentication methods:

* **SQL Authentication**: Azure SQL Server user name and password.
* **Azure Active Directory**: Multi-factor authentication (MFA).
* **Azure Active Directory with password**: Azure AD user name and password.
* **Azure Active Directory with integrated authentication**: Federated on-premises ADFS with Azure AD.

**Connect to an Azure database**

1. Start the PDI client and create a new transformation or job.
2. In the **View** tab of the **Explorer** pane, double-click **Database connections**.
3. Enter your connection information:

   | Field                         | Description                                                                                                    |
   | ----------------------------- | -------------------------------------------------------------------------------------------------------------- |
   | **Host Name**                 | Azure SQL server instance name.                                                                                |
   | **Database Name**             | Azure SQL database name.                                                                                       |
   | **Port Number**               | TCP port. Azure SQL uses TCP port `1433`. Allow outbound TCP `1433`.                                           |
   | **Authentication method**     | Authentication method. Default is **SQL Authentication**.                                                      |
   | **Username**                  | Database user name.                                                                                            |
   | **Password**                  | Database password.                                                                                             |
   | **Always Encryption Enabled** | Enables encryption. See [Use the Always Encryption Enabled option](#use-the-always-encryption-enabled-option). |
   | **Client id**                 | Client identifier for a durable connection path.                                                               |
   | **Client Secret Key**         | Key value name in Azure Key Vault.                                                                             |
4. Click **Test** to verify the connection.

**Use the Always Encryption Enabled option**

Before you can use **Always Encryption Enabled**, complete these steps using your Azure SQL tools:

1. Generate a column master key in Azure Key Vault.
2. Encrypt the column using the column master key.
3. Register the app under Azure Active Directory.
4. Obtain the **Client id** and **Client Secret Key**.
5. Grant permissions to the **Client id** for accessing Azure Key Vault.
6. Select **Always Encryption Enabled**.
7. Enter the **Client id** and **Client Secret Key**.

#### Edit existing connections

Once a connection has been established, you can open the **Database Connection** dialog box to change it.

**Edit database connections in PUC**

1. Select **File** > **Manage Data Sources**.
2. Select the data source you want to edit.
3. In **More actions and options**, select **Edit**.
4. Select items in the left pane to refine the connection.

**Edit database connections in PDI**

1. Open a transformation or job in the PDI client.
2. In **Explorer**, open the **View** tab.
3. Expand **Database connections**.
4. Right-click a connection name, then select **Edit**.
5. Select items in the left pane to refine the connection.

#### Specify advanced configuration (Advanced)

Use **Advanced** to set SQL-generation properties. You can use these settings to standardize behavior across tools.

1. Open the **Database Connection** dialog box. See [Open the connection dialog box](#open-the-connection-dialog-box).
2. Select **Advanced** in the left pane.
3. Configure options as needed:

   | Identifier                         | Description                                               |
   | ---------------------------------- | --------------------------------------------------------- |
   | Supports the Boolean data type\*   | Use native Boolean data types supported by the database.  |
   | Supports the timestamp data type\* | Use the timestamp data type supported by the database.    |
   | Quote all in database              | Enable case-sensitive table names by quoting identifiers. |
   | Force all to lower-case            | Force identifiers to lower-case.                          |
   | Force all to upper-case            | Force identifiers to upper-case.                          |
   | Preserve case of reserved words\*  | Use the database reserved-words list.                     |
   | Preferred schema name\*            | Preferred schema name (for example, `MYSCHEMA`).          |
   | SQL statements                     | SQL statement used to initialize the connection.          |

   \* PDI-only options.
4. Click **Test**.
5. Click **OK** to close the test dialog.
6. Click **OK** to save the connection.

#### Quoting

Pentaho uses database-specific quoting rules. This helps you use names that follow database naming conventions.

PUC and PDI include a reserved-words list for most supported databases.

Pentaho separates schema (table owner) from the table name. This supports table or field names that include periods (common in some ERP systems).

To avoid quoting errors, Pentaho does not apply quoting if the schema or table name already includes a start or end quotation mark. This lets you control quoting explicitly.

#### Set database-specific options (Options)

Use **Options** to set JDBC parameters for a specific database.

1. Open the **Database Connection** dialog box. See [Open the connection dialog box](#open-the-connection-dialog-box).
2. Select **Options** in the left pane.
3. In the next available row, enter a parameter name and value.

   For JDBC configuration help, select **Help**. A browser window opens with database-specific guidance for the selected **Database Type**.
4. Click **Test**.
5. Click **OK** to close the test dialog.
6. Click **OK** to save the connection.

#### Define connection pooling (Pooling)

Connection pooling helps control database access. It can also help when your database license limits concurrent connections.

1. Open the **Database Connection** dialog box. See [Open the connection dialog box](#open-the-connection-dialog-box).
2. Select **Pooling** in the left pane.
3. Configure pooling options for your JDBC driver.

   Typical options include:

   * **Enable Connection Pooling**
   * **Pool Size**: initial and maximum
   * **Parameters** such as `validationQuery`

   Common `validationQuery` values:

   * Oracle and PostgreSQL: `Select 1 from dual`
   * SQL Server and MySQL: `Select 1`
4. Click **Test**.
5. Click **OK** to close the test dialog.
6. Click **OK** to save the connection.

#### Connect to clusters (PDI only)

Use **Clustering** to create database connections to data partitions in PDI.

To create a new partition connection, enter:

* **Partition ID**
* **Host Name**
* **Port**
* **Database Name**
* **User Name**
* **Password**

If you run Pentaho Server as a cluster and use the Data Source Wizard (DSW) in PUC, a new data source is visible only on the cluster node where you created it.

To make a new DSW data source visible across all cluster nodes, disable DSW data source caching:

1. Open `server/pentaho-server/pentaho-solutions/system/system.properties`.
2. Set `enableDomainIdCache=false`.

{% hint style="warning" %}
Disabling DSW caching can slow down loading the data source list.
{% endhint %}

#### Modify connections

You can perform additional tasks from PUC or the PDI client.

**Modify connections from PUC**

In PUC, open **File** > **Manage Data Sources**, then use **More actions and options**.

![More actions and options menu in the PUC Manage Data Sources dialog box](https://2345962715-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqfaQ2p0JAZrP8b3cpM9a%2Fuploads%2Fgit-blob-cda20df43d7b3d70b8ae03b262e2ce118ed3e139%2FssPUCManageDataSourcesOtherConnectionTasks.png?alt=media)

| Task                | Description                                                                                                                                                                                       |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Export**          | Exports connection information. **Metadata** downloads an XMI file. **Analysis** downloads an XML file. **Data Source Wizard** downloads an XMI (reporting only) or ZIP (reporting and analysis). |
| **Import Analysis** | Imports data values from a Mondrian file into a specified data source. You can select an available data source or enter parameters manually.                                                      |
| **Import Metadata** | Imports a model and creates a data source. **Data Source Wizard (includes Analysis model)** uses a ZIP created using **Export**. **Metadata model** uses an XMI file.                             |

**Modify connections from PDI**

In PDI, right-click the connection name under **Database connections** in the **View** tab.

| Task                  | Description                                                        |
| --------------------- | ------------------------------------------------------------------ |
| **Duplicate**         | Duplicates a connection after you enter a new **Connection Name**. |
| **Copy to clipboard** | Copies the XML defining the connection.                            |
| **SQL Editor**        | Executes SQL commands against an existing connection.              |
| **Clear DB Cache**    | Clears PDI’s database cache (use after schema changes).            |
| **Share**             | Shares the connection across transformations and jobs.             |
| **Explore**           | Opens Database Explorer to browse schemas and tables.              |
| **Show dependences**  | Lists transformations and jobs that use the connection.            |

#### Delete connections

If you no longer need a connection, delete it in PUC or PDI.

{% hint style="warning" %}
Deleting a connection affects all reports, charts, dashboards, and other content associated with that connection.
{% endhint %}

**Delete connections in PUC**

1. Select **File** > **Manage Data Sources**.
2. Select the data source you want to delete.
3. In **More actions and options**, select **Delete**.

**Delete connections in PDI**

1. In **Explorer**, select the **View** tab.
2. Expand **Database connections**.
3. Right-click the connection name, then select **Delete**.

### Assign permissions to use or manage database connections

You may have several connections to your data that you do not want to share with all of your users. When connected to the Pentaho Server, the PDI client (also known as Spoon) gives you the ability to make your data visible to only those users and roles that you specify. You can assign permission to allow users and roles to read, write, or delete the connection. Connection definitions are stored in the Pentaho Repository. The PDI client Repository Explorer enables you to browse the available connections and select the one for which you want to assign permissions.

1. From within the PDI client, click on **Tools** > **Repository** > **Explore**.

   The Repository Explorer appears.
2. Select the **Connections** tab.
3. Select the connection for which you want to assign permissions.
4. From the **User/Role** area, select the user or role for which you want to assign permissions.
5. Check the permissions you want to assign to the selected user or role.

   | Selection                 | Selection Result                                                                                                                                                                                                                                                                          |
   | ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
   | **Read**                  | For this user or role, the connection appears in the connection list and can be selected for use. If users or roles have permission to read a transformation or job but not to a referenced database connection, they cannot open the transformation or job and an error message appears. |
   | **Write**                 | This user or role can edit the connection definition.                                                                                                                                                                                                                                     |
   | **Delete**                | This user or role can permanently remove the connection definition from the list.                                                                                                                                                                                                         |
   | **Manage Access Control** | This user or role can assign read, write, and delete permissions to other users or roles.                                                                                                                                                                                                 |
6. Click **Apply**.
7. Click **Close** to exit the dialog box.

You can also delegate the ability to assign these permissions to another user or role. For more information on managing users and roles, see [With the PDI client](#with-the-pdi-client).

### Manage users and roles

If you are using basic Pentaho security, the Pentaho administrator may be tasked with creating and managing users and roles, including assigning permissions to allow users to access the content they need.

#### With PUC

Switch between user and role settings to add, delete, and edit users and roles in PUC.

See the **Pentaho Business Analytics** documentation for details.

#### With the PDI client

Manage users, roles, and permissions in the Pentaho Repository with the PDI client (Spoon).

You must sign in as an administrator. Or sign in with a role that has **Administer Security** permission.

**Quick tasks**

**Users**

* [Add users](#add-users)
* [Change user passwords](#change-user-passwords)
* [Delete users](#delete-users)
* [Assign users to roles](#assign-users-to-roles)
* [Edit user information](#edit-user-information)
* [Hide a user Home folder in PDI](#hide-a-user-home-folder-in-pdi)

**Roles and permissions**

* [Add roles](#add-roles)
* [Edit roles](#edit-roles)
* [Delete roles](#delete-roles)
* [Assign user permissions in the repository using the PDI client](#assign-user-permissions-in-the-repository-using-the-pdi-client)
* [Enable system role permissions](#enable-system-role-permissions)
* [Make changes to the administrator role](#make-changes-to-the-administrator-role)

**Reference**

* [Sample users, default roles, and permissions](#sample-users-default-roles-and-permissions)

**Before you begin**

Back up relevant files before you change security settings.

* If you installed PDI using the suite installer or custom methods, back up your Data Integration directories.
* If you installed PDI using the manual method, back up `pentaho.war` and your solution files.

**Sample users, default roles, and permissions**

Use the built-in sample users and default roles as a starting point. They can guide your own role design.

1. Open the PDI client and sign in to the repository.
2. Select **Tools** > **Repository** > **Explore**.
3. Select the **Security** tab.

   Available users are listed.

   ![Repository explorer Security tab](https://2345962715-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqfaQ2p0JAZrP8b3cpM9a%2Fuploads%2Fgit-blob-f244bd2e3eb7a06c599f96b03d519f623ef7f36d%2FPDI_Repository_Explorer_Security_tab.png?alt=media)
4. Choose what you want to manage:
   * **Users**: Select a user to see assigned roles and an optional description.
   * **Roles**: Select a role to see its permissions.
   * **System Roles**: Select a system role to see its permissions.

Default roles and sample users (typical setup):

* **Administrator** (sample user: `admin`)
  * Administer Security
  * Schedule Content
  * Read Content
  * Publish Content
  * Create Content
  * Execute
  * Manage Data Sources
* **Power User** (sample user: `suzy`)
  * Schedule Content
  * Read Content
  * Publish Content
  * Create Content
  * Execute
* **Report Author** (sample user: `tiffany`)
  * Schedule Content
  * Publish Content
* **Business Analyst** (sample user: `pat`)
  * Publish Content

Permission definitions:

* **Administer Security**
  * Lets users access and manage content across perspectives.
  * Lets users view and manage schedules in the **Schedules** perspective.
  * The Administrator role effectively retains all permissions, even if unchecked.
* **Schedule Content**
  * Lets users schedule content.
  * Lets users manage their own schedules.
* **Read Content**
  * Lets users view content in perspectives.
* **Publish Content**
  * Lets users store reports and data models in the repository.
  * With **Create Content**, lets users publish to the repository.
* **Create Content**
  * Lets users create, import, delete, and save jobs and transformations.
  * Lets users see repository data sources used by jobs and transformations.
  * With **Execute**, lets users export, copy/paste, and save to VFS.
* **Execute**
  * Lets users run, preview, debug, replay, verify, and schedule.
  * With **Create Content**, lets users export, copy/paste, and save to VFS.
* **Manage Data Sources**
  * Lets users create, edit, or delete data sources.
  * Lets users view repository data sources.

**Manage users**

Use the **Security** tab to manage users. Select **Users** to see the user list.

**Add users**

1. Select **Users**, then select the plus sign next to **Available**.
2. Enter **User Name** and **Password**.
3. Optional: Enter a **Description**.
4. Optional: Under **Member**, select a role and select **OK**.
5. Select **OK** to save.

**Change user passwords**

1. Select **Users**, then select the user.
2. Select the **Edit** icon.
3. In **Password**, enter the new password.
4. Select **OK**.

When you sign in to the PDI client for the first time, change the default administrator password.

**Edit user information**

1. Select **Users**, then select the user in the **Available** list.
2. Select the **Edit** icon.
3. Update the user details.
4. Select **OK**.

**Delete users**

{% hint style="warning" %}
Disable a user or role instead of deleting it.
{% endhint %}

1. Select **Users**, then select the user in the **Available** list.
2. Next to **Available**, select the **X** icon.
3. Select **Yes** to confirm.

If you delete a user or role, repository content ownership and ACL entries remain unchanged. This can create a security risk if you later create a new user or role with the same name.

Safer alternatives:

* If you are disabling a role, unassign all members from the role.
* If you are disabling a user, reset the password to a cryptic, unknown value.

**Hide a user Home folder in PDI**

If your organization uses multi-tenancy, you may want to hide individual Home folders. This can help centralize user-created content in a secure folder.

See the **Administer Pentaho Data Integration and Analytics** documentation for details.

**Manage roles and permissions**

Select **Roles** to manage roles and their permissions.

**Add roles**

1. Select **Roles**.
2. Select the plus sign next to **Available**.
3. Enter the **Role Name**.
4. Optional: Enter a **Description**.
5. Optional: Select users (Shift/Ctrl), then select the right arrow to assign them.
6. Select **OK**.

**Edit roles**

1. Select **Roles**.
2. Select the role, then select the **Edit** icon.
3. Make changes.
4. Select **OK**.

**Delete roles**

1. Select **Roles**.
2. Select the role in the **Available** list.
3. Select the **X** icon next to **Available**.
4. Select **Yes** to confirm.

**Assign users to roles**

1. Select **Roles**.
2. Select the role.
3. Next to **Members**, select the plus sign.
4. Select users, then select the right arrow.
5. Select **OK**.

**Assign user permissions in the repository using the PDI client**

Use role permissions to control what users can do. For example, create an admin role that can administer security and create content.

1. Select **Roles**.
2. In the **Available** list, select the role.
3. In **Permission**, select or clear permissions.
4. Select **Apply**.

Changes apply the next time affected users sign in.

**Enable system role permissions**

Pentaho requires the **Authenticated** system role for repository sign-in. Users are assigned **Authenticated** automatically at sign-in.

By default, **Authenticated** has **Read Content** permission. Adjust permissions as needed.

{% hint style="info" %}
The **Anonymous** system role is not used.
{% endhint %}

1. Select **System Roles**.
2. Select **Authenticated**.
3. Under **Permissions**, select or clear permissions.
4. Select **Apply**.

**Make changes to the administrator role**

You cannot edit the administrator role’s action-based permissions in the PDI client. The administrator role controls access to the **Security** tab.

{% hint style="warning" %}
Deleting the administrator role blocks access to the **Security** tab unless another role gets admin permission.
{% endhint %}

Use a configuration change for these scenarios:

* Delete the administrator role.
* Remove administrator permission from the administrator role.
* Configure LDAP.

1. Shut down the Pentaho Server.
2. Open `repository.spring.xml` in `pentaho-server/pentaho-solutions/system`.
3. Find the `immutableRoleBindingMap` element.
4. Replace the entire node with this XML.

   Replace `yourAdminRole` with the role that should have administrator permission.

   ```xml
   <util:map id="immutableRoleBindingMap">
       <entry key="yourAdminRole">
         <util:list>
           <value>org.pentaho.di.reader</value>
           <value>org.pentaho.di.creator</value>
           <value>org.pentaho.di.securityAdministrator</value>
         </util:list>
       </entry>
   </util:map>
   ```
5. Restart the Pentaho Server.

### Configure the design tools and utilities

Before using design tools and utilities, perform configuration tasks for each workstation running these tools.

{% hint style="info" %}
Depending on how these tools and utilities were [installed](https://docs.pentaho.com/install/pentaho-installation-overview-cp), they might be located on different machines than the Pentaho Server.
{% endhint %}

#### BA design tools

The following table describes the BA design tools and their uses:

| Design Tool                  | What it does                                                                                                                                                           |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Pentaho Aggregation Designer | Optimizes the multidimensional Mondrian data model.                                                                                                                    |
| Pentaho Metadata Editor      | Creates relational data models and refines the models created with the Data Access Wizard.                                                                             |
| Pentaho Report Designer      | Interactive Reports is a web-based design interface that is used to create both simple and on-demand operational reports without depending on IT or report developers. |
| Pentaho Schema Workbench     | Creates multidimensional Mondrian data models and refines the models created with the Data Access Wizard.                                                              |

Your IT administrator must [specify the data connections for these tools](https://docs.pentaho.com/install/legacy-redirects/tasks-to-be-performed-by-an-it-administrator-legacy-redirects/configure-the-pentaho-server-legacy-pages/specify-data-connections-for-ba-design-tools).

**Start and stop BA design tools**

Each BA design tool has its own startup command. Start and stop each tool separately.

**Start BA design tools**

The startup method depends on your operating system.

{% hint style="info" %}
Replace `pentaho` with your Pentaho installation directory.
{% endhint %}

{% tabs %}
{% tab title="Windows" %}
Run the `.bat` file for the tool you want.

* Aggregation Designer: `pentaho\design-tools\aggregation-designer\startaggregationdesigner.bat`
* Metadata Editor: `pentaho\design-tools\metadata-editor\metadata-editor.bat`
* Report Designer: `pentaho\design-tools\report-designer\report-designer.bat`
* Schema Workbench: `pentaho\design-tools\schema-workbench\workbench.bat`

You can also create shortcuts to these files.
{% endtab %}

{% tab title="Linux" %}
Run the `.sh` file for the tool you want.

* Aggregation Designer: `pentaho/design-tools/aggregation-designer/startaggregationdesigner.sh`
* Metadata Editor: `pentaho/design-tools/metadata-editor/metadata-editor.sh`
* Report Designer: `pentaho/design-tools/report-designer/report-designer.sh`
* Schema Workbench: `pentaho/design-tools/schema-workbench/workbench.sh`
  {% endtab %}

{% tab title="macOS" %}
Open the `.app` file for the tool you want.

* Aggregation Designer: `pentaho/design-tools/aggregation-designer/Aggregation Designer.app`
* Metadata Editor: `pentaho/design-tools/metadata-editor/Metadata Editor 64-bit.app`
* Report Designer: `pentaho/design-tools/report-designer/Pentaho Report Designer.app`
* Schema Workbench: `pentaho/design-tools/schema-workbench/Schema_Workbench.app`
  {% endtab %}
  {% endtabs %}

**Stop BA design tools**

Use the standard exit method for your tool.

* Aggregation Designer: Close the application window.
* Metadata Editor, Report Designer, and Schema Workbench: Select **File** > **Exit**.

Save your work before you exit.

#### PDI design tools and utilities

Pentaho Data Integration allows users to design and manage complex ETL workloads with a graphical user interface.

Kitchen, Pan, and Carte are command-line tools for executing jobs and transformations modeled in the PDI client.

* Use Pan and Kitchen command-line tools to work with transformations and jobs.
* Use Carte clusters to:
  * Schedule jobs to run on a remote Carte server.
  * Stop Carte from the command line interface or URL.
  * Run transformations and jobs from the repository on the Carte server.

See the **Pentaho Data Integration** document for details on Kitchen, Pan, and Carte.

**Start and stop PDI design tools and utilities**

Each PDI tool has a specific function, so you start and stop each one individually.

**Starting PDI tools and utilities**

Run the batch, script, or app files in the `data-integration` folder.

You can find `data-integration` in the folder where you installed the tool or utility.

Use the file for your operating system.

{% tabs %}
{% tab title="Windows" %}

* PDI client: `spoon.bat`
* Carte: `carte.bat`
* Kitchen: `kitchen.bat`
* Pan: `pan.bat`

Tip: Consider making a shortcut for each `.bat` file. Add it to the desktop or the **Start** menu.
{% endtab %}

{% tab title="Linux" %}

* PDI client: `spoon.sh`
* Carte: `carte.sh`
* Kitchen: `kitchen.sh`
* Pan: `pan.sh`
  {% endtab %}

{% tab title="macOS" %}

* PDI client: `data integration.app`
* Carte: `carte.app`
* Kitchen: `kitchen.app`
* Pan: `pan.app`
  {% endtab %}
  {% endtabs %}

**Stopping PDI design tools**

Use these methods to stop the PDI design tools:

* PDI client: Close the window or select **File > Exit**.
* Carte: Stop it from the command line interface or URL.
* Pan and Kitchen: Stop them from the command line interface.

For details on stopping Kitchen, Pan, or Carte, see the Pentaho Data Integration documentation.

**Increase the PDI client memory limit**

As a best practice, increase PDI's memory limit so the Pentaho Server and the PDI client (Spoon) can perform memory-intensive tasks, process or sort large datasets, and run complex transformations and jobs.

Increase the memory limit for both the Pentaho Server and the PDI client.

If you do not increase the memory limit, PDI uses the default settings in the startup scripts.

For Pentaho Server memory changes, see [Increase Pentaho Server memory limit for custom installations on Windows or Linux](https://docs.pentaho.com/install/legacy-redirects/tasks-to-be-performed-by-an-it-administrator-legacy-redirects/configure-the-pentaho-server-legacy-pages/increase-pentaho-server-memory-limit-for-custom-installations-on-windows-or-linux).

**Increase memory for the PDI client (Spoon)**

1. Exit from the PDI client if it is currently running.
2. Open the PDI client startup script with a text editor.

   The startup script depends on your operating system:

   * Windows: `pentaho/design-tools/data-integration/spoon.bat`
   * Linux: `pentaho/design-tools/data-integration/spoon.sh`
3. Modify the `-Xmx` value to a larger upper memory limit.

   This example allocates 2 GB of heap space:

   ```
   PENTAHO_DI_JAVA_OPTIONS="-Xmx2g -XX:MaxPermSize=256m"
   ```
4. Save and close the startup file.
5. Start the PDI client.

   If you still see memory errors, increase the value again.

**Alternative: Set an environment variable**

{% hint style="info" %}
Instead of modifying the startup script, you can set the `PENTAHO_DI_JAVA_OPTIONS` environment variable on your client to `-Xmx2g -XX:MaxPermSize=256m`.
{% endhint %}
