# Set up Pentaho to connect to a Hadoop cluster

Use this topic to configure Pentaho to connect to Hadoop clusters.

Supported distributions include Amazon EMR, Azure HDInsight (HDI), Cloudera Data Platform (CDP), and Google Dataproc.

Pentaho also supports related services such as HDFS, HBase, Hive, Oozie, Sqoop, YARN/MapReduce, ZooKeeper, and Spark.

You can connect to clusters and services from these Pentaho components:

* PDI client (Spoon), along with Kitchen and Pan command line tools
* Pentaho Server
* Analyzer (PAZ)
* Pentaho Interactive Reports (PIR)
* Pentaho Report Designer (PRD)
* Pentaho Metadata Editor (PME)

Pentaho connects to Hadoop clusters through a compatibility layer called a driver (Big Data shim).

To confirm which drivers are supported for your version, see the [Components Reference](/install/components-reference.md).

Drivers are shipped as vendor-specific builds of the optional `pentaho-big-data-ee-plugin`.

Download drivers from the [Hitachi Vantara Lumada and Pentaho Support Portal](https://support.pentaho.com/hc/en-us).

**Note:** Pentaho ships with a generic Apache Hadoop driver. For specific vendor drivers, visit the [Hitachi Vantara Lumada and Pentaho Support Portal](https://support.pentaho.com/hc/en-us) to download the drivers.

Before you add a named connection to a cluster, install a driver for the vendor and version you use.

### Install a new driver

You need a driver for each cluster vendor and version you connect to from:

* PDI client (Spoon), plus Kitchen and Pan
* Pentaho Server
* Analyzer
* Interactive Reports
* Pentaho Report Designer (PRD)
* Pentaho Metadata Editor (PME)

{% hint style="info" %}
Pentaho ships with a generic Apache Hadoop driver. Download vendor-specific drivers from the Support Portal.
{% endhint %}

{% stepper %}
{% step %}

### Download the driver plugin

1. Sign in to the [Support Portal](https://support.pentaho.com/hc/en-us).
2. Go to **Downloads**.
3. In the **11.0** list, open the full downloads list.
4. Open **Pentaho 11.0 GA Release**.
5. Download the driver plugin from `Big Data Shims`.

Common driver plugin files:

* Apache Vanilla: `pentaho-big-data-ee-plugin-apachevanilla-11.0.0.0-<build-number>.zip`
* Cloudera Data Platform: `pentaho-big-data-ee-plugin-cdpdc71-11.0.0.0-<build-number>.zip`
* Google Dataproc: `pentaho-big-data-ee-plugin-dataproc1421-11.0.0.0-<build-number>.zip`
* Amazon EMR: `pentaho-big-data-ee-plugin-emr770-11.0.0.0-<build-number>.zip`
* Azure HDInsight: `pentaho-big-data-ee-plugin-hdi40-11.0.0.0-<build-number>.zip`
  {% endstep %}

{% step %}

### Install the driver on the PDI client

1. Stop PDI.
2. Extract the downloaded `.zip` into:
   * `<pdi-install-dir>/data-integration/plugins`
3. If you are replacing an existing driver plugin, remove the old `pentaho-big-data-ee-plugin` folder first.
   {% endstep %}

{% step %}

### Install the driver on the Pentaho Server

1. Stop the Pentaho Server.
2. Extract the downloaded `.zip` into:
   * `<pentaho-server>/pentaho-solutions/system/kettle/plugins`
3. If you are replacing an existing driver plugin, remove the old `pentaho-big-data-ee-plugin` folder first.
   {% endstep %}

{% step %}

### Restart and verify

1. Restart the PDI client and the Pentaho Server.
2. Create or update your cluster connection and verify it connects.
   {% endstep %}
   {% endstepper %}

### Update drivers

When drivers for new Hadoop versions are released, download the new driver plugin and repeat the install steps.

### Additional configurations for specific distributions

Use these settings when you configure Pentaho to connect to specific Hadoop distributions:

* [Amazon EMR](#amazon-emr)
* [Azure HDInsight](#azure-hdinsight)
* [Cloudera Data Platform (CDP)](#cloudera-data-platform-cdp)
* [Google Dataproc](#google-dataproc)

#### Amazon EMR

The following settings are available while you configure Pentaho to connect to a working Amazon EMR cluster.

{% hint style="info" %}
EMR clusters (version 7.x and later) built with JDK 17 exclude `commons-lang-2.6.jar` from standard Hadoop library directories (such as `$HADOOP_HOME/lib`).

To use the EMR driver with EMR 7.x:

1. Download `commons-lang-2.6.jar` from a trusted source (for example, [Maven Repository: commons-lang » commons-lang » 2.6](https://mvnrepository.com/artifact/commons-lang/commons-lang/2.6)).
2. Copy the JAR to `$HADOOP_HOME/lib` or `$HADOOP_MAPRED_HOME/lib` on every EMR node.
   {% endhint %}

**Before you begin**

Before you set up Pentaho to connect to an Amazon EMR cluster, do these tasks:

1. Check the [Components Reference](/install/components-reference.md) to confirm your Pentaho version supports your EMR version.
2. Prepare your Amazon EMR cluster:
   1. Configure an Amazon EC2 cluster.
   2. Install required services and service client tools.
   3. Test the cluster.
3. Install PDI on an Amazon EC2 instance in the same Amazon VPC as the EMR cluster.
4. Get connection details from your Hadoop administrator.
5. Add the YARN user on the cluster to the group defined by `dfs.permissions.superusergroup` in `hdfs-site.xml`.

{% hint style="info" %}
As a best practice, install PDI on the Amazon EC2 instance.

Otherwise, you may not be able to read or write cluster files.

For a workaround, see [Unable to read or write files to HDFS on the Amazon EMR cluster](/install/legacy-redirects/use-hadoop-with-pentaho-redirects/unable-to-read-or-write-files-to-hdfs-on-amazon-emr-cluster.md).
{% endhint %}

You also need to share connection details with users after setup.

For the full list, see [Hadoop connection and access information list](/install/legacy-redirects/hadoop-connection-and-access-information-list.md).

**Edit configuration files for users**

Your cluster administrator must download cluster configuration files.

Update the files with Pentaho-specific and user-specific values.

Use these files to create or update a named connection.

**Where named connection files live**

Named connection files are stored here:

* Named connection XML: `<username>/.pentaho/metastore/pentaho/NamedCluster`
* Named connection config folder: `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>`
* Extra settings file: `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

Save edited files in a safe location.

**Files to provide to users**

Provide these files to each user:

* `core-site.xml`
* `mapred-site.xml`
* `hdfs-site.xml`
* `yarn-site.xml`

**Verify or edit core-site.xml file**

{% hint style="info" %}
If you plan to run MapReduce jobs on Amazon EMR, confirm you have read, write, and execute access to the S3 buffer directories specified in `core-site.xml`.
{% endhint %}

Edit `core-site.xml` to add AWS access keys and (optional) LZO compression settings.

{% stepper %}
{% step %}

### Open the file

Open `core-site.xml` from the folder where you saved the other `*-site.xml` files.
{% endstep %}

{% step %}

### Add AWS credentials

Add your AWS Access Key ID and secret access key:

```xml
<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>
```

{% endstep %}

{% step %}

### Optional: Add S3N credentials

If you use S3N, add these properties:

```xml
<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>
```

{% endstep %}

{% step %}

### Add filesystem implementation settings

Add these properties:

```xml
<property>
   <name>fs.s3n.impl</name>
   <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

<property>
   <name>fs.s3.impl</name>
   <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

<property>
   <name>fs.s3a.impl</name>
   <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
```

{% endstep %}

{% step %}

### Configure LZO compression

If you are not using LZO compression, remove any references to `com.hadoop.compression.lzo.LzoCodec` from `core-site.xml`.

If you are using LZO compression:

1. Download the LZO JAR.
2. Add it to `pentaho-big-data-plugin/hadoop-configurations/emr3x/lib`.

Download: <http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/>
{% endstep %}

{% step %}

### Save and apply the change

Save the file.

Update the named connection.

Upload the updated `core-site.xml`.
{% endstep %}
{% endstepper %}

**Edit mapred-site.xml file**

If you use MapReduce, edit `mapred-site.xml`.

You also enable cross-platform MapReduce job submission.

{% stepper %}
{% step %}

### Open the file

Open `mapred-site.xml` from the folder where you saved the other `*-site.xml` files.
{% endstep %}

{% step %}

### Add the property

Add this property:

```xml
<property>
  <name>mapreduce.app-submission.cross-platform</name>
  <value>true</value>
</property>
```

This property is only required for MapReduce jobs on Windows.
{% endstep %}

{% step %}

### Save and apply the change

Save the file.

Update the named connection.

Upload the updated `mapred-site.xml`.
{% endstep %}
{% endstepper %}

**Connect to a Hadoop cluster with the PDI client**

After you set up the Pentaho Server to connect to a cluster, configure and test the connection.

See the **Pentaho Data Integration** documentation for instructions.

**Connect other Pentaho components to the Amazon EMR cluster**

Use this procedure to create and test a connection to your Amazon EMR cluster from these Pentaho components:

* Pentaho Server (DI and BA)
* Pentaho Metadata Editor (PME)
* Pentaho Report Designer (PRD)

**Install a driver for the Pentaho Server**

Install a driver for the Pentaho Server.

For instructions, see [Install a new driver](#install-a-new-driver).

**Create and test connections**

Create and test a connection for each component:

* **Pentaho Server for DI**: Create a transformation in the PDI client and run it remotely.
* **Pentaho Server for BA**: Create a connection to the cluster in the Data Source Wizard.
* **PME**: Create a connection to the cluster in PME.
* **PRD**: Create a connection to the cluster in PRD.

**Share connection details with users**

After you connect to the cluster and services, share the connection details with users.

Users can access the cluster only from machines configured to connect to it.

To connect, users need:

* Hadoop distribution and version
* HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
* Oozie URL (if used)

Users also need permissions for required HDFS directories.

For a detailed list of required information, see [Hadoop connection and access information list](/install/legacy-redirects/hadoop-connection-and-access-information-list.md).

#### Azure HDInsight

Use these settings when you configure Pentaho to connect to Azure HDInsight (HDI).

**Before you begin**

Before you set up Pentaho to connect to HDI, do the following:

1. Check [Components Reference](/install/components-reference.md). Confirm your Pentaho version supports your HDI version.
2. Prepare your HDI instance:
   1. Configure your Azure HDInsight instance.
   2. Install required services and client tools.
   3. Test the platform.
   4. If HDI uses Kerberos, complete the Kerberos steps in this page.
3. Get connection details from your platform admin. You will share some of this information with users later. See [Hadoop connection and access information list](/install/legacy-redirects/hadoop-connection-and-access-information-list.md).
4. Add the YARN user to the group defined by `dfs.permissions.superusergroup` in `hdfs-site.xml`.
5. Set up the Hadoop driver for your HDI version. See [Install a new driver](#install-a-new-driver).

**Kerberos-secured HDInsight instances**

If you connect to HDI secured with Kerberos, complete these steps first:

1. Configure Kerberos security on the platform. Configure the Kerberos realm, KDC, and admin server.
2. Configure these nodes to accept remote connection requests:
   * NameNode
   * DataNode
   * Secondary NameNode
   * JobTracker
   * TaskTracker
3. If you deployed HDI using an enterprise program, set up Kerberos for those nodes.
4. Add user credentials to the Kerberos database for each Pentaho user.
5. Verify an OS user exists on each HDI node for each Kerberos user. Create users as needed.

{% hint style="info" %}
User account UIDs should be greater than `min.user.id`. The default is usually `1000`.
{% endhint %}

6. Set up Kerberos on your Pentaho machines. See the *Administer Pentaho Data Integration and Analytics* guide.

**Edit configuration files for users**

Your Azure admin downloads the site configuration files for the services you use. They update the files with Pentaho-specific and user-specific settings. Users upload the updated files when they create a named connection.

Named connection files are stored in these locations:

* `<username>/.pentaho/metastore/pentaho/NamedCluster`
* `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

Save the updated files in a known location for reuse.

**Files to provide**

* `core-site.xml` (secured HDInsight only)
* `hbase-site.xml`
* `hive-site.xml`
* `mapred-site.xml`
* `yarn-site.xml`

{% hint style="info" %}
If you update these files after creating a named connection, edit the named connection and re-upload the updated files.
{% endhint %}

**Edit Core site XML file**

If you use a secured instance of Azure HDInsight, update `core-site.xml`.

1. Open `core-site.xml`.
2. Add or update properties for your storage type.

   **WASB storage**

   Add these properties:

   * `fs.AbstractFileSystem.wasb.impl`: `org.apache.hadoop.fs.azure.Wasb`
   * `pentaho.runtime.fs.default.name`: `wasb://<container-name>@<storage-account-name>.blob.core.windows.net`

   Example:

   ```xml
   <property>
     <name>fs.AbstractFileSystem.wasb.impl</name>
     <value>org.apache.hadoop.fs.azure.Wasb</value>
   </property>

   <property>
     <name>pentaho.runtime.fs.default.name</name>
     <value>wasb://&lt;container-name&gt;@&lt;storage-account-name&gt;.blob.core.windows.net</value>
   </property>
   ```

   **ADLS (ABFS) storage**

   Add this property:

   * `pentaho.runtime.fs.default.name`: `abfs://<container-name>@<storage-account-name>.dfs.core.windows.net`

   Example:

   ```xml
   <property>
     <name>pentaho.runtime.fs.default.name</name>
     <value>abfs://&lt;container-name&gt;@&lt;storage-account-name&gt;.dfs.core.windows.net</value>
   </property>
   ```
3. Save the file.

**Edit HBase site XML file**

If you use HBase, update `hbase-site.xml` to set the temporary directory.

1. Open `hbase-site.xml`.
2. Add or update this property:
   * `hbase.tmp.dir`: `/tmp/hadoop/hbase`
3. Save the file.

**Edit Hive site XML file**

If you use Hive, update `hive-site.xml` to set the Hive metastore location.

1. Open `hive-site.xml`.
2. Add or update these properties:

   * `hive.metastore.uris`: Hive metastore URI, if different from your HDInsight instance.
   * `fs.azure.account.keyprovider.<storage-account>.blob.core.windows.net`: Azure storage key provider principal, if required.

   Example:

   ```xml
   <property>
     <name>hive.metastore.uris</name>
     <value>thrift://&lt;metastore-hostname&gt;:9083</value>
   </property>
   ```
3. Save the file.

**Edit Mapred site XML file**

If you use MapReduce, update `mapred-site.xml` for job history logging and cross-platform execution.

1. Open `mapred-site.xml`.
2. Ensure these properties exist:

   * `mapreduce.jobhistory.address`: where MapReduce job history logs are stored
   * `mapreduce.job.hdfs-servers`: HDFS servers used by YARN to run MapReduce jobs

   Example:

   ```xml
   <property>
     <name>mapreduce.jobhistory.address</name>
     <value>&lt;active-node-hostname&gt;:10020</value>
   </property>

   <property>
     <name>mapreduce.job.hdfs-servers</name>
     <value>hdfs://&lt;active-node-hostname&gt;:8020</value>
   </property>
   ```
3. Optional: If YARN containers run on JDK 11 nodes, add this property:
   * `mapreduce.jvm.add-opens-as-default`: `false`

{% hint style="warning" %}
Do not add `mapreduce.jvm.add-opens-as-default` for containers running on JDK 17 nodes.
{% endhint %}

Example:

```xml
<property>
  <name>mapreduce.jvm.add-opens-as-default</name>
  <value>false</value>
</property>
```

4. Save the file.

**Edit YARN site XML file**

If you use YARN, verify your `yarn-site.xml` settings.

1. Open `yarn-site.xml`.
2. Add or update these properties:

   * `yarn.resourcemanager.hostname`: ResourceManager host name
   * `yarn.resourcemanager.address`: ResourceManager address and port
   * `yarn.resourcemanager.admin.address`: ResourceManager admin address and port

   Example:

   ```xml
   <property>
     <name>yarn.resourcemanager.hostname</name>
     <value>&lt;resource-manager-hostname&gt;</value>
   </property>

   <property>
     <name>yarn.resourcemanager.address</name>
     <value>&lt;resource-manager-hostname&gt;:8050</value>
   </property>

   <property>
     <name>yarn.resourcemanager.admin.address</name>
     <value>&lt;resource-manager-hostname&gt;:8141</value>
   </property>
   ```
3. Save the file.

{% hint style="info" %}
After you change these files, edit the named connection and upload the updated files.
{% endhint %}

**Oozie configuration**

If you use Oozie, configure both the cluster and the Pentaho server.

By default, the Oozie user runs Oozie jobs. If you start an Oozie job from PDI, set up a PDI proxy user.

**Set up Oozie on a cluster**

Add your PDI user to `oozie-site.xml`.

1. Open `oozie-site.xml` on the cluster.
2. Add these properties. Replace `<pdi-username>` with the PDI user name.

   ```xml
   <property>
     <name>oozie.service.ProxyUserService.proxyuser.<pdi-username>.groups</name>
     <value>*</value>
   </property>
   <property>
     <name>oozie.service.ProxyUserService.proxyuser.<pdi-username>.hosts</name>
     <value>*</value>
   </property>
   ```
3. Save the file.

**Set up Oozie on the server**

Set the proxy user for the named cluster on the Pentaho server.

1. Open `config.properties`:

   `/<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

   This path is created when you create a named connection.
2. Set `pentaho.oozie.proxy.user` to the proxy user name.
3. Save the file.

**Windows configuration for a secured cluster**

If you run Pentaho Server on Windows and your cluster uses Kerberos, point Tomcat to your `krb5.conf` or `krb5.ini`.

1. Go to `server/pentaho-server`.
2. Open `start-pentaho.bat`.
3. Set `CATALINA_OPTS` to include the Kerberos config path:

   ```
   set "CATALINA_OPTS=%CATALINA_OPTS% -Djava.security.krb5.conf=C:\kerberos\krb5.conf"
   ```
4. Save the file.

**Connect to HDI with the PDI client**

After you set up the Pentaho Server to connect to HDI, configure and test the connection from PDI.

See the *Pentaho Data Integration* documentation for how to connect the PDI client to a cluster.

**Connect other Pentaho components to HDI**

Create and test an Azure HDInsight (HDI) connection in:

* Pentaho Server
* Pentaho Metadata Editor (PME)
* Pentaho Report Designer (PRD)

**Prerequisites**

Install a driver for the Pentaho Server. See [Install a new driver](#install-a-new-driver).

**Create and test connections**

Create and test the connection in each product:

* **Pentaho Server (DI)**: Create a transformation in the PDI client. Run it remotely.
* **Pentaho Server (BA)**: Create a connection to HDI in the Data Source Wizard.
* **PME**: Create a connection to HDI.
* **PRD**: Create a connection to HDI.

After you connect, share connection details with users.

Users typically need:

* HDI distribution and version
* HDFS, ResourceManager (JobTracker), ZooKeeper, and HiveServer2 hostnames, IP addresses, and ports
* Oozie URL (if used)
* Permissions for required HDFS directories, including user home directories

See [Hadoop connection and access information list](/install/legacy-redirects/hadoop-connection-and-access-information-list.md).

#### Cloudera Data Platform (CDP)

Use these advanced settings when you configure Pentaho to connect to Cloudera Data Platform (CDP).

**Before you begin**

Before you set up Pentaho to connect to CDP, do these tasks:

1. Check [Components Reference](/install/components-reference.md). Verify your Pentaho version supports your CDP version.
2. Prepare CDP:
   1. Configure Cloudera Data Platform.

      See [CDP documentation](https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/index.html).
   2. Install required services and client tools.
   3. Test the platform.
3. Get connection details from your platform administrator.

   You will share some of this information with users later.

   See [Hadoop connection and access information list](/install/legacy-redirects/hadoop-connection-and-access-information-list.md).
4. Add the YARN user to the group defined by `dfs.permissions.superusergroup`.

   Find this property in `hdfs-site.xml` or in Cloudera Manager.
5. Set up the Hadoop driver for your CDP version. See [Install a new driver](#install-a-new-driver).

**Set up a secured instance of CDP**

If you connect to Kerberos-secured CDP, also do these tasks:

1. Configure Kerberos on the platform.

   Include the realm, KDC, and administrative server.
2. Configure these nodes to accept remote connection requests:
   * Name
   * Data
   * Secondary
   * Job tracker
   * Task tracker
3. If you deployed CDP using an enterprise program, set up Kerberos for:
   * Name
   * Data
   * Secondary name
   * Job tracker
   * Task tracker nodes
4. Add credentials to the Kerberos database for each Pentaho user.
5. Verify each user has an operating system account on each CDP node.

   Add operating system users if needed.

{% hint style="info" %}
User account UIDs should be greater than `min.user.id`.

This value is usually `1000`.
{% endhint %}

6. Set up Kerberos on your Pentaho machines.

   See **Administer Pentaho Data Integration and Analytics**.

**Edit configuration files for users**

Cloudera administrators download site configuration files for the services you use.

They update the files with Pentaho-specific and user-specific settings.

Users then upload the files when they create a named connection.

Named connection files are stored here:

* `<username>/.pentaho/metastore/pentaho/NamedCluster`
* `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

Save the updated files in a known location for reuse.

**Files to provide**

* `config.properties`
* `core-site.xml` (secured CDP only)
* `hive-site.xml`
* `mapred-site.xml`
* `yarn-site.xml`

{% hint style="info" %}
If you update configuration files after creating a named connection, edit the named connection and re-upload the updated files.
{% endhint %}

**Edit Core site XML file**

If you use a secured instance of CDP, update `core-site.xml`.

1. Open `core-site.xml`.
2. Add or update these properties:

   | Property                                     | Value                                                     |
   | -------------------------------------------- | --------------------------------------------------------- |
   | `hadoop.proxyuser.oozie.hosts`               | Oozie hosts on your CDP cluster.                          |
   | `hadoop.proxyuser.oozie.groups`              | Oozie groups on your CDP cluster.                         |
   | `hadoop.proxyuser.<security_service>.hosts`  | Proxy user hosts for other services on your CDP cluster.  |
   | `hadoop.proxyuser.<security_service>.groups` | Proxy user groups for other services on your CDP cluster. |
   | `fs.s3a.access.key`                          | Your S3 access key, if you access S3 from CDP.            |
   | `fs.s3a.secret.key`                          | Your S3 secret key, if you access S3 from CDP.            |
3. Optional (AWS): If you connect to CDP Public Cloud on AWS and use an S3 bucket outside the CDP environment, update or add these properties:

   ```xml
   <property>
     <name>fs.s3a.delegation.token.binding</name>
     <value>org.apache.hadoop.fs.s3a.auth.delegation.SessionTokenBinding</value>
   </property>
   ```

   ```xml
   <property>
     <name>fs.s3a.aws.credentials.provider</name>
     <value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
   </property>
   ```

   Ensure the gateway node has valid AWS credentials (for example, under `~/.aws/`).
4. Optional (Azure): If you connect to CDP Public Cloud on Azure and use a storage account outside the CDP environment:
   * Remove these properties:
     * `fs.azure.enable.delegation.token`
     * `fs.azure.delegation.token.provider.type`
     * `fs.azure.account.auth.type`
     * `fs.azure.account.oauth.provider.type`
   * Add these properties:
     * `fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net` = `SharedKey`
     * `fs.azure.account.key.<storage-account-name>.dfs.core.windows.net` = `<storage-account-key>`
5. Optional (GCP): If you connect to CDP Public Cloud on GCP and use a bucket outside the CDP environment, create a custom role with these permissions:

   ```
   storage.bucket.get
   storage.objects.create
   storage.objects.delete
   storage.objects.get
   storage.objects.getIamPolicy
   storage.objects.list
   storage.objects.setIamPolicy
   storage.objects.update
   ```

   Assign the custom role to the Data Lake and Log service accounts for the bucket.
6. Save the file.

**Edit Hive site XML file**

If you use Hive, update `hive-site.xml` to set the Hive metastore location.

1. Open `hive-site.xml`.
2. Add or update these properties:

   | Property                            | Value                                                                   |
   | ----------------------------------- | ----------------------------------------------------------------------- |
   | `hive.metastore.uris`               | Set this to the Hive metastore URI if it differs from your CDP cluster. |
   | `hive.server2.enable.impersonation` | Set to `true` if you use impersonation.                                 |
   | `hive.server2.enable.doAs`          | Set to `true` if you use impersonation.                                 |
   | `tez.lib.uris`                      | Required when you use Hive 3 on Tez.                                    |

   Example:

   ```xml
   <property>
     <name>hive.server2.enable.doAs</name>
     <value>true</value>
   </property>
   ```

   ```xml
   <property>
     <name>tez.lib.uris</name>
     <value>/user/tez/0.9.1.7.1.4.0-203/tez.tar.gz</value>
   </property>
   ```
3. Save the file.

**Edit Mapred site XML file**

If you use MapReduce, update `mapred-site.xml` to set job history logging and allow cross-platform submissions.

1. Open `mapred-site.xml`.
2. Ensure these properties exist:

   | Property                                  | Value                                                                     |
   | ----------------------------------------- | ------------------------------------------------------------------------- |
   | `mapreduce.jobhistory.address`            | Where MapReduce job history logs are stored.                              |
   | `mapreduce.app-submission.cross-platform` | Set to `true` to allow submissions from Windows clients to Linux servers. |

   Example:

   ```xml
   <property>
     <name>mapreduce.app-submission.cross-platform</name>
     <value>true</value>
   </property>
   ```
3. Save the file.

**Edit YARN site XML file**

If you use YARN, verify your YARN settings in `yarn-site.xml`.

1. Open `yarn-site.xml`.
2. Add or update these properties:

   | Property                                             | Value                                                                              |
   | ---------------------------------------------------- | ---------------------------------------------------------------------------------- |
   | `yarn.application.classpath`                         | Classpaths needed to run YARN applications. Use commas to separate multiple paths. |
   | `yarn.resourcemanager.hostname`                      | Resource Manager host name for your environment.                                   |
   | `yarn.resourcemanager.address`                       | Resource Manager address and port for your environment.                            |
   | `yarn.resourcemanager.admin.address`                 | Resource Manager admin address and port for your environment.                      |
   | `yarn.resourcemanager.proxy-user-privileges.enabled` | Set to `true` if you use a proxy user.                                             |

   Example:

   ```xml
   <property>
     <name>yarn.resourcemanager.proxy-user-privileges.enabled</name>
     <value>true</value>
   </property>
   ```
3. Save the file.

{% hint style="info" %}
After you change these files, edit the named connection and upload the updated files.
{% endhint %}

**Oozie configuration**

If you use Oozie on your cluster, configure proxy access on the cluster and the server.

By default, the `oozie` user runs Oozie jobs.

If you start an Oozie job from PDI, configure a proxy user.

**Set up Oozie on a cluster**

Add your PDI user to `oozie-site.xml`.

1. Open `oozie-site.xml` on the cluster.
2. Add these properties.

   Replace `<your_pdi_user_name>` with your PDI user name.

   ```xml
   <property>
     <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.groups</name>
     <value>*</value>
   </property>
   <property>
     <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.hosts</name>
     <value>*</value>
   </property>
   ```
3. Save the file.

**Set up Oozie on the server**

Add the proxy user name to the PDI named connection configuration.

1. Open this file:

   `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection_name>/config.properties`

{% hint style="info" %}
This path is created when you create a named connection.
{% endhint %}

2. Set `pentaho.oozie.proxy.user` to the proxy user name.
3. Save the file.

**Windows configuration for a secured cluster**

If you run Pentaho Server on Windows and use Kerberos, set the path to your `krb5.conf` or `krb5.ini` file.

1. Open `server/pentaho-server/start-pentaho.bat`.
2. Add `-Djava.security.krb5.conf` to `CATALINA_OPTS`.

   Example:

   ```bat
   set "CATALINA_OPTS=%CATALINA_OPTS% -Djava.security.krb5.conf=C:\kerberos\krb5.conf"
   ```
3. Save the file.

**Connect to CDP with the PDI client**

After you set up the Pentaho Server to connect to CDP, configure and test the connection from the PDI client.

See **Pentaho Data Integration** for the client connection steps.

**Connect other Pentaho components to CDP**

Create and test a connection to CDP from Pentaho Server, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME).

**Create and test connections**

Create and test a connection in each component.

* **Pentaho Server for Data Integration (DI)**: Create a transformation in the PDI client, then run it remotely.
* **Pentaho Server for Business Analytics (BA)**: Create a connection to CDP in the Data Source Wizard.
* **Pentaho Metadata Editor (PME)**: Create a connection to CDP in PME.
* **Pentaho Report Designer (PRD)**: Create a connection to CDP in PRD.

**Share connection details with users**

After you connect to CDP and its services, give connection details to users who need access.

Users typically need:

* CDP distribution and version
* HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames, IP addresses, and port numbers
* Oozie URL (if used)
* Permission to access required HDFS directories, including home directories

Users might need more information, depending on the steps, entries, and services they use.

See [Hadoop connection and access information list](/install/legacy-redirects/hadoop-connection-and-access-information-list.md).

#### Google Dataproc

The following settings are available while you configure Pentaho to connect to Google Dataproc.

**Before you begin**

Before you set up Pentaho to connect to a Google Dataproc cluster, do these tasks:

1. Check the [Components Reference](/install/components-reference.md).
2. Prepare your Google Cloud access:
   * Get credentials for a Google account and access to the Google Cloud Console.
   * Get required credentials for Google Cloud Platform, Compute Engine, and Dataproc.
3. Contact your Hadoop administrator for cluster connection details.

You also need to [provide some of this information to users](/install/legacy-redirects/hadoop-connection-and-access-information-list.md) after setup.

**Create a Dataproc cluster**

You can create a Dataproc cluster using several methods.

For cluster setup options, see the [Google Cloud Documentation](https://cloud.google.com/dataproc/docs/guides/create-cluster).

**Install the Google Cloud SDK on your local machine**

Use Google’s instructions to install the Google Cloud SDK for your platform:

* Linux: [Install the Google Cloud SDK on Linux](https://cloud.google.com/sdk/docs/downloads-interactive#linux)
* Windows: [Install the Google Cloud SDK on Windows](https://cloud.google.com/sdk/docs/downloads-interactive#windows)

**Set command variables**

Set these environment variables before you run command-line examples on your local machine or in Cloud Shell.

1. Set the variables:

   ```bash
   export PROJECT=project
   export HOSTNAME=hostname
   export ZONE=zone
   ```
2. Set `PROJECT` to your Google Cloud project ID.
3. Set `HOSTNAME` to the name of the master node in your Dataproc cluster.

   **Note:** The master node name ends with `-m`.
4. Set `ZONE` to the zone of the instances in your Dataproc cluster.

**Set up a Google Compute Engine instance for PDI**

Run the PDI client inside Google Compute Engine (GCE).

Users must connect remotely through VNC to use the desktop UI.

VM instances in GCE do not publicly expose the required remote desktop ports.

Create an SSH tunnel between the VNC client and the VM instance.

{% stepper %}
{% step %}

### Create a VM instance and set network tags

1. In the Google Cloud Console, open the Compute Engine console.
2. Go to **Compute Engine** > **VM instances**.
3. Select **Create instance**.
4. Open **Advanced options** and then the **Networking** tab.
5. In **Network tags**, enter `vnc-server`.
   {% endstep %}

{% step %}

### Install and configure VNC

1. Install and update a VNC service for the remote UI.
2. Install Gnome and VNC.
   {% endstep %}

{% step %}

### Connect using SSH and create an SSH tunnel

1. Log in to the instance using SSH.
2. Use an SSH client and the VM external IP.

   **Note:** The Google Cloud Console shows the external IP.
3. Create an SSH tunnel from your VNC client machine.
4. Connect to the VNC session.
   {% endstep %}

{% step %}

### Optional: Configure Kerberos

If you use Kerberos, configure Kerberos on the GCE VM.

Authenticate the client machine with the Kerberos controller.

This is required for Kerberos-enabled Dataproc clusters.
{% endstep %}
{% endstepper %}

When you finish, you can run PDI in GCE.

You can design and launch jobs and transformations on Dataproc.

**Edit configuration files for users**

Your cluster administrator must download cluster configuration files.

Update the files with Pentaho-specific and user-specific values.

Use these files to create a named connection.

**Where named connection files live**

Named connection files are stored here:

* Named connection XML: `<username>/.pentaho/metastore/pentaho/NamedCluster`
* Named connection config folder: `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>`
* Extra settings file: `<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties`

Save edited files in a safe location.

**Files to provide to users**

Provide these files to each user:

* `core-site.xml`
* `hdfs-site.xml`
* `mapred-site.xml`
* `yarn-site.xml`
* `hive-site.xml`

{% hint style="info" %}
You can copy these files from a Dataproc cluster using SCP.
{% endhint %}

**Edit `mapred-site.xml` (MapReduce)**

If you use MapReduce, update `mapred-site.xml`.

You also enable cross-platform MapReduce job submission.

{% stepper %}
{% step %}

### Open the file

Open `mapred-site.xml` from the folder where you saved the other `*-site.xml` files.
{% endstep %}

{% step %}

### Add the property

Add this property:

```xml
<property>
  <name>mapreduce.app-submission.cross-platform</name>
  <value>true</value>
</property>
```

This property is only required for MapReduce jobs on Windows.
{% endstep %}

{% step %}

### Save and apply the change

Save the file.

Edit the named connection.

Upload the updated `mapred-site.xml`.
{% endstep %}
{% endstepper %}

**Connect to a Hadoop cluster with the PDI client**

After you set up the Pentaho Server to connect to a cluster, configure and test the connection.

See the **Pentaho Data Integration** documentation for instructions.

**Connect other Pentaho components to Dataproc**

Use this procedure to create and test a connection to your Dataproc cluster from these Pentaho components:

* Pentaho Server (DI and BA)
* Pentaho Metadata Editor (PME)
* Pentaho Report Designer (PRD)

**Install a driver for the Pentaho Server**

Install a driver for the Pentaho Server.

For instructions, see [Install a new driver](#install-a-new-driver).

**Create and test connections**

Create and test a connection for each component:

* **Pentaho Server for DI**: Create a transformation in the PDI client and run it remotely.
* **Pentaho Server for BA**: Create a connection to the cluster in the Data Source Wizard.
* **PME**: Create a connection to the cluster in PME.
* **PRD**: Create a connection to the cluster in PRD.

**Share connection details with users**

After you connect to the cluster and services, share the connection details with users.

Users can access the cluster only from machines configured to connect to it.

To connect, users need:

* Hadoop distribution and version
* HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
* Oozie URL (if used)

Users also need permissions for required HDFS directories.

For a detailed list of required information, see [Hadoop connection and access information list](/install/legacy-redirects/hadoop-connection-and-access-information-list.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/install/legacy-redirects/tasks-to-be-performed-by-an-it-administrator-legacy-redirects/set-up-the-pentaho-server-to-connect-to-a-hadoop-cluster.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.