Tasks to be performed by an IT administrator

Legacy page kept for existing links. Content moved to the main topic.

This content moved to Install Pentaho Data Integration and Analytics 11.0. 5. Save and close the file. 6. Copy default.properties to the .pentaho/simple-jndi directory in the user’s home directory. Replace the existing file.

**Note:** If the `.pentaho/simple-jndi` directory does not exist, create it.

7. Restart the server and verify the change.

After you update a product

After you configure a product to use encrypted passwords, all logins with that product use encrypted passwords.

Connect to any databases you updated to verify the changes.

Set up Pentaho to connect to a Hadoop cluster

Use this topic to configure Pentaho to connect to Hadoop clusters.

Supported distributions include Amazon EMR, Azure HDInsight (HDI), Cloudera Data Platform (CDP), and Google Dataproc.

Pentaho also supports related services such as HDFS, HBase, Hive, Oozie, Sqoop, YARN/MapReduce, ZooKeeper, and Spark.

You can connect to clusters and services from these Pentaho components:

PDI client (Spoon), along with Kitchen and Pan command line tools
Pentaho Server
Analyzer (PAZ)
Pentaho Interactive Reports (PIR)
Pentaho Report Designer (PRD)
Pentaho Metadata Editor (PME)

Pentaho connects to Hadoop clusters through a compatibility layer called a driver (Big Data shim).

To confirm which drivers are supported for your version, see the Components Reference.

Drivers are shipped as vendor-specific builds of the optional pentaho-big-data-ee-plugin.

Download drivers from the Hitachi Vantara Lumada and Pentaho Support Portal.

Note: Pentaho ships with a generic Apache Hadoop driver. For specific vendor drivers, visit the Hitachi Vantara Lumada and Pentaho Support Portal to download the drivers.

Install a new driver

You need a driver for each cluster vendor and version you connect to from:

PDI client (Spoon), plus Kitchen and Pan
Pentaho Server
Analyzer
Interactive Reports
Pentaho Report Designer (PRD)
Pentaho Metadata Editor (PME)

Pentaho ships with a generic Apache Hadoop driver. Download vendor-specific drivers from the Support Portal.

Download the driver plugin

Sign in to the Support Portal.
Go to Downloads.
In the 11.0 list, open the full downloads list.
Open Pentaho 11.0 GA Release.
Download the driver plugin from Big Data Shims.

Common driver plugin files:

Apache Vanilla: pentaho-big-data-ee-plugin-apachevanilla-11.0.0.0-<build-number>.zip
Cloudera Data Platform: pentaho-big-data-ee-plugin-cdpdc71-11.0.0.0-<build-number>.zip
Google Dataproc: pentaho-big-data-ee-plugin-dataproc1421-11.0.0.0-<build-number>.zip
Amazon EMR: pentaho-big-data-ee-plugin-emr770-11.0.0.0-<build-number>.zip
Azure HDInsight: pentaho-big-data-ee-plugin-hdi40-11.0.0.0-<build-number>.zip

Install the driver on the PDI client

Stop PDI.
Extract the downloaded .zip into:
- <pdi-install-dir>/data-integration/plugins
If you are replacing an existing driver plugin, remove the old pentaho-big-data-ee-plugin folder first.

Install the driver on the Pentaho Server

Stop the Pentaho Server.
Extract the downloaded .zip into:
- <pentaho-server>/pentaho-solutions/system/kettle/plugins
If you are replacing an existing driver plugin, remove the old pentaho-big-data-ee-plugin folder first.

Restart and verify

Restart the PDI client and the Pentaho Server.
Create or update your cluster connection and verify it connects.

Update drivers

When drivers for new Hadoop versions are released, download the new driver plugin and repeat the install steps.

Additional configurations for specific distributions

Use these settings when you configure Pentaho to connect to specific Hadoop distributions:

Amazon EMR

The following settings are available while you configure Pentaho to connect to a working Amazon EMR cluster.

EMR clusters (version 7.x and later) built with JDK 17 exclude commons-lang-2.6.jar from standard Hadoop library directories (such as $HADOOP_HOME/lib).

To use the EMR driver with EMR 7.x:

Download commons-lang-2.6.jar from a trusted source (for example, Maven Repository: commons-lang » commons-lang » 2.6).
Copy the JAR to $HADOOP_HOME/lib or $HADOOP_MAPRED_HOME/lib on every EMR node.

Before you begin

Before you set up Pentaho to connect to an Amazon EMR cluster, do these tasks:

Check the Components Reference to confirm your Pentaho version supports your EMR version.
Prepare your Amazon EMR cluster:
1. Configure an Amazon EC2 cluster.
2. Install required services and service client tools.
3. Test the cluster.
Install PDI on an Amazon EC2 instance in the same Amazon VPC as the EMR cluster.
Get connection details from your Hadoop administrator.
Add the YARN user on the cluster to the group defined by dfs.permissions.superusergroup in hdfs-site.xml.

As a best practice, install PDI on the Amazon EC2 instance.

Otherwise, you may not be able to read or write cluster files.

For a workaround, see Unable to read or write files to HDFS on the Amazon EMR cluster.

You also need to share connection details with users after setup.

For the full list, see Hadoop connection and access information list.

Edit configuration files for users

Your cluster administrator must download cluster configuration files.

Update the files with Pentaho-specific and user-specific values.

Use these files to create or update a named connection.

Where named connection files live

Named connection files are stored here:

Named connection XML: <username>/.pentaho/metastore/pentaho/NamedCluster
Named connection config folder: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>
Extra settings file: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

Save edited files in a safe location.

Files to provide to users

Provide these files to each user:

core-site.xml
mapred-site.xml
hdfs-site.xml
yarn-site.xml

Verify or edit core-site.xml file

If you plan to run MapReduce jobs on Amazon EMR, confirm you have read, write, and execute access to the S3 buffer directories specified in core-site.xml.

Edit core-site.xml to add AWS access keys and (optional) LZO compression settings.

Open the file

Open core-site.xml from the folder where you saved the other *-site.xml files.

Add AWS credentials

Add your AWS Access Key ID and secret access key:

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>

Optional: Add S3N credentials

If you use S3N, add these properties:

<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>[INSERT YOUR VALUE HERE]</value>
</property>

Add filesystem implementation settings

Add these properties:

<property>
   <name>fs.s3n.impl</name>
   <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

<property>
   <name>fs.s3.impl</name>
   <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

<property>
   <name>fs.s3a.impl</name>
   <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>

Configure LZO compression

If you are not using LZO compression, remove any references to com.hadoop.compression.lzo.LzoCodec from core-site.xml.

If you are using LZO compression:

Download the LZO JAR.
Add it to pentaho-big-data-plugin/hadoop-configurations/emr3x/lib.

Download: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/

Save and apply the change

Save the file.

Update the named connection.

Upload the updated core-site.xml.

Edit mapred-site.xml file

If you use MapReduce, edit mapred-site.xml.

You also enable cross-platform MapReduce job submission.

Open the file

Open mapred-site.xml from the folder where you saved the other *-site.xml files.

Add the property

Add this property:

<property>
  <name>mapreduce.app-submission.cross-platform</name>
  <value>true</value>
</property>

This property is only required for MapReduce jobs on Windows.

Save and apply the change

Save the file.

Update the named connection.

Upload the updated mapred-site.xml.

Connect to a Hadoop cluster with the PDI client

After you set up the Pentaho Server to connect to a cluster, configure and test the connection.

See the Pentaho Data Integration documentation for instructions.

Connect other Pentaho components to the Amazon EMR cluster

Use this procedure to create and test a connection to your Amazon EMR cluster from these Pentaho components:

Pentaho Server (DI and BA)
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)

Install a driver for the Pentaho Server

Install a driver for the Pentaho Server.

For instructions, see Install a new driver.

Create and test connections

Create and test a connection for each component:

Pentaho Server for DI: Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA: Create a connection to the cluster in the Data Source Wizard.
PME: Create a connection to the cluster in PME.
PRD: Create a connection to the cluster in PRD.

Share connection details with users

After you connect to the cluster and services, share the connection details with users.

Users can access the cluster only from machines configured to connect to it.

To connect, users need:

Hadoop distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
Oozie URL (if used)

Users also need permissions for required HDFS directories.

For a detailed list of required information, see Hadoop connection and access information list.

Azure HDInsight

Use these settings when you configure Pentaho to connect to Azure HDInsight (HDI).

Before you begin

Before you set up Pentaho to connect to HDI, do the following:

Check Components Reference. Confirm your Pentaho version supports your HDI version.
Prepare your HDI instance:
1. Configure your Azure HDInsight instance.
2. Install required services and client tools.
3. Test the platform.
4. If HDI uses Kerberos, complete the Kerberos steps in this page.
Get connection details from your platform admin. You will share some of this information with users later. See Hadoop connection and access information list.
Add the YARN user to the group defined by dfs.permissions.superusergroup in hdfs-site.xml.
Set up the Hadoop driver for your HDI version. See Install a new driver.

Kerberos-secured HDInsight instances

If you connect to HDI secured with Kerberos, complete these steps first:

Configure Kerberos security on the platform. Configure the Kerberos realm, KDC, and admin server.
Configure these nodes to accept remote connection requests:
- NameNode
- DataNode
- Secondary NameNode
- JobTracker
- TaskTracker
If you deployed HDI using an enterprise program, set up Kerberos for those nodes.
Add user credentials to the Kerberos database for each Pentaho user.
Verify an OS user exists on each HDI node for each Kerberos user. Create users as needed.

User account UIDs should be greater than min.user.id. The default is usually 1000.

Set up Kerberos on your Pentaho machines. See the Administer Pentaho Data Integration and Analytics guide.

Edit configuration files for users

Your Azure admin downloads the site configuration files for the services you use. They update the files with Pentaho-specific and user-specific settings. Users upload the updated files when they create a named connection.

Named connection files are stored in these locations:

<username>/.pentaho/metastore/pentaho/NamedCluster
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

Save the updated files in a known location for reuse.

Files to provide

core-site.xml (secured HDInsight only)
hbase-site.xml
hive-site.xml
mapred-site.xml
yarn-site.xml

If you update these files after creating a named connection, edit the named connection and re-upload the updated files.

Edit Core site XML file

If you use a secured instance of Azure HDInsight, update core-site.xml.

Open core-site.xml.

Add or update properties for your storage type.

WASB storage

Add these properties:

fs.AbstractFileSystem.wasb.impl: org.apache.hadoop.fs.azure.Wasb
pentaho.runtime.fs.default.name: wasb://<container-name>@<storage-account-name>.blob.core.windows.net

Example:

<property>
  <name>fs.AbstractFileSystem.wasb.impl</name>
  <value>org.apache.hadoop.fs.azure.Wasb</value>
</property>

<property>
  <name>pentaho.runtime.fs.default.name</name>
  <value>wasb://&lt;container-name&gt;@&lt;storage-account-name&gt;.blob.core.windows.net</value>
</property>

ADLS (ABFS) storage

Add this property:

pentaho.runtime.fs.default.name: abfs://<container-name>@<storage-account-name>.dfs.core.windows.net

Example:

<property>
  <name>pentaho.runtime.fs.default.name</name>
  <value>abfs://&lt;container-name&gt;@&lt;storage-account-name&gt;.dfs.core.windows.net</value>
</property>

Save the file.

Edit HBase site XML file

If you use HBase, update hbase-site.xml to set the temporary directory.

Open hbase-site.xml.
Add or update this property:
- hbase.tmp.dir: /tmp/hadoop/hbase
Save the file.

Edit Hive site XML file

If you use Hive, update hive-site.xml to set the Hive metastore location.

Open hive-site.xml.
Add or update these properties:
- hive.metastore.uris: Hive metastore URI, if different from your HDInsight instance.
- fs.azure.account.keyprovider.<storage-account>.blob.core.windows.net: Azure storage key provider principal, if required.
Example:
```
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://&lt;metastore-hostname&gt;:9083</value>
</property>
```
Save the file.

Edit Mapred site XML file

If you use MapReduce, update mapred-site.xml for job history logging and cross-platform execution.

Open mapred-site.xml.

Ensure these properties exist:

mapreduce.jobhistory.address: where MapReduce job history logs are stored
mapreduce.job.hdfs-servers: HDFS servers used by YARN to run MapReduce jobs

Example:

<property>
  <name>mapreduce.jobhistory.address</name>
  <value>&lt;active-node-hostname&gt;:10020</value>
</property>

<property>
  <name>mapreduce.job.hdfs-servers</name>
  <value>hdfs://&lt;active-node-hostname&gt;:8020</value>
</property>

Optional: If YARN containers run on JDK 11 nodes, add this property:
- mapreduce.jvm.add-opens-as-default: false

Do not add mapreduce.jvm.add-opens-as-default for containers running on JDK 17 nodes.

Example:

<property>
  <name>mapreduce.jvm.add-opens-as-default</name>
  <value>false</value>
</property>

Save the file.

Edit YARN site XML file

If you use YARN, verify your yarn-site.xml settings.

Open yarn-site.xml.

Add or update these properties:

yarn.resourcemanager.hostname: ResourceManager host name
yarn.resourcemanager.address: ResourceManager address and port
yarn.resourcemanager.admin.address: ResourceManager admin address and port

Example:

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>&lt;resource-manager-hostname&gt;</value>
</property>

<property>
  <name>yarn.resourcemanager.address</name>
  <value>&lt;resource-manager-hostname&gt;:8050</value>
</property>

<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>&lt;resource-manager-hostname&gt;:8141</value>
</property>

Save the file.

After you change these files, edit the named connection and upload the updated files.

Oozie configuration

If you use Oozie, configure both the cluster and the Pentaho server.

By default, the Oozie user runs Oozie jobs. If you start an Oozie job from PDI, set up a PDI proxy user.

Set up Oozie on a cluster

Add your PDI user to oozie-site.xml.

Open oozie-site.xml on the cluster.

Add these properties. Replace <pdi-username> with the PDI user name.

<property>
  <name>oozie.service.ProxyUserService.proxyuser.<pdi-username>.groups</name>
  <value>*</value>
</property>
<property>
  <name>oozie.service.ProxyUserService.proxyuser.<pdi-username>.hosts</name>
  <value>*</value>
</property>

Save the file.

Set up Oozie on the server

Set the proxy user for the named cluster on the Pentaho server.

Open config.properties:
/<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
This path is created when you create a named connection.
Set pentaho.oozie.proxy.user to the proxy user name.
Save the file.

Windows configuration for a secured cluster

If you run Pentaho Server on Windows and your cluster uses Kerberos, point Tomcat to your krb5.conf or krb5.ini.

Go to server/pentaho-server.
Open start-pentaho.bat.

Set CATALINA_OPTS to include the Kerberos config path:

set \"CATALINA_OPTS=%CATALINA_OPTS% -Djava.security.krb5.conf=C:\\kerberos\\krb5.conf\"

Save the file.

Connect to HDI with the PDI client

After you set up the Pentaho Server to connect to HDI, configure and test the connection from PDI.

See the Pentaho Data Integration documentation for how to connect the PDI client to a cluster.

Connect other Pentaho components to HDI

Create and test an Azure HDInsight (HDI) connection in:

Pentaho Server
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)

Prerequisites

Install a driver for the Pentaho Server. See Install a new driver.

Create and test connections

Create and test the connection in each product:

Pentaho Server (DI): Create a transformation in the PDI client. Run it remotely.
Pentaho Server (BA): Create a connection to HDI in the Data Source Wizard.
PME: Create a connection to HDI.
PRD: Create a connection to HDI.

After you connect, share connection details with users.

Users typically need:

HDI distribution and version
HDFS, ResourceManager (JobTracker), ZooKeeper, and HiveServer2 hostnames, IP addresses, and ports
Oozie URL (if used)
Permissions for required HDFS directories, including user home directories

See Hadoop connection and access information list.

Cloudera Data Platform (CDP)

Use these advanced settings when you configure Pentaho to connect to Cloudera Data Platform (CDP).

Before you begin

Before you set up Pentaho to connect to CDP, do these tasks:

Check Components Reference. Verify your Pentaho version supports your CDP version.
Prepare CDP:
1. Configure Cloudera Data Platform.
  See CDP documentation.
2. Install required services and client tools.
3. Test the platform.
Get connection details from your platform administrator.
You will share some of this information with users later.
See Hadoop connection and access information list.
Add the YARN user to the group defined by dfs.permissions.superusergroup.
Find this property in hdfs-site.xml or in Cloudera Manager.
Set up the Hadoop driver for your CDP version. See Install a new driver.

Set up a secured instance of CDP

If you connect to Kerberos-secured CDP, also do these tasks:

Configure Kerberos on the platform.
Include the realm, KDC, and administrative server.
Configure these nodes to accept remote connection requests:
- Name
- Data
- Secondary
- Job tracker
- Task tracker
If you deployed CDP using an enterprise program, set up Kerberos for:
- Name
- Data
- Secondary name
- Job tracker
- Task tracker nodes
Add credentials to the Kerberos database for each Pentaho user.
Verify each user has an operating system account on each CDP node.
Add operating system users if needed.

User account UIDs should be greater than min.user.id.

This value is usually 1000.

Set up Kerberos on your Pentaho machines.
See Administer Pentaho Data Integration and Analytics.

Edit configuration files for users

Cloudera administrators download site configuration files for the services you use.

They update the files with Pentaho-specific and user-specific settings.

Users then upload the files when they create a named connection.

Named connection files are stored here:

<username>/.pentaho/metastore/pentaho/NamedCluster
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

Save the updated files in a known location for reuse.

Files to provide

config.properties
core-site.xml (secured CDP only)
hive-site.xml
mapred-site.xml
yarn-site.xml

If you update configuration files after creating a named connection, edit the named connection and re-upload the updated files.

Edit Core site XML file

If you use a secured instance of CDP, update core-site.xml.

Open core-site.xml.
Add or update these properties:
Property
Value
hadoop.proxyuser.oozie.hosts
Oozie hosts on your CDP cluster.
hadoop.proxyuser.oozie.groups
Oozie groups on your CDP cluster.
hadoop.proxyuser.<security_service>.hosts
Proxy user hosts for other services on your CDP cluster.
hadoop.proxyuser.<security_service>.groups
Proxy user groups for other services on your CDP cluster.
fs.s3a.access.key
Your S3 access key, if you access S3 from CDP.
fs.s3a.secret.key
Your S3 secret key, if you access S3 from CDP.

Optional (AWS): If you connect to CDP Public Cloud on AWS and use an S3 bucket outside the CDP environment, update or add these properties:

<property>
  <name>fs.s3a.delegation.token.binding</name>
  <value>org.apache.hadoop.fs.s3a.auth.delegation.SessionTokenBinding</value>
</property>

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
</property>

Ensure the gateway node has valid AWS credentials (for example, under ~/.aws/).

Optional (Azure): If you connect to CDP Public Cloud on Azure and use a storage account outside the CDP environment:
- Remove these properties:
  - fs.azure.enable.delegation.token
  - fs.azure.delegation.token.provider.type
  - fs.azure.account.auth.type
  - fs.azure.account.oauth.provider.type
- Add these properties:
  - fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net = SharedKey
  - fs.azure.account.key.<storage-account-name>.dfs.core.windows.net = <storage-account-key>
Optional (GCP): If you connect to CDP Public Cloud on GCP and use a bucket outside the CDP environment, create a custom role with these permissions:
```
storage.bucket.get
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.getIamPolicy
storage.objects.list
storage.objects.setIamPolicy
storage.objects.update
```
Assign the custom role to the Data Lake and Log service accounts for the bucket.
Save the file.

Edit Hive site XML file

If you use Hive, update hive-site.xml to set the Hive metastore location.

Open hive-site.xml.
Add or update these properties:
Property
Value
hive.metastore.uris
Set this to the Hive metastore URI if it differs from your CDP cluster.
hive.server2.enable.impersonation
Set to true if you use impersonation.
hive.server2.enable.doAs
Set to true if you use impersonation.
tez.lib.uris
Required when you use Hive 3 on Tez.
Example:
```
<property>
  <name>hive.server2.enable.doAs</name>
  <value>true</value>
</property>
```
```
<property>
  <name>tez.lib.uris</name>
  <value>/user/tez/0.9.1.7.1.4.0-203/tez.tar.gz</value>
</property>
```
Save the file.

Edit Mapred site XML file

If you use MapReduce, update mapred-site.xml to set job history logging and allow cross-platform submissions.

Open mapred-site.xml.
Ensure these properties exist:
Property
Value
mapreduce.jobhistory.address
Where MapReduce job history logs are stored.
mapreduce.app-submission.cross-platform
Set to true to allow submissions from Windows clients to Linux servers.
Example:
```
<property>
  <name>mapreduce.app-submission.cross-platform</name>
  <value>true</value>
</property>
```
Save the file.

Edit YARN site XML file

If you use YARN, verify your YARN settings in yarn-site.xml.

Open yarn-site.xml.
Add or update these properties:
Property
Value
yarn.application.classpath
Classpaths needed to run YARN applications. Use commas to separate multiple paths.
yarn.resourcemanager.hostname
Resource Manager host name for your environment.
yarn.resourcemanager.address
Resource Manager address and port for your environment.
yarn.resourcemanager.admin.address
Resource Manager admin address and port for your environment.
yarn.resourcemanager.proxy-user-privileges.enabled
Set to true if you use a proxy user.
Example:
```
<property>
  <name>yarn.resourcemanager.proxy-user-privileges.enabled</name>
  <value>true</value>
</property>
```
Save the file.

After you change these files, edit the named connection and upload the updated files.

Oozie configuration

If you use Oozie on your cluster, configure proxy access on the cluster and the server.

By default, the oozie user runs Oozie jobs.

If you start an Oozie job from PDI, configure a proxy user.

Set up Oozie on a cluster

Add your PDI user to oozie-site.xml.

Open oozie-site.xml on the cluster.

Add these properties.

Replace <your_pdi_user_name> with your PDI user name.

<property>
  <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.groups</name>
  <value>*</value>
</property>
<property>
  <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.hosts</name>
  <value>*</value>
</property>

Save the file.

Set up Oozie on the server

Add the proxy user name to the PDI named connection configuration.

Open this file:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection_name>/config.properties

This path is created when you create a named connection.

Set pentaho.oozie.proxy.user to the proxy user name.
Save the file.

Windows configuration for a secured cluster

If you run Pentaho Server on Windows and use Kerberos, set the path to your krb5.conf or krb5.ini file.

Open server/pentaho-server/start-pentaho.bat.

Add -Djava.security.krb5.conf to CATALINA_OPTS.

Example:

set \"CATALINA_OPTS=%CATALINA_OPTS% -Djava.security.krb5.conf=C:\\kerberos\\krb5.conf\"

Save the file.

Connect to CDP with the PDI client

After you set up the Pentaho Server to connect to CDP, configure and test the connection from the PDI client.

See Pentaho Data Integration for the client connection steps.

Connect other Pentaho components to CDP

Create and test a connection to CDP from Pentaho Server, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME).

Create and test connections

Create and test a connection in each component.

Pentaho Server for Data Integration (DI): Create a transformation in the PDI client, then run it remotely.
Pentaho Server for Business Analytics (BA): Create a connection to CDP in the Data Source Wizard.
Pentaho Metadata Editor (PME): Create a connection to CDP in PME.
Pentaho Report Designer (PRD): Create a connection to CDP in PRD.

Share connection details with users

After you connect to CDP and its services, give connection details to users who need access.

Users typically need:

CDP distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames, IP addresses, and port numbers
Oozie URL (if used)
Permission to access required HDFS directories, including home directories

Users might need more information, depending on the steps, entries, and services they use.

See Hadoop connection and access information list.

Google Dataproc

The following settings are available while you configure Pentaho to connect to Google Dataproc.

Before you begin

Before you set up Pentaho to connect to a Google Dataproc cluster, do these tasks:

Check the Components Reference.
Prepare your Google Cloud access:
- Get credentials for a Google account and access to the Google Cloud Console.
- Get required credentials for Google Cloud Platform, Compute Engine, and Dataproc.
Contact your Hadoop administrator for cluster connection details.

You also need to provide some of this information to users after setup.

Create a Dataproc cluster

You can create a Dataproc cluster using several methods.

For cluster setup options, see the Google Cloud Documentation.

Install the Google Cloud SDK on your local machine

Use Google’s instructions to install the Google Cloud SDK for your platform:

Linux: Install the Google Cloud SDK on Linux
Windows: Install the Google Cloud SDK on Windows

Set command variables

Set these environment variables before you run command-line examples on your local machine or in Cloud Shell.

Set the variables:

export PROJECT=project
export HOSTNAME=hostname
export ZONE=zone

Set PROJECT to your Google Cloud project ID.
Set HOSTNAME to the name of the master node in your Dataproc cluster.
Note: The master node name ends with -m.
Set ZONE to the zone of the instances in your Dataproc cluster.

Set up a Google Compute Engine instance for PDI

Run the PDI client inside Google Compute Engine (GCE).

Users must connect remotely through VNC to use the desktop UI.

VM instances in GCE do not publicly expose the required remote desktop ports.

Create an SSH tunnel between the VNC client and the VM instance.

Create a VM instance and set network tags

In the Google Cloud Console, open the Compute Engine console.
Go to Compute Engine > VM instances.
Select Create instance.
Open Advanced options and then the Networking tab.
In Network tags, enter vnc-server.

Install and configure VNC

Install and update a VNC service for the remote UI.
Install Gnome and VNC.

Connect using SSH and create an SSH tunnel

Log in to the instance using SSH.
Use an SSH client and the VM external IP.
Note: The Google Cloud Console shows the external IP.
Create an SSH tunnel from your VNC client machine.
Connect to the VNC session.

Optional: Configure Kerberos

If you use Kerberos, configure Kerberos on the GCE VM.

Authenticate the client machine with the Kerberos controller.

This is required for Kerberos-enabled Dataproc clusters.

When you finish, you can run PDI in GCE.

You can design and launch jobs and transformations on Dataproc.

Edit configuration files for users

Your cluster administrator must download cluster configuration files.

Update the files with Pentaho-specific and user-specific values.

Use these files to create a named connection.

Where named connection files live

Named connection files are stored here:

Named connection XML: <username>/.pentaho/metastore/pentaho/NamedCluster
Named connection config folder: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>
Extra settings file: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties

Save edited files in a safe location.

Files to provide to users

Provide these files to each user:

core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
hive-site.xml

You can copy these files from a Dataproc cluster using SCP.

Edit mapred-site.xml (MapReduce)

If you use MapReduce, update mapred-site.xml.

You also enable cross-platform MapReduce job submission.

Open the file

Open mapred-site.xml from the folder where you saved the other *-site.xml files.

Add the property

Add this property:

<property>
  <name>mapreduce.app-submission.cross-platform</name>
  <value>true</value>
</property>

This property is only required for MapReduce jobs on Windows.

Save and apply the change

Save the file.

Edit the named connection.

Upload the updated mapred-site.xml.

Connect to a Hadoop cluster with the PDI client

After you set up the Pentaho Server to connect to a cluster, configure and test the connection.

See the Pentaho Data Integration documentation for instructions.

Connect other Pentaho components to Dataproc

Use this procedure to create and test a connection to your Dataproc cluster from these Pentaho components:

Pentaho Server (DI and BA)
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)

Install a driver for the Pentaho Server

Install a driver for the Pentaho Server.

For instructions, see Install a new driver.

Create and test connections

Create and test a connection for each component:

Pentaho Server for DI: Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA: Create a connection to the cluster in the Data Source Wizard.
PME: Create a connection to the cluster in PME.
PRD: Create a connection to the cluster in PRD.

Share connection details with users

After you connect to the cluster and services, share the connection details with users.

Users can access the cluster only from machines configured to connect to it.

To connect, users need:

Hadoop distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
Oozie URL (if used)

Users also need permissions for required HDFS directories.

For a detailed list of required information, see Hadoop connection and access information list.

Last updated 9 days ago

Was this helpful?

hashtagAfter you update a product

hashtagSet up Pentaho to connect to a Hadoop cluster

hashtagInstall a new driver

hashtagDownload the driver plugin

hashtagInstall the driver on the PDI client

hashtagInstall the driver on the Pentaho Server

hashtagRestart and verify

hashtagUpdate drivers

hashtagAdditional configurations for specific distributions

hashtagOpen the file

hashtagAdd AWS credentials

hashtagOptional: Add S3N credentials

hashtagAdd filesystem implementation settings

hashtagConfigure LZO compression

hashtagSave and apply the change

hashtagOpen the file

hashtagAdd the property

hashtagSave and apply the change

hashtagCreate a VM instance and set network tags

hashtagInstall and configure VNC

hashtagConnect using SSH and create an SSH tunnel

hashtagOptional: Configure Kerberos

hashtagOpen the file

hashtagAdd the property

hashtagSave and apply the change

After you update a product

Set up Pentaho to connect to a Hadoop cluster

Install a new driver

Download the driver plugin

Install the driver on the PDI client

Install the driver on the Pentaho Server

Restart and verify

Update drivers

Additional configurations for specific distributions

Open the file

Add AWS credentials

Optional: Add S3N credentials

Add filesystem implementation settings

Configure LZO compression

Save and apply the change

Open the file

Add the property

Save and apply the change

Create a VM instance and set network tags

Install and configure VNC

Connect using SSH and create an SSH tunnel

Optional: Configure Kerberos

Open the file

Add the property

Save and apply the change