Tasks to be performed by an IT administrator
Legacy page kept for existing links. Content moved to the main topic.
This content moved to Install Pentaho Data Integration and Analytics 11.0. 5. Save and close the file. 6. Copy default.properties to the .pentaho/simple-jndi directory in the user’s home directory. Replace the existing file.
**Note:** If the `.pentaho/simple-jndi` directory does not exist, create it.7. Restart the server and verify the change.
After you update a product
After you configure a product to use encrypted passwords, all logins with that product use encrypted passwords.
Connect to any databases you updated to verify the changes.
Set up Pentaho to connect to a Hadoop cluster
Use this topic to configure Pentaho to connect to Hadoop clusters.
Supported distributions include Amazon EMR, Azure HDInsight (HDI), Cloudera Data Platform (CDP), and Google Dataproc.
Pentaho also supports related services such as HDFS, HBase, Hive, Oozie, Sqoop, YARN/MapReduce, ZooKeeper, and Spark.
You can connect to clusters and services from these Pentaho components:
PDI client (Spoon), along with Kitchen and Pan command line tools
Pentaho Server
Analyzer (PAZ)
Pentaho Interactive Reports (PIR)
Pentaho Report Designer (PRD)
Pentaho Metadata Editor (PME)
Pentaho connects to Hadoop clusters through a compatibility layer called a driver (Big Data shim).
To confirm which drivers are supported for your version, see the Components Reference.
Drivers are shipped as vendor-specific builds of the optional pentaho-big-data-ee-plugin.
Download drivers from the Hitachi Vantara Lumada and Pentaho Support Portal.
Note: Pentaho ships with a generic Apache Hadoop driver. For specific vendor drivers, visit the Hitachi Vantara Lumada and Pentaho Support Portal to download the drivers.
Install a new driver
You need a driver for each cluster vendor and version you connect to from:
PDI client (Spoon), plus Kitchen and Pan
Pentaho Server
Analyzer
Interactive Reports
Pentaho Report Designer (PRD)
Pentaho Metadata Editor (PME)
Pentaho ships with a generic Apache Hadoop driver. Download vendor-specific drivers from the Support Portal.
Download the driver plugin
Sign in to the Support Portal.
Go to Downloads.
In the 11.0 list, open the full downloads list.
Open Pentaho 11.0 GA Release.
Download the driver plugin from
Big Data Shims.
Common driver plugin files:
Apache Vanilla:
pentaho-big-data-ee-plugin-apachevanilla-11.0.0.0-<build-number>.zipCloudera Data Platform:
pentaho-big-data-ee-plugin-cdpdc71-11.0.0.0-<build-number>.zipGoogle Dataproc:
pentaho-big-data-ee-plugin-dataproc1421-11.0.0.0-<build-number>.zipAmazon EMR:
pentaho-big-data-ee-plugin-emr770-11.0.0.0-<build-number>.zipAzure HDInsight:
pentaho-big-data-ee-plugin-hdi40-11.0.0.0-<build-number>.zip
Update drivers
When drivers for new Hadoop versions are released, download the new driver plugin and repeat the install steps.
Additional configurations for specific distributions
Use these settings when you configure Pentaho to connect to specific Hadoop distributions:
Amazon EMR
The following settings are available while you configure Pentaho to connect to a working Amazon EMR cluster.
EMR clusters (version 7.x and later) built with JDK 17 exclude commons-lang-2.6.jar from standard Hadoop library directories (such as $HADOOP_HOME/lib).
To use the EMR driver with EMR 7.x:
Download
commons-lang-2.6.jarfrom a trusted source (for example, Maven Repository: commons-lang » commons-lang » 2.6).Copy the JAR to
$HADOOP_HOME/libor$HADOOP_MAPRED_HOME/libon every EMR node.
Before you begin
Before you set up Pentaho to connect to an Amazon EMR cluster, do these tasks:
Check the Components Reference to confirm your Pentaho version supports your EMR version.
Prepare your Amazon EMR cluster:
Configure an Amazon EC2 cluster.
Install required services and service client tools.
Test the cluster.
Install PDI on an Amazon EC2 instance in the same Amazon VPC as the EMR cluster.
Get connection details from your Hadoop administrator.
Add the YARN user on the cluster to the group defined by
dfs.permissions.superusergroupinhdfs-site.xml.
As a best practice, install PDI on the Amazon EC2 instance.
Otherwise, you may not be able to read or write cluster files.
For a workaround, see Unable to read or write files to HDFS on the Amazon EMR cluster.
You also need to share connection details with users after setup.
For the full list, see Hadoop connection and access information list.
Edit configuration files for users
Your cluster administrator must download cluster configuration files.
Update the files with Pentaho-specific and user-specific values.
Use these files to create or update a named connection.
Where named connection files live
Named connection files are stored here:
Named connection XML:
<username>/.pentaho/metastore/pentaho/NamedClusterNamed connection config folder:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>Extra settings file:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
Save edited files in a safe location.
Files to provide to users
Provide these files to each user:
core-site.xmlmapred-site.xmlhdfs-site.xmlyarn-site.xml
Verify or edit core-site.xml file
If you plan to run MapReduce jobs on Amazon EMR, confirm you have read, write, and execute access to the S3 buffer directories specified in core-site.xml.
Edit core-site.xml to add AWS access keys and (optional) LZO compression settings.
Configure LZO compression
If you are not using LZO compression, remove any references to com.hadoop.compression.lzo.LzoCodec from core-site.xml.
If you are using LZO compression:
Download the LZO JAR.
Add it to
pentaho-big-data-plugin/hadoop-configurations/emr3x/lib.
Download: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/
Edit mapred-site.xml file
If you use MapReduce, edit mapred-site.xml.
You also enable cross-platform MapReduce job submission.
Connect to a Hadoop cluster with the PDI client
After you set up the Pentaho Server to connect to a cluster, configure and test the connection.
See the Pentaho Data Integration documentation for instructions.
Connect other Pentaho components to the Amazon EMR cluster
Use this procedure to create and test a connection to your Amazon EMR cluster from these Pentaho components:
Pentaho Server (DI and BA)
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)
Install a driver for the Pentaho Server
Install a driver for the Pentaho Server.
For instructions, see Install a new driver.
Create and test connections
Create and test a connection for each component:
Pentaho Server for DI: Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA: Create a connection to the cluster in the Data Source Wizard.
PME: Create a connection to the cluster in PME.
PRD: Create a connection to the cluster in PRD.
Share connection details with users
After you connect to the cluster and services, share the connection details with users.
Users can access the cluster only from machines configured to connect to it.
To connect, users need:
Hadoop distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
Oozie URL (if used)
Users also need permissions for required HDFS directories.
For a detailed list of required information, see Hadoop connection and access information list.
Azure HDInsight
Use these settings when you configure Pentaho to connect to Azure HDInsight (HDI).
Before you begin
Before you set up Pentaho to connect to HDI, do the following:
Check Components Reference. Confirm your Pentaho version supports your HDI version.
Prepare your HDI instance:
Configure your Azure HDInsight instance.
Install required services and client tools.
Test the platform.
If HDI uses Kerberos, complete the Kerberos steps in this page.
Get connection details from your platform admin. You will share some of this information with users later. See Hadoop connection and access information list.
Add the YARN user to the group defined by
dfs.permissions.superusergroupinhdfs-site.xml.Set up the Hadoop driver for your HDI version. See Install a new driver.
Kerberos-secured HDInsight instances
If you connect to HDI secured with Kerberos, complete these steps first:
Configure Kerberos security on the platform. Configure the Kerberos realm, KDC, and admin server.
Configure these nodes to accept remote connection requests:
NameNode
DataNode
Secondary NameNode
JobTracker
TaskTracker
If you deployed HDI using an enterprise program, set up Kerberos for those nodes.
Add user credentials to the Kerberos database for each Pentaho user.
Verify an OS user exists on each HDI node for each Kerberos user. Create users as needed.
User account UIDs should be greater than min.user.id. The default is usually 1000.
Set up Kerberos on your Pentaho machines. See the Administer Pentaho Data Integration and Analytics guide.
Edit configuration files for users
Your Azure admin downloads the site configuration files for the services you use. They update the files with Pentaho-specific and user-specific settings. Users upload the updated files when they create a named connection.
Named connection files are stored in these locations:
<username>/.pentaho/metastore/pentaho/NamedCluster<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
Save the updated files in a known location for reuse.
Files to provide
core-site.xml(secured HDInsight only)hbase-site.xmlhive-site.xmlmapred-site.xmlyarn-site.xml
If you update these files after creating a named connection, edit the named connection and re-upload the updated files.
Edit Core site XML file
If you use a secured instance of Azure HDInsight, update core-site.xml.
Open
core-site.xml.Add or update properties for your storage type.
WASB storage
Add these properties:
fs.AbstractFileSystem.wasb.impl:org.apache.hadoop.fs.azure.Wasbpentaho.runtime.fs.default.name:wasb://<container-name>@<storage-account-name>.blob.core.windows.net
Example:
ADLS (ABFS) storage
Add this property:
pentaho.runtime.fs.default.name:abfs://<container-name>@<storage-account-name>.dfs.core.windows.net
Example:
Save the file.
Edit HBase site XML file
If you use HBase, update hbase-site.xml to set the temporary directory.
Open
hbase-site.xml.Add or update this property:
hbase.tmp.dir:/tmp/hadoop/hbase
Save the file.
Edit Hive site XML file
If you use Hive, update hive-site.xml to set the Hive metastore location.
Open
hive-site.xml.Add or update these properties:
hive.metastore.uris: Hive metastore URI, if different from your HDInsight instance.fs.azure.account.keyprovider.<storage-account>.blob.core.windows.net: Azure storage key provider principal, if required.
Example:
Save the file.
Edit Mapred site XML file
If you use MapReduce, update mapred-site.xml for job history logging and cross-platform execution.
Open
mapred-site.xml.Ensure these properties exist:
mapreduce.jobhistory.address: where MapReduce job history logs are storedmapreduce.job.hdfs-servers: HDFS servers used by YARN to run MapReduce jobs
Example:
Optional: If YARN containers run on JDK 11 nodes, add this property:
mapreduce.jvm.add-opens-as-default:false
Do not add mapreduce.jvm.add-opens-as-default for containers running on JDK 17 nodes.
Example:
Save the file.
Edit YARN site XML file
If you use YARN, verify your yarn-site.xml settings.
Open
yarn-site.xml.Add or update these properties:
yarn.resourcemanager.hostname: ResourceManager host nameyarn.resourcemanager.address: ResourceManager address and portyarn.resourcemanager.admin.address: ResourceManager admin address and port
Example:
Save the file.
After you change these files, edit the named connection and upload the updated files.
Oozie configuration
If you use Oozie, configure both the cluster and the Pentaho server.
By default, the Oozie user runs Oozie jobs. If you start an Oozie job from PDI, set up a PDI proxy user.
Set up Oozie on a cluster
Add your PDI user to oozie-site.xml.
Open
oozie-site.xmlon the cluster.Add these properties. Replace
<pdi-username>with the PDI user name.Save the file.
Set up Oozie on the server
Set the proxy user for the named cluster on the Pentaho server.
Open
config.properties:/<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.propertiesThis path is created when you create a named connection.
Set
pentaho.oozie.proxy.userto the proxy user name.Save the file.
Windows configuration for a secured cluster
If you run Pentaho Server on Windows and your cluster uses Kerberos, point Tomcat to your krb5.conf or krb5.ini.
Go to
server/pentaho-server.Open
start-pentaho.bat.Set
CATALINA_OPTSto include the Kerberos config path:Save the file.
Connect to HDI with the PDI client
After you set up the Pentaho Server to connect to HDI, configure and test the connection from PDI.
See the Pentaho Data Integration documentation for how to connect the PDI client to a cluster.
Connect other Pentaho components to HDI
Create and test an Azure HDInsight (HDI) connection in:
Pentaho Server
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)
Prerequisites
Install a driver for the Pentaho Server. See Install a new driver.
Create and test connections
Create and test the connection in each product:
Pentaho Server (DI): Create a transformation in the PDI client. Run it remotely.
Pentaho Server (BA): Create a connection to HDI in the Data Source Wizard.
PME: Create a connection to HDI.
PRD: Create a connection to HDI.
After you connect, share connection details with users.
Users typically need:
HDI distribution and version
HDFS, ResourceManager (JobTracker), ZooKeeper, and HiveServer2 hostnames, IP addresses, and ports
Oozie URL (if used)
Permissions for required HDFS directories, including user home directories
See Hadoop connection and access information list.
Cloudera Data Platform (CDP)
Use these advanced settings when you configure Pentaho to connect to Cloudera Data Platform (CDP).
Before you begin
Before you set up Pentaho to connect to CDP, do these tasks:
Check Components Reference. Verify your Pentaho version supports your CDP version.
Prepare CDP:
Configure Cloudera Data Platform.
See CDP documentation.
Install required services and client tools.
Test the platform.
Get connection details from your platform administrator.
You will share some of this information with users later.
Add the YARN user to the group defined by
dfs.permissions.superusergroup.Find this property in
hdfs-site.xmlor in Cloudera Manager.Set up the Hadoop driver for your CDP version. See Install a new driver.
Set up a secured instance of CDP
If you connect to Kerberos-secured CDP, also do these tasks:
Configure Kerberos on the platform.
Include the realm, KDC, and administrative server.
Configure these nodes to accept remote connection requests:
Name
Data
Secondary
Job tracker
Task tracker
If you deployed CDP using an enterprise program, set up Kerberos for:
Name
Data
Secondary name
Job tracker
Task tracker nodes
Add credentials to the Kerberos database for each Pentaho user.
Verify each user has an operating system account on each CDP node.
Add operating system users if needed.
User account UIDs should be greater than min.user.id.
This value is usually 1000.
Set up Kerberos on your Pentaho machines.
See Administer Pentaho Data Integration and Analytics.
Edit configuration files for users
Cloudera administrators download site configuration files for the services you use.
They update the files with Pentaho-specific and user-specific settings.
Users then upload the files when they create a named connection.
Named connection files are stored here:
<username>/.pentaho/metastore/pentaho/NamedCluster<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
Save the updated files in a known location for reuse.
Files to provide
config.propertiescore-site.xml(secured CDP only)hive-site.xmlmapred-site.xmlyarn-site.xml
If you update configuration files after creating a named connection, edit the named connection and re-upload the updated files.
Edit Core site XML file
If you use a secured instance of CDP, update core-site.xml.
Open
core-site.xml.Add or update these properties:
PropertyValuehadoop.proxyuser.oozie.hostsOozie hosts on your CDP cluster.
hadoop.proxyuser.oozie.groupsOozie groups on your CDP cluster.
hadoop.proxyuser.<security_service>.hostsProxy user hosts for other services on your CDP cluster.
hadoop.proxyuser.<security_service>.groupsProxy user groups for other services on your CDP cluster.
fs.s3a.access.keyYour S3 access key, if you access S3 from CDP.
fs.s3a.secret.keyYour S3 secret key, if you access S3 from CDP.
Optional (AWS): If you connect to CDP Public Cloud on AWS and use an S3 bucket outside the CDP environment, update or add these properties:
Ensure the gateway node has valid AWS credentials (for example, under
~/.aws/).Optional (Azure): If you connect to CDP Public Cloud on Azure and use a storage account outside the CDP environment:
Remove these properties:
fs.azure.enable.delegation.tokenfs.azure.delegation.token.provider.typefs.azure.account.auth.typefs.azure.account.oauth.provider.type
Add these properties:
fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net=SharedKeyfs.azure.account.key.<storage-account-name>.dfs.core.windows.net=<storage-account-key>
Optional (GCP): If you connect to CDP Public Cloud on GCP and use a bucket outside the CDP environment, create a custom role with these permissions:
Assign the custom role to the Data Lake and Log service accounts for the bucket.
Save the file.
Edit Hive site XML file
If you use Hive, update hive-site.xml to set the Hive metastore location.
Open
hive-site.xml.Add or update these properties:
PropertyValuehive.metastore.urisSet this to the Hive metastore URI if it differs from your CDP cluster.
hive.server2.enable.impersonationSet to
trueif you use impersonation.hive.server2.enable.doAsSet to
trueif you use impersonation.tez.lib.urisRequired when you use Hive 3 on Tez.
Example:
Save the file.
Edit Mapred site XML file
If you use MapReduce, update mapred-site.xml to set job history logging and allow cross-platform submissions.
Open
mapred-site.xml.Ensure these properties exist:
PropertyValuemapreduce.jobhistory.addressWhere MapReduce job history logs are stored.
mapreduce.app-submission.cross-platformSet to
trueto allow submissions from Windows clients to Linux servers.Example:
Save the file.
Edit YARN site XML file
If you use YARN, verify your YARN settings in yarn-site.xml.
Open
yarn-site.xml.Add or update these properties:
PropertyValueyarn.application.classpathClasspaths needed to run YARN applications. Use commas to separate multiple paths.
yarn.resourcemanager.hostnameResource Manager host name for your environment.
yarn.resourcemanager.addressResource Manager address and port for your environment.
yarn.resourcemanager.admin.addressResource Manager admin address and port for your environment.
yarn.resourcemanager.proxy-user-privileges.enabledSet to
trueif you use a proxy user.Example:
Save the file.
After you change these files, edit the named connection and upload the updated files.
Oozie configuration
If you use Oozie on your cluster, configure proxy access on the cluster and the server.
By default, the oozie user runs Oozie jobs.
If you start an Oozie job from PDI, configure a proxy user.
Set up Oozie on a cluster
Add your PDI user to oozie-site.xml.
Open
oozie-site.xmlon the cluster.Add these properties.
Replace
<your_pdi_user_name>with your PDI user name.Save the file.
Set up Oozie on the server
Add the proxy user name to the PDI named connection configuration.
Open this file:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection_name>/config.properties
This path is created when you create a named connection.
Set
pentaho.oozie.proxy.userto the proxy user name.Save the file.
Windows configuration for a secured cluster
If you run Pentaho Server on Windows and use Kerberos, set the path to your krb5.conf or krb5.ini file.
Open
server/pentaho-server/start-pentaho.bat.Add
-Djava.security.krb5.conftoCATALINA_OPTS.Example:
Save the file.
Connect to CDP with the PDI client
After you set up the Pentaho Server to connect to CDP, configure and test the connection from the PDI client.
See Pentaho Data Integration for the client connection steps.
Connect other Pentaho components to CDP
Create and test a connection to CDP from Pentaho Server, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME).
Create and test connections
Create and test a connection in each component.
Pentaho Server for Data Integration (DI): Create a transformation in the PDI client, then run it remotely.
Pentaho Server for Business Analytics (BA): Create a connection to CDP in the Data Source Wizard.
Pentaho Metadata Editor (PME): Create a connection to CDP in PME.
Pentaho Report Designer (PRD): Create a connection to CDP in PRD.
Share connection details with users
After you connect to CDP and its services, give connection details to users who need access.
Users typically need:
CDP distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames, IP addresses, and port numbers
Oozie URL (if used)
Permission to access required HDFS directories, including home directories
Users might need more information, depending on the steps, entries, and services they use.
See Hadoop connection and access information list.
Google Dataproc
The following settings are available while you configure Pentaho to connect to Google Dataproc.
Before you begin
Before you set up Pentaho to connect to a Google Dataproc cluster, do these tasks:
Check the Components Reference.
Prepare your Google Cloud access:
Get credentials for a Google account and access to the Google Cloud Console.
Get required credentials for Google Cloud Platform, Compute Engine, and Dataproc.
Contact your Hadoop administrator for cluster connection details.
You also need to provide some of this information to users after setup.
Create a Dataproc cluster
You can create a Dataproc cluster using several methods.
For cluster setup options, see the Google Cloud Documentation.
Install the Google Cloud SDK on your local machine
Use Google’s instructions to install the Google Cloud SDK for your platform:
Set command variables
Set these environment variables before you run command-line examples on your local machine or in Cloud Shell.
Set the variables:
Set
PROJECTto your Google Cloud project ID.Set
HOSTNAMEto the name of the master node in your Dataproc cluster.Note: The master node name ends with
-m.Set
ZONEto the zone of the instances in your Dataproc cluster.
Set up a Google Compute Engine instance for PDI
Run the PDI client inside Google Compute Engine (GCE).
Users must connect remotely through VNC to use the desktop UI.
VM instances in GCE do not publicly expose the required remote desktop ports.
Create an SSH tunnel between the VNC client and the VM instance.
When you finish, you can run PDI in GCE.
You can design and launch jobs and transformations on Dataproc.
Edit configuration files for users
Your cluster administrator must download cluster configuration files.
Update the files with Pentaho-specific and user-specific values.
Use these files to create a named connection.
Where named connection files live
Named connection files are stored here:
Named connection XML:
<username>/.pentaho/metastore/pentaho/NamedClusterNamed connection config folder:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>Extra settings file:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
Save edited files in a safe location.
Files to provide to users
Provide these files to each user:
core-site.xmlhdfs-site.xmlmapred-site.xmlyarn-site.xmlhive-site.xml
You can copy these files from a Dataproc cluster using SCP.
Edit mapred-site.xml (MapReduce)
If you use MapReduce, update mapred-site.xml.
You also enable cross-platform MapReduce job submission.
Connect to a Hadoop cluster with the PDI client
After you set up the Pentaho Server to connect to a cluster, configure and test the connection.
See the Pentaho Data Integration documentation for instructions.
Connect other Pentaho components to Dataproc
Use this procedure to create and test a connection to your Dataproc cluster from these Pentaho components:
Pentaho Server (DI and BA)
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)
Install a driver for the Pentaho Server
Install a driver for the Pentaho Server.
For instructions, see Install a new driver.
Create and test connections
Create and test a connection for each component:
Pentaho Server for DI: Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA: Create a connection to the cluster in the Data Source Wizard.
PME: Create a connection to the cluster in PME.
PRD: Create a connection to the cluster in PRD.
Share connection details with users
After you connect to the cluster and services, share the connection details with users.
Users can access the cluster only from machines configured to connect to it.
To connect, users need:
Hadoop distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
Oozie URL (if used)
Users also need permissions for required HDFS directories.
For a detailed list of required information, see Hadoop connection and access information list.
Last updated
Was this helpful?

