Pentaho configuration
You can use basic configuration tasks to set up the Pentaho Server, data connections, the Pentaho design tools, and Hadoop cluster connections so you can get started creating ETL solutions and data analytics. These configuration tasks can be performed through the PUC (Pentaho User Console), the PDI (Pentaho Data Integration) client, or edits to shell scripts and property files. A Pentaho administrator user name and password are required to perform configuration tasks through the user console.
These tasks are for IT and Pentaho administrators, as described in the following definitions:
IT administrator
Installs, configures, and upgrades the Pentaho Server. An IT administrator knows where the data is stored, how to connect to it, details about the computing environment, and how to use the command line on Microsoft Windows or Linux.
Pentaho administrator
Responsible for creation and management of users and roles, along with managing workstations so the ETL specialists and business analysts can create, publish, and share content.
Note: This guide assumes you have installed the Pentaho software.
Tasks to be performed by an IT administrator
As an IT administrator, you configure the Pentaho Server and define security. If your team uses big data, you also configure access to Hadoop clusters.
Configure the Pentaho Server
Basic server tasks include starting and stopping the Pentaho Server, tuning memory, and setting up connections.
The following IT administrator tasks prepare the system for more specific Pentaho administrator configuration tasks, like defining connections and managing users and roles:
Stop and start the Pentaho Server and repository
To perform maintenance work on the Pentaho Server or components that use the Pentaho Server, such as the Pentaho User Console and Pentaho Data Integration, stop the repository and server, perform the work, and then restart the repository and server.
Start the Pentaho Repository before you start the Pentaho Server.
You must be an IT administrator and know where the data that you are managing is stored, how to access that stored data, details about the computing environment, and how to use the command line for Windows or Linux.
Run the following control scripts to stop and start the Pentaho Server:
Start on Windows
...\\pentaho\\server\\pentaho-server\\start-pentaho.batStop on Windows
...\\pentaho\\server\\pentaho-server\\stop-pentaho.batStart on Linux
.../pentaho/server/pentaho-server/start-pentaho.shStop on Linux
.../pentaho/server/pentaho-server/stop-pentaho.sh
If you installed the Pentaho Repository as PostgreSQL, MySQL, MS SQL Server, or Oracle, consult that database documentation for start and stop instructions.
Script arguments and services (custom start/stop scripts)
Use these standard arguments:
startstoprestartstatushelp
Common service names:
pentahoserverpostgresql
Create scripts for automatic stop and start of the Pentaho server and repository on Linux
If you used the manual installation to install Pentaho on Linux, create scripts for automatic start on boot and stop at shutdown for the Pentaho Server and Pentaho Repository.
You must have root permissions.
Navigate to
/etc/init.d/and create a file namedpentaho.Open the
pentahofile and enter the following content:Optional: Update the script for your environment.
If you are not using Red Hat Enterprise Linux, modify the details of the script to work with the OS, shells, and init systems that you are using. The script was tested only on Red Hat Enterprise Linux.
If you use an account other than the
pentaholocal user account to start services, replacepentahowith your account name.If you use a MySQL or Oracle repository instead of PostgreSQL, replace
postgresqlwithmysqlororacle.If the solution repository runs on the same machine as the server, change
postgresqlto the name of theinitscript for your database.If the solution repository runs on a remote computer, remove
postgresqlentirely and adjust the paths to the Pentaho Server scripts.
Save and close the
pentahofile.Navigate to
/home/pentaho/pentaho/server/pentaho-serverand openstart-pentaho.sh.In
start-pentaho.sh, change the lastifstatement to match the following example:Save and close
start-pentaho.sh.Make the init script executable:
Add the init script to the standard run levels.
If your distribution supports
update-rc.d, run:
Many modern Linux distributions use systemd instead of SysV init scripts. If update-rc.d is not available, consult your distribution documentation for the equivalent.
Increase Pentaho Server memory limit for custom installations on Windows or Linux
If you used a custom method to install PDI on Windows or Linux, consider increasing the PDI memory limit so that the Pentaho Server and the PDI client (also known as Spoon) can perform memory-intensive tasks, like sorting, grouping large datasets, or running complex transformations and jobs.
Increase the memory limit for both the Pentaho Server and the PDI client. If you do not increase the memory limit, PDI uses the default memory settings in the PDI startup scripts.
For instructions on increasing memory limits by editing the Tomcat startup script, see Configure and start the Pentaho Server after manual installation.
For instructions on increasing the memory limit in the PDI client, see Modify the PDI client startup script.
Specify data connections for BA design tools
The BA design tools need a way to connect to where you store data, and to the Pentaho Server where you publish and share what you create in the design tools.
We support connecting to data stored in these ways:
Pentaho data sources, such as relational Metadata data models or multidimensional Mondrian data models.
JDBC databases.
JDBC database connections
Your database access protocol affects how you create, test, and publish reports and models. Choose one of these protocols:
Native (JDBC): Standard JDBC connection details stored in each report or model.
JNDI: A named connection. Properties are stored outside the report or model.
ODBC: Not supported. Available only for compatibility.
OCI: Oracle-only. Use when you must use Oracle Call Interface.
Choose an access protocol
Native (JDBC)
Use Native (JDBC) when you need a quick setup or you are evaluating.
Keep this limitation in mind:
If connection details change, you must update every report and model that embeds them.
Behavior by tool:
Report Designer and Metadata Editor
Connection details are embedded in each report or Metadata model.
Report Designer lists the saved connection name after you define it.
Metadata Editor requires you to re-enter the connection name when you use it.
Before publishing, ensure the connection name matches a Native (JDBC) connection already defined on the Pentaho Server.
Schema Workbench and Aggregation Designer
Connection details are not stored. Define the connection each time.
Before publishing, you need the Pentaho Server URL and the name of a Native (JDBC) connection already defined on that server.
Recommendation:
Use for trials and evaluation.
For production, switch to JNDI before publishing.
JNDI
Use JNDI when you need centralized maintenance and a distributed team.
Key characteristics:
Reports and models store only the JNDI name.
You update connection details in one place.
Behavior by tool:
Report Designer and Metadata Editor
Connection properties live in a file on each workstation.
Before publishing, ensure the JNDI name matches a JNDI connection already defined on the Pentaho Server.
Schema Workbench and Aggregation Designer
JNDI is not available for connecting from the tool.
Before publishing, you still need the Pentaho Server URL and a JNDI connection name already defined on the Pentaho Server.
Recommendation:
Use for production deployments.
ODBC (not supported)
ODBC is not supported. It is available only if you have a hard dependency.
If you must use ODBC, contact Pentaho Support.
OCI (Oracle only)
If you connect to an Oracle database, you can use OCI. Use the steps in Define JDBC or OCI connections for BA design tools.
Define connections for BA design tools
Use these sections to set up connections on workstations running BA design tools.
Define JDBC or OCI connections for BA design tools
Choose an access protocol in Choose an access protocol.
Install the matching driver.
Set the access protocol later in each design tool when you define the data source.
Add drivers
The driver lets the design tools connect to the Pentaho Server. This connection helps you validate the model.
Your database administrator or IT team can provide the driver. If needed, download it from your database vendor.
Copy the driver to these directories on every machine that runs the design tools. Stop the design tools before you copy the file.
Aggregation Designer:
/pentaho/design-tools/agg-designer/drivers/Metadata Editor:
/pentaho/design-tools/metadata-editor/libext/JDBC/Report Designer:
/pentaho/design-tools/report-designer/lib/jdbc/Schema Workbench:
/pentaho/design-tools/schema-workbench/drivers/
Keep one driver version per database in each directory. Remove older versions to avoid conflicts.
If you are unsure, contact Pentaho Support.
Microsoft SQL Server driver notes
If you use Microsoft SQL Server, you might need the jTDS driver. jTDS is not vendor supported. Confirm the correct driver with Pentaho Support.
On Windows, many JDBC drivers support Type 2 integrated authentication using the integratedSecurity connection string property. To use it, copy sqljdbc_auth.dll to every machine and directory where you copied the JDBC driver.
You can find sqljdbc_auth.dll at:
<installation directory>\\sqljdbc_<version>\\<language>\\auth\\
Use the
x64DLL with a 64-bit JVM on an x64 processor.Use the
IA64DLL with a 64-bit JVM on an Itanium processor.
Define JNDI connections for Report Designer and Metadata Editor
Choose an access protocol in Choose an access protocol.
Complete these tasks on each workstation.
Add drivers
The driver lets design tools connect to the Pentaho Server and validate models.
Your database administrator or IT team can provide the driver. You can also download it from your database vendor.
Check Components Reference to confirm your driver is supported.
Copy the driver to these directories on each machine where you installed design tools. Stop the tools first.
Aggregation Designer:
/pentaho/design-tools/agg-designer/drivers/Metadata Editor:
/pentaho/design-tools/metadata-editor/libext/JDBC/Report Designer:
/pentaho/design-tools/report-designer/lib/jdbc/Schema Workbench:
/pentaho/design-tools/schema-workbench/drivers/
Keep only one version of the driver in each directory. Remove older versions to avoid conflicts.
If you are unsure, contact Pentaho Support.
Configure Report Designer
Stop Report Designer. Stop the Pentaho Server too, if it runs locally.
For server steps, see Stop and start the Pentaho Server and repository.
Go to the
.pentahodirectory in the user profile.Examples:
Windows:
C:\\Users\\username\\.pentaho\\Linux or Solaris:
/home/username/.pentaho/
Go to
~/.pentaho/simple-jndi/. Create it if needed.Edit
default.properties. Create it if needed.Example:
In this example,
SampleDatais the JNDI connection name.Save the file.
Restart Report Designer. Restart the Pentaho Server too, if it runs locally.
Repeat this process on each workstation where you run Report Designer.
Configure Metadata Editor
Stop Metadata Editor. Stop the Pentaho Server too, if it runs locally.
For server steps, see Stop and start the Pentaho Server and repository.
Go to the
metadata-editor/simple_jndidirectory under your Metadata Editor install.Edit
JDBC.properties. Create it if needed.Example:
In this example,
SampleDatais the JNDI connection name.Save the file.
Restart Metadata Editor. Restart the Pentaho Server too, if it runs locally.
Repeat this process on each workstation where you run Metadata Editor.
Specify data connections for the Pentaho Server
For business analytics, you can use the Pentaho User Console to define the connection to where you store data.
We support accessing data stored in the following ways:
Comma separated values (CSV) files, or any file that can be converted to CSV, such as spreadsheets, XML, or other semi-structured or tabular data files. Use CSV files if you are evaluating Pentaho or you want to get started quickly.
JDBC drivers to help with database connections.
In Pentaho Data Integration (PDI), you can make connections in each job and transformation through an input step. Although users can create connections themselves, it is best to set up shared connections for your users so that they can select the connection they need from a list.
JDBC database connections
Use JDBC database connections when the Pentaho Server connects to an RDBMS.
You choose one of these access types:
Native (JDBC): Define the connection in Pentaho User Console (PUC).
JNDI: Define a data source in your application server (Tomcat).
OCI: Oracle-only option for using the Oracle Call Interface.
ODBC: Available but not supported. Contact Pentaho Support if you must use it.
Choose between native (JDBC) and JNDI
Use native (JDBC) when you want the simplest setup.
You manage the full connection definition in PUC.
If connection details change, update each affected connection in PUC.
Use JNDI when you want central management in the application server.
This is common in production.
It can also help keep credentials out of Pentaho admin workflows.
If connection details change, update the application server configuration.
Add JDBC drivers
The Pentaho Server needs the correct JDBC driver for your database.
Get the driver from your database vendor or your IT team.
The Components Reference lists supported drivers.
Copy the driver to these directories:
Pentaho Server:
/pentaho/server/pentaho-server/tomcat/lib/PDI client:
data-integration/lib
Restart the Pentaho Server and any PDI clients after you add drivers.
Driver conflict guidance
Keep only one driver per database type in each directory.
Remove older versions to avoid conflicts.
This matters when the driver matches your repository database type.
If you are unsure, contact Pentaho Support.
Microsoft SQL Server driver
For Microsoft SQL Server (MSSQL), you might need the non-vendor driver called jTDS.
Contact Pentaho Support to confirm the correct driver.
Most Microsoft JDBC drivers support Type 2 integrated authentication on Windows.
Use the integratedSecurity property in the connection string.
To use integrated authentication, copy sqljdbc_auth.dll to every location where you copied the JDBC driver.
You can find sqljdbc_auth.dll here:
<installation directory>\\sqljdbc_<version>\\<language>\\auth\\
Choose the correct DLL:
64-bit JVM on an x64 processor: use the
x64folder.64-bit JVM on an Itanium processor: use the
IA64folder.
Set up JNDI connections for the Pentaho Server
If you installed the Pentaho Server using the manual installation method, you set up JNDI data sources in Tomcat.
Use this section to add JNDI data sources and update existing ones.
Before you begin
You need IT administrator access.
You must know how to connect to your data sources.
You must know your OS and database setup.
Defining JNDI connections for PDI clients
If you publish to the Pentaho Server from a PDI client, you can configure the PDI client with the same JNDI settings as the Pentaho Server.
This lets you develop and test transformations without keeping your application server running.
Edit the jdbc.properties file to mirror your application server data sources:
Location:
/pentaho/design-tools/data-integration/simple-jndi/jdbc.properties
Tomcat JNDI connections
Use JNDI connections when you manage data sources in Tomcat.
Install the correct JDBC driver before you create or use a JNDI data source.
Next step: define the connection
After the driver is installed, create the connection definition in Pentaho.
See also
The native database connections in the Pentaho Suite are based on Java Database Connectivity (JDBC).
For Pentaho Server-specific connection setup, see:
Define security for the Pentaho Server
You also need to establish a security plan for your Pentaho system. Pentaho supports two different security options: Pentaho Security, or advanced security providers, such as LDAP, Single Sign-On, or Microsoft Active Directory.
This table can help you choose the option that is best for you.
Pentaho security in PUC and Pentaho security in the PDI client
Advanced security providers
Summary
Pentaho Security is the easiest way to configure security quickly. Both the Pentaho User Console and the PDI client enable you to define and manage users and roles. The Pentaho Server controls which users and roles can access web resources through the User Console or resources in the Pentaho Repository.
Pentaho Security works well if you do not have a security provider or if you have a user community with less than 100 users.
If you are already using a security provider, such as LDAP, Single Sign-On, or Microsoft Active Directory, you can use the users and roles you have already defined with Pentaho. Your security provider controls which users and roles can access Pentaho web resources through the User Console or resources in the Pentaho Repository.
Advanced security scales well for production and enterprise user communities.
Expertise
Knowledge of your user community and which users should have which roles in the Pentaho system. Knowledge about security in general is NOT required.
Knowledge of your user community and which users should have which roles in the Pentaho system. Knowledge about your particular security provider and its options is required.
Recommendation
Recommended for the Pentaho trial download or evaluation only. Caution: Do not use for production.
Recommended for production.
Depending on the security option you choose, refer to the following details in the Administer Pentaho Data Integration and Analytics document:
If you want to use Pentaho Security, see Manage Users and Roles in PUC and Manage Users and Roles in the PDI client.
If you want to use an advanced security provider, see Implement Advanced Security.
For information on using Kerberos for security, see How to Enable Kerberos Authentication.
Use password encryption with Pentaho
Strengthen security by using encrypted passwords for Pentaho applications.
For IT administrators, who have permissions to modify files on the server and the permission to stop and start the server, perform these tasks when you want to enhance your company's security by encrypting the passwords that are currently stored as plain text in configuration files, for example, if you want to meet specific server security levels for regulatory compliance.
As a best practice, stop the server before modifying configuration files, then start the server when finished. After you have configured a Pentaho product to use encrypted passwords, all logins with the Pentaho product will use the encrypted passwords. Connect to any databases that were edited to ensure all changes are correct.
Use this topic to:
Encrypted passwords are supported for:
You can also use encrypted passwords with JDBC security. See the Administer Pentaho Data Integration and Analytics document for instructions on switching to JDBC security.
Encrypting a password
Perform the following steps on the machine with the Pentaho Server to create an encrypted password.
Stop the server.
For instructions, see Stop and start the Pentaho Server and repository.
At the command line, navigate to the
server/pentaho-serverdirectory.Run
encr.bat(Windows) orencr.sh(Linux).Example:
The console prints the encrypted password.
Note: You must have a JRE or JDK installed.
Restart the server.
Use encrypted passwords with Pentaho products
How you apply an encrypted password varies by product.
Pentaho Data Integration (PDI)
Perform the following steps to use an encrypted password with Pentaho Data Integration (PDI).
Stop the server.
For instructions, see Stop and start the Pentaho Server and repository.
Navigate to the
design-tools/data-integration/simple-jndidirectory.Open the
jdbc.propertiesfile in a text editor.Replace all instances of the
passwordvalue with the encrypted password.Save and close the file.
Restart the server and verify the change.
Pentaho User Console (PUC)
Perform the following steps to use an encrypted password with the Pentaho User Console (PUC).
Stop the server.
For instructions, see Stop and start the Pentaho Server and repository.
Navigate to the
server/pentaho-server/tomcat/webapps/pentaho/META-INFdirectory.Open the
context.xmlfile in a text editor.Replace the
passwordvalue in everyResourceelement with the encrypted password.Save and close the file.
Restart the server and verify the change.
PUC email
After you configure PUC to use an encrypted password, you can use that password with PUC email.
Log in to PUC as an administrator.
Open the Administration Perspective.
Select the Mail server section.
Enter your encrypted password value in the
passwordfield.Note: If you use Gmail, enable Allow less secure apps to access your account.
Select Test Email Configuration.
Verify that PUC sends an email to the address you specified.
Pentaho Aggregation Designer
To use encrypted passwords with Pentaho Aggregation Designer, you must first centralize your passwords in a jndi.properties file.
Stop the server.
For instructions, see Stop and start the Pentaho Server and repository.
Create a
jndi.propertiesfile with the default properties:Save
jndi.propertiesin thedesign-tools/aggregation-designer/libdirectory.In the user’s home directory, navigate to the
.pentaho/simple-jndidirectory.Open
default.propertiesin a text editor.If
default.propertiesdoes not exist, create it.If you created
default.propertiesunderdesign-tools/aggregation-designer/simple-jndi, updateorg.osjava.sj.rootindesign-tools/aggregation-designer/lib/jndi.propertiesto point to it. Example:
Replace the
passwordvalue in every property indefault.propertieswith the encrypted password.Note: If you use a remote repository, replace
localhostwith the repository IP address.Save and close the file.
Restart the server and verify the change.
Pentaho Metadata Editor (PME)
The Pentaho Metadata Editor (PME) stores passwords in the JNDI connection default.properties file. For setup details, see Define JNDI connections for Report Designer and Metadata Editor.
Stop the server.
For instructions, see Stop and start the Pentaho Server and repository.
In the user’s home directory, navigate to the
.pentaho/simple-jndidirectory.Open
default.propertiesin a text editor.Note: If
default.propertiesdoes not exist, create it.Replace the
passwordvalue in every property with the encrypted password.Note: If you use a remote repository, replace
localhostwith the repository IP address.Save and close the file.
Restart the server and verify the change.
Pentaho Report Designer (PRD)
The Pentaho Report Designer (PRD) stores passwords in the JNDI connection default.properties file. For setup details, see Define JNDI connections for Report Designer and Metadata Editor.
Stop the server.
For instructions, see Stop and start the Pentaho Server and repository.
Navigate to the
design-tools/report-designer/configuration-template/simple-jndidirectory.Open
default.propertiesin a text editor.Replace the
passwordvalue in every property with the encrypted password.Note: If you use a remote repository, replace
localhostwith the repository IP address.Save the file.
Copy
default.propertiesto the.pentaho/simple-jndidirectory in the user’s home directory. Replace the existing file.Note: If the
.pentaho/simple-jndidirectory does not exist, create it.Restart the server and verify the change.
After you update a product
After you configure a product to use encrypted passwords, all logins with that product use encrypted passwords.
Connect to any databases you updated to verify the changes.
Set up Pentaho to connect to a Hadoop cluster
Use this topic to configure Pentaho to connect to Hadoop clusters.
Supported distributions include Amazon EMR, Azure HDInsight (HDI), Cloudera Data Platform (CDP), and Google Dataproc.
Pentaho also supports related services such as HDFS, HBase, Hive, Oozie, Sqoop, YARN/MapReduce, ZooKeeper, and Spark.
You can connect to clusters and services from these Pentaho components:
PDI client (Spoon), along with Kitchen and Pan command line tools
Pentaho Server
Analyzer (PAZ)
Pentaho Interactive Reports (PIR)
Pentaho Report Designer (PRD)
Pentaho Metadata Editor (PME)
Pentaho connects to Hadoop clusters through a compatibility layer called a driver (Big Data shim).
To confirm which drivers are supported for your version, see the Components Reference.
Drivers are shipped as vendor-specific builds of the optional pentaho-big-data-ee-plugin.
Download drivers from the Hitachi Vantara Lumada and Pentaho Support Portal.
Note: Pentaho ships with a generic Apache Hadoop driver. For specific vendor drivers, visit the Hitachi Vantara Lumada and Pentaho Support Portal to download the drivers.
Install a new driver
You need a driver for each cluster vendor and version you connect to from:
PDI client (Spoon), plus Kitchen and Pan
Pentaho Server
Analyzer
Interactive Reports
Pentaho Report Designer (PRD)
Pentaho Metadata Editor (PME)
Pentaho ships with a generic Apache Hadoop driver. Download vendor-specific drivers from the Support Portal.
Download the driver plugin
Sign in to the Support Portal.
Go to Downloads.
In the 11.0 list, open the full downloads list.
Open Pentaho 11.0 GA Release.
Download the driver plugin from
Big Data Shims.
Common driver plugin files:
Apache Vanilla:
pentaho-big-data-ee-plugin-apachevanilla-11.0.0.0-<build-number>.zipCloudera Data Platform:
pentaho-big-data-ee-plugin-cdpdc71-11.0.0.0-<build-number>.zipGoogle Dataproc:
pentaho-big-data-ee-plugin-dataproc1421-11.0.0.0-<build-number>.zipAmazon EMR:
pentaho-big-data-ee-plugin-emr770-11.0.0.0-<build-number>.zipAzure HDInsight:
pentaho-big-data-ee-plugin-hdi40-11.0.0.0-<build-number>.zip
Install the driver on the PDI client
Stop PDI.
Extract the downloaded
.zipinto:<pdi-install-dir>/data-integration/plugins
If you are replacing an existing driver plugin, remove the old
pentaho-big-data-ee-pluginfolder first.
Install the driver on the Pentaho Server
Stop the Pentaho Server.
Extract the downloaded
.zipinto:<pentaho-server>/pentaho-solutions/system/kettle/plugins
If you are replacing an existing driver plugin, remove the old
pentaho-big-data-ee-pluginfolder first.
Restart and verify
Restart the PDI client and the Pentaho Server.
Create or update your cluster connection and verify it connects.
Update drivers
When drivers for new Hadoop versions are released, download the new driver plugin and repeat the install steps.
Additional configurations for specific distributions
Use these settings when you configure Pentaho to connect to specific Hadoop distributions:
Amazon EMR
The following settings are available while you configure Pentaho to connect to a working Amazon EMR cluster.
EMR clusters (version 7.x and later) built with JDK 17 exclude commons-lang-2.6.jar from standard Hadoop library directories (such as $HADOOP_HOME/lib).
To use the EMR driver with EMR 7.x:
Download
commons-lang-2.6.jarfrom a trusted source (for example, Maven Repository: commons-lang » commons-lang » 2.6).Copy the JAR to
$HADOOP_HOME/libor$HADOOP_MAPRED_HOME/libon every EMR node.
Before you begin
Before you set up Pentaho to connect to an Amazon EMR cluster, do these tasks:
Check the Components Reference to confirm your Pentaho version supports your EMR version.
Prepare your Amazon EMR cluster:
Configure an Amazon EC2 cluster.
Install required services and service client tools.
Test the cluster.
Install PDI on an Amazon EC2 instance in the same Amazon VPC as the EMR cluster.
Get connection details from your Hadoop administrator.
Add the YARN user on the cluster to the group defined by
dfs.permissions.superusergroupinhdfs-site.xml.
As a best practice, install PDI on the Amazon EC2 instance.
Otherwise, you may not be able to read or write cluster files.
For a workaround, see Unable to read or write files to HDFS on the Amazon EMR cluster.
You also need to share connection details with users after setup.
For the full list, see Hadoop connection and access information list.
Edit configuration files for users
Your cluster administrator must download cluster configuration files.
Update the files with Pentaho-specific and user-specific values.
Use these files to create or update a named connection.
Where named connection files live
Named connection files are stored here:
Named connection XML:
<username>/.pentaho/metastore/pentaho/NamedClusterNamed connection config folder:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>Extra settings file:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
Save edited files in a safe location.
Files to provide to users
Provide these files to each user:
core-site.xmlmapred-site.xmlhdfs-site.xmlyarn-site.xml
Verify or edit core-site.xml file
If you plan to run MapReduce jobs on Amazon EMR, confirm you have read, write, and execute access to the S3 buffer directories specified in core-site.xml.
Edit core-site.xml to add AWS access keys and (optional) LZO compression settings.
Open the file
Open core-site.xml from the folder where you saved the other *-site.xml files.
Add AWS credentials
Add your AWS Access Key ID and secret access key:
Optional: Add S3N credentials
If you use S3N, add these properties:
Add filesystem implementation settings
Add these properties:
Configure LZO compression
If you are not using LZO compression, remove any references to com.hadoop.compression.lzo.LzoCodec from core-site.xml.
If you are using LZO compression:
Download the LZO JAR.
Add it to
pentaho-big-data-plugin/hadoop-configurations/emr3x/lib.
Download: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/
Save and apply the change
Save the file.
Update the named connection.
Upload the updated core-site.xml.
Edit mapred-site.xml file
If you use MapReduce, edit mapred-site.xml.
You also enable cross-platform MapReduce job submission.
Open the file
Open mapred-site.xml from the folder where you saved the other *-site.xml files.
Add the property
Add this property:
This property is only required for MapReduce jobs on Windows.
Save and apply the change
Save the file.
Update the named connection.
Upload the updated mapred-site.xml.
Connect to a Hadoop cluster with the PDI client
After you set up the Pentaho Server to connect to a cluster, configure and test the connection.
See the Pentaho Data Integration documentation for instructions.
Connect other Pentaho components to the Amazon EMR cluster
Use this procedure to create and test a connection to your Amazon EMR cluster from these Pentaho components:
Pentaho Server (DI and BA)
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)
Install a driver for the Pentaho Server
Install a driver for the Pentaho Server.
For instructions, see Install a new driver.
Create and test connections
Create and test a connection for each component:
Pentaho Server for DI: Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA: Create a connection to the cluster in the Data Source Wizard.
PME: Create a connection to the cluster in PME.
PRD: Create a connection to the cluster in PRD.
Share connection details with users
After you connect to the cluster and services, share the connection details with users.
Users can access the cluster only from machines configured to connect to it.
To connect, users need:
Hadoop distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
Oozie URL (if used)
Users also need permissions for required HDFS directories.
For a detailed list of required information, see Hadoop connection and access information list.
Azure HDInsight
Use these settings when you configure Pentaho to connect to Azure HDInsight (HDI).
Before you begin
Before you set up Pentaho to connect to HDI, do the following:
Check Components Reference. Confirm your Pentaho version supports your HDI version.
Prepare your HDI instance:
Configure your Azure HDInsight instance.
Install required services and client tools.
Test the platform.
If HDI uses Kerberos, complete the Kerberos steps in this page.
Get connection details from your platform admin. You will share some of this information with users later. See Hadoop connection and access information list.
Add the YARN user to the group defined by
dfs.permissions.superusergroupinhdfs-site.xml.Set up the Hadoop driver for your HDI version. See Install a new driver.
Kerberos-secured HDInsight instances
If you connect to HDI secured with Kerberos, complete these steps first:
Configure Kerberos security on the platform. Configure the Kerberos realm, KDC, and admin server.
Configure these nodes to accept remote connection requests:
NameNode
DataNode
Secondary NameNode
JobTracker
TaskTracker
If you deployed HDI using an enterprise program, set up Kerberos for those nodes.
Add user credentials to the Kerberos database for each Pentaho user.
Verify an OS user exists on each HDI node for each Kerberos user. Create users as needed.
User account UIDs should be greater than min.user.id. The default is usually 1000.
Set up Kerberos on your Pentaho machines. See the Administer Pentaho Data Integration and Analytics guide.
Edit configuration files for users
Your Azure admin downloads the site configuration files for the services you use. They update the files with Pentaho-specific and user-specific settings. Users upload the updated files when they create a named connection.
Named connection files are stored in these locations:
<username>/.pentaho/metastore/pentaho/NamedCluster<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
Save the updated files in a known location for reuse.
Files to provide
core-site.xml(secured HDInsight only)hbase-site.xmlhive-site.xmlmapred-site.xmlyarn-site.xml
If you update these files after creating a named connection, edit the named connection and re-upload the updated files.
Edit Core site XML file
If you use a secured instance of Azure HDInsight, update core-site.xml.
Open
core-site.xml.Add or update properties for your storage type.
WASB storage
Add these properties:
fs.AbstractFileSystem.wasb.impl:org.apache.hadoop.fs.azure.Wasbpentaho.runtime.fs.default.name:wasb://<container-name>@<storage-account-name>.blob.core.windows.net
Example:
ADLS (ABFS) storage
Add this property:
pentaho.runtime.fs.default.name:abfs://<container-name>@<storage-account-name>.dfs.core.windows.net
Example:
Save the file.
Edit HBase site XML file
If you use HBase, update hbase-site.xml to set the temporary directory.
Open
hbase-site.xml.Add or update this property:
hbase.tmp.dir:/tmp/hadoop/hbase
Save the file.
Edit Hive site XML file
If you use Hive, update hive-site.xml to set the Hive metastore location.
Open
hive-site.xml.Add or update these properties:
hive.metastore.uris: Hive metastore URI, if different from your HDInsight instance.fs.azure.account.keyprovider.<storage-account>.blob.core.windows.net: Azure storage key provider principal, if required.
Example:
Save the file.
Edit Mapred site XML file
If you use MapReduce, update mapred-site.xml for job history logging and cross-platform execution.
Open
mapred-site.xml.Ensure these properties exist:
mapreduce.jobhistory.address: where MapReduce job history logs are storedmapreduce.job.hdfs-servers: HDFS servers used by YARN to run MapReduce jobs
Example:
Optional: If YARN containers run on JDK 11 nodes, add this property:
mapreduce.jvm.add-opens-as-default:false
Do not add mapreduce.jvm.add-opens-as-default for containers running on JDK 17 nodes.
Example:
Save the file.
Edit YARN site XML file
If you use YARN, verify your yarn-site.xml settings.
Open
yarn-site.xml.Add or update these properties:
yarn.resourcemanager.hostname: ResourceManager host nameyarn.resourcemanager.address: ResourceManager address and portyarn.resourcemanager.admin.address: ResourceManager admin address and port
Example:
Save the file.
After you change these files, edit the named connection and upload the updated files.
Oozie configuration
If you use Oozie, configure both the cluster and the Pentaho server.
By default, the Oozie user runs Oozie jobs. If you start an Oozie job from PDI, set up a PDI proxy user.
Set up Oozie on a cluster
Add your PDI user to oozie-site.xml.
Open
oozie-site.xmlon the cluster.Add these properties. Replace
<pdi-username>with the PDI user name.Save the file.
Set up Oozie on the server
Set the proxy user for the named cluster on the Pentaho server.
Open
config.properties:/<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.propertiesThis path is created when you create a named connection.
Set
pentaho.oozie.proxy.userto the proxy user name.Save the file.
Windows configuration for a secured cluster
If you run Pentaho Server on Windows and your cluster uses Kerberos, point Tomcat to your krb5.conf or krb5.ini.
Go to
server/pentaho-server.Open
start-pentaho.bat.Set
CATALINA_OPTSto include the Kerberos config path:Save the file.
Connect to HDI with the PDI client
After you set up the Pentaho Server to connect to HDI, configure and test the connection from PDI.
See the Pentaho Data Integration documentation for how to connect the PDI client to a cluster.
Connect other Pentaho components to HDI
Create and test an Azure HDInsight (HDI) connection in:
Pentaho Server
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)
Prerequisites
Install a driver for the Pentaho Server. See Install a new driver.
Create and test connections
Create and test the connection in each product:
Pentaho Server (DI): Create a transformation in the PDI client. Run it remotely.
Pentaho Server (BA): Create a connection to HDI in the Data Source Wizard.
PME: Create a connection to HDI.
PRD: Create a connection to HDI.
After you connect, share connection details with users.
Users typically need:
HDI distribution and version
HDFS, ResourceManager (JobTracker), ZooKeeper, and HiveServer2 hostnames, IP addresses, and ports
Oozie URL (if used)
Permissions for required HDFS directories, including user home directories
See Hadoop connection and access information list.
Cloudera Data Platform (CDP)
Use these advanced settings when you configure Pentaho to connect to Cloudera Data Platform (CDP).
Before you begin
Before you set up Pentaho to connect to CDP, do these tasks:
Check Components Reference. Verify your Pentaho version supports your CDP version.
Prepare CDP:
Configure Cloudera Data Platform.
See CDP documentation.
Install required services and client tools.
Test the platform.
Get connection details from your platform administrator.
You will share some of this information with users later.
Add the YARN user to the group defined by
dfs.permissions.superusergroup.Find this property in
hdfs-site.xmlor in Cloudera Manager.Set up the Hadoop driver for your CDP version. See Install a new driver.
Set up a secured instance of CDP
If you connect to Kerberos-secured CDP, also do these tasks:
Configure Kerberos on the platform.
Include the realm, KDC, and administrative server.
Configure these nodes to accept remote connection requests:
Name
Data
Secondary
Job tracker
Task tracker
If you deployed CDP using an enterprise program, set up Kerberos for:
Name
Data
Secondary name
Job tracker
Task tracker nodes
Add credentials to the Kerberos database for each Pentaho user.
Verify each user has an operating system account on each CDP node.
Add operating system users if needed.
User account UIDs should be greater than min.user.id.
This value is usually 1000.
Set up Kerberos on your Pentaho machines.
See Administer Pentaho Data Integration and Analytics.
Edit configuration files for users
Cloudera administrators download site configuration files for the services you use.
They update the files with Pentaho-specific and user-specific settings.
Users then upload the files when they create a named connection.
Named connection files are stored here:
<username>/.pentaho/metastore/pentaho/NamedCluster<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
Save the updated files in a known location for reuse.
Files to provide
config.propertiescore-site.xml(secured CDP only)hive-site.xmlmapred-site.xmlyarn-site.xml
If you update configuration files after creating a named connection, edit the named connection and re-upload the updated files.
Edit Core site XML file
If you use a secured instance of CDP, update core-site.xml.
Open
core-site.xml.Add or update these properties:
PropertyValuehadoop.proxyuser.oozie.hostsOozie hosts on your CDP cluster.
hadoop.proxyuser.oozie.groupsOozie groups on your CDP cluster.
hadoop.proxyuser.<security_service>.hostsProxy user hosts for other services on your CDP cluster.
hadoop.proxyuser.<security_service>.groupsProxy user groups for other services on your CDP cluster.
fs.s3a.access.keyYour S3 access key, if you access S3 from CDP.
fs.s3a.secret.keyYour S3 secret key, if you access S3 from CDP.
Optional (AWS): If you connect to CDP Public Cloud on AWS and use an S3 bucket outside the CDP environment, update or add these properties:
Ensure the gateway node has valid AWS credentials (for example, under
~/.aws/).Optional (Azure): If you connect to CDP Public Cloud on Azure and use a storage account outside the CDP environment:
Remove these properties:
fs.azure.enable.delegation.tokenfs.azure.delegation.token.provider.typefs.azure.account.auth.typefs.azure.account.oauth.provider.type
Add these properties:
fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net=SharedKeyfs.azure.account.key.<storage-account-name>.dfs.core.windows.net=<storage-account-key>
Optional (GCP): If you connect to CDP Public Cloud on GCP and use a bucket outside the CDP environment, create a custom role with these permissions:
Assign the custom role to the Data Lake and Log service accounts for the bucket.
Save the file.
Edit Hive site XML file
If you use Hive, update hive-site.xml to set the Hive metastore location.
Open
hive-site.xml.Add or update these properties:
PropertyValuehive.metastore.urisSet this to the Hive metastore URI if it differs from your CDP cluster.
hive.server2.enable.impersonationSet to
trueif you use impersonation.hive.server2.enable.doAsSet to
trueif you use impersonation.tez.lib.urisRequired when you use Hive 3 on Tez.
Example:
Save the file.
Edit Mapred site XML file
If you use MapReduce, update mapred-site.xml to set job history logging and allow cross-platform submissions.
Open
mapred-site.xml.Ensure these properties exist:
PropertyValuemapreduce.jobhistory.addressWhere MapReduce job history logs are stored.
mapreduce.app-submission.cross-platformSet to
trueto allow submissions from Windows clients to Linux servers.Example:
Save the file.
Edit YARN site XML file
If you use YARN, verify your YARN settings in yarn-site.xml.
Open
yarn-site.xml.Add or update these properties:
PropertyValueyarn.application.classpathClasspaths needed to run YARN applications. Use commas to separate multiple paths.
yarn.resourcemanager.hostnameResource Manager host name for your environment.
yarn.resourcemanager.addressResource Manager address and port for your environment.
yarn.resourcemanager.admin.addressResource Manager admin address and port for your environment.
yarn.resourcemanager.proxy-user-privileges.enabledSet to
trueif you use a proxy user.Example:
Save the file.
After you change these files, edit the named connection and upload the updated files.
Oozie configuration
If you use Oozie on your cluster, configure proxy access on the cluster and the server.
By default, the oozie user runs Oozie jobs.
If you start an Oozie job from PDI, configure a proxy user.
Set up Oozie on a cluster
Add your PDI user to oozie-site.xml.
Open
oozie-site.xmlon the cluster.Add these properties.
Replace
<your_pdi_user_name>with your PDI user name.Save the file.
Set up Oozie on the server
Add the proxy user name to the PDI named connection configuration.
Open this file:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection_name>/config.properties
This path is created when you create a named connection.
Set
pentaho.oozie.proxy.userto the proxy user name.Save the file.
Windows configuration for a secured cluster
If you run Pentaho Server on Windows and use Kerberos, set the path to your krb5.conf or krb5.ini file.
Open
server/pentaho-server/start-pentaho.bat.Add
-Djava.security.krb5.conftoCATALINA_OPTS.Example:
Save the file.
Connect to CDP with the PDI client
After you set up the Pentaho Server to connect to CDP, configure and test the connection from the PDI client.
See Pentaho Data Integration for the client connection steps.
Connect other Pentaho components to CDP
Create and test a connection to CDP from Pentaho Server, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME).
Create and test connections
Create and test a connection in each component.
Pentaho Server for Data Integration (DI): Create a transformation in the PDI client, then run it remotely.
Pentaho Server for Business Analytics (BA): Create a connection to CDP in the Data Source Wizard.
Pentaho Metadata Editor (PME): Create a connection to CDP in PME.
Pentaho Report Designer (PRD): Create a connection to CDP in PRD.
Share connection details with users
After you connect to CDP and its services, give connection details to users who need access.
Users typically need:
CDP distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames, and port numbers
Oozie URL (if used)
Permission to access required HDFS directories, including home directories
Users might need more information, depending on the steps, entries, and services they use.
See Hadoop connection and access information list.
Google Dataproc
The following settings are available while you configure Pentaho to connect to Google Dataproc.
Before you begin
Before you set up Pentaho to connect to a Google Dataproc cluster, do these tasks:
Check the Components Reference.
Prepare your Google Cloud access:
Get credentials for a Google account and access to the Google Cloud Console.
Get required credentials for Google Cloud Platform, Compute Engine, and Dataproc.
Contact your Hadoop administrator for cluster connection details.
You also need to provide some of this information to users after setup.
Create a Dataproc cluster
You can create a Dataproc cluster using several methods.
For cluster setup options, see the Google Cloud Documentation.
Install the Google Cloud SDK on your local machine
Use Google’s instructions to install the Google Cloud SDK for your platform:
Set command variables
Set these environment variables before you run command-line examples on your local machine or in Cloud Shell.
Set the variables:
Set
PROJECTto your Google Cloud project ID.Set
HOSTNAMEto the name of the master node in your Dataproc cluster.Note: The master node name ends with
-m.Set
ZONEto the zone of the instances in your Dataproc cluster.
Set up a Google Compute Engine instance for PDI
Run the PDI client inside Google Compute Engine (GCE).
Users must connect remotely through VNC to use the desktop UI.
VM instances in GCE do not publicly expose the required remote desktop ports.
Create an SSH tunnel between the VNC client and the VM instance.
Create a VM instance and set network tags
In the Google Cloud Console, open the Compute Engine console.
Go to Compute Engine > VM instances.
Select Create instance.
Open Advanced options and then the Networking tab.
In Network tags, enter
vnc-server.
Install and configure VNC
Install and update a VNC service for the remote UI.
Install Gnome and VNC.
Connect using SSH and create an SSH tunnel
Log in to the instance using SSH.
Use an SSH client and the VM external IP.
Note: The Google Cloud Console shows the external IP.
Create an SSH tunnel from your VNC client machine.
Connect to the VNC session.
Optional: Configure Kerberos
If you use Kerberos, configure Kerberos on the GCE VM.
Authenticate the client machine with the Kerberos controller.
This is required for Kerberos-enabled Dataproc clusters.
When you finish, you can run PDI in GCE.
You can design and launch jobs and transformations on Dataproc.
Edit configuration files for users
Your cluster administrator must download cluster configuration files.
Update the files with Pentaho-specific and user-specific values.
Use these files to create a named connection.
Where named connection files live
Named connection files are stored here:
Named connection XML:
<username>/.pentaho/metastore/pentaho/NamedClusterNamed connection config folder:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>Extra settings file:
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<connection-name>/config.properties
Save edited files in a safe location.
Files to provide to users
Provide these files to each user:
core-site.xmlhdfs-site.xmlmapred-site.xmlyarn-site.xmlhive-site.xml
You can copy these files from a Dataproc cluster using SCP.
Edit mapred-site.xml (MapReduce)
If you use MapReduce, update mapred-site.xml.
You also enable cross-platform MapReduce job submission.
Open the file
Open mapred-site.xml from the folder where you saved the other *-site.xml files.
Add the property
Add this property:
This property is only required for MapReduce jobs on Windows.
Save and apply the change
Save the file.
Edit the named connection.
Upload the updated mapred-site.xml.
Connect to a Hadoop cluster with the PDI client
After you set up the Pentaho Server to connect to a cluster, configure and test the connection.
See the Pentaho Data Integration documentation for instructions.
Connect other Pentaho components to Dataproc
Use this procedure to create and test a connection to your Dataproc cluster from these Pentaho components:
Pentaho Server (DI and BA)
Pentaho Metadata Editor (PME)
Pentaho Report Designer (PRD)
Install a driver for the Pentaho Server
Install a driver for the Pentaho Server.
For instructions, see Install a new driver.
Create and test connections
Create and test a connection for each component:
Pentaho Server for DI: Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA: Create a connection to the cluster in the Data Source Wizard.
PME: Create a connection to the cluster in PME.
PRD: Create a connection to the cluster in PRD.
Share connection details with users
After you connect to the cluster and services, share the connection details with users.
Users can access the cluster only from machines configured to connect to it.
To connect, users need:
Hadoop distribution and version
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames (or IP addresses) and port numbers
Oozie URL (if used)
Users also need permissions for required HDFS directories.
For a detailed list of required information, see Hadoop connection and access information list.
Tasks to be performed by a Pentaho administrator
As a Pentaho administrator, you configure data connections, manage security, and set up client tools.
Define data connections
PUC (Pentaho User Console) and PDI (Pentaho Data Integration) can access data from various sources.
If your data source is a database, create a database connection before you can use it.
Before you create a connection, install the correct driver for your database. Your IT administrator can help. See Specify data connections for the Pentaho Server.
Open the connection dialog box
You can open the Database Connection dialog box from PUC or PDI.
Open from Pentaho User Console (PUC)
Log on to PUC.
Select File > Manage Data Sources.
The Manage Data Sources dialog box appears.
In More actions and options, select New Connection.
The Database Connection dialog box appears.

Database Connection dialog box Enter your connection information.
Click Test.
Click OK.
Open from Pentaho Data Integration (PDI)
Start the PDI client (Spoon).
Create a new transformation or job.
In the View tab, open the Database connections folder.
The Database Connection dialog box appears.

Database Connection dialog box Enter your connection information.
Click Test.
Click OK.
Notes for PDI connections
PDI can connect to multiple database vendors.
PDI includes JDBC drivers for PostgreSQL by default.
Avoid ODBC connections.
The ODBC-to-JDBC bridge driver adds complexity and can hurt performance. Use ODBC only if no JDBC driver is available.
For details, see the Pentaho Community article on why you should avoid ODBC.
When you define a database connection in PDI, Spoon stores the connection information:
If you use the Pentaho Repository, the connection is stored in the repository. Other users can use it after they connect.
If you do not use the Pentaho Repository, Spoon stores it in XML with the transformation or job.
Before you define a JDBC connection, gather your database information. For example, gather the database type, port, user name, and password.
In PDI, you can set connection properties as variables. This helps jobs access multiple database types.
Use clean ANSI SQL where possible.
Enter database connection information
The connection information you enter depends on the database and access protocol.
Native (JDBC) protocol information
In Connection Name, enter a descriptive name.
The name can include spaces. Do not use special characters such as
#,$, or%.In Database Type, select the database you want to use.
In Access, select Native (JDBC).
In Settings, enter the following information:
FieldDescriptionHost Name
Server name or IP address.
Database Name
Database name. If you use ODBC, enter the DSN.
Port Number
TCP/IP port number (if different from the default).
User Name
Optional user name for the database.
Password
Optional password for the database.
Click Test.
Click OK to close the test dialog.
Click OK to save the connection.
In PUC, the connection appears in Manage Data Sources.
In PDI, it appears under Database connections in the View tab.
OCI protocol information (PDI only)
In Connection Name, enter a descriptive name.
The name can include spaces. Do not use special characters such as
#,$, or%.In Database Type, select Oracle.
In Access, select OCI.
In Settings, enter the fields as directed by the Oracle OCI documentation.
FieldDescriptionSID
Oracle system ID.
Tablespace for Data
Tablespace where data is stored.
Tablespace for Indices
Tablespace where indexes are stored.
User Name
User name for the database.
Password
Password for the database.
Click Test.
Click OK to close the test dialog.
Click OK to save the connection.
If you want to use Advanced, Options, or Pooling for OCI, refer to the Oracle documentation.
Connect to Snowflake using strong authentication
You can improve connection security by using key pair authentication.
After you enter the Snowflake connection details in General, select Options.
Set the following parameters:
ParameterValueauthenticatorsnowflake_jwtprivate_key_filePath to the private key file (for example,
/rsa_key.p8).private_key_file_pwdPassword for the private key file.
For details, see the Snowflake JDBC docs: Private key file name and password as connection properties.
Click Test.
Click OK to close the test dialog.
Click OK to save the connection.
Connect to an Azure SQL database
You can use an Azure SQL database as a data source in the PDI client. This connection is required for the Bulk load into Azure SQL DB job entry.
Pentaho supports:
Multiple authentication methods
Because one physical server can host databases for multiple customers, Azure SQL differs from SQL Server. For details, see Feature comparison.
Before you begin
You need:
An Azure account with an active subscription
An Azure SQL Database instance
Azure SQL database drivers installed
Also gather the following information:
Host name
Database name
Port number
Authentication method
User name
Password
If you use Always Encryption Enabled, also gather:
Client id
Client Secret Key
Authentication method
Pentaho supports four authentication methods:
SQL Authentication: Azure SQL Server user name and password.
Azure Active Directory: Multi-factor authentication (MFA).
Azure Active Directory with password: Azure AD user name and password.
Azure Active Directory with integrated authentication: Federated on-premises ADFS with Azure AD.
Connect to an Azure database
Start the PDI client and create a new transformation or job.
In the View tab of the Explorer pane, double-click Database connections.
Enter your connection information:
FieldDescriptionHost Name
Azure SQL server instance name.
Database Name
Azure SQL database name.
Port Number
TCP port. Azure SQL uses TCP port
1433. Allow outbound TCP1433.Authentication method
Authentication method. Default is SQL Authentication.
Username
Database user name.
Password
Database password.
Always Encryption Enabled
Enables encryption. See Use the Always Encryption Enabled option.
Client id
Client identifier for a durable connection path.
Client Secret Key
Key value name in Azure Key Vault.
Click Test to verify the connection.
Use the Always Encryption Enabled option
Before you can use Always Encryption Enabled, complete these steps using your Azure SQL tools:
Generate a column master key in Azure Key Vault.
Encrypt the column using the column master key.
Register the app under Azure Active Directory.
Obtain the Client id and Client Secret Key.
Grant permissions to the Client id for accessing Azure Key Vault.
Select Always Encryption Enabled.
Enter the Client id and Client Secret Key.
Edit existing connections
Once a connection has been established, you can open the Database Connection dialog box to change it.
Edit database connections in PUC
Select File > Manage Data Sources.
Select the data source you want to edit.
In More actions and options, select Edit.
Select items in the left pane to refine the connection.
Edit database connections in PDI
Open a transformation or job in the PDI client.
In Explorer, open the View tab.
Expand Database connections.
Right-click a connection name, then select Edit.
Select items in the left pane to refine the connection.
Specify advanced configuration (Advanced)
Use Advanced to set SQL-generation properties. You can use these settings to standardize behavior across tools.
Open the Database Connection dialog box. See Open the connection dialog box.
Select Advanced in the left pane.
Configure options as needed:
IdentifierDescriptionSupports the Boolean data type*
Use native Boolean data types supported by the database.
Supports the timestamp data type*
Use the timestamp data type supported by the database.
Quote all in database
Enable case-sensitive table names by quoting identifiers.
Force all to lower-case
Force identifiers to lower-case.
Force all to upper-case
Force identifiers to upper-case.
Preserve case of reserved words*
Use the database reserved-words list.
Preferred schema name*
Preferred schema name (for example,
MYSCHEMA).SQL statements
SQL statement used to initialize the connection.
* PDI-only options.
Click Test.
Click OK to close the test dialog.
Click OK to save the connection.
Quoting
Pentaho uses database-specific quoting rules. This helps you use names that follow database naming conventions.
PUC and PDI include a reserved-words list for most supported databases.
Pentaho separates schema (table owner) from the table name. This supports table or field names that include periods (common in some ERP systems).
To avoid quoting errors, Pentaho does not apply quoting if the schema or table name already includes a start or end quotation mark. This lets you control quoting explicitly.
Set database-specific options (Options)
Use Options to set JDBC parameters for a specific database.
Open the Database Connection dialog box. See Open the connection dialog box.
Select Options in the left pane.
In the next available row, enter a parameter name and value.
For JDBC configuration help, select Help. A browser window opens with database-specific guidance for the selected Database Type.
Click Test.
Click OK to close the test dialog.
Click OK to save the connection.
Define connection pooling (Pooling)
Connection pooling helps control database access. It can also help when your database license limits concurrent connections.
Open the Database Connection dialog box. See Open the connection dialog box.
Select Pooling in the left pane.
Configure pooling options for your JDBC driver.
Typical options include:
Enable Connection Pooling
Pool Size: initial and maximum
Parameters such as
validationQuery
Common
validationQueryvalues:Oracle and PostgreSQL:
Select 1 from dualSQL Server and MySQL:
Select 1
Click Test.
Click OK to close the test dialog.
Click OK to save the connection.
Connect to clusters (PDI only)
Use Clustering to create database connections to data partitions in PDI.
To create a new partition connection, enter:
Partition ID
Host Name
Port
Database Name
User Name
Password
If you run Pentaho Server as a cluster and use the Data Source Wizard (DSW) in PUC, a new data source is visible only on the cluster node where you created it.
To make a new DSW data source visible across all cluster nodes, disable DSW data source caching:
Open
server/pentaho-server/pentaho-solutions/system/system.properties.Set
enableDomainIdCache=false.
Disabling DSW caching can slow down loading the data source list.
Modify connections
You can perform additional tasks from PUC or the PDI client.
Modify connections from PUC
In PUC, open File > Manage Data Sources, then use More actions and options.

Export
Exports connection information. Metadata downloads an XMI file. Analysis downloads an XML file. Data Source Wizard downloads an XMI (reporting only) or ZIP (reporting and analysis).
Import Analysis
Imports data values from a Mondrian file into a specified data source. You can select an available data source or enter parameters manually.
Import Metadata
Imports a model and creates a data source. Data Source Wizard (includes Analysis model) uses a ZIP created using Export. Metadata model uses an XMI file.
Modify connections from PDI
In PDI, right-click the connection name under Database connections in the View tab.
Duplicate
Duplicates a connection after you enter a new Connection Name.
Copy to clipboard
Copies the XML defining the connection.
SQL Editor
Executes SQL commands against an existing connection.
Clear DB Cache
Clears PDI’s database cache (use after schema changes).
Share
Shares the connection across transformations and jobs.
Explore
Opens Database Explorer to browse schemas and tables.
Show dependences
Lists transformations and jobs that use the connection.
Delete connections
If you no longer need a connection, delete it in PUC or PDI.
Deleting a connection affects all reports, charts, dashboards, and other content associated with that connection.
Delete connections in PUC
Select File > Manage Data Sources.
Select the data source you want to delete.
In More actions and options, select Delete.
Delete connections in PDI
In Explorer, select the View tab.
Expand Database connections.
Right-click the connection name, then select Delete.
Assign permissions to use or manage database connections
You may have several connections to your data that you do not want to share with all of your users. When connected to the Pentaho Server, the PDI client (also known as Spoon) gives you the ability to make your data visible to only those users and roles that you specify. You can assign permission to allow users and roles to read, write, or delete the connection. Connection definitions are stored in the Pentaho Repository. The PDI client Repository Explorer enables you to browse the available connections and select the one for which you want to assign permissions.
From within the PDI client, click on Tools > Repository > Explore.
The Repository Explorer appears.
Select the Connections tab.
Select the connection for which you want to assign permissions.
From the User/Role area, select the user or role for which you want to assign permissions.
Check the permissions you want to assign to the selected user or role.
SelectionSelection ResultRead
For this user or role, the connection appears in the connection list and can be selected for use. If users or roles have permission to read a transformation or job but not to a referenced database connection, they cannot open the transformation or job and an error message appears.
Write
This user or role can edit the connection definition.
Delete
This user or role can permanently remove the connection definition from the list.
Manage Access Control
This user or role can assign read, write, and delete permissions to other users or roles.
Click Apply.
Click Close to exit the dialog box.
You can also delegate the ability to assign these permissions to another user or role. For more information on managing users and roles, see With the PDI client.
Manage users and roles
If you are using basic Pentaho security, the Pentaho administrator may be tasked with creating and managing users and roles, including assigning permissions to allow users to access the content they need.
With PUC
Switch between user and role settings to add, delete, and edit users and roles in PUC.
See the Pentaho Business Analytics documentation for details.
With the PDI client
Manage users, roles, and permissions in the Pentaho Repository with the PDI client (Spoon).
You must sign in as an administrator. Or sign in with a role that has Administer Security permission.
Quick tasks
Users
Roles and permissions
Reference
Before you begin
Back up relevant files before you change security settings.
If you installed PDI using the suite installer or custom methods, back up your Data Integration directories.
If you installed PDI using the manual method, back up
pentaho.warand your solution files.
Sample users, default roles, and permissions
Use the built-in sample users and default roles as a starting point. They can guide your own role design.
Open the PDI client and sign in to the repository.
Select Tools > Repository > Explore.
Select the Security tab.
Available users are listed.

Repository explorer Security tab Choose what you want to manage:
Users: Select a user to see assigned roles and an optional description.
Roles: Select a role to see its permissions.
System Roles: Select a system role to see its permissions.
Default roles and sample users (typical setup):
Administrator (sample user:
admin)Administer Security
Schedule Content
Read Content
Publish Content
Create Content
Execute
Manage Data Sources
Power User (sample user:
suzy)Schedule Content
Read Content
Publish Content
Create Content
Execute
Report Author (sample user:
tiffany)Schedule Content
Publish Content
Business Analyst (sample user:
pat)Publish Content
Permission definitions:
Administer Security
Lets users access and manage content across perspectives.
Lets users view and manage schedules in the Schedules perspective.
The Administrator role effectively retains all permissions, even if unchecked.
Schedule Content
Lets users schedule content.
Lets users manage their own schedules.
Read Content
Lets users view content in perspectives.
Publish Content
Lets users store reports and data models in the repository.
With Create Content, lets users publish to the repository.
Create Content
Lets users create, import, delete, and save jobs and transformations.
Lets users see repository data sources used by jobs and transformations.
With Execute, lets users export, copy/paste, and save to VFS.
Execute
Lets users run, preview, debug, replay, verify, and schedule.
With Create Content, lets users export, copy/paste, and save to VFS.
Manage Data Sources
Lets users create, edit, or delete data sources.
Lets users view repository data sources.
Manage users
Use the Security tab to manage users. Select Users to see the user list.
Add users
Select Users, then select the plus sign next to Available.
Enter User Name and Password.
Optional: Enter a Description.
Optional: Under Member, select a role and select OK.
Select OK to save.
Change user passwords
Select Users, then select the user.
Select the Edit icon.
In Password, enter the new password.
Select OK.
When you sign in to the PDI client for the first time, change the default administrator password.
Edit user information
Select Users, then select the user in the Available list.
Select the Edit icon.
Update the user details.
Select OK.
Delete users
Disable a user or role instead of deleting it.
Select Users, then select the user in the Available list.
Next to Available, select the X icon.
Select Yes to confirm.
If you delete a user or role, repository content ownership and ACL entries remain unchanged. This can create a security risk if you later create a new user or role with the same name.
Safer alternatives:
If you are disabling a role, unassign all members from the role.
If you are disabling a user, reset the password to a cryptic, unknown value.
Hide a user Home folder in PDI
If your organization uses multi-tenancy, you may want to hide individual Home folders. This can help centralize user-created content in a secure folder.
See the Administer Pentaho Data Integration and Analytics documentation for details.
Manage roles and permissions
Select Roles to manage roles and their permissions.
Add roles
Select Roles.
Select the plus sign next to Available.
Enter the Role Name.
Optional: Enter a Description.
Optional: Select users (Shift/Ctrl), then select the right arrow to assign them.
Select OK.
Edit roles
Select Roles.
Select the role, then select the Edit icon.
Make changes.
Select OK.
Delete roles
Select Roles.
Select the role in the Available list.
Select the X icon next to Available.
Select Yes to confirm.
Assign users to roles
Select Roles.
Select the role.
Next to Members, select the plus sign.
Select users, then select the right arrow.
Select OK.
Assign user permissions in the repository using the PDI client
Use role permissions to control what users can do. For example, create an admin role that can administer security and create content.
Select Roles.
In the Available list, select the role.
In Permission, select or clear permissions.
Select Apply.
Changes apply the next time affected users sign in.
Enable system role permissions
Pentaho requires the Authenticated system role for repository sign-in. Users are assigned Authenticated automatically at sign-in.
By default, Authenticated has Read Content permission. Adjust permissions as needed.
The Anonymous system role is not used.
Select System Roles.
Select Authenticated.
Under Permissions, select or clear permissions.
Select Apply.
Make changes to the administrator role
You cannot edit the administrator role’s action-based permissions in the PDI client. The administrator role controls access to the Security tab.
Deleting the administrator role blocks access to the Security tab unless another role gets admin permission.
Use a configuration change for these scenarios:
Delete the administrator role.
Remove administrator permission from the administrator role.
Configure LDAP.
Shut down the Pentaho Server.
Open
repository.spring.xmlinpentaho-server/pentaho-solutions/system.Find the
immutableRoleBindingMapelement.Replace the entire node with this XML.
Replace
yourAdminRolewith the role that should have administrator permission.Restart the Pentaho Server.
Configure the design tools and utilities
Before using design tools and utilities, perform configuration tasks for each workstation running these tools.
Depending on how these tools and utilities were installed, they might be located on different machines than the Pentaho Server.
BA design tools
The following table describes the BA design tools and their uses:
Pentaho Aggregation Designer
Optimizes the multidimensional Mondrian data model.
Pentaho Metadata Editor
Creates relational data models and refines the models created with the Data Access Wizard.
Pentaho Report Designer
Interactive Reports is a web-based design interface that is used to create both simple and on-demand operational reports without depending on IT or report developers.
Pentaho Schema Workbench
Creates multidimensional Mondrian data models and refines the models created with the Data Access Wizard.
Your IT administrator must specify the data connections for these tools.
Start and stop BA design tools
Each BA design tool has its own startup command. Start and stop each tool separately.
Start BA design tools
The startup method depends on your operating system.
Replace pentaho with your Pentaho installation directory.
Run the .bat file for the tool you want.
Aggregation Designer:
pentaho\design-tools\aggregation-designer\startaggregationdesigner.batMetadata Editor:
pentaho\design-tools\metadata-editor\metadata-editor.batReport Designer:
pentaho\design-tools\report-designer\report-designer.batSchema Workbench:
pentaho\design-tools\schema-workbench\workbench.bat
You can also create shortcuts to these files.
Run the .sh file for the tool you want.
Aggregation Designer:
pentaho/design-tools/aggregation-designer/startaggregationdesigner.shMetadata Editor:
pentaho/design-tools/metadata-editor/metadata-editor.shReport Designer:
pentaho/design-tools/report-designer/report-designer.shSchema Workbench:
pentaho/design-tools/schema-workbench/workbench.sh
Open the .app file for the tool you want.
Aggregation Designer:
pentaho/design-tools/aggregation-designer/Aggregation Designer.appMetadata Editor:
pentaho/design-tools/metadata-editor/Metadata Editor 64-bit.appReport Designer:
pentaho/design-tools/report-designer/Pentaho Report Designer.appSchema Workbench:
pentaho/design-tools/schema-workbench/Schema_Workbench.app
Stop BA design tools
Use the standard exit method for your tool.
Aggregation Designer: Close the application window.
Metadata Editor, Report Designer, and Schema Workbench: Select File > Exit.
Save your work before you exit.
PDI design tools and utilities
Pentaho Data Integration allows users to design and manage complex ETL workloads with a graphical user interface.
Kitchen, Pan, and Carte are command-line tools for executing jobs and transformations modeled in the PDI client.
Use Pan and Kitchen command-line tools to work with transformations and jobs.
Use Carte clusters to:
Schedule jobs to run on a remote Carte server.
Stop Carte from the command line interface or URL.
Run transformations and jobs from the repository on the Carte server.
See the Pentaho Data Integration document for details on Kitchen, Pan, and Carte.
Start and stop PDI design tools and utilities
Each PDI tool has a specific function, so you start and stop each one individually.
Starting PDI tools and utilities
Run the batch, script, or app files in the data-integration folder.
You can find data-integration in the folder where you installed the tool or utility.
Use the file for your operating system.
PDI client:
spoon.batCarte:
carte.batKitchen:
kitchen.batPan:
pan.bat
Tip: Consider making a shortcut for each .bat file. Add it to the desktop or the Start menu.
PDI client:
spoon.shCarte:
carte.shKitchen:
kitchen.shPan:
pan.sh
PDI client:
data integration.appCarte:
carte.appKitchen:
kitchen.appPan:
pan.app
Stopping PDI design tools
Use these methods to stop the PDI design tools:
PDI client: Close the window or select File > Exit.
Carte: Stop it from the command line interface or URL.
Pan and Kitchen: Stop them from the command line interface.
For details on stopping Kitchen, Pan, or Carte, see the Pentaho Data Integration documentation.
Increase the PDI client memory limit
As a best practice, increase PDI's memory limit so the Pentaho Server and the PDI client (Spoon) can perform memory-intensive tasks, process or sort large datasets, and run complex transformations and jobs.
Increase the memory limit for both the Pentaho Server and the PDI client.
If you do not increase the memory limit, PDI uses the default settings in the startup scripts.
For Pentaho Server memory changes, see Increase Pentaho Server memory limit for custom installations on Windows or Linux.
Increase memory for the PDI client (Spoon)
Exit from the PDI client if it is currently running.
Open the PDI client startup script with a text editor.
The startup script depends on your operating system:
Windows:
pentaho/design-tools/data-integration/spoon.batLinux:
pentaho/design-tools/data-integration/spoon.sh
Modify the
-Xmxvalue to a larger upper memory limit.This example allocates 2 GB of heap space:
Save and close the startup file.
Start the PDI client.
If you still see memory errors, increase the value again.
Alternative: Set an environment variable
Instead of modifying the startup script, you can set the PENTAHO_DI_JAVA_OPTIONS environment variable on your client to -Xmx2g -XX:MaxPermSize=256m.
Last updated
Was this helpful?

