Connecting to a Hadoop cluster with the PDI client

To connect to a Hadoop cluster, install a driver, then create and test a named connection. A named connection stores cluster connection details under a friendly name.

After you set up a named connection, you can edit, duplicate, test, or delete it. Named connections help when you promote content between environments. You update the connection once, then reuse it across jobs and transformations.

In this topic:

Audience and prerequisites

This topic is intended for ETL developers, data engineers, and data analysts.

Before you begin, verify the Hadoop administrator has:

Set up your user account on the cluster.
Granted permissions to the required HDFS directories.

You typically need access to your home directory and any directories used by your jobs and transformations.

Pentaho ships with a default Apache Hadoop driver already installed. Supported versions of other drivers, including Amazon EMR, Apache Vanilla, Cloudera (CDP), and Google Dataproc, must be downloaded from the Support Portal. You need a driver for each Hadoop vendor and distribution you connect to.

To install a driver for the PDI client, see Install a driver for the PDI client.

When drivers for new Hadoop versions are released, download them from the Support Portal, then add them to Pentaho. Install these drivers using the procedure in Install Pentaho Data Integration and Analytics.

Ask the Hadoop administrator for a copy of the cluster site.xml files and the following information:

Distribution and version of the cluster.
IP addresses and port numbers for HDFS, JobTracker, and ZooKeeper (if used).
Kerberos and cluster credentials (for secured clusters).
Oozie URL (if used).

Hadoop drivers

Pre-installed Apache Hadoop driver

You can access and use the installed Apache Hadoop driver for HDFS copy file operations and for executing input and output transformations and jobs. The driver works with secure and unsecured clusters. Because the driver is pre-installed, you do not have to install a .kar file.

Supported big data steps include:

Both operating system file browsers and the Pentaho virtual file system (VFS) browser are supported. For more information, see Connecting to Virtual File Systems.

Only Hadoop clusters that follow standard Hadoop connection rules work with the Apache Hadoop driver. For example, EMR clusters may work, but MapR does not because its connection rules are not standard.

The Apache Hadoop driver is not intended to support higher-level operations such as Hive, HBase, Sqoop, and Oozie. If you need these services, install the vendor driver for your distribution.

Apache Vanilla Hadoop driver

The Apache Vanilla Hadoop driver is not shipped as installed with PDI. After you install it, you can connect to Apache Vanilla Hadoop clusters.

For Apache Vanilla Hadoop, this driver supports services including HDFS file copy, big data file formats (ORC, Avro, and Parquet), Hive operations, MapReduce jobs, and Sqoop. This applies to secure and unsecured environments.

Install a driver for the PDI client

Before you can add a named connection to a cluster, install a driver for the Hadoop vendor and version you are connecting to.

How the driver is registered depends on whether you are connected to a repository when you install it:

Connected to Pentaho Repository: The driver is placed into the Pentaho Server directory. It is available to all users.
Not connected to Pentaho Repository: The driver is placed into the local PDI directory. It is available only to you.

This task assumes you are not using the default Hadoop driver. Download the vendor-specific driver from the Support Portal.

Perform the following steps to install a driver for the PDI client:

In the PDI client, select the View tab of the transformation or job.
Right-click the Hadoop clusters folder, then select Add driver.
The Add driver dialog box appears.
Select Browse.
The Choose File to Upload dialog box appears.
Navigate to the location where you downloaded the driver file.
Select the driver (.kar file) you want to add, then select Open and Next.
The selected file name appears in the Browse field. Vendor distribution files include their abbreviations in the .kar file names:
- Amazon EMR (emr)
- Apache Vanilla Hadoop (apachevanilla)
- Azure HDInsight (hdi)
- Cloudera Data Platform (cdp)
- Google Dataproc (dataproc)
Select Next.
The Congratulations dialog box appears and indicates that you must restart Pentaho Server or the PDI client.

The Driver field in the New cluster and Import cluster dialog boxes now shows the driver you added.

Configure a CDP Public Cloud cluster (optional)

CDP Public Cloud DataHub clusters use gateway nodes for third-party tools. Install and use a gateway node for Pentaho. Use CentOS for the gateway node.

Perform the following steps on the cluster gateway node:

Switch to root, then add a user to the sudoers list:
```
usermod -a -G wheel <username>
```
Package installation requires sudo. SSH into the gateway node as the cloudbreak user, then elevate.
Install tigervnc-server to access the gateway node UI:
```
sudo yum install tigervnc-server
vncpasswd
vncserver
```
vncpasswd sets the VNC password. vncserver starts the VNC service.

Install webkitgtk (required because libwebkitgtk-1.0.0 is deprecated on CentOS repositories):

sudo rpm --import http://li.nux.ro/download/nux/RPM-GPG-KEY-nux.ro
sudo yum -y install epel-release && sudo rpm -Uvh http://li.nux.ro/download/nux/dextop/el7/x86_64/nux-dextop-release-05.el7.nux.noarch.rpm
sudo yum install webkitgtk

Verify Kerberos is installed and configured:
1. Download a keytab from Cloudera Manager.
2. Upload the keytab to the gateway node.
3. Run one of the following commands:
  - Use interactive login:
    kinit
  - Use a keytab:
    kinit -kt <keytab_location> <username>

Adding a cluster connection

Add named Hadoop cluster connections by importing them or creating them manually.

If you use high availability (HA) clusters, create the connection manually.

If you are connected to the Pentaho Repository, other users can reuse the connection. If you are not connected, the connection is available only locally.

Security is set up per user. Security settings are not stored in the repository.

Import a cluster connection

You can add a cluster by importing the site.xml files from an existing cluster.

In the PDI client, create a new transformation or job, or open an existing transformation or job.
Select the View tab.
Right-click the Hadoop Clusters folder.
Select Import cluster.
The Hadoop Clusters dialog box appears.
Enter a name in Cluster name.
Valid cluster names may include uppercase and lowercase letters, and numbers. The only special character allowed is the dash (-). Do not use other symbols, punctuation, or spaces.
After you create the connection, you can locate it in the View tab.
Note: If the Cluster name is already in use, you will be prompted to overwrite the existing cluster. Overwriting cannot be undone.
- Select Cancel, then enter a unique name.
- Select Yes, Overwrite to overwrite the existing cluster.
Use Driver and Version to select the Hadoop distribution and version.
The Support Portal provides supported drivers.
Select Browse to add file(s), then browse to the directory containing the site.xml files.
Required files:
- hive-site.xml
- mapred-site.xml
- yarn-site.xml
- core-site.xml
- hbase-site.xml
- hdfs-site.xml
- oozie-site.xml (only if you use Oozie)
Select Open.
The Site XML files section shows the files you selected.
If you are connecting to a secure cluster, enter Username and Password in the HDFS section.
Select Next, then choose a security option.
- For a non-secure Hadoop cluster, select None. Select Next, then test the connection.
- For a secure Hadoop cluster, see Add security to cluster connections.

Manually add a cluster connection

You can create a cluster connection manually by supplying the site.xml files. The cluster administrator typically provides these files.

Note: If you use high availability (HA) clusters, you must create the connection manually.

In the PDI client, create a new job or transformation, or open an existing one.
Select the View tab.
Right-click the Hadoop Clusters folder.
Select New cluster.
The Hadoop Cluster dialog box appears.
Hadoop New Cluster dialog box
Enter the connection information from the cluster administrator.
Note: Use Kettle variables for each value. This reduces risk when you run jobs and transformations without a repository connection.

Option

Description

Cluster Name

Enter the name you want to assign to the cluster connection. Note: Valid cluster names may include uppercase and lowercase letters and numbers. In addition, the only special character allowed is a dash (-). To ensure a valid cluster name, do not use any other symbols, punctuation characters, or blank spaces.

After you create the connection, you can locate this named connection in the View tab on the PDI client.

Current Configured Driver and Version

Read only information about distribution of Hadoop on the cluster and its version number.

Site XML files

Enter the location of the site.xml files provided by the cluster administrator. Click **Browse to add file(s)**and browse to the directory containing the site.xml files. Pentaho creates the applicable directory on the machine where the PDI client is located and copies the site.xml files to that directory.

Alternatively, if you leave this option blank, Pentaho creates the directory for the distribution and version of Hadoop you selected in the Driver and Version options. You must then copy the site.xml files to that directory.

HDFS

Provide the following information for the HDFS node:

Enter the Hostname for the HDFS node.
Enter the Port for the HDFS node.

If the cluster is enabled for high availability (HA), you do not need a port. Clear the port number.

Enter the Username and Password, which the cluster administrator typically provides.

JobTracker

If you have a separate JobTracker node, provide the following information:

Enter the Hostname for the JobTracker node.
Enter the Port for the JobTracker node.

ZooKeeper

If you have a ZooKeeper node and want to connect a ZooKeeper service, provide the following information:

Enter the Hostname for the ZooKeeper node.
Enter the Port for the ZooKeeper node.

Oozie

Enter the Oozie client address in Hostname. Supply this URL only if you want to connect to Oozie.

Kafka

Enter the host:port pair(s) for the initial Kafka connection in Bootstrap servers. Use a comma-separated list for multiple servers, for example, host1:port1,host2:port2. You do not need to include every Kafka broker, but add more than one in case a server is down.

Select Next.
Choose a security option.
- For a non-secure Hadoop cluster, select None. Select Next, then test the connection.
- For a secure Hadoop cluster, see Add security to cluster connections.

Add security to cluster connections

If you have a secure Hadoop cluster, the available security options depend on the driver. All drivers support Kerberos. If you use a Hortonworks driver, you can also select Knox.

If you are connected to a Pentaho Repository, you can specify additional Kerberos options for secure impersonation. See the Administer Pentaho Data Integration and Analytics document for details.

If you are not sure what security type is configured, contact the cluster administrator.

Note: For Kerberos, you need the authentication user name and either a password or a keytab file. For Knox, you need the Gateway URL, user name, and password.

Specify Kerberos security

Note: You can define different principal users for each named connection only if all clusters for these connections are in the same Kerberos realm. See MIT Kerberos Documentation for more information about Kerberos realms.

Select Kerberos as the security type.
Select Next.
Choose a security method, then enter the credentials from the cluster administrator.
- Password: Specify Authentication username and Password.
  If you are connected to the Pentaho Repository and use secure impersonation, specify Impersonation username and Password.
  See Install Pentaho Data Integration and Analytics if the environment requires advanced settings, the server is on Windows, or you use a Cloudera Impala database for secure impersonation.
  Edit Cluster dialog box - Password option
- Keytab: Specify Authentication username and Authentication Keytab.
  Select Browse to locate your keytab file.
  If you are connected to the Pentaho Repository and use secure impersonation, specify Impersonation username and Impersonation Keytab.
  See Install Pentaho Data Integration and Analytics if the environment requires advanced settings, the server is on Windows, or you use a Cloudera Impala database for secure impersonation.
  Edit Cluster dialog box - Keytab option
Select Next to test the connection.
The Test results dialog box appears.
Test results dialog box
For each tested connection, the dialog box shows one of the following icons:
- A green checkmark indicates the connection to the cluster service was successful.
- A yellow caution symbol indicates the cluster service information was not supplied. The test for that component was skipped.
- A red circle-backslash indicates the connection failed. Check the connection information, then test again.
  If you suspect a different issue, see the troubleshooting section in Install Pentaho Data Integration and Analytics or contact the cluster administrator.
  Note: Select the drop-down arrow in the Hadoop file system test for more details.
Select Finish.

If the connection test returns no errors, PDI is connected.

If the test returns errors, see the troubleshooting section in Administer Pentaho Data Integration and Analytics or contact the cluster administrator. Then test again.

Test an existing cluster connection

In the PDI client, select the View tab.
Navigate to the Hadoop Clusters folder.
If needed, expand the Hadoop Clusters folder.
Right-click the cluster you want to test, then select Test cluster.
The Test results dialog box appears.
Test results dialog box
For each tested connection, the dialog box shows one of the following icons:
- A green checkmark indicates the connection to the cluster service was successful.
- A yellow caution symbol indicates the cluster service information was not supplied. The test for that component was skipped.
- A red circle-backslash indicates the connection failed. Check the connection information, then test again.
  If you suspect a different issue, see the troubleshooting section in Administer Pentaho Data Integration and Analytics or contact the cluster administrator.
  Note: Select the drop-down arrow in the Hadoop file system test for more details.
Select Finish.

If the connection test returns no errors, PDI is connected.

If the test returns errors, see the troubleshooting section in Administer Pentaho Data Integration and Analytics, then test again.

Managing Hadoop cluster connections

After you add a cluster connection, you can edit, duplicate, test, or delete it.

Edit a Hadoop cluster connection

How updates apply depends on whether you are connected to a repository:

Connected to a repository: Changes apply to all repository transformations and jobs. Connection details are loaded at runtime, unless the connection cannot be found.
Not connected to a repository: Changes apply only to local (file system) transformations and jobs. Saved transformations and jobs do not pick up the changes until you save them again.

In the View tab, select the Hadoop clusters folder.
Right-click the connection, then select Edit.
You can also double-click the connection.
Update the values, then select Next.
If the cluster uses high availability (HA), clear the port number.
For Security type, select None, then select Next.
To add or edit security, see Add security to cluster connections.
Select Close to save your changes.

Duplicate a Hadoop cluster connection

Duplicating a connection is useful for testing changes without affecting production settings.

In the View tab, select the Hadoop clusters folder.
Right-click the connection, then select Duplicate cluster.
In Cluster Name, enter a new name.
The system prefixes the name with copy-of-.
Select Browse to add file(s), then select the site.xml files to import.
Duplicating a connection copies the existing site.xml files to a new metastore directory. If you select site.xml files here, they replace the copied files.
Select Next.
For Security type, select None, then select Next.
To add or edit security, see Add security to cluster connections.
Select Edit cluster.
Update the cluster configuration values, then select Next.
Select Close.

Test a cluster connection

See Test a cluster connection.

Delete a Hadoop cluster connection

If you delete a named connection, you cannot restore it. Recreate the connection instead.

In the View tab, select the Hadoop clusters folder.
Right-click the connection, then select Delete cluster.
Select Yes, Delete.
The cluster connection is deleted, including all security credentials.

Connect other Pentaho components to a cluster

See Install Pentaho Data Integration and Analytics for advanced settings for connecting other Pentaho components.

Archived pages

The original subpages for this topic were moved to Connecting to a Hadoop cluster with the PDI client (archive).

Last updated 4 hours ago

Was this helpful?

hashtagAudience and prerequisites

hashtagHadoop drivers

hashtagPre-installed Apache Hadoop driver

hashtagApache Vanilla Hadoop driver

hashtagInstall a driver for the PDI client

hashtagConfigure a CDP Public Cloud cluster (optional)

hashtagAdding a cluster connection

hashtagImport a cluster connection

hashtagManually add a cluster connection

hashtagAdd security to cluster connections

hashtagTest an existing cluster connection

hashtagManaging Hadoop cluster connections

hashtagEdit a Hadoop cluster connection

hashtagDuplicate a Hadoop cluster connection

hashtagTest a cluster connection

hashtagDelete a Hadoop cluster connection

hashtagConnect other Pentaho components to a cluster

hashtagArchived pages

Audience and prerequisites

Hadoop drivers

Pre-installed Apache Hadoop driver

Apache Vanilla Hadoop driver

Install a driver for the PDI client

Configure a CDP Public Cloud cluster (optional)

Adding a cluster connection

Import a cluster connection

Manually add a cluster connection

Add security to cluster connections

Test an existing cluster connection

Managing Hadoop cluster connections

Edit a Hadoop cluster connection

Duplicate a Hadoop cluster connection

Test a cluster connection

Delete a Hadoop cluster connection

Connect other Pentaho components to a cluster

Archived pages