Manually add a cluster connection
You can manually create a cluster connection by supplying the site.xml
files, which are typically provided by the cluster administrator.
Note: If you are using high availability (HA) clusters, you must manually add the connection information to create the cluster connection.
Perform the following steps to manually add a cluster connection.
In the PDI client, create a new job or transformation or open an existing one.
Click the View tab and then right-click the Hadoop Clusters folder.
Click New cluster.
The Hadoop Cluster dialog box appears.
Hadoop New Cluster dialog box Enter the connection information from the cluster administrator in the Hadoop Cluster dialog box.
Note: As a best practice, use Kettle variables for each connection parameter value to reduce risks associated with running jobs and transformations in environments that are disconnected from the repository.
Option
Description
Cluster Name
Enter the name you want to assign to the cluster connection. Note: Valid cluster names may include uppercase and lowercase letters and numbers. In addition, the only special character allowed is a dash (-
). To ensure a valid cluster name, do not use any other symbols, punctuation characters, or blank spaces.
After you create the connection, you can locate this named connection in the View tab on the PDI client.
Driver and Version
Select the distribution of Hadoop on the cluster and its version number.
Site XML files
Enter the location of the site.xml
files provided by the cluster administrator. Click **Browse to add file(s)**and browse to the directory containing the site.xml
files. Pentaho creates the applicable directory on the machine where the PDI client is located and copies the site.xml
files to that directory.
Alternatively, if you leave this option blank, Pentaho creates the directory for the distribution and version of Hadoop you selected in the Driver and Version options. You must then copy the site.xml
files to that directory.
HDFS
Provide the following information for the HDFS node:- Enter the Hostname for the HDFS node in the Hadoop cluster.
Enter the Port for the HDFS node in the Hadoop cluster.
Note that if the cluster is enabled for high availability (HA), then a port number is not needed, and you should clear the port number.
Enter the Username and Password for the HDFS node, which are typically provided by the cluster administrator.
JobTracker
If you have a separate JobTracker node, provide the following information:- Enter the Hostname for the JobTracker node in the Hadoop cluster.
Enter the Port for the JobTracker node in the Hadoop cluster.
ZooKeeper
If you have a Zookeeper node and want to connect a Zookeeper service, provide the following information:- Enter the Hostname for the Zookeeper node in the Hadoop cluster.
Enter the Port for the Zookeeper node in the Hadoop cluster.
Oozie
Enter the Oozie client address in the Hostname field. Supply this URL only if you want to connect to the Oozie service.
Kafka
Enter the host:port pair(s) for the initial connection to the Kafka cluster in the Bootstrap servers field. Use a comma-separated list for multiple servers, for example, host1:port1,host2:port2
. Although there is no need to include all servers used for Kafka, you might want to include more than one in case a server is down.
5. Click **Next** and specify the security option for the cluster.
- If the Hadoop cluster is non-secure, select **None** and click **Next** to [test the connection](Test%20the%20cluster%20connection%20(Add%20Hadoop%20cluster%20connection).md).
- If the Hadoop cluster is secure, you need to add security to the cluster connection. See [Add security to cluster connections](Secure%20cluster%20connections%20(Add%20Hadoop%20cluster%20connection).md) for instructions.
Last updated
Was this helpful?