Installing Pentaho Data Optimizer on a Cloudera Manager cluster
To install Data Optimizer on a Cloudera Manager cluster, use the following workflow:
Step 1: Download the Pentaho Data Optimizer software
Step 2: Add the Pentaho Data Optimizer parcel to a parcel repository
Step 3: Download the parcel to Cloudera Manager
Step 4: Install the custom service descriptor on the Cloudera Manager server
Step 5: Distribute and activate the parcel
Step 6: Add the Pentaho Data Optimizer service to the cluster
Step 7: Configure Data Optimizer volumes to restart automatically
Step 8: Configure HDFS to use the Data Optimizer volume
Step 9: Refresh HDFS Datanodes after adding Data Optimizer volumes
Step 10: Data Optimizer extension for Cloudera Manager
Step 1: Download the Pentaho Data Optimizer software
Follow the steps below to download the Data Optimizer install files and verify the integrity of the files. This task assumes you know how to calculate an MD5 checksum.
Download both the
TARfile containing the custom service descriptor (CSD) file and the parcel for Pentaho Data Optimizer from the Support Portal. The file is namedpdso-1.3.x.x-el7.tar.Verify the downloaded content.
Calculate the MD5 checksum of the downloaded file.
Compare it to the MD5 checksum provided on the software download page.
Ensure the two values match.
Extract the contents of the
pdso-1.3.x.x-el7.tarfile to a directory.
The contents of the directory are extracted.
pdso-1.3.x.x-el7.parcel
The Pentaho Data Optimizer parcel.
pdso-1.3.x.xel7.parcel.sha
A SHA-1 checksum of the parcel file, used for the local parcel repository.
manifest.json
A manifest for the hosted parcel repository.
pdso-1.3.x.x-el7.jar
The file containing the custom service descriptor.
Step 2: Add the Pentaho Data Optimizer parcel to a parcel repository
To install Pentaho Data Optimizer with Cloudera Manager (CM), you must place the parcel in a repository accessible to your CM server.
Internally hosted remote parcel repository
For users with an existing parcel repository hosted on their own webserver in their network, which is network accessible from the Cloudera Manager server:
In the
/<parcel_dir>/directory on your parcel repository webserver, create a subdirectory calledpdso/1.3.0/.Copy the
pdso-1.3.0.x-el7.parcelandmanifest.jsonfiles to the/<parcel_dir>/pdso/1.3.0/directory.Change the file ownership and permissions as necessary on
/<parcel_dir>/pdso/,/<parcel_dir>/pdso/1.3.0/and their contents, so the webserver can serve these files.
No internally hosted parcel repository
If you do not have a private, internally-hosted parcel repository, see Configuring a Local Parcel Repository in the Cloudera documentation for more information on configuring a parcel repository.
Step 3: Download the parcel to Cloudera Manager
If you are using a local parcel repository, skip this section and proceed to Distribute and activate the parcel.
This task assumes you have an internally-hosted remote parcel repository.
Log into Cloudera Manager Admin Console and click HostsParcels in the top navigation bar.
The Parcels page opens.
Click Configuration.
The Parcel Configurations dialog box opens.
Locate the Remote Parcel Repository URLs setting and add the location of the Data Optimizer parcel on your internally-hosted remote repository server as your new entry.
The value of the new entry is the location of the Data Optimizer parcel on your internally hosted remote repository server. For example, if your remote repository is located at
https://myrepo.example.com/parcel-repo/, then the value would behttps://myrepo.example.com/parcel-repo/pdso/1.2.0/.Save your changes to return to the Parcels page.
Click Check for New Parcels.
The
pdsoparcel entry displays in the parcel list with a status of Available Remotely.Click Download for the
pdsoparcel.
When the process is complete, the status of the parcel changes to Downloaded.
Step 4: Install the custom service descriptor on the Cloudera Manager server
The Data Optimizer Custom Service Descriptor (CSD) file describes the service to Cloudera Manager (CM). This file is required so that CM is aware of the service and its roles.
Copy the CSD file to the
/opt/cloudera/csd/directory on the Cloudera Manager server.Change the file ownership to the
cloudera-scmuser and group.chown cloudera-scm:cloudera-scm/opt/cloudera/csd/pdso-1.3.0.<x>-el7.jar
Change the file permissions to 640:
chmod 640 /opt/cloudera/csd/pdso-1.3.0.<x>-el7.jarRestart the Cloudera Manager server.
If you are running the Cloudera Manager Management Service, you must restart it on the Cloudera Manager dashboard.
Step 5: Distribute and activate the parcel
Before you can add the Data Optimizer service to your cluster, you must first distribute the pdso parcel to all the nodes in the cluster and then activate the parcel.
Log into Cloudera Manager Admin Console and navigate to Hosts > Parcels.
Find the
pdsoparcel.The status of the parcel should be Downloaded.
If you do not see the
pdsoparcel, click Check for New Parcels at the top of the Parcels page.Click Distribute for the
pdsoparcel.The distribution process starts immediately. Depending on the size of the cluster, this process may take several minutes or more. Allow it to complete.
The parcel status changes to Distributed.
Click Activate for the
pdsoparcel.The activation process starts immediately. Depending on the size of the cluster, this process may take several minutes or more. Allow it to complete.
The parcel status changes to Activated.
The Data Optimizer service is ready to be added to your Cloudera cluster.
Step 6: Add the Pentaho Data Optimizer service to the cluster
To add Data Optimizer to your cluster, perform the following steps:
Log in to the Cloudera Manager dashboard.
Navigate to the cluster and open the action menu dropdown for the cluster then select Add Service.
Select Pentaho Data Optimizer from the list of available services and click Continue.
Assign hosts for the Volume role.
On the Assign Roles page, locate the Volume role for the Data Optimizer service and click the Volume dialog box.
Select the hosts to assign to the Volume role and click OK.
Only hosts that have the HDFS Datanode role are valid candidates to add the Data Optimizer Volume role.
Assign hosts for the Volume Monitor role:
Navigate back to the Assign Roles page and locate the named Volume Monitor for the Data Optimizer service.
Click the Volume Monitor dialog box then assign hosts in your cluster to the Volume Monitor role.
If prompted, select Custom.
Select each of the hosts to which you added the Volume role and click OK.
Note: Each host with a Volume instance must have a Volume Monitor instance as well. Do not select hosts without a Volume instance.
Click Continue.
Proceed to the Review Changes page. then enter the Data Optimizer volume configuration parameters for your environment.
See the Data Optimizer configuration parameters section for information about how to configure Data Optimizer volumes.
Note: Remember the value of the
MOUNT_POINTparameter. You will need this value when configuring HDFS to use the Data Optimizer volume.After you have entered and confirmed all your Data Optimizer configuration values, click Continue.
The Command Details page opens. From here, you can monitor the First Run Command.
At this point in the process, Cloudera Manager attempts to start the service and launch the Volume instances for the initial time.
Monitor the start commands as they run in the background. Verify that all Volume and Volume Monitor instances start without error.
(Optional) If you encounter errors, you may need to troubleshoot.
Look at the
stdout,stderr, and role logs in the Cloudera Manager UI.If necessary, see Troubleshoot Data Optimizer.
After all Data Optimizer volumes have started, you can click through the remaining pages in the Add Service wizard to return to the Cloudera Manager dashboard.
Step 7: Configure Data Optimizer volumes to restart automatically
After you complete the Add Service wizard, you can configure Data Optimizer volumes to automatically start when the data nodes are rebooted.
Note: If Auto Start is enabled for the HDFS Datanode component, then this configuration change is required.
In the Cloudera dashboard, navigate to Cluster Admin > Service > Auto Start > Pentaho Data Optimizer
Locate the Auto Start status for the Data Optimizer volume component and change it to Enabled.
Save your changes.
The Data Optimizer volumes start automatically when the data nodes are rebooted.
Step 8: Configure HDFS to use the Data Optimizer volume
Before HDFS datanodes can begin tiering blocks to the Data Optimizer volume, you must configure the HDFS datanodes to see the Data Optimizer volume and to recognize Data Optimizer as an ARCHIVE volume type. If you deployed Data Optimizer to some but not all datanodes, then you will need to create a new configuration group for the datanodes running Data Optimizer volumes.
From the Cloudera dashboard, navigate to HDFS > Configuration.
If you configured a subset of data nodes with Data Optimizer, create a new HDFS configuration group with the following steps:
In the top of the HDFS Configs window, open the Config Group drop-down menu and click Manage Config Groups.
The Manage HDFS Configuration Groups dialog box opens.
From the list of existing configuration groups, select the HDFS configuration group you want to copy.
Although this selection is typically the Default group, your selected group may differ if you are already using configuration groups to manage your data nodes.
Open the dropdown menu on the form and select Duplicate to create a copy of the selected configuration group.
Fill in the Create New Configuration Group form as follows:
ValueEntryName
Datanode PDSO Volume Group
Description
Configuration Group for Datanodes with Data Optimizer volumes
Click OK.
In the Manage HDFS Configuration Groups dialog box, select the Datanode PDSO Volume Group from the list of configuration groups in the left pane.
On the right side of the dialog box, click the + (plus) icon to add hosts to the selected configuration group.
The Select Configuration Group Hosts dialog box opens.
In the Select Configuration Group Hosts dialog box, select the checkbox next to each of the datanodes with a Data Optimizer Volume.
Click OK.
The Manage HDFS Configuration Groups dialog box opens.
Click Save.
The HDFS Configs page appears.
Locate the Datanode directories property (
dfs.datanode.data.dir).(Optional) If you did not create a configuration group, proceed to the next step. If you created a new HDFS configuration group because Data Optimizer is deployed to a subset of datanodes, perform the following steps:
Place your cursor over the Datanode directories field.
The + (override) icon appears to the right of the field.
Click the + icon to override the current entry.
When prompted, choose the HDFS Configuration Group and select the Datanode PDSO Volume Group you created previously in this task.
The new override value prepopulates with the value from the current configuration.
Place your cursor after the end of the current text in the Datanode directories text box and add the
[ARCHIVE]<pdso_mount_point>/datavalue.The
<pdso_mount_point>value is associated with the Pentaho Data Optimizer Mount Point property in the Data Optimizer Configuration.For example, if the PDSO mount point is
/mnt/pdso, then the value of the new entry would be[ARCHIVE]/mnt/pdso/data.The property requires a comma delimited list, so be sure to separate the new entry from the existing entries with a comma.
Note: As a best practice, create a subdirectory under the Data Optimizer mount point for the HDFS Datanode directory and assign it a name. In this example, the subdirectory name is
data, but the name can be whatever you choose.Save your work.
Refresh or restart your datanodes. See Refresh HDFS Datanodes after adding Data Optimizer volumes.
Verify Data Optimizer is working properly. See Tiering HDFS Blocks to Data Optimizer.
Step 9: Refresh HDFS Datanodes after adding Data Optimizer volumes
After you have configured HDFS data nodes to use Data Optimizer volumes, refresh or restart the data nodes so they can register the configuration change and use the volumes. The data node property that was modified is a refreshable configuration, so the data nodes can pick up new data directories without restarting, which can be disruptive.
Typically, Cloudera Manager prompts you to refresh your data nodes because Cloudera Manager detects that the configuration change is refreshable. Click the refresh icon when it appears at the top of the HDFS configuration page to refresh your data nodes. After the refresh, the data nodes can start tiering to Data Optimizer.
In some cases, such as if you created a new configuration role group, Cloudera Manager may prompt for a restart. In this case, contact the Data Optimizer implementation team and your Cluster Administrator to determine how best to proceed and resolve the required restart notifications. You still might be able to perform a refresh by executing the Refresh Cluster action. After the refresh, data nodes are ready to begin tiering as described in the previous paragraph.
After HDFS has been configured to use Data Optimizer Volumes, do not stop the Volume on a node when the Datanode service is still running. Doing so can result in unpredictable behavior, including HDFS marking a volume as failed.
To verify Data Optimizer is tiering properly, see Tiering HDFS Blocks to Data Optimizer.
Step 10: Data Optimizer extension for Cloudera Manager
Cloudera Manager extensions allow you to deploy and manage third party services like Data Optimizer on a Cloudera cluster. The Data Optimizer extension for Cloudera Manager (CM) defines the Data Optimizer service and its roles for Cloudera Manager. This extension is compatible only with parcel-deployed Cloudera clusters.
The Data Optimizer extension contains a Custom Service Descriptor (CSD) file that defines the Data Optimizer service, the roles it provides, and how the service is managed. For example, the CSD file tells CM which scripts to call to start or stop the roles associated with the service.
You must deploy this CSD file directly to the CM server with root or sudo permissions.
The Data Optimizer extension also includes a parcel file that contains the Data Optimizer code in the form of executable binaries and scripts. Cloudera Manager executes the Data Optimizer code according to the instructions provided in the CSD file whenever the service or roles are started and stopped, or when changing log levels, collecting logs, or enabling/disabling the recovery mode. Deploy the parcel directly to the CM server or download it from a privately-hosted parcel repository.
The Data Optimizer extension for Cloudera Manager contains the following roles:
Volume
Instances of the Volume role are added to HDFS datanodes and enable the Data Optimizer tiering capability on those data nodes.
Volume Monitor
Instances of the Volume Monitor role are deployed alongside Volume instances and provide proactive monitoring capabilities to ensure that the Volume is healthy, and to generate alerts when necessary.
Last updated
Was this helpful?

