Configure Data Optimizer

Configure Data Optimizer by using the user interface for either Cloudera Manager or Ambari to set parameters in the Data Optimizer configuration file.

Data Optimizer configuration parameters

The Data Optimizer management interface distributes the configuration information to the Data Optimizer volume nodes for use by the Data Optimizer volume service.

CAUTION: Never modify the BUCKET and MOUNT_POINT parameters in the Data Optimizer configuration file after the initial installation. Changing these values after installation breaks the instance because the Data Optimizer instance ID is calculated based on the values provided in these parameters.

Note: Do not include leading or trailing spaces if you copy and paste parameter values. Ambari and Cloudera Manager do not validate input.

Parameter

Requirement

Description

ENDPOINT

Required

Endpoint address for Hitachi Content Platform. If the ENDPOINT_TYPE is HCP, use the form tenant.hcp_dns_name.

ENDPOINT_TYPE

Optional

Default endpoint type. Acceptable values are case sensitive. - If connecting to Hitachi Content Platform, use HCP.

If connecting to Hitachi Content Platform for cloud scale, use HCPCS.
If connecting to Amazon S3, use AWS.

PDO_URL

Required

The system name or IP address of the Data Catalog with which the Hadoop cluster communicates to handle migration tasks.

DATASOURCE_ID

Required

The unique ID assigned to the data source in the Data Catalog after registering the HDFS as a data source.

PDO_SCHEDULER_INTERVAL

Required

The time interval that specifies how often the Data Optimizer agent script queries the Data Catalog server for migration jobs and executes the migrations.

BUCKET

Required

Content Platform bucket name or a wildcard value of instance_id. You can use the unique ID generated by Content Platform (instance_id) as a wildcard to avoid name conflicts and to simplify configuration of the instances. Multiple instances can share a common configuration if you use the instance_id wildcard and all other values are identical. You cannot append or prepend the instance_id wildcard value to any other value. For example, bucket_instance_id is an invalid value. If Content Platform is properly configured, Data Optimizer creates its own bucket if the bucket does not already exist.

ACCESS_KEY

Required

S3 Access Key ID used to authenticate S3 requests to Content Platform.

SECRET_KEY

Required

S3 Secret Key used to authenticate S3 requests.

PROTOCOL

Optional

Protocol used to encrypt communication between Data Optimizer and Content Platform using TLS. The default value is https. Acceptable, case sensitive values are https and http.

VERIFY_SSL_CERTIFICATE

Optional

Value used to specify whether to verify certificates within Data Optimizer. Acceptable, case sensitive values are true and false. The default is value is true. If the VERIFY_SSL_CERTIFICATE parameter is set to false, certificate verification is disabled within Data Optimizer. Set this parameter to false when Content Platform is presenting a self-signed certificate, and you still want to use TLS to encrypt transmissions between Data Optimizer and Content Platform.

MOUNT_POINT

Required

HDFS DataNode local directory where Data Optimizer is mounted. The directory must exist and the HDFS user using Data Optimizer must have write permission for the directory. The directory must allow rwx permissions for the owner and owner’s group. For example:``` mkdir MOUNT_POINT*<mount point>* chown user:group MOUNT_POINT*<mount point>* chmod 770 MOUNT_POINT*<mount point>*

BUCKET_STORAGE_LIMIT_GB

Required

Size in GB to report as the total capacity of the volume.

CAUTION: If the usage exceeds the quota, or upper limit, on the volume’s Content Platform bucket, writes to the volume fail. Data Optimizer does not prevent writing to the volume if the usage exceeds the capacity. As a best practice, specify a value that is less than the bucket quota, so that HDFS stops choosing the volume for writes before the volume exceeds its quota on Content Platform.

CACHE_DIR

Required

Directory that Data Optimizer uses to store temporary files associated with open file handles. If MD_STORE_DIR is not specified, Data Optimizer also uses this directory to store files associated with persisting the local metadata store. The directory must exist and the HDFS user using Data Optimizer must have write permission for the directory. The directory must allow rwx permissions for the owner and owner’s group. The CACHE DIR parameter must be a fully-qualified directory path starting at the system root (/). For example:

mkdir CACHE_DIR cache dir
chown user:group CACHE_DIR cache dir
chmod 770 CACHE_DIR cache dir

MD_STORE_DIR

Optional

Local directory used to store files associated with persisting the Data Optimizer local metadata store. The MD_STORE_DIR parameter value must be a fully-qualified directory path starting at the system root (/). If an MD_STORE_DIR value is not specified, the CACHE_DIR directory is used. Specify a value for MD_STORE_DIR when the CACHE_DIR directory is located on volatile storage or if there is a more durable location for long-term file persistence. Do not choose a volatile storage medium for this directory, as it is intended to persist for the life of the Data Optimizer volume. For example, if you use transient storage for the CACHE_DIR directory, such as RAM_DISK, you should specify a more durable location for the MD_STORE_DIR directory. In addition, if you have a more durable location, such as a RAID partition, and there is room for the metadata store files (up to 2.5 GB), you should specify a MD_STORE_DIR directory on that partition. If the files associated with metadata store persistence are lost or corrupted, you can recover them as explained in Recovering from local metadata store failure or corruption.

RECOVERY_MODE

Optional

Value used to specify whether recovery mode is enabled. Do not set the RECOVERY_MODE parameter unless you have read and understood the section Recovering from local metadata store failure or corruption. The default value is false. Acceptable, case-sensitive values are true and false.

LOG_LEVEL

Optional

Value used to specify how verbose the logging is for Data Optimizer. The default value is INFO. Acceptable, case-sensitive values are ALERT, ERR, WARNING, INFO, and DEBUG. See Data Optimizer logging for more details about logging and log levels.

METRICS_FILE

Optional

Local file that Data Optimizer writes metrics to when prompted by the ldoctl metrics collect command. The METRICS_FILEvalue must be a fully-qualified file path starting at the system root (/). If a METRICS_FILE value is not defined, Data Optimizer writes metrics to the system journal. The parent directory must exist and the HDFS user using Data Optimizer must have write permission for the directory. See Monitor Data Optimizer for more information.

LOG_SDK

Optional

Local directory where detailed AWS S3 logs are saved. If the LOG_SDK parameter is specified and if LOG_LEVEL is set to DEBUG, Data Optimizer volumes log details about the S3 communication between the Data Optimizer instance and Content Platform. The directory must exist, must be a fully-qualified directory path starting at the system root (/), and the HDFS user using Data Optimizer must have write permission for the directory. See AWS S3 SDK logging for more information.

General Data Optimizer Configuration for Ambari

The following table list the parameters and respective description:

Note: Do not include leading or trailing spaces if you copy and paste parameter values. Ambari and Cloudera Manager do not validate input.

Parameter

Description

ENDPOINT_TYPE

The type of S3 endpoint you are using. Acceptable, case sensitive values are HCP, HCPCS, and AWS. The default value is HCP. - If connecting to Hitachi Content Platform, use HCP.

If connecting to Hitachi Content Platform for cloud scale, use HCPCS.
If connecting to Amazon S3, use AWS.

AWS_REGION

The AWS region that Ambari connects to. The AWS_REGION value is required if S3 Endpoint Type is AWS.

ENDPOINT

The S3 endpoint URL for the object storage service.- If the ENDPOINT_TYPE is HCP, use the form tenant.hcp_dns_name.

If the ENDPOINT_TYPE is HCPCS, use the form hcpcs_dns_name.
If the ENDPOINT_TYPE is AWS, you can leave the field blank or populate it with a region-specific S3 endpoint.

BUCKET

S3 bucket used on the object store for all the backend storage of the Data Optimizer instances.

ACCESS_KEY

S3 Access Key ID used to authenticate S3 requests to the object store.

SECRET_KEY

S3 Secret Key used to authenticate S3 requests to the object store.

ENDPOINT_SCHEME

S3 Connection Scheme or Endpoint Scheme. Acceptable, case sensitive values are https and http. The default value is https. If set to https, Data Optimizer uses TLS to encrypt all communication with object storage.

VERIFY_SSL_CERTIFICATE

Value used to specify whether to verify certificates within the Data Optimizer volume. Acceptable, case sensitive values are Enabled and Disabled. The default value is Enabled. If the ENDPOINT_SCHEME parameter is https then set the VERIFY_SSL_CERTIFICATE parameter to enabled. Similarly, If the ENDPOINT_SCHEME parameter is then set the VERIFY_SSL_CERTIFICATE parameter to disabled.

By default, Content Platform uses a self-signed certificate that is not in the trust store on the HDFS DataNode. Disabling verification allows TLS negotiation to occur, despite the untrusted certificate. Disabling verification does not reduce the strength of TLS encryption, but it does disable endpoint authentication. It is a best practice to replace the Content Platform self-signed certificate with one signed by a trusted certificate authority. See the Hitachi Content Platform documentation for details.

MOUNT_POINT

HDFS DataNode local directory where the Data Optimizer instance is mounted. HDFS writes block replicas to the local directory you specify. The MOUNT_POINT parameter value must be a fully-qualified directory path starting at the system root (/).

VOLUME_STORAGE_LIMIT_GB

The storage capacity in GB of each Data Optimizer volume instance. If the combined usage of Data Optimizer volumes exceeds the quota allocated to their shared bucket on Content Platform, writes to those Data Optimizer volumes fail. The VOLUME_STORAGE_LIMIT_GB parameter value, multiplied by the number of Data Optimizer instances should not exceed the Content Platform quota. In fact, the Content Platform quota should include additional capacity for deleted versions and to account for asynchronous garbage collection services. HDFS writes only the amount of data to each Data Optimizer volume that is equal to or less than the amount specified in the HCP Bucket Storage Limit parameter, minus the reserved space (dfs.datanode.du.reserved).

CACHE_DIR

A local directory on the HDFS DataNode that Data Optimizer uses to store temporary files associated with open file handles. The CACHE DIR parameter must be a fully-qualified directory path starting at the system root (/).

MD_STORE_DIR

Local directory on each node used to store files associated with persisting the Data Optimizer local metadata store. The MD_STORE_DIR parameter value must be a fully-qualified directory path starting at the system root (/). Specify a value for MD_STORE_DIR when the CACHE_DIR directory is located is on volatile storage or if there is a more durable location for long term file persistence. Do not choose a volatile storage medium for this directory as it is intended to persist for the life of the Data Optimizer volume. If the files associated with metadata store persistence are lost or corrupted, you can recover them as explained in Recovering from local metadata store failure or corruption.

LOG_LEVEL

Value used to specify how verbose the logging is for Data Optimizer. The default value is WARNING. Acceptable, case sensitive values are ALERT, ERR, WARNING, INFO, and DEBUG. See Data Optimizer logging for details about logging and log levels.

LOG_SDK

Optional. Local directory where detailed AWS S3 logs are saved. If the LOG_SDK parameter is specified and if LOG_LEVEL is set to DEBUG, Data Optimizer volumes log details about the S3 communication between the Data Optimizer volume instance and Content Platform. The LOG_SDK parameter value must exist, must be a fully-qualified directory path starting at the system root (/), and the HDFS user using Data Optimizer must have write permission for the directory. See AWS S3 SDK logging for further details.

Note: The configuration file is located in the /etc/ldo directory on each HDFS DataNode on which both the Data Optimizer is installed, and the ARCHIVE volumes are configured.

Settings for HTTP or HTTPS Proxy Connections

Using the settings in this section, you can configure Data Optimizer to use an HTTP or HTTPS proxy. In some cases, Data Optimizer is installed on a host that does not have direct access to the object storage service and must connect through a proxy. This is more likely to be the case when using a cloud storage provider such as Amazon Web Services. Using the settings in this section, you can configure Data Optimizer to use an HTTP or HTTPS proxy. If a proxy is not required, leave these settings at their defaults.

Parameter

Description

PROXY

The IP address or domain name of the http or https proxy server, if required.

PROXY_PORT

The port that the proxy server listens on.

PROXY_SCHEME

The scheme is either http or https, depending on what the proxy server supports.

PROXY_USER

The user for the proxy server, if authentication is required.

PROXY_PASSWORD

The password for the proxy server, if authentication is required.

Recovery Specific Configuration for Ambari

Use the following parameter to configure the recovery mode for Ambari.

CAUTION: Do not enable this parameter unless you have familiarized yourself with the Maintain Data Optimizer metadata section and understand the implications.

Parameter

Description

RECOVERY_MODE

Value used to determine whether recovery mode is enabled. The RECOVERY_MODE parameter controls the Data Optimizer authoritative versus non-authoritative behavior. Accepable values are Enabled and Disabled. The default value is Disabled.

Volume Monitor Configuration for Cloudera Manager only

Use the following parameter to configure the Volume Monitor interval for Cloudera Manager.

Parameter

Description

MONITOR_INTERVAL

Value used to specify how frequently, in minutes, the Volume Monitor checks the health of the Data Optimizer volume. As a best practice, set the interval to five minutes.

Hitachi Content Platform configuration

Data Optimizer requires either Hitachi Content Platform or Hitachi Content Platform for cloud scale.

For both Content Platform and Content Platform for cloud scale, a single user defines who creates and owns all the Data Optimizer buckets. It is important for the security of the data in these buckets that the user credentials are not shared with any other application. For security, only an HDFS or Data Optimizer administrator should have access to credentials to create and define who owns the Data Optimizer buckets. The credentials are in the Data Optimizer configuration files on the HDFS DataNodes.

See the Hitachi Content Platform product documentation for more information.

Note: If you need to work with customer support to troubleshoot or resolve an issue, make sure that you share the Content Platform user credentials with them.

Configure a tenant in Content Platform

To create a Content Platform tenant, you need the administrator role.

You must create a Hitachi Content Platform tenant for Data Optimizer. In most cases, Data Optimizer instances create their own buckets, so you need to properly configure namespace defaults to result in properly configured buckets.

Use the following steps to configure a tenant in Content Platform.

In the top-level menu of the Hitachi Content Platform System Management Console, click Tenants.
The Tenants page opens.
On the Tenants page, click Create Tenant.
The Create Tenant panel opens.
On the Create Tenant panel, create a tenant, making sure to:
- Allocate enough quota for all anticipated Data Optimizer instances.
- Enable versioning. See the Hitachi Content Platform product documentation for more information.
Use the following steps to enable the management API (MAPI), so that Data Optimizer instances can create buckets.
1. Log into the System Management Console or Tenant Management Console using a user account with the security role.
2. In the top-level menu of either console, select Security > MAPI.
  The Management API page opens.
3. In the Management API Setting section on the Management API page, select Enable the HCP management API.
4. Click Update Settings
Enable MAPI at the cluster level.
Use the following steps to configure namespace defaults for the tenant:
1. From the Content Platform Tenant Management Console, select Configuration > Namespace Defaults.
2. In the Hard Quota field, type a new number of gigabytes or terabytes of storage to allocate for an individual Data Optimizer instance namespace and select either GB or TB to indicate the measurement unit. The default is 50 GB. The maximum value you can specify is equal to the hard quota for the tenant.
3. Set Cloud Optimized to On.
4. Set Versioning to On.
5. Enable version pruning older than 0 days.

Create a tenant user account

Use this task in Hitachi Content Platform to create a tenant user account to be used exclusively by Data Optimizer, not by an actual user. This user owns and has exclusive data access permissions to Data Optimizer buckets.

Note: The tenant user must not have any administrative role in the tenant beyond administration of the buckets they own. No users should have access to the data in Data Optimizer buckets at any time for any reason except when required by customer support.

Use the following steps in the Content Platform Tenant Management Console to create a tenant user account. See the Hitachi Content Platform product documentation for more information.

Navigate to Security > Users > Create User Account.
The Create User Account panel opens.
In the Create User Account panel, in the Username field, type a login account.
Adhere to the following guidelines:
1. Choose a name like pdso-svc-usr, to indicate that the user is not a person but a software service.
2. Do not enable any administrative roles.
3. Select Allow namespace management.
  You need to do this so Data Optimizer instances can create buckets.
Click Create User Account.
The text “Successfully created user account. Authorization token:” is shown, followed by a text string with two values separated by a colon. The value on the left side of the text string is the base64-encoded username for the ACCESS_KEY property, and the value on the right is the md5-encoded password to use for the SECRET_KEY property.
Capture the base64-encoded username and md5-encoded password to add to the Data Optimizer configuration file.
Edit the /etc/ldoData Optimizer configuration file and add the encoded username to the ACCESS_KEY property and add the encoded password to the SECRET_KEY property.
Save and close the configuration file.

(Optional) Create a bucket for Data Optimizer

Use this task to manually create a bucket for the Data Optimizer instance.

Note: The best practice is to let Data Optimizer instances create their own buckets.

Perform the following steps in Hitachi Content Platform to create a bucket manually. See the Hitachi Content Platform documentation for more information.

In the Content Platform Tenant Management Console, click Namespaces.
The Namespaces page opens.
On the Namespaces page, click Create Namespace.
The Create Namespace panel opens.
Use the following steps to create a namespace:
1. In the Namespace Owner field, specify the tenant user created in the Create a tenant user account procedure.
2. Configure Hard Quota to provide adequate capacity for an individual Data Optimizer instance.
3. Set Cloud Optimized to On.
4. Set Versioning to On.
5. Enable version pruning older than 0 days.
Use the following steps to enable an access control list (ACL):
1. In the Tenant Management Console, click Namespaces.
  The Namespaces page opens.
2. In the list of namespaces, click the name of the Data Optimizer namespace.
3. Click the Settings tab.
  The Settings panel opens.
4. On the left side of the Settings panel, click ACLs.
  The ACLs panel opens.
5. In the ACLs panel, select Enable ACLs.
  A confirmation prompt displays.
6. Click Enable ACLs.
Use the following steps to enable the Hitachi API for Amazon S3:
1. In the Tenant Management Console, click Namespaces.
  The Namespaces page opens.
2. In the list of namespaces, click the name of the Data Optimizer namespace.
3. Click the Protocols tab.
  The Protocols panel opens.
4. Select Enable Hitachi API for Amazon S3.
  Note: Enable HTTP only if you will not be using TLS.
5. Click Update Settings.
Specify the namespace name in the BUCKET parameter of the Data Optimizer configuration file, /etc/ldo.

HCP for cloud-scale configuration

If you are using HCP for cloud-scale and configuring more than 100 Data Optimizer instances, you need to increase the maximum number of buckets allowed for your user.

See Hitachi Content Platform configuration documentation for more information.

PreviousInstalling Data Optimizer on the cluster NextRun Data Optimizer

Last updated 4 months ago