HBase row decoder

The HBase row decoder step decodes an incoming key and HBase result object using a specified mapping.

You can use this step with Pentaho MapReduce to process data read from HBase.

For background on PDI and Hadoop, see Pentaho MapReduce workflow.

Step name

Step name: Specify the unique name of the HBase row decoder step on the canvas. You can customize the name or leave the default.

Options

The HBase row decoder step includes the following tabs:

Configure fields
Create/Edit mappings

Configure fields tab

Input fields:

Key field: Incoming PDI field that contains the input key.
HBase result field: Incoming PDI field that contains the serialized HBase result.

Create/Edit mappings tab

Most HBase data is stored as raw bytes. PDI uses mappings to decode values and support meaningful comparisons.

Before decoding, you typically:

Configure a connection using Hadoop cluster properties.
Define which column family each value belongs to and its type.
Specify the key type.

You can select an existing mapping to load its field definitions into the key fields table.

You can also create a new HBase table and mapping simultaneously by configuring the mapping fields and naming a new table in HBase table name.

Top-level options:

Hadoop Cluster: Select an existing Hadoop cluster configuration.
- Select New to create a configuration.
- Select Edit to modify an existing configuration.
- For details, see Connecting to a Hadoop cluster with the PDI client.
HBase table name: HBase table to use.
Connection information must be valid and complete for this list to populate.
Get table names: Retrieves existing table names.
Table names are shown as namespace:tablename. For namespace details, see Namespaces.
Mapping name: Existing mapping to use.
The list is empty when no mappings exist for the selected table. You can define multiple mappings for the same table using different subsets of columns.

Key fields table

Enter information about the HBase columns you want to decode.

Incoming field names must match the mapping field Alias values.

There can be fewer incoming fields than fields defined in the mapping.
If there are more incoming fields than the mapping defines, the step logs an error.
One incoming field must match the key defined in the mapping.

A valid mapping must define metadata for the table key. Because HBase does not provide a key name, you must specify an Alias for the key.

For non-key columns:

Column family and Column name are required.
Alias is optional. If you omit it, the step uses the column name.

You must provide Type information for all fields.

The step does not support adding new column families to an existing table.

Columns:

#: Order of mapping entries.
Alias: Name assigned to the key (required for key; optional for non-key columns).
Key: Whether the field is the table key (Y or N).
Column family: Column family for non-key fields.
Column name: Column name.
Type: Data type.
Key column types:
- String
- Integer
- UnsignedInteger
- Long
- UnsignedLong
- Date
- UnsignedDate
- Binary
Non-key column types:
- String
- Integer
- Long
- Float
- Double
- Boolean
- Date
- BigNumber
- Serializable
- Binary
Indexed values: Comma-separated values for string columns.

Actions:

Save mapping: Saves the mapping.
- With valid connection details and a mapping name, the mapping is saved in HBase.
- If you only need the mapping locally, connection details and mapping name are not required. The mapping is serialized into the transformation metadata.
Delete mapping: Deletes the named mapping from the mapping table (does not delete the HBase table).
Create a tuple template: Creates a tuple mapping template.
Tuple output mode writes data into wide rows where the number of columns may vary row-to-row. It assumes all column values share the same type.
Tuple output fields:
- KEY
- Family
- Column
- Value
- Timestamp
Family is preconfigured as String and Timestamp is preconfigured as Long. You must set types for KEY, Column, and Value.

Additional notes on data types

For keys to sort properly in HBase, distinguish between signed and unsigned numbers.

Because of how HBase stores integer and long values internally, the sign bit must be flipped before storing signed values so that positive values sort after negative values. Unsigned values can be stored directly.

Additional behavior:

String columns can optionally define legal values by entering comma-separated values in Indexed values.
Date keys can be stored as signed or unsigned long types with epoch-based timestamps.
If you map a date key as String, PDI can change the type to Date for manipulation in the transformation.
Boolean values can be stored as 0/1 Integer/Long, or as String (Y/N, yes/no, true/false, T/F).
BigNumber values can be stored as serialized BigDecimal objects or as strings parseable by BigDecimal.
Serializable values are serialized Java objects.
Binary values are raw byte arrays.

Use HBase row decoder with Pentaho MapReduce

The HBase row decoder step is designed for MapReduce transformations to decode the key and value data output by TableInputFormat.

Key output is the row key from HBase.
Value output is an HBase result object containing all column values for the row.

To process HBase data using incoming key/value data, configure Hadoop to access HBase.

Example workflow

Create a Pentaho MapReduce job entry that includes a transformation with a MapReduce Input step and an HBase row decoder step.
MapReduce transformation example
In the MapReduce Input step, configure Key field and Value field to produce a serialized result (select Serializable for Type).
MapReduce input step example
In the HBase row decoder step:
- Set Key field to key.
- Set HBase result field to value.
HBase row decoder step example
Define or load a mapping on the Create/Edit mappings tab.
HBase row decoder step, example 2
In the Pentaho MapReduce job entry, configure input splits using TableInputFormat:
- Set Input Path and Input format on the Job Setup tab.
Pentaho MapReduce entry example
On the User Defined tab, set the following properties to configure the scan performed by TableInputFormat:
Name
Value
hbase.mapred.inputtable
Name of the HBase table to read from (required).
hbase.mapred.tablecolumns
Space-delimited list of columns in ColFam:ColName format. To read all columns from a family, omit ColName (required).
hbase.mapreduce.scan.cachedrows
Number of rows to cache for scanners (optional).
hbase.mapreduce.scan.timestamp
Timestamp used to filter columns with a specific timestamp (optional).
hbase.mapreduce.scan.timerange.start
Start timestamp for a time range filter (optional).
hbase.mapreduce.scan.timerange.end
End timestamp for a time range filter (optional).