HBase row decoder
The HBase row decoder step decodes an incoming key and HBase result object using a specified mapping.
You can use this step with Pentaho MapReduce to process data read from HBase.
For background on PDI and Hadoop, see Pentaho MapReduce workflow.
Step name
Step name: Specify the unique name of the HBase row decoder step on the canvas. You can customize the name or leave the default.
Options
The HBase row decoder step includes the following tabs:
Configure fields
Create/Edit mappings
Configure fields tab

Input fields:
Key field: Incoming PDI field that contains the input key.
HBase result field: Incoming PDI field that contains the serialized HBase result.
Create/Edit mappings tab
Most HBase data is stored as raw bytes. PDI uses mappings to decode values and support meaningful comparisons.
Before decoding, you typically:
Configure a connection using Hadoop cluster properties.
Define which column family each value belongs to and its type.
Specify the key type.

You can select an existing mapping to load its field definitions into the key fields table.
You can also create a new HBase table and mapping simultaneously by configuring the mapping fields and naming a new table in HBase table name.
Top-level options:
Hadoop Cluster: Select an existing Hadoop cluster configuration.
Select New to create a configuration.
Select Edit to modify an existing configuration.
For details, see Connecting to a Hadoop cluster with the PDI client.
HBase table name: HBase table to use.
Connection information must be valid and complete for this list to populate.
Get table names: Retrieves existing table names.
Table names are shown as
namespace:tablename. For namespace details, see Namespaces.Mapping name: Existing mapping to use.
The list is empty when no mappings exist for the selected table. You can define multiple mappings for the same table using different subsets of columns.
Key fields table
Enter information about the HBase columns you want to decode.
Incoming field names must match the mapping field Alias values.
There can be fewer incoming fields than fields defined in the mapping.
If there are more incoming fields than the mapping defines, the step logs an error.
One incoming field must match the key defined in the mapping.
A valid mapping must define metadata for the table key. Because HBase does not provide a key name, you must specify an Alias for the key.
For non-key columns:
Column family and Column name are required.
Alias is optional. If you omit it, the step uses the column name.
You must provide Type information for all fields.
The step does not support adding new column families to an existing table.
Columns:
#: Order of mapping entries.
Alias: Name assigned to the key (required for key; optional for non-key columns).
Key: Whether the field is the table key (
YorN).Column family: Column family for non-key fields.
Column name: Column name.
Type: Data type.
Key column types:
String
Integer
UnsignedInteger
Long
UnsignedLong
Date
UnsignedDate
Binary
Non-key column types:
String
Integer
Long
Float
Double
Boolean
Date
BigNumber
Serializable
Binary
Indexed values: Comma-separated values for string columns.
Actions:
Save mapping: Saves the mapping.
With valid connection details and a mapping name, the mapping is saved in HBase.
If you only need the mapping locally, connection details and mapping name are not required. The mapping is serialized into the transformation metadata.
Delete mapping: Deletes the named mapping from the mapping table (does not delete the HBase table).
Create a tuple template: Creates a tuple mapping template.
Tuple output mode writes data into wide rows where the number of columns may vary row-to-row. It assumes all column values share the same type.
Tuple output fields:
KEYFamilyColumnValueTimestamp
Familyis preconfigured as String andTimestampis preconfigured as Long. You must set types forKEY,Column, andValue.
Additional notes on data types
For keys to sort properly in HBase, distinguish between signed and unsigned numbers.
Because of how HBase stores integer and long values internally, the sign bit must be flipped before storing signed values so that positive values sort after negative values. Unsigned values can be stored directly.
Additional behavior:
String columns can optionally define legal values by entering comma-separated values in Indexed values.
Date keys can be stored as signed or unsigned long types with epoch-based timestamps.
If you map a date key as String, PDI can change the type to Date for manipulation in the transformation.
Boolean values can be stored as 0/1 Integer/Long, or as String (
Y/N,yes/no,true/false,T/F).BigNumber values can be stored as serialized
BigDecimalobjects or as strings parseable byBigDecimal.Serializable values are serialized Java objects.
Binary values are raw byte arrays.
Use HBase row decoder with Pentaho MapReduce
The HBase row decoder step is designed for MapReduce transformations to decode the key and value data output by TableInputFormat.
Key output is the row key from HBase.
Value output is an HBase result object containing all column values for the row.
To process HBase data using incoming key/value data, configure Hadoop to access HBase.
Example workflow
Create a Pentaho MapReduce job entry that includes a transformation with a MapReduce Input step and an HBase row decoder step.

MapReduce transformation example In the MapReduce Input step, configure Key field and Value field to produce a serialized result (select Serializable for Type).

MapReduce input step example In the HBase row decoder step:
Set Key field to
key.Set HBase result field to
value.

HBase row decoder step example Define or load a mapping on the Create/Edit mappings tab.

HBase row decoder step, example 2 In the Pentaho MapReduce job entry, configure input splits using
TableInputFormat:Set Input Path and Input format on the Job Setup tab.

Pentaho MapReduce entry example On the User Defined tab, set the following properties to configure the scan performed by
TableInputFormat:NameValuehbase.mapred.inputtableName of the HBase table to read from (required).
hbase.mapred.tablecolumnsSpace-delimited list of columns in
ColFam:ColNameformat. To read all columns from a family, omitColName(required).hbase.mapreduce.scan.cachedrowsNumber of rows to cache for scanners (optional).
hbase.mapreduce.scan.timestampTimestamp used to filter columns with a specific timestamp (optional).
hbase.mapreduce.scan.timerange.startStart timestamp for a time range filter (optional).
hbase.mapreduce.scan.timerange.endEnd timestamp for a time range filter (optional).
When you execute the job, the output includes the row key from HBase and an HBase result object containing all column values for each row.
Last updated
Was this helpful?

