# HBase row decoder

The **HBase row decoder** step decodes an incoming key and HBase result object using a specified mapping.

You can use this step with [Pentaho MapReduce](/pdia-data-integration/pdi-job-entries-reference-overview/pentaho-mapreduce.md) to process data read from HBase.

For background on PDI and Hadoop, see [Pentaho MapReduce workflow](broken://pages/949LwK6GQVyzyxlXN7DW).

### Step name

* **Step name**: Specify the unique name of the HBase row decoder step on the canvas. You can customize the name or leave the default.

### Options

The HBase row decoder step includes the following tabs:

* **Configure fields**
* **Create/Edit mappings**

#### Configure fields tab

![Configure fields tab](/files/MZd3Vqnu6SEM5kY6gZ3V)

Input fields:

* **Key field**: Incoming PDI field that contains the input key.
* **HBase result field**: Incoming PDI field that contains the serialized HBase result.

#### Create/Edit mappings tab

Most HBase data is stored as raw bytes. PDI uses mappings to decode values and support meaningful comparisons.

Before decoding, you typically:

* Configure a connection using Hadoop cluster properties.
* Define which column family each value belongs to and its type.
* Specify the key type.

![Create/Edit mappings tab](/files/lrbziRoZmTHi0WsUk7yw)

You can select an existing mapping to load its field definitions into the key fields table.

You can also create a new HBase table and mapping simultaneously by configuring the mapping fields and naming a new table in **HBase table name**.

Top-level options:

* **Hadoop Cluster**: Select an existing Hadoop cluster configuration.
  * Select **New** to create a configuration.
  * Select **Edit** to modify an existing configuration.
  * For details, see [Connecting to a Hadoop cluster with the PDI client](/pdia-data-integration/extracting-data-into-pdi/connecting-to-a-hadoop-cluster-with-the-pdi-client-article.md).
* **HBase table name**: HBase table to use.

  Connection information must be valid and complete for this list to populate.
* **Get table names**: Retrieves existing table names.

  Table names are shown as `namespace:tablename`. For namespace details, see [Namespaces](/pdia-data-integration/pdi-transformation-steps-reference-overview/hbase-input-cp-main-page.md#namespaces).
* **Mapping name**: Existing mapping to use.

  The list is empty when no mappings exist for the selected table. You can define multiple mappings for the same table using different subsets of columns.

**Key fields table**

Enter information about the HBase columns you want to decode.

Incoming field names must match the mapping field **Alias** values.

* There can be fewer incoming fields than fields defined in the mapping.
* If there are more incoming fields than the mapping defines, the step logs an error.
* One incoming field must match the key defined in the mapping.

A valid mapping must define metadata for the table key. Because HBase does not provide a key name, you must specify an **Alias** for the key.

For non-key columns:

* **Column family** and **Column name** are required.
* **Alias** is optional. If you omit it, the step uses the column name.

You must provide **Type** information for all fields.

{% hint style="warning" %}
The step does not support adding new column families to an existing table.
{% endhint %}

Columns:

* **#**: Order of mapping entries.
* **Alias**: Name assigned to the key (required for key; optional for non-key columns).
* **Key**: Whether the field is the table key (`Y` or `N`).
* **Column family**: Column family for non-key fields.
* **Column name**: Column name.
* **Type**: Data type.

  Key column types:

  * String
  * Integer
  * UnsignedInteger
  * Long
  * UnsignedLong
  * Date
  * UnsignedDate
  * Binary

  Non-key column types:

  * String
  * Integer
  * Long
  * Float
  * Double
  * Boolean
  * Date
  * BigNumber
  * Serializable
  * Binary
* **Indexed values**: Comma-separated values for string columns.

Actions:

* **Save mapping**: Saves the mapping.
  * With valid connection details and a mapping name, the mapping is saved in HBase.
  * If you only need the mapping locally, connection details and mapping name are not required. The mapping is serialized into the transformation metadata.
* **Delete mapping**: Deletes the named mapping from the mapping table (does not delete the HBase table).
* **Create a tuple template**: Creates a tuple mapping template.

  Tuple output mode writes data into wide rows where the number of columns may vary row-to-row. It assumes all column values share the same type.

  Tuple output fields:

  * `KEY`
  * `Family`
  * `Column`
  * `Value`
  * `Timestamp`

  `Family` is preconfigured as **String** and `Timestamp` is preconfigured as **Long**. You must set types for `KEY`, `Column`, and `Value`.

### Additional notes on data types

For keys to sort properly in HBase, distinguish between signed and unsigned numbers.

Because of how HBase stores integer and long values internally, the sign bit must be flipped before storing signed values so that positive values sort after negative values. Unsigned values can be stored directly.

Additional behavior:

* **String columns** can optionally define legal values by entering comma-separated values in **Indexed values**.
* **Date keys** can be stored as signed or unsigned long types with epoch-based timestamps.

  If you map a date key as **String**, PDI can change the type to **Date** for manipulation in the transformation.
* **Boolean values** can be stored as 0/1 **Integer**/**Long**, or as **String** (`Y/N`, `yes/no`, `true/false`, `T/F`).
* **BigNumber** values can be stored as serialized `BigDecimal` objects or as strings parseable by `BigDecimal`.
* **Serializable** values are serialized Java objects.
* **Binary** values are raw byte arrays.

### Use HBase row decoder with Pentaho MapReduce

The HBase row decoder step is designed for MapReduce transformations to decode the key and value data output by `TableInputFormat`.

* Key output is the row key from HBase.
* Value output is an HBase result object containing all column values for the row.

{% hint style="info" %}
To process HBase data using incoming key/value data, configure Hadoop to access HBase.
{% endhint %}

#### Example workflow

1. Create a Pentaho MapReduce job entry that includes a transformation with a MapReduce Input step and an HBase row decoder step.

   ![MapReduce transformation example](/files/NEzJNDXcKYW6OGhrKQdf)
2. In the MapReduce Input step, configure **Key field** and **Value field** to produce a serialized result (select **Serializable** for **Type**).

   ![MapReduce input step example](/files/Y9z5vikAmm4ddAL2aNjK)
3. In the HBase row decoder step:

   * Set **Key field** to `key`.
   * Set **HBase result field** to `value`.

   ![HBase row decoder step example](/files/NHVPuySbdBUzz3oqhciX)
4. Define or load a mapping on the **Create/Edit mappings** tab.

   ![HBase row decoder step, example 2](/files/lrbziRoZmTHi0WsUk7yw)
5. In the Pentaho MapReduce job entry, configure input splits using `TableInputFormat`:

   * Set **Input Path** and **Input format** on the **Job Setup** tab.

   ![Pentaho MapReduce entry example](/files/rk0Cm34KLjqGxxYrBiXM)
6. On the **User Defined** tab, set the following properties to configure the scan performed by `TableInputFormat`:

   | Name                                   | Value                                                                                                                     |
   | -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
   | `hbase.mapred.inputtable`              | Name of the HBase table to read from (required).                                                                          |
   | `hbase.mapred.tablecolumns`            | Space-delimited list of columns in `ColFam:ColName` format. To read all columns from a family, omit `ColName` (required). |
   | `hbase.mapreduce.scan.cachedrows`      | Number of rows to cache for scanners (optional).                                                                          |
   | `hbase.mapreduce.scan.timestamp`       | Timestamp used to filter columns with a specific timestamp (optional).                                                    |
   | `hbase.mapreduce.scan.timerange.start` | Start timestamp for a time range filter (optional).                                                                       |
   | `hbase.mapreduce.scan.timerange.end`   | End timestamp for a time range filter (optional).                                                                         |

When you execute the job, the output includes the row key from HBase and an HBase result object containing all column values for each row.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdia-data-integration/pdi-transformation-steps-reference-overview/hbase-row-decoder-pdi.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
