HBase Output

Use the HBase Output step to write data to an HBase table according to user-defined column metadata.

Step name

Step name: Specify the unique name of the step on the canvas. You can customize the name or leave the default.

Options

The HBase Output step includes two tabs:

Configure connection
Create/Edit mappings

Configure connection tab

This tab contains HBase connection information.

You can configure a connection in one of two ways:

Use Hadoop cluster properties.
Use an hbase-site.xml (and optional hbase-default.xml) configuration file.

Below the connection details are fields to specify which target HBase table to write to and which mapping to use to encode incoming field values.

Connection and write options

Hadoop cluster: Select an existing Hadoop cluster configuration.
- Select Edit to edit an existing cluster configuration.
- Select New to create a new cluster configuration.
- For details, see Connecting to a Hadoop cluster with the PDI client.
URL to hbase-site.xml: Address of hbase-site.xml.
URL to hbase-default.xml: Address of hbase-default.xml.
HBase table name: Target HBase table.
Get table names: Populates the table name list.
Only mapped table names are retrieved. If you enter namespace: in HBase table name and then select Get table names, only table names in that namespace are shown.
For namespace details, see Namespaces.
Mapping name: Mapping used to encode and interpret column values.
Select Get mappings for the specified table to populate available mappings.
Store mapping info in step meta: Stores mapping information in step metadata instead of loading it from HBase at runtime.
Delete rows by mapping key: Deletes rows using the row key on the mapped input field.
Disable write to WAL: Disables writing to the Write Ahead Log (WAL).
The WAL provides a recovery mechanism if a server fails while data is being inserted. Disabling WAL can improve performance.
This option is not available when Delete rows by mapping key is selected.
Size of write buffer (bytes): Size of the buffer used to transfer data to HBase.
A larger buffer uses more memory on the client and server but results in fewer remote procedure calls.
If you leave this field blank, the default in hbase-default.xml is used (2 MB / 2097152 bytes).

Create/Edit mappings tab

This tab creates or edits a mapping for a given HBase table.

A mapping defines metadata about values stored in the table. Because HBase stores most values as raw bytes, mappings allow PDI to encode values correctly.

Before a value can be written to HBase, you must specify:

The column family the value belongs to
The value type
The key type

The names of fields entering the step must match the Alias values in the mapping.

There can be fewer incoming fields than fields in the mapping.
If there are more incoming fields than the mapping defines, the step logs an error.
One incoming field must match the key defined in the mapping.

This tab works similarly to HBase Input, except that HBase Output can create the target table if it does not already exist.

Top-level fields

HBase table name: Select a table name.
Connection details on the Configure connection tab must be complete and valid for this list to populate.
Get table names: Retrieves all table names, including tables without Pentaho mappings.
Mapping name: Existing mappings for the selected table.
You can define multiple mappings on the same table using different subsets of columns.

Mapping fields table

Columns:

#: Order of the mapping operation.
Alias: Name you assign to the key (required for key; optional for non-key columns).
Key: Whether the field is the table key.
Column family: Column family for non-key columns.
Column name: Column name.
Type: Data type.
Key column types:
- String
- Integer
- UnsignedInteger
- Long
- UnsignedLong
- Date
- UnsignedDate
- Binary
Non-key column types:
- String
- Integer
- Long
- Float
- Double
- Boolean
- Date
- BigNumber
- Serializable
- Binary
Indexed values: Comma-separated values for string columns.

Buttons:

Get incoming fields: Populates the mapping table from the incoming stream fields.
Create a tuple template: Creates a template to write tuples to HBase.
Save mapping: Saves the mapping.
Delete mapping: Deletes the mapping (does not delete the HBase table).

Mapping notes

A valid mapping must define metadata for the table key. The key must have an Alias because HBase does not provide a key name.

For keys to sort properly in HBase, note the distinction between signed and unsigned numbers.

Because of the way HBase stores integer and long values internally, the sign bit must be flipped before storing signed numbers so that positive numbers sort after negative numbers. Unsigned values can be stored directly.

Additional behavior:

String columns can optionally define legal values by entering comma-separated values in Indexed values.
Date keys can be stored as signed or unsigned long types (epoch-based timestamps). If you map a date key as String, PDI can change its type to Date for manipulation in the transformation.
Boolean values can be stored as 0/1 integer/long or as strings (Y/N, yes/no, true/false, T/F).
BigNumber values can be stored as serialized BigDecimal objects or as strings parseable by BigDecimal.
Serializable values are serialized Java objects.
Binary values are raw byte arrays.

To speed up mapping creation, select Get incoming fields.

Alias and Column name are set to each incoming field name.
Type information is set automatically.
Column family is set to either:
- The first column family defined (if the table exists)
- Family1 (if the table does not exist)

The step does not support adding new column families to an existing table.

Performance considerations

Write buffering and WAL settings can affect performance:

If you leave Size of write buffer (bytes) blank, the buffer is 2 MB (default), auto flush is enabled, and Put operations are executed immediately. This means each row is transmitted to HBase as soon as it reaches the step.
If you enter a value for Size of write buffer (bytes) (even the default value), auto flush is disabled and rows are transferred only when the buffer is full.

Disabling the Write Ahead Log (WAL) can improve performance but reduces the ability to recover after server failures.

On the Create/Edit mappings tab, you can create a new table by entering a table name that does not already exist.

You can suffix a new table name with options for compression and Bloom filters:

Compression options: NONE, GZ, LZO
Bloom filter options: NONE, ROW, ROWCOL

If you do not specify options, the defaults are NONE for both compression and Bloom filters.

Example:

NewTable@GZ@ROWCOL

Due to licensing constraints, HBase does not ship with LZO compression libraries. Install them on each node if you want to use LZO compression.

PreviousHBase Input NextHBase row decoder

Last updated 2 months ago

Was this helpful?

hashtagStep name

hashtagOptions

hashtagConfigure connection tab

hashtagCreate/Edit mappings tab

hashtagPerformance considerations

hashtagCreating new tables (compression and Bloom filter options)

Step name

Options

Configure connection tab

Create/Edit mappings tab

Performance considerations

Creating new tables (compression and Bloom filter options)