HBase Input

Use the HBase Input step to read data from an HBase table according to user-defined column metadata.

HBase is a distributed, column-oriented database that provides random read and write access to the Hadoop File System. HBase stores all data as raw bytes without any associated metadata. A mapping provides metadata that allows the step to decode the binary values properly.

Step name

Step name: Specify the unique name of the HBase Input step on the canvas. You can customize the name or leave the default.

Options

The HBase Input step includes the following tabs:

Configure query
Create/Edit mappings
Filter result set

It also includes the following reference sections:

Configure query tab

Before a value can be read from HBase, you must specify the value type and column family, and the type of the table key.

You must define a mapping to use a source table. You can output some or all of the fields defined in the mapping.

If you clear all rows from the fields table, the step outputs all fields defined in the mapping.
You can delete rows from the fields table to output a subset of fields.

This tab contains connection details and basic query information. You can configure a connection by using Hadoop cluster properties, or by using an hbase-site.xml (and optional hbase-default.xml) configuration file.

Connection and scan settings

Hadoop Cluster: Select an existing Hadoop cluster configuration.
- Select Edit to edit an existing configuration.
- Select New to create a new configuration.
- For details, see Connecting to a Hadoop cluster with the PDI client.
URL to hbase-site.xml: Path to hbase-site.xml.
URL to hbase-default.xml: Path to hbase-default.xml.
HBase table name: Name of the source table to read.
Get mapped table names: Retrieves mapped table names.
- If you enter namespace:tablename in HBase table name and then select Get mapped table names, only mapped tables in that namespace are shown.
- If you do not enter a namespace, tables across all namespaces are shown.
- See Namespaces.
Mapping name: Mapping used to decode and interpret column values.
- Select Get mappings for the specified table to populate available mappings.
Store mapping info in step meta data: Stores mapping info in step metadata instead of loading it from HBase at runtime.
Start key value (inclusive) for table scan: Starting key value for a partial scan (inclusive).
Stop key value (excluding) for table scan: Stopping key value for a partial scan (exclusive).
You can leave start and stop key fields blank. If you leave stop key blank, the scan returns all rows starting with (and including) the start key.
Scanner row cache size: Number of rows to cache per fetch request. See Performance considerations.

Key fields table

This table displays metadata for the selected table.

Columns:

#: Order of fields.
Alias: Name assigned to the field in the output stream.
Key: Whether the field is the table key.
Column family: Column family in the source table.
Column name: Column name in the source table. Column family + column name uniquely identify a column.
Type: PDI data type for the field.
Format: Formatting mask.
Indexed values: Optional set of values for string columns (comma-separated).

Buttons:

Get Key/Fields Info: Populates the field list and displays the key name as defined in the mapping.

Formatting notes for range scans:

For date key values in range scans, you must provide a formatting string.
You can provide formatting in either of these ways:
- Output the key from the mapping and set Format on the key row.
- If you do not output the key (or you output all fields by leaving the table blank), suffix the start/stop key value with a format string using @.
Example: 1969-08-28@yyy-MM-dd

Create/Edit mappings tab

Use this tab to create or edit mappings for an HBase table.

The mapping defines metadata about values stored in the table. Because HBase stores data as raw bytes, PDI uses the mapping to decode values and execute comparisons for column-based filtering.

A valid mapping must define metadata for the table key. The key must have a value in Alias because HBase does not provide a key name.

For non-key columns:

Column family and Column name are required.
Alias is optional. If you omit it, PDI uses the column name.

All fields must include type information.

Top-level fields

HBase table name: List of table names.
Connection information from the previous tab must be complete and valid for this list to populate.
Get table names: Retrieves all table names, including tables without Pentaho mappings.
If you enter namespace:tablename and select Get mapped table names, only mapped table names display. If you do not enter a namespace, tables across all namespaces display.
Mapping name: Existing mappings for the selected table.
You can define multiple mappings for the same table using different subsets of columns.

Fields

Columns:

#: Order of mapping entries.
Alias: Name assigned to the HBase table key (required for key column; optional for non-key columns).
Key: Whether the field is the table key (Y or N).
Column family: Column family in the source table (required for non-key columns).
Column name: Column name in the source table.
Type: Data type.
Key column types:
- String
- Integer
- UnsignedInteger
- Long
- UnsignedLong
- Date
- UnsignedDate
- Binary
Non-key column types:
- String
- Integer
- Long
- Float
- Double
- Boolean
- Date
- BigNumber
- Serializable
- Binary
Indexed values: Comma-separated values for string columns.

Buttons:

Save mapping: Saves the mapping.
Delete mapping: Deletes the mapping from the mapping table (does not delete the HBase table).
Create a tuple template: Creates a mapping template to extract tuples from HBase.

Additional notes on data types

For keys to sort properly in HBase, note the distinction between signed and unsigned numbers.

Because of the way HBase stores integer and long values, PDI flips the sign bit before storing signed numbers so that positive numbers sort after negative numbers. Unsigned values can be stored directly.

Additional behavior:

String columns can optionally define legal values by entering comma-separated values in Indexed values.
Date keys can be stored as signed or unsigned long types (epoch-based timestamps). If you map a date key as String, PDI can change its type to Date for manipulation in the transformation.
Boolean values can be stored as 0/1 integers/longs or as strings (Y/N, yes/no, true/false, T/F).
BigNumber values can be stored as serialized BigDecimal objects or as strings parseable by BigDecimal.
Serializable values are stored as serialized Java objects.
Binary values are raw byte arrays.

Filter result set tab

Use this tab to refine the set of rows returned by specifying filters on columns other than the key.

Match behavior

Match all / Match any: When you define multiple column filters, choose whether returned rows must match all filters or any single filter.
You can set bounded ranges on a numeric column by defining upper and lower bound filters and selecting Match all.
You can define open-ended ranges by selecting Match any.

Fields

Columns:

#: Order of filter operations.
Alias: Column alias name (from the mapping).
Type: Data type (populated after you select an alias).
Operator:
- Numeric/date/Boolean: equality and inequality operators
- String: substring and regular expression operators
Comparison value: Comparison constant used with the operator.
Format: Formatting mask applied to the field.
Signed comparison: Whether the comparison involves negative numbers for non-string fields.
Because HBase stores numbers in two’s complement form, Signed comparison indicates whether deserialization is required for correct comparison.
If all values are positive, HBase’s native lexicographical comparisons produce accurate results. If values can be negative, values must be deserialized before comparison.

HBase Input includes a custom comparator to deserialize column values before comparison. Install it on each HBase node before signed comparisons will work correctly.

A special comparator for Boolean values is also provided to deserialize and interpret Boolean values from numeric and string encodings.

Namespaces

You can use namespaces in HBase table name to create a logical grouping of tables (for example, one namespace for development and another for production).

You must create a namespace before you can write to it. If you do not enter a namespace when creating a mapping, Pentaho uses the default namespace named default.

For details, see HBase namespaces.

You can also use a variable for a namespace. This makes it easier to move a transformation between environments.

Namespace variable format:

$\{nsvarname\}:

Every namespace has a pentaho_mappings table that stores mapping metadata for columns. This table is created automatically when you create mappings.

Performance considerations

In addition to standard HBase server tuning, the following HBase Input settings can affect performance:

Scanner row cache size (Configure query tab):
- Default (blank): no caching; one row per fetch request.
- Setting a value can make scans faster but uses more memory.
Column selection behavior:
- If you specify fields in the key fields table on the Configure query tab, HBase must check each row for column presence, which can reduce speed.
- Enabling Bloom filters on the table can reduce lookups.
- If you leave the fields table blank, the scan returns all columns in each row. HBase avoids extra lookups, but HBase Input outputs only columns defined in the mapping.

PreviousHadoop File Output NextHBase Output

Last updated 2 months ago

Was this helpful?

hashtagStep name

hashtagOptions

hashtagConfigure query tab

hashtagCreate/Edit mappings tab

hashtagFilter result set tab

hashtagNamespaces

hashtagPerformance considerations

Step name

Options

Configure query tab

Create/Edit mappings tab

Filter result set tab

Namespaces

Performance considerations