Pentaho MapReduce

This job entry executes transformations as part of a Hadoop MapReduce job, instead of requiring a traditional Hadoop Java class.

A Hadoop MapReduce job can include any combination of the following transformation types:

  • Mapper transformation (required): Converts input data into key/value tuples. It can filter and sort data, and applies a function to each element of a list.

  • Combiner transformation (optional): Summarizes map output records that share the same key. This can reduce the amount of data written to disk and transmitted over the network.

  • Reducer transformation (optional): Performs summary operations across keys (for example, counting occurrences) and outputs results.

circle-info

This entry was formerly known as Hadoop Transformation Job Executor.

circle-exclamation

General

  • Entry name: Specify the unique name of the job entry on the canvas. The default is Pentaho MapReduce.

Options

The Pentaho MapReduce job entry includes several tabs to define transformations and configure the Hadoop cluster connection.

Mapper tab

Pentaho MapReduce Mapper tab
Option
Definition

Transformation

Specify the transformation that performs the mapper functions by entering its path or clicking Browse.

If you select a transformation that shares the same root path as the current transformation, PDI inserts ${Internal.Entry.Current.Directory} in place of the common root path.

If you are working with a repository, specify the transformation name. If you are not working with a repository, specify the transformation XML file name.

Note: Transformations previously specified by reference are automatically converted to use the transformation name within the Pentaho Repository.

Input step name

The name of the step that receives mapping data from Hadoop. It must be a MapReduce Input step.

Output step name

The name of the step that passes mapping output back to Hadoop. It must be a MapReduce Output step.

Combiner tab

Combiner tab, Pentaho MapReduce
Option
Definition

Transformation

Specify the transformation that performs the combiner functions by entering its path or clicking Browse.

You can use internal variables such as ${Internal.Entry.Current.Directory} in the path.

If you are working with a repository, specify the transformation name. If you are not working with a repository, specify the transformation XML file name.

Note: Transformations previously specified by reference are automatically converted to use the transformation name within the Pentaho Repository.

Input step name

The name of the step that receives combiner data from Hadoop. It must be a MapReduce Input step.

Output step name

The name of the step that passes combiner output back to Hadoop. It must be a MapReduce Output step.

Use single threaded transformation engine

Use the single-threaded transformation execution engine to run the combiner transformation. This can reduce overhead when processing many small groups of output.

Reducer tab

Reducer tab, Pentaho MapReduce
Option
Definition

Transformation

Specify the transformation that performs the reducer functions by entering its path or clicking Browse.

You can use internal variables such as ${Internal.Entry.Current.Directory} in the path.

If you are working with a repository, specify the transformation name. If you are not working with a repository, specify the transformation XML file name.

Note: Transformations previously specified by reference are automatically converted to use the transformation name within the Pentaho Repository.

Input step name

The name of the step that receives reducer data from Hadoop. It must be a MapReduce Input step.

Output step name

The name of the step that passes reducer output back to Hadoop. It must be a MapReduce Output step.

Use single threaded transformation engine

Use the single-threaded transformation execution engine to run the reducer transformation. This can reduce overhead when processing many small groups of output.

Job Setup tab

Pentaho MapReduce Job setup tab
Option
Definition

Input path

The input directory path on the Hadoop cluster that contains the source data (for example, /wordcount/input). You can provide multiple input directories as a comma-separated list. If you want to read from S3, use the S3A connector (s3a://). Connectors s3 and s3n are not supported. See the Hadoop documentation about Amazon S3 for details.

Output path

The output directory path on the Hadoop cluster (for example, /wordcount/output). The output directory cannot exist before you run the MapReduce job. To write to S3, use the S3A connector (s3a://).

Remove output path before job

Remove the output path before scheduling the MapReduce job. Note: Do not use this option with S3. To clean an S3 output path, use an alternative entry, such as Delete foldersarrow-up-right.

Input format

The Apache Hadoop class name that describes the input specification. For more information, see InputFormatarrow-up-right.

Output format

The Apache Hadoop class name that describes the output specification. For more information, see OutputFormatarrow-up-right.

Ignore output of map key

Ignore the key output from the mapper transformation and replace it with NullWritable.

Ignore output of map value

Ignore the value output from the mapper transformation and replace it with NullWritable.

Ignore output of reduce key

Ignore the key output from combiner and/or reducer transformations and replace it with NullWritable. This requires a reducer transformation (not the Identity Reducer).

Ignore output of reduce value

Ignore the value output from combiner and/or reducer transformations and replace it with NullWritable. This requires a reducer transformation (not the Identity Reducer).

Cluster tab

Cluster tab, Pentaho MapReduce
Option
Definition

Hadoop job name

The Hadoop job name. This field is required.

Hadoop Cluster

Select an existing Hadoop cluster configuration or create a new one.

  • Click Edit to modify an existing configuration.

  • Click New to create a new configuration.

For more information, see Connecting to a Hadoop cluster with the PDI client.

Number of Mapper Tasks

The number of mapper tasks to assign to the job. Input size typically determines this value.

Number of Reducer Tasks

The number of reducer tasks to assign to the job. Note: If this is 0, no reduce operation is performed, the mapper output becomes the job output, and combiner operations are not performed.

Logging Interval

The number of seconds between log messages.

Enable Blocking

Force the job to wait until the Hadoop job completes before continuing. This is the only way for PDI to determine job status. If you clear this option, PDI continues immediately and error handling/routing does not work.

chevron-rightHadoop cluster configuration fields (Edit/New)hashtag

When you click Edit or New next to Hadoop Cluster, the Hadoop cluster dialog box appears.

Option
Definition

Cluster Name

The name of the cluster configuration.

Hostname (HDFS)

The hostname for the HDFS node.

Port (HDFS)

The port for the HDFS node.

Username (HDFS)

The username for the HDFS node.

Password (HDFS)

The password for the HDFS node.

Hostname (JobTracker)

The hostname for the JobTracker node. If you have a separate JobTracker node, enter that hostname; otherwise, use the HDFS hostname.

Port (JobTracker)

The port for the JobTracker. This port cannot be the same as the HDFS port.

Hostname (ZooKeeper)

The hostname for the ZooKeeper node.

Port (ZooKeeper)

The port for the ZooKeeper node.

URL (Oozie)

A valid Oozie URL.

After you set these options:

  1. Click Test to verify the configuration.

  2. Click OK to return to the Cluster tab.

User Defined tab

User Defined tab, Pentaho MapReduce

Use this tab to define user-defined parameters and variables.

Column
Definition

Name

The name of the parameter or variable to set. To set a Java system property, prefix the name with java.system (for example, java.system.SAMPLE_VARIABLE). Variables set here override variables set in kettle.properties. For more information, see Kettle Variables.

Value

The value to assign to the parameter or variable.

Use PDI outside and inside the Hadoop cluster

PDI can run both outside a Hadoop cluster and within cluster nodes.

  • Outside the cluster, PDI can extract data from or load data into Hadoop HDFS, Hive, and HBase.

  • Inside the cluster, PDI transformations can act as mapper and/or reducer tasks. This enables you to build MapReduce jobs visually.

Pentaho MapReduce workflow

PDI and Pentaho MapReduce enable you to pull data from a Hadoop cluster, transform it, and pass it back to the cluster.

Build the mapper transformation

Start by designing the transformation you want.

  • Create a PDI transformation.

  • Add the Hadoop MapReduce Input and MapReduce Output steps.

  • Configure both steps and connect your steps with hops.

  • Name this transformation Mapper.

Big Data Key Value Pair Example

Hadoop communicates in key/value pairs. The MapReduce Input step defines how key/value pairs from Hadoop are interpreted by PDI.

MapReduce Input Step dialog

The MapReduce Output step passes output back to Hadoop.

MapReduce Output Step dialog

Build the job that runs MapReduce

  • Create a PDI job.

  • Add the Pentaho MapReduce job entry.

  • Configure the Mapper tab to reference your mapper transformation.

  • Add supporting entries such as Start and success/failure handling entries.

Transformation Job Workflow Word Count Example
Pentaho MapReduce dialog Mapper tab

Run a Hadoop job by using a Java class

PDI can also execute a Java class from a job. Use the Hadoop Job Executor job entry to configure and run a .jar file.

Hadoop Job Executor Workflow
Hadoop Job Executor dialog

If you use Amazon Elastic MapReduce (EMR), you can use the Amazon EMR Job Executor job entry, which includes connection information for Amazon S3 and configuration options for EMR.

Amazon EMR Job Executor job entry

Hadoop to PDI data type conversion

The Hadoop Job Executor and Pentaho MapReduce entries include an advanced configuration mode where you must specify Hadoop input/output data types.

PDI (Kettle) Data Type
Apache Hadoop Data Type

java.lang.Integer

org.apache.hadoop.io.IntWritable

java.lang.Long

org.apache.hadoop.io.IntWritable

java.lang.Long

org.apache.hadoop.io.LongWritable

org.apache.hadoop.io.IntWritable

java.lang.Long

java.lang.String

org.apache.hadoop.io.Text

java.lang.String

org.apache.hadoop.io.IntWritable

org.apache.hadoop.io.LongWritable

org.apache.hadoop.io.Text

org.apache.hadoop.io.LongWritable

java.lang.Long

Hadoop Hive-specific SQL limitations

Hive has limitations that can affect SQL queries, including:

  • Outer joins are not supported.

  • Each column can be used only once in a SELECT clause.

  • Conditional joins can use only the = condition unless you use a WHERE clause.

  • INSERT statements have specific syntax and limitations.

For details, see:

Big data tutorials

Pentaho big data tutorials provide scenario-based examples that demonstrate integration between Pentaho and Hadoop by using a sample data set.

Videos:

Last updated

Was this helpful?