Pentaho MapReduce
This job entry executes transformations as part of a Hadoop MapReduce job, instead of requiring a traditional Hadoop Java class.
A Hadoop MapReduce job can include any combination of the following transformation types:
Mapper transformation (required): Converts input data into key/value tuples. It can filter and sort data, and applies a function to each element of a list.
Combiner transformation (optional): Summarizes map output records that share the same key. This can reduce the amount of data written to disk and transmitted over the network.
Reducer transformation (optional): Performs summary operations across keys (for example, counting occurrences) and outputs results.
This entry was formerly known as Hadoop Transformation Job Executor.
The Hadoop job name field on the Cluster tab is required.
General
Entry name: Specify the unique name of the job entry on the canvas. The default is Pentaho MapReduce.
Options
The Pentaho MapReduce job entry includes several tabs to define transformations and configure the Hadoop cluster connection.
Mapper tab

Transformation
Specify the transformation that performs the mapper functions by entering its path or clicking Browse.
If you select a transformation that shares the same root path as the current transformation, PDI inserts ${Internal.Entry.Current.Directory} in place of the common root path.
If you are working with a repository, specify the transformation name. If you are not working with a repository, specify the transformation XML file name.
Note: Transformations previously specified by reference are automatically converted to use the transformation name within the Pentaho Repository.
Input step name
The name of the step that receives mapping data from Hadoop. It must be a MapReduce Input step.
Output step name
The name of the step that passes mapping output back to Hadoop. It must be a MapReduce Output step.
Combiner tab

Transformation
Specify the transformation that performs the combiner functions by entering its path or clicking Browse.
You can use internal variables such as ${Internal.Entry.Current.Directory} in the path.
If you are working with a repository, specify the transformation name. If you are not working with a repository, specify the transformation XML file name.
Note: Transformations previously specified by reference are automatically converted to use the transformation name within the Pentaho Repository.
Input step name
The name of the step that receives combiner data from Hadoop. It must be a MapReduce Input step.
Output step name
The name of the step that passes combiner output back to Hadoop. It must be a MapReduce Output step.
Use single threaded transformation engine
Use the single-threaded transformation execution engine to run the combiner transformation. This can reduce overhead when processing many small groups of output.
Reducer tab

Transformation
Specify the transformation that performs the reducer functions by entering its path or clicking Browse.
You can use internal variables such as ${Internal.Entry.Current.Directory} in the path.
If you are working with a repository, specify the transformation name. If you are not working with a repository, specify the transformation XML file name.
Note: Transformations previously specified by reference are automatically converted to use the transformation name within the Pentaho Repository.
Input step name
The name of the step that receives reducer data from Hadoop. It must be a MapReduce Input step.
Output step name
The name of the step that passes reducer output back to Hadoop. It must be a MapReduce Output step.
Use single threaded transformation engine
Use the single-threaded transformation execution engine to run the reducer transformation. This can reduce overhead when processing many small groups of output.
Job Setup tab

Input path
The input directory path on the Hadoop cluster that contains the source data (for example, /wordcount/input). You can provide multiple input directories as a comma-separated list. If you want to read from S3, use the S3A connector (s3a://). Connectors s3 and s3n are not supported. See the Hadoop documentation about Amazon S3 for details.
Output path
The output directory path on the Hadoop cluster (for example, /wordcount/output). The output directory cannot exist before you run the MapReduce job. To write to S3, use the S3A connector (s3a://).
Remove output path before job
Remove the output path before scheduling the MapReduce job. Note: Do not use this option with S3. To clean an S3 output path, use an alternative entry, such as Delete folders.
Input format
The Apache Hadoop class name that describes the input specification. For more information, see InputFormat.
Output format
The Apache Hadoop class name that describes the output specification. For more information, see OutputFormat.
Ignore output of map key
Ignore the key output from the mapper transformation and replace it with NullWritable.
Ignore output of map value
Ignore the value output from the mapper transformation and replace it with NullWritable.
Ignore output of reduce key
Ignore the key output from combiner and/or reducer transformations and replace it with NullWritable. This requires a reducer transformation (not the Identity Reducer).
Ignore output of reduce value
Ignore the value output from combiner and/or reducer transformations and replace it with NullWritable. This requires a reducer transformation (not the Identity Reducer).
Cluster tab

Hadoop job name
The Hadoop job name. This field is required.
Hadoop Cluster
Select an existing Hadoop cluster configuration or create a new one.
Click Edit to modify an existing configuration.
Click New to create a new configuration.
For more information, see Connecting to a Hadoop cluster with the PDI client.
Number of Mapper Tasks
The number of mapper tasks to assign to the job. Input size typically determines this value.
Number of Reducer Tasks
The number of reducer tasks to assign to the job. Note: If this is 0, no reduce operation is performed, the mapper output becomes the job output, and combiner operations are not performed.
Logging Interval
The number of seconds between log messages.
Enable Blocking
Force the job to wait until the Hadoop job completes before continuing. This is the only way for PDI to determine job status. If you clear this option, PDI continues immediately and error handling/routing does not work.
Hadoop cluster configuration fields (Edit/New)
When you click Edit or New next to Hadoop Cluster, the Hadoop cluster dialog box appears.
Cluster Name
The name of the cluster configuration.
Hostname (HDFS)
The hostname for the HDFS node.
Port (HDFS)
The port for the HDFS node.
Username (HDFS)
The username for the HDFS node.
Password (HDFS)
The password for the HDFS node.
Hostname (JobTracker)
The hostname for the JobTracker node. If you have a separate JobTracker node, enter that hostname; otherwise, use the HDFS hostname.
Port (JobTracker)
The port for the JobTracker. This port cannot be the same as the HDFS port.
Hostname (ZooKeeper)
The hostname for the ZooKeeper node.
Port (ZooKeeper)
The port for the ZooKeeper node.
URL (Oozie)
A valid Oozie URL.
After you set these options:
Click Test to verify the configuration.
Click OK to return to the Cluster tab.
User Defined tab

Use this tab to define user-defined parameters and variables.
Name
The name of the parameter or variable to set. To set a Java system property, prefix the name with java.system (for example, java.system.SAMPLE_VARIABLE). Variables set here override variables set in kettle.properties. For more information, see Kettle Variables.
Value
The value to assign to the parameter or variable.
Workflows and related information
Use PDI outside and inside the Hadoop cluster
PDI can run both outside a Hadoop cluster and within cluster nodes.
Outside the cluster, PDI can extract data from or load data into Hadoop HDFS, Hive, and HBase.
Inside the cluster, PDI transformations can act as mapper and/or reducer tasks. This enables you to build MapReduce jobs visually.
Pentaho MapReduce workflow
PDI and Pentaho MapReduce enable you to pull data from a Hadoop cluster, transform it, and pass it back to the cluster.
Build the mapper transformation
Start by designing the transformation you want.
Create a PDI transformation.
Add the Hadoop MapReduce Input and MapReduce Output steps.
Configure both steps and connect your steps with hops.
Name this transformation
Mapper.

Hadoop communicates in key/value pairs. The MapReduce Input step defines how key/value pairs from Hadoop are interpreted by PDI.

The MapReduce Output step passes output back to Hadoop.

Build the job that runs MapReduce
Create a PDI job.
Add the Pentaho MapReduce job entry.
Configure the Mapper tab to reference your mapper transformation.
Add supporting entries such as Start and success/failure handling entries.


Run a Hadoop job by using a Java class
PDI can also execute a Java class from a job. Use the Hadoop Job Executor job entry to configure and run a .jar file.


If you use Amazon Elastic MapReduce (EMR), you can use the Amazon EMR Job Executor job entry, which includes connection information for Amazon S3 and configuration options for EMR.

Hadoop to PDI data type conversion
The Hadoop Job Executor and Pentaho MapReduce entries include an advanced configuration mode where you must specify Hadoop input/output data types.
java.lang.Integer
org.apache.hadoop.io.IntWritable
java.lang.Long
org.apache.hadoop.io.IntWritable
java.lang.Long
org.apache.hadoop.io.LongWritable
org.apache.hadoop.io.IntWritable
java.lang.Long
java.lang.String
org.apache.hadoop.io.Text
java.lang.String
org.apache.hadoop.io.IntWritable
org.apache.hadoop.io.LongWritable
org.apache.hadoop.io.Text
org.apache.hadoop.io.LongWritable
java.lang.Long
Hadoop Hive-specific SQL limitations
Hive has limitations that can affect SQL queries, including:
Outer joins are not supported.
Each column can be used only once in a
SELECTclause.Conditional joins can use only the
=condition unless you use aWHEREclause.INSERTstatements have specific syntax and limitations.
For details, see:
Big data tutorials
Pentaho big data tutorials provide scenario-based examples that demonstrate integration between Pentaho and Hadoop by using a sample data set.
Videos:
Loading data into Hadoop from outside the Hadoop cluster: https://www.youtube.com/watch?v=Ylekzmd6TAc
Pentaho MapReduce overview (interactive design without scripts/code): https://www.youtube.com/watch?v=KZe1UugxXcs
Last updated
Was this helpful?

