Read CSV text file formats of a Pentaho Data Catalog resource that is stored in a Hadoop Distributed File System ( HDFS) or Amazon S3 ecosystem, and then output the data as table rows that can be used by a transformation.

Catalog Output

Deprecated

Encode CSV text file formats by using the schema that is defined in PDI to create or replace a data resource in Pentaho Data Catalogand add metadata to the data resource.

Change file encoding

Utility

Change file encoding and create a new file.

Check if a column exists

Lookup

Check if a column exists in a table on a specified connection.

Check if file is locked

Lookup

Check if a file is locked by another process.

Check if webservice is available

Lookup

Check if a webservice is available.

Clone row

Utility

Clone a row as many times as needed.

Closure Generator

Transform

Generate a closure table using parent-child relationships.

Combination lookup/update

Data Warehouse

Update a junk dimension in a data warehouse. Alternatively, look up information in this dimension. The primary key of a junk dimension are all the fields.

Concat Fields

Transform

Concatenate multiple fields into one target field. The fields can be separated by a separator and the enclosure logic is completely compatible with the Text File Output step.

Copybook Input

Discovery

Reads binary data files that are mapped by a fixed-length COBOL copybook definition file.

Copy rows to result

Job

Write rows to the executing job. The information will then be passed to the next entry in this job.

CouchDB Input

Big Data

Retrieve all documents from a given view in a given design document from a given database.

Credit card validator

Validation

Determines if a credit card number is valid (uses LUHN10 (MOD-10) algorithm), and which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),...).

CSV File Input

Input

Read from a simple CSV file input.

Data Grid

Input

Enter rows of static data in a grid, usually for testing, reference or demo purpose.

Data Validator

Validation

Validates passing data based on a set of rules.

Database join

Lookup

Execute a database query using stream values as parameters.

Database lookup

Lookup

Look up values in a database using field values.

De-serialize from file

Input

Read rows of data from a data cube.

Delay row

Utility

Output each input row after a delay.

Delete

Output

Permanently removes a row from a database.

Detect empty stream

Flow

Output one empty row if input stream is empty (I.e. when input stream does not contain any row).

Discover metadata from a text file

Input

Determines the structure of delimited text files.

Dimension lookup/update

Data Warehouse

Update a slowly changing dimension in a data warehouse. Alternatively, look up information in this dimension.

Dummy (do nothing)

Flow

Does not do anything. It is useful, however, when testing things or in certain situations where you want to split streams.

Dynamic SQL row

Lookup

Execute dynamic SQL statement build in a previous field.

Edi to XML

Utility

Convert an Edifact message to XML to simplify data extraction.

ElasticSearch Bulk Insert (deprecated)

Deprecated

Replaced by Elasticsearch REST bulk insert.

Elasticsearch REST bulk insert

Bulk loading

Perform bulk inserts into Elasticsearch.

Email messages input

Input

Read POP3/IMAP server and retrieve messages.

ESRI Shapefile Reader

Input

Read shape file data from an ESRI shape file and linked DBF file.

ETL metadata injection

Flow

Inject metadata into an existing transformation prior to execution. This allows for the creation of dynamic and highly flexible data integration solutions.

Example step (deprecated)

Deprecated

Is an example of a plugin test step.

Execute a process

Utility

Execute a process and return the result.

Execute Row SQL Script

Scripting

Execute an SQL statement or file for every input row.

Execute SQL Script

Scripting

Execute an SQL script, optionally parameterized using input rows.

Extract to rows

Input

Parses hierarchical data type fields coming from a previous step.

File exists (Step)

Lookup

Check if a file exists.

Filter Rows

Flow

Filter rows using simple equations.

Fixed file input

Input

Read from a fixed file input.

Formula

Scripting

Calculate a formula using Pentaho's libformula.

Fuzzy match

Lookup

Find the approximate matches to a string using matching algorithms. Read a field from a main stream and output approximative value from lookup stream.

Steps: G - L

Name

Category

Description

Generate random credit card numbers

Input

Generate random valid (luhn check) credit card numbers.

Generate random value

Input

Generate random value.

Generate Rows

Input

Generate a number of empty or equal rows.

Get data from XML

Input

Get data from XML file by using XPath. This step also allows you to parse XML defined in a previous field.

Get File Names

Input

Get file names from the operating system and send them to the next step.

Get files from result

Job

Read filenames used or generated in a previous entry in a job.

Get Files Rows Count

Input

Get files rows count.

Get ID from slave server

Transform

Retrieve unique IDs in blocks from a slave server. The referenced sequence needs to be configured on the slave server in the XML configuration file.

Get records from stream

Streaming

Return records that were previously generated by another transformation in a job.

Get repository names

Input

List detailed information about transformations and/or jobs in a repository.

Get rows from result

Job

Read rows from a previous entry in a job.

Get Session Variables

Pentaho Server

Retrieve the value of a session variable.

Get SubFolder names

Input

Read a parent folder and return all subfolders.

Get System Info

Input

Get information from the system like system date, arguments, etc.

Get table names

Input

Get table names from database connection and send them to the next step.

Get Variables

Job

Determine the values of certain (environment or Kettle) variables and put them in field values.

Google Analytics (deprecated

Deprecated

Fetch data from google analytics account. Replacement step is Google Analytics v4. Note: This step will only work for Universal Analytics 360 customers until July 1, 2024.

Google Analytics v4

Input

Fetch data from Google Analytics account.

Greenplum Bulk Loader (deprecated)

Deprecated

Bulk load Greenplum data. Replacement step is Greenplum Load.

Greenplum Load

Bulk loading

Bulk load Greenplum data.

Group By

Statistics

Build aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.

GZIP CSV Input

Input

Read in parallel from a GZIP CSV file.

Hadoop File Input

Big Data

Read data from a variety of different text-file types stored on a Hadoop cluster.

Hadoop File Output

Big Data

Write data to a variety of different text-file types stored on a Hadoop cluster.

HBase Input

Big Data

Read from an HBase column family.

HBase Output

Big Data

Write to an HBase column family.

HBase row decoder

Big Data

Decodes an incoming key and HBase result object to a mapping.

Hierarchical JSON input

Input

Load JSON data into PDI from a previous step or from a file.

Hierarchical JSON Output

Output

Converts hierarchical data from a previous step into JSON format.

HL7 Input

Input

Read data from HL7 data streams.

HTTP client

Lookup

Call a web service over HTTP by supplying a base URL by allowing parameters to be set dynamically.

HTTP Post

Lookup

Call a web service request over HTTP by supplying a base URL by allowing parameters to be set dynamically.

IBM WebSphere MQ Consumer (deprecated)

Deprecated

Receive messages from any IBM WebSphere MQ Server.

IBM WebSphere MQ Producer (deprecated)

Deprecated

Send messages to any IBM WebSphere MQ Server.

Identify last row in a stream

Flow

Mark the last row.

If field value is null

Utility

Set a field value to a constant if it is null.

Infobright Loader

Bulk loading

Load data to an Infobright database table.

Ingres VectorWise Bulk Loader

Bulk loading

Interface with the Ingres VectorWise Bulk Loader "COPY TABLE" command.

Injector

Inline

Inject rows into the transformation through the java API.

Insert / Update

Output

Update or insert rows in a database based upon keys.

Java Filter

Flow

Filter rows using java code.

JMS Consumer

Streaming

Receive messages from a JMS server.

JMS consumer (deprecated)

Deprecated

Replaced by JMS Consumer.

JMS Producer

Streaming

Send messages to a JMS server.

JMS producer (deprecated)

Deprecated

Replaced by JMS Producer.

Job Executor

Flow

Run a PDI job, and passes parameters and rows.

Join Rows (cartesian product)

Joins

Output the cartesian product of the input streams. The number of rows is the multiplication of the number of rows in the input streams.

JSON Input

Input

Extract relevant portions out of JSON structures (file or incoming field) and output rows.

JSON output

Output

Create JSON block and output it in a field to a file.

Kafka consumer

Streaming

Run a sub-transformation that executes according to message batch size or duration, letting you process a continuous stream of records in near-real-time.

Kafka Producer

Streaming

Publish messages in near-real-time across worker nodes where multiple, subscribed members have access.

Kinesis Consumer

Streaming

Extract data from a specific stream located within the Amazon Kinesis Data Streams service.

Kinesis Producer

Streaming

Push data to an existing region and stream located within the Amazon Kinesis Data Streams service.

Knowledge Flow

Data Mining

Executes a Knowledge Flow data mining process.

LDAP Input

Input

Read data from LDAP host.

LDAP Output

Output

Perform Insert, upsert, update, add or delete operations on records based on their DN (Distinguished Name).

LDIF Input

Input

Read data from LDIF files.

Load file content in memory

Input

Load file content in memory.

LucidDB streaming loader (deprecated)

Deprecated

Load data into LucidDB by using Remote Rows UDX.

## Steps: M - R

Name

Category

Description

Mail

Utility

Send e-mail.

Mail Validator

Validation

Check if an email address is valid.

Mapping

Run a mapping (sub-transformation), use MappingInput and MappingOutput to specify the fields interface.

Mapping Input Specification

Mapping

Specify the input interface of a mapping.

Mapping Output Specification

Mapping

Specify the output interface of a mapping.

MapReduce Input

Big Data

Enter Key Value pairs from Hadoop MapReduce.

MapReduce Output

Big Data

Exit Key Value pairs, then push into Hadoop MapReduce.

MaxMind GeoIP Lookup

Lookup

Lookup an IPv4 address in a MaxMind database and add fields such as geography, ISP, or organization.

Memory Group By

Statistics

Build aggregates in a group by fashion. This step doesn't require sorted input.

Merge Join

Joins

Join two streams on a given key and outputs a joined set. The input streams must be sorted on the join key.

Merge Rows (diff)

Joins

Merge two streams of rows, sorted on a certain key. The two streams are compared and the equals, changed, deleted and new rows are flagged.

Metadata structure of stream

Utility

Read the metadata of the incoming stream.

Microsoft Access Input

Input

Read data from a Microsoft Access file

Microsoft Access Output

Output

Store records into an MS-Access database table.

Microsoft Excel Input

Input

Read data from Excel and OpenOffice Workbooks (XLS, XLSX, ODS).

Microsoft Excel Output (deprecated)

Deprecated

Replaced by Microsoft Excel writer.

Microsoft Excel writer

Output

Write or appends data to an Excel file.

Modified Java Script Value

Scripting

Run JavaScript programs (and much more).

Modify values from a single row

Input

Build complex hierarchical data.

Modify values from grouped rows

Input

Modifies hierarchical data to form nested JSON key-value pairs.

Mondrian Input

Input

Execute and retrieve data using an MDX query against a Pentaho Analyses OLAP server (Mondrian).

MonetDB Bulk Loader

Bulk loading

Load data into MonetDB by using their bulk load command in streaming mode.

MongoDB Execute

Big Data

Connects to a MongoDB cluster and executes Mongo shell-style commands.

MongoDB Input

Big Data

Read all entries from a MongoDB collection in the specified database.

MongoDB Output

Big Data

Write to a MongoDB collection.

MQTT Consumer

Streaming

Pull streaming data from an MQTT broker or clients through an MQTT transformation.

MQTT Producer

Streaming

Publish messages in near-real-time to an MQTT broker.

Multiway Merge Join

Joins

Join multiple streams. This step supports INNER and FULL OUTER joins.

MySQL Bulk Loader

Bulk loading

Load data over a named pipe (not available on MS Windows).

Null if...

Utility

Set a field value to null if it is equal to a constant value.

Number range

Transform

Create ranges based on numeric field.

OLAP Input

Input

Execute and retrieve data using an MDX query against any XML/A OLAP datasource using olap4j.

OpenERP object delete (deprecated)

Deprecated

Delete data from the OpenERP server using the XMLRPC interface with the 'unlink' function.

OpenERP object input (deprecated)

Deprecated

Retrieve data from the OpenERP server using the XMLRPC interface with the 'read' function.

OpenERP object output (deprecated)

Deprecated

Update data on the OpenERP server using the XMLRPC interface and the 'import' function

Oracle Bulk Loader

Bulk loading

Use Oracle Bulk Loader to load data.

ORC Input

Big Data

Read fields data from ORC files into a PDI data stream.

ORC Output

Big Data

Serialize data from the PDI data stream into an ORC file format and writes it to a file.

Output steps metrics

Statistics

Return metrics for one or several steps.

Palo cell input (deprecated)

Deprecated

Retrieve all cell data from a Palo cube.

Palo cell output (deprecated)

Deprecated

Update cell data in a Palo cube.

Palo dim input (deprecated)

Deprecated

Return elements from a dimension in a Palo database.

Palo dim output (deprecated)

Deprecated

Create/update dimension elements and element consolidations in a Palo database.

Parquet Input

Big Data

Decode Parquet data formats and extracts fields from the structure it defines.

Parquet Output

Big Data

Map fields within data files and choose where you want to process those files.

Pentaho Reporting Output

Output

Execute an existing report file (.prpt).

PostgreSQL Bulk Loader

Bulk loading

Bulk load PostgreSQL data.

Prioritize streams

Flow

Prioritize streams in an order way.

Process files

Utility

Process one file per row (copy or move or delete). This step only accept filename in input.

Properties Output

Output

Write data to properties file.

Property Input

Input

Read data (key, value) from properties files.

Python Executor

Scripting

Map upstream data from a PDI input step or execute a Python script to generate data. When you send all rows, Python stores the dataset in a variable that kicks off your Python script.

Query metadata from a database

Metadata Discovery

Retrieves metadata from a database connection.

Query HCP

Input

Uses the Metadata Query Engine (MQE) to query your Hitachi Content Platform (HCP) repository for objects, their URLs, and system metadata properties.

R script executor

Statistics

Execute an R script within a PDI transformation.

Read metadata from HCP

Input

Identifies an HCP object by its URL path then specifies a target annotation name to read.

Read metadata from Copybook

Metadata Discovery

Reads a binary fixed-length copybook definition file and outputs the file and column descriptor information as fields to PDI rows.

Read metadata

Deprecated

Search and retrieve metadata in the Pentaho Data Catalog that is associated with specific data resources that are registered in Data Catalog.

Regex Evaluation

Scripting

Evaluate regular expressions. This step uses a regular expression to evaluate a field. It can also extract new fields out of an existing field with capturing groups.

Replace in String

Transform

Replace all occurrences a word in a string with another word.

Reservoir Sampling

Statistics

Transform Samples a fixed number of rows from the incoming stream.

REST client step

Lookup

Consume RESTful services. REpresentational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which the web services are viewed as resources and can be identified by their URLs

Row Denormaliser

Transform

Denormalise rows by looking up key-value pairs and by assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields.

Row Flattener

Transform

Flatten consecutive rows based on the order in which they appear in the input stream.

Row Normaliser

Transform

Normalise de-normalised information.

RSS Input

Input

Read RSS feeds.

RSS Output

Output

Read RSS stream.

Rule Executor

Scripting

Execute a rule against each row (using Drools).

Rule Accumulator

Scripting

Execute a rule against a set of rows (using Drools).

Run SSH commands

Utility

Run SSH commands and returns result.

Steps: S - Z

Name

Category

Description

S3 CSV Input

Input

Read from an S3 CSV file.

S3 File Output

Output

Export data to a text file on an Amazon Simple Storage Service (S3).

Salesforce bulk operation

Bulk loading

Perform bulk operations on Salesforce objects

Salesforce Delete

Output

Delete records in a Salesforce module.

Salesforce Input

Input

Read information from Salesforce.

Salesforce Insert

Output

Insert records in a Salesforce module.

Salesforce Update

Output

Update records in a Salesforce module.

Salesforce Upsert

Output

Insert or update records in a Salesforce module.

Sample rows

Statistics

Filter rows based on the line number.

SAP input (deprecated)

Deprecated

Read data from SAP ERP, optionally with parameters.

SAS Input

Input

Reads file in sas7bdat (SAS) native format.

Script

Experimental

Calculate values by scripting in Ruby, Python, Groovy, Javascript, and other scripting languages.

Select Values

Transform

Select or remove fields in a row. Optionally, set the field meta-data: type, length and precision.

Send message to Syslog

Utility

Send message to Syslog server.

Serialize to file

Output

Write rows of data to a data cube.

Set Field Value

Transform

Replace value of a field with another value field.

Set Field Value to a Constant

Transform

Replace value of a field to a constant.

Set files in result

Job

Set filenames in the result of this transformation. Subsequent job entries can then use this information.

Set Session Variables

Pentaho Server

Set the value of session variable.

Set Variables

Job

Set environment variables based on a single input row.

SFTP Put

Experimental

Upload a file or a stream file to a remote host via SFTP.

Shared dimension

Flow

Refine your data for the Streamlined Data Refinery through the creation of dimensions which can be shared.

Simple Mapping (sub-transformation)

Mapping

Turn a repetitive, re-usable part of a transformation (a sequence of steps) into a mapping (sub-transformation).

Single Threader

Flow

Execute a sequence of steps in a single thread.

Socket reader

Inline

Read a socket. A socket client that connects to a server (Socket Writer step).

Socket writer

Inline

Write a socket. A socket server that can send rows of data to a socket reader.

Sort rows

Transform

Sort rows based upon field values (ascending or descending).

Sorted Merge

Joins

Merge rows coming from multiple input steps providing these rows are sorted themselves on the given key fields.

Split field to rows

Transform

Split a single string field by delimiter and creates a new row for each split term.

Split Fields

Transform

Split a single field into more then one.

Splunk Input

Transform

Read data from Splunk.

Splunk Output

Transform

Write data to Splunk.

SQL File Output

Output

Output SQL INSERT statements to a file.

Stream lookup

Lookup

Look up values coming from another stream in the transformation.

String Operations

Transform

Apply certain operations like trimming, padding, and others to string value.

Strings cut

Transform

Cut out a snippet of a string.

Switch / Case

Flow

Switch a row to a certain target step based on the case value in a field.

Synchronize after merge

Output

Perform insert/update/delete in one go based on the value of a field.

Table Compare

Utility

Compare the data from two tables (provided they have the same lay-out). It'll find differences between the data in the two tables and log it.

Table exists

Lookup

Check if a table exists on a specified connection.

Table Input

Input

Read information from a database table.

Table Output

Output

Write information to a database table.

Teradata Fastload Bulk Loader

Bulk loading

Bulk load Teradata Fastload data.

Teradata TPT Insert Upsert Bulk Loader

Bulk loading

Bulk load via TPT using the tbuild command.

Text File Input

Input

Read data from a text file in several formats. This data can then be passed to your next step(s).

Text file input (deprecated)

Deprecated

Replaced by Text File Input.

Text File Output

Output

Write rows to a text file.

Text file output (deprecated)

Deprecated

Replaced by Text File Output.

Transformation Executor

Flow

Run a PDI transformation, sets parameters, and passes rows.

Unique Rows

Transform

Remove double rows and leave only unique occurrences. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.

Unique Rows (HashSet)

Transform

Remove double rows and leave only unique occurrences by using a HashSet.

Univariate Statistics

Statistics

Compute some simple stats based on a single input field.

Update

Output

Update data in a database table based upon keys.

User Defined Java Class

Scripting

Program a step using Java code.