PDI transformation steps

Steps extend and expand the functionality of Pentaho Data Integration (PDI) transformations. You can use the following steps in PDI.

Steps: A - F

Name
Category
Description

Flow

Abort a transformation.

Transform

Add a checksum column for each input row.

Transform

Add one or more constants to the input rows.

Transform

Get the next value from a sequence.

Transform

Add sequence depending of fields value change. Each time value of at least one field change, PDI will reset sequence.

Transform

Encode several fields into an XML fragment.

Streaming

Pull streaming data from an AMQP broker or clients through an AMQP transformation.

Streaming

Publish messages in near-real-time to an AMQP broker.

Statistics

Execute analytic queries over a sorted dataset (LEAD/LAG/FIRST/LAST).

Flow

Refine your data for the Streamlined Data Refinery by creating measures, link dimensions, or attributes on stream fields.

Flow

Append two streams in an ordered way.

Data Mining

Write data in ARFF format to a file.

Output

Generate documentation automatically based on input in the form of a list of transformations and jobs.

Big Data

Decode binary or JSON Avro data and extracts fields from the structure it defines, either from flat files or incoming fields.

Avro input (deprecated)

Deprecated

Replaced by Avro Input.

Big Data

Serialize data into Avro binary or JSON format from the PDI data stream, then writes it to file.

Flow

Block this step until selected steps finish.

Flow

Block flow until all incoming rows have been processed. Subsequent steps only receive the last input row to this step.

Transform

Create new fields by performing simple calculations.

Lookup

Get back information by calling a database procedure.

Pentaho Server

Call API endpoints from the Pentaho Server within a PDI transformation.

Cassandra Input

Deprecated

No longer a part of the PDI disctribution. Contact Pentaho Support for details.

Cassandra Output

Deprecated

No longer a part of the PDI disctribution. Contact Pentaho Support for details.

Catalog Input

Deprecated

Read CSV text file formats of a Pentaho Data Catalog resource that is stored in a Hadoop Distributed File System ( HDFS) or Amazon S3 ecosystem, and then output the data as table rows that can be used by a transformation.

Catalog Output

Deprecated

Encode CSV text file formats by using the schema that is defined in PDI to create or replace a data resource in Pentaho Data Catalogand add metadata to the data resource.

Utility

Change file encoding and create a new file.

Lookup

Check if a column exists in a table on a specified connection.

Lookup

Check if a file is locked by another process.

Lookup

Check if a webservice is available.

Utility

Clone a row as many times as needed.

Transform

Generate a closure table using parent-child relationships.

Data Warehouse

Update a junk dimension in a data warehouse. Alternatively, look up information in this dimension. The primary key of a junk dimension are all the fields.

Transform

Concatenate multiple fields into one target field. The fields can be separated by a separator and the enclosure logic is completely compatible with the Text File Output step.

Discovery

Reads binary data files that are mapped by a fixed-length COBOL copybook definition file.

Job

Write rows to the executing job. The information will then be passed to the next entry in this job.

Big Data

Retrieve all documents from a given view in a given design document from a given database.

Validation

Determines if a credit card number is valid (uses LUHN10 (MOD-10) algorithm), and which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),...).

Input

Read from a simple CSV file input.

Input

Enter rows of static data in a grid, usually for testing, reference or demo purpose.

Validation

Validates passing data based on a set of rules.

Lookup

Execute a database query using stream values as parameters.

Lookup

Look up values in a database using field values.

Input

Read rows of data from a data cube.

Utility

Output each input row after a delay.

Output

Permanently removes a row from a database.

Flow

Output one empty row if input stream is empty (I.e. when input stream does not contain any row).

Input

Determines the structure of delimited text files.

Data Warehouse

Update a slowly changing dimension in a data warehouse. Alternatively, look up information in this dimension.

Flow

Does not do anything. It is useful, however, when testing things or in certain situations where you want to split streams.

Lookup

Execute dynamic SQL statement build in a previous field.

Utility

Convert an Edifact message to XML to simplify data extraction.

ElasticSearch Bulk Insert (deprecated)

Deprecated

Bulk loading

Perform bulk inserts into Elasticsearch.

Input

Read POP3/IMAP server and retrieve messages.

Input

Read shape file data from an ESRI shape file and linked DBF file.

Flow

Inject metadata into an existing transformation prior to execution. This allows for the creation of dynamic and highly flexible data integration solutions.

Example step (deprecated)

Deprecated

Is an example of a plugin test step.

Utility

Execute a process and return the result.

Scripting

Execute an SQL statement or file for every input row.

Scripting

Execute an SQL script, optionally parameterized using input rows.

Input

Parses hierarchical data type fields coming from a previous step.

Lookup

Check if a file exists.

Flow

Filter rows using simple equations.

Input

Read from a fixed file input.

Scripting

Calculate a formula using Pentaho's libformula.

Lookup

Find the approximate matches to a string using matching algorithms. Read a field from a main stream and output approximative value from lookup stream.

Steps: G - L

Name

Category

Description

Input

Generate random valid (luhn check) credit card numbers.

Input

Generate random value.

Input

Generate a number of empty or equal rows.

Input

Get data from XML file by using XPath. This step also allows you to parse XML defined in a previous field.

Input

Get file names from the operating system and send them to the next step.

Job

Read filenames used or generated in a previous entry in a job.

Input

Get files rows count.

Transform

Retrieve unique IDs in blocks from a slave server. The referenced sequence needs to be configured on the slave server in the XML configuration file.

Streaming

Return records that were previously generated by another transformation in a job.

Input

List detailed information about transformations and/or jobs in a repository.

Job

Read rows from a previous entry in a job.

Pentaho Server

Retrieve the value of a session variable.

Input

Read a parent folder and return all subfolders.

Input

Get information from the system like system date, arguments, etc.

Input

Get table names from database connection and send them to the next step.

Job

Determine the values of certain (environment or Kettle) variables and put them in field values.

Google Analytics (deprecated

Deprecated

Fetch data from google analytics account. Replacement step is Google Analytics v4. Note: This step will only work for Universal Analytics 360 customers until July 1, 2024.

Input

Fetch data from Google Analytics account.

Greenplum Bulk Loader (deprecated)

Deprecated

Bulk load Greenplum data. Replacement step is Greenplum Load.

Bulk loading

Bulk load Greenplum data.

Statistics

Build aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.

Input

Read in parallel from a GZIP CSV file.

Big Data

Read data from a variety of different text-file types stored on a Hadoop cluster.

Big Data

Write data to a variety of different text-file types stored on a Hadoop cluster.

Big Data

Read from an HBase column family.

Big Data

Write to an HBase column family.

Big Data

Decodes an incoming key and HBase result object to a mapping.

Input

Load JSON data into PDI from a previous step or from a file.

Output

Converts hierarchical data from a previous step into JSON format.

Input

Read data from HL7 data streams.

Lookup

Call a web service over HTTP by supplying a base URL by allowing parameters to be set dynamically.

Lookup

Call a web service request over HTTP by supplying a base URL by allowing parameters to be set dynamically.

Deprecated

Receive messages from any IBM WebSphere MQ Server.

Deprecated

Send messages to any IBM WebSphere MQ Server.

Flow

Mark the last row.

Utility

Set a field value to a constant if it is null.

Bulk loading

Load data to an Infobright database table.

Bulk loading

Interface with the Ingres VectorWise Bulk Loader "COPY TABLE" command.

Inline

Inject rows into the transformation through the java API.

Output

Update or insert rows in a database based upon keys.

Flow

Filter rows using java code.

Streaming

Receive messages from a JMS server.

Streaming

Send messages to a JMS server.

Flow

Run a PDI job, and passes parameters and rows.

Joins

Output the cartesian product of the input streams. The number of rows is the multiplication of the number of rows in the input streams.

Input

Extract relevant portions out of JSON structures (file or incoming field) and output rows.

Output

Create JSON block and output it in a field to a file.

Streaming

Run a sub-transformation that executes according to message batch size or duration, letting you process a continuous stream of records in near-real-time.

Streaming

Publish messages in near-real-time across worker nodes where multiple, subscribed members have access.

Streaming

Extract data from a specific stream located within the Amazon Kinesis Data Streams service.

Streaming

Push data to an existing region and stream located within the Amazon Kinesis Data Streams service.

Data Mining

Executes a Knowledge Flow data mining process.

Input

Read data from LDAP host.

Output

Perform Insert, upsert, update, add or delete operations on records based on their DN (Distinguished Name).

Input

Read data from LDIF files.

Input

Load file content in memory.

LucidDB streaming loader (deprecated)

Deprecated

Load data into LucidDB by using Remote Rows UDX.

## Steps: M - R

Name
Category
Description

Utility

Send e-mail.

Validation

Check if an email address is valid.

Mapping

Run a mapping (sub-transformation), use MappingInput and MappingOutput to specify the fields interface.

Mapping

Specify the input interface of a mapping.

Mapping

Specify the output interface of a mapping.

Big Data

Enter Key Value pairs from Hadoop MapReduce.

Big Data

Exit Key Value pairs, then push into Hadoop MapReduce.

Lookup

Lookup an IPv4 address in a MaxMind database and add fields such as geography, ISP, or organization.

Statistics

Build aggregates in a group by fashion. This step doesn't require sorted input.

Joins

Join two streams on a given key and outputs a joined set. The input streams must be sorted on the join key.

Joins

Merge two streams of rows, sorted on a certain key. The two streams are compared and the equals, changed, deleted and new rows are flagged.

Utility

Read the metadata of the incoming stream.

Input

Read data from a Microsoft Access file

Output

Store records into an MS-Access database table.

Input

Read data from Excel and OpenOffice Workbooks (XLS, XLSX, ODS).

Microsoft Excel Output (deprecated)

Deprecated

Output

Write or appends data to an Excel file.

Scripting

Run JavaScript programs (and much more).

Input

Build complex hierarchical data.

Input

Modifies hierarchical data to form nested JSON key-value pairs.

Input

Execute and retrieve data using an MDX query against a Pentaho Analyses OLAP server (Mondrian).

Bulk loading

Load data into MonetDB by using their bulk load command in streaming mode.

Big Data

Connects to a MongoDB cluster and executes Mongo shell-style commands.

Big Data

Read all entries from a MongoDB collection in the specified database.

Big Data

Write to a MongoDB collection.

Streaming

Pull streaming data from an MQTT broker or clients through an MQTT transformation.

Streaming

Publish messages in near-real-time to an MQTT broker.

Joins

Join multiple streams. This step supports INNER and FULL OUTER joins.

Bulk loading

Load data over a named pipe (not available on MS Windows).

Utility

Set a field value to null if it is equal to a constant value.

Transform

Create ranges based on numeric field.

Input

Execute and retrieve data using an MDX query against any XML/A OLAP datasource using olap4j.

OpenERP object delete (deprecated)

Deprecated

Delete data from the OpenERP server using the XMLRPC interface with the 'unlink' function.

OpenERP object input (deprecated)

Deprecated

Retrieve data from the OpenERP server using the XMLRPC interface with the 'read' function.

OpenERP object output (deprecated)

Deprecated

Update data on the OpenERP server using the XMLRPC interface and the 'import' function

Bulk loading

Use Oracle Bulk Loader to load data.

Big Data

Read fields data from ORC files into a PDI data stream.

Big Data

Serialize data from the PDI data stream into an ORC file format and writes it to a file.

Statistics

Return metrics for one or several steps.

Palo cell input (deprecated)

Deprecated

Retrieve all cell data from a Palo cube.

Palo cell output (deprecated)

Deprecated

Update cell data in a Palo cube.

Palo dim input (deprecated)

Deprecated

Return elements from a dimension in a Palo database.

Palo dim output (deprecated)

Deprecated

Create/update dimension elements and element consolidations in a Palo database.

Big Data

Decode Parquet data formats and extracts fields from the structure it defines.

Big Data

Map fields within data files and choose where you want to process those files.

Output

Execute an existing report file (.prpt).

Bulk loading

Bulk load PostgreSQL data.

Flow

Prioritize streams in an order way.

Utility

Process one file per row (copy or move or delete). This step only accept filename in input.

Output

Write data to properties file.

Input

Read data (key, value) from properties files.

Scripting

Map upstream data from a PDI input step or execute a Python script to generate data. When you send all rows, Python stores the dataset in a variable that kicks off your Python script.

Metadata Discovery

Retrieves metadata from a database connection.

Input

Uses the Metadata Query Engine (MQE) to query your Hitachi Content Platform (HCP) repository for objects, their URLs, and system metadata properties.

Statistics

Execute an R script within a PDI transformation.

Input

Identifies an HCP object by its URL path then specifies a target annotation name to read.

Metadata Discovery

Reads a binary fixed-length copybook definition file and outputs the file and column descriptor information as fields to PDI rows.

Read metadata

Deprecated

Search and retrieve metadata in the Pentaho Data Catalog that is associated with specific data resources that are registered in Data Catalog.

Scripting

Evaluate regular expressions. This step uses a regular expression to evaluate a field. It can also extract new fields out of an existing field with capturing groups.

Transform

Replace all occurrences a word in a string with another word.

Statistics

Transform Samples a fixed number of rows from the incoming stream.

Lookup

Consume RESTful services. REpresentational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which the web services are viewed as resources and can be identified by their URLs

Transform

Denormalise rows by looking up key-value pairs and by assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields.

Transform

Flatten consecutive rows based on the order in which they appear in the input stream.

Transform

Normalise de-normalised information.

Input

Read RSS feeds.

Output

Read RSS stream.

Scripting

Execute a rule against each row (using Drools).

Scripting

Execute a rule against a set of rows (using Drools).

Utility

Run SSH commands and returns result.

Steps: S - Z

Name
Category
Description

Input

Read from an S3 CSV file.

Output

Export data to a text file on an Amazon Simple Storage Service (S3).

Bulk loading

Perform bulk operations on Salesforce objects

Output

Delete records in a Salesforce module.

Input

Read information from Salesforce.

Output

Insert records in a Salesforce module.

Output

Update records in a Salesforce module.

Output

Insert or update records in a Salesforce module.

Statistics

Filter rows based on the line number.

SAP input (deprecated)

Deprecated

Read data from SAP ERP, optionally with parameters.

Input

Reads file in sas7bdat (SAS) native format.

Experimental

Calculate values by scripting in Ruby, Python, Groovy, Javascript, and other scripting languages.

Transform

Select or remove fields in a row. Optionally, set the field meta-data: type, length and precision.

Utility

Send message to Syslog server.

Output

Write rows of data to a data cube.

Transform

Replace value of a field with another value field.

Transform

Replace value of a field to a constant.

Job

Set filenames in the result of this transformation. Subsequent job entries can then use this information.

Pentaho Server

Set the value of session variable.

Job

Set environment variables based on a single input row.

Experimental

Upload a file or a stream file to a remote host via SFTP.

Flow

Refine your data for the Streamlined Data Refinery through the creation of dimensions which can be shared.

Mapping

Turn a repetitive, re-usable part of a transformation (a sequence of steps) into a mapping (sub-transformation).

Flow

Execute a sequence of steps in a single thread.

Inline

Read a socket. A socket client that connects to a server (Socket Writer step).

Inline

Write a socket. A socket server that can send rows of data to a socket reader.

Transform

Sort rows based upon field values (ascending or descending).

Joins

Merge rows coming from multiple input steps providing these rows are sorted themselves on the given key fields.

Transform

Split a single string field by delimiter and creates a new row for each split term.

Transform

Split a single field into more then one.

Transform

Read data from Splunk.

Transform

Write data to Splunk.

Output

Output SQL INSERT statements to a file.

Lookup

Look up values coming from another stream in the transformation.

Transform

Apply certain operations like trimming, padding, and others to string value.

Transform

Cut out a snippet of a string.

Flow

Switch a row to a certain target step based on the case value in a field.

Output

Perform insert/update/delete in one go based on the value of a field.

Utility

Compare the data from two tables (provided they have the same lay-out). It'll find differences between the data in the two tables and log it.

Lookup

Check if a table exists on a specified connection.

Input

Read information from a database table.

Output

Write information to a database table.

Bulk loading

Bulk load Teradata Fastload data.

Bulk loading

Bulk load via TPT using the tbuild command.

Input

Read data from a text file in several formats. This data can then be passed to your next step(s).

Text file input (deprecated)

Deprecated

Replaced by Text File Input.

Output

Write rows to a text file.

Text file output (deprecated)

Deprecated

Replaced by Text File Output.

Flow

Run a PDI transformation, sets parameters, and passes rows.

Transform

Remove double rows and leave only unique occurrences. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.

Transform

Remove double rows and leave only unique occurrences by using a HashSet.

Statistics

Compute some simple stats based on a single input field.

Output

Update data in a database table based upon keys.

Scripting

Program a step using Java code.

Scripting

Calculate the result of a Java Expression using Janino.

Transform

Map values of a certain field from one value to another.

Bulk loading

Bulk load data into a Vertica table using their high performance COPY feature.

Lookup

Look up information using web services (WSDL).

Output

Write custom metadata fields to a Hitachi Content Platform object.

Write metadata

Deprecated

Add new metadata to metadata in the Pentaho Data Catalog that is associated with specific data resources.

Utility

Write data to log.

Input

Read records from an XBase type of database file (DBF).

Input

Process very large and complex XML files very fast.

Joins

Join a stream of XML-Tags into a target XML string.

Output

Write data to an XML file.

Validation

Validate XML source (files or streams) against XML Schema Definition.

Transform

Transform XML stream using XSL (eXtensible Stylesheet Language).

Input

Read YAML source (file or stream) parse them and convert them to rows and writes these to one or more output.

Utility

Create a standard ZIP archive from the data stream fields.

Last updated

Was this helpful?