PDI transformation steps
Steps extend and expand the functionality of Pentaho Data Integration (PDI) transformations. You can use the following steps in PDI.
Steps: A - F
Transform
Add sequence depending of fields value change. Each time value of at least one field change, PDI will reset sequence.
Streaming
Pull streaming data from an AMQP broker or clients through an AMQP transformation.
Flow
Refine your data for the Streamlined Data Refinery by creating measures, link dimensions, or attributes on stream fields.
Output
Generate documentation automatically based on input in the form of a list of transformations and jobs.
Big Data
Decode binary or JSON Avro data and extracts fields from the structure it defines, either from flat files or incoming fields.
Big Data
Serialize data into Avro binary or JSON format from the PDI data stream, then writes it to file.
Flow
Block flow until all incoming rows have been processed. Subsequent steps only receive the last input row to this step.
Cassandra Input
Deprecated
No longer a part of the PDI disctribution. Contact Pentaho Support for details.
Cassandra Output
Deprecated
No longer a part of the PDI disctribution. Contact Pentaho Support for details.
Catalog Input
Deprecated
Read CSV text file formats of a Pentaho Data Catalog resource that is stored in a Hadoop Distributed File System ( HDFS) or Amazon S3 ecosystem, and then output the data as table rows that can be used by a transformation.
Catalog Output
Deprecated
Encode CSV text file formats by using the schema that is defined in PDI to create or replace a data resource in Pentaho Data Catalogand add metadata to the data resource.
Data Warehouse
Update a junk dimension in a data warehouse. Alternatively, look up information in this dimension. The primary key of a junk dimension are all the fields.
Transform
Concatenate multiple fields into one target field. The fields can be separated by a separator and the enclosure logic is completely compatible with the Text File Output step.
Discovery
Reads binary data files that are mapped by a fixed-length COBOL copybook definition file.
Job
Write rows to the executing job. The information will then be passed to the next entry in this job.
Big Data
Retrieve all documents from a given view in a given design document from a given database.
Validation
Determines if a credit card number is valid (uses LUHN10 (MOD-10) algorithm), and which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),...).
Flow
Output one empty row if input stream is empty (I.e. when input stream does not contain any row).
Data Warehouse
Update a slowly changing dimension in a data warehouse. Alternatively, look up information in this dimension.
Flow
Does not do anything. It is useful, however, when testing things or in certain situations where you want to split streams.
Flow
Inject metadata into an existing transformation prior to execution. This allows for the creation of dynamic and highly flexible data integration solutions.
Example step (deprecated)
Deprecated
Is an example of a plugin test step.
Lookup
Find the approximate matches to a string using matching algorithms. Read a field from a main stream and output approximative value from lookup stream.
Steps: G - L
Name
Category
Description
Input
Get data from XML file by using XPath. This step also allows you to parse XML defined in a previous field.
Transform
Retrieve unique IDs in blocks from a slave server. The referenced sequence needs to be configured on the slave server in the XML configuration file.
Streaming
Return records that were previously generated by another transformation in a job.
Input
List detailed information about transformations and/or jobs in a repository.
Job
Determine the values of certain (environment or Kettle) variables and put them in field values.
Google Analytics (deprecated
Deprecated
Fetch data from google analytics account. Replacement step is Google Analytics v4. Note: This step will only work for Universal Analytics 360 customers until July 1, 2024.
Greenplum Bulk Loader (deprecated)
Deprecated
Bulk load Greenplum data. Replacement step is Greenplum Load.
Statistics
Build aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.
Big Data
Read data from a variety of different text-file types stored on a Hadoop cluster.
Big Data
Write data to a variety of different text-file types stored on a Hadoop cluster.
Lookup
Call a web service over HTTP by supplying a base URL by allowing parameters to be set dynamically.
Lookup
Call a web service request over HTTP by supplying a base URL by allowing parameters to be set dynamically.
Bulk loading
Interface with the Ingres VectorWise Bulk Loader "COPY TABLE" command.
Joins
Output the cartesian product of the input streams. The number of rows is the multiplication of the number of rows in the input streams.
Input
Extract relevant portions out of JSON structures (file or incoming field) and output rows.
Streaming
Run a sub-transformation that executes according to message batch size or duration, letting you process a continuous stream of records in near-real-time.
Streaming
Publish messages in near-real-time across worker nodes where multiple, subscribed members have access.
Streaming
Extract data from a specific stream located within the Amazon Kinesis Data Streams service.
Streaming
Push data to an existing region and stream located within the Amazon Kinesis Data Streams service.
Output
Perform Insert, upsert, update, add or delete operations on records based on their DN (Distinguished Name).
LucidDB streaming loader (deprecated)
Deprecated
Load data into LucidDB by using Remote Rows UDX.
## Steps: M - R
Mapping
Run a mapping (sub-transformation), use MappingInput and MappingOutput to specify the fields interface.
Lookup
Lookup an IPv4 address in a MaxMind database and add fields such as geography, ISP, or organization.
Statistics
Build aggregates in a group by fashion. This step doesn't require sorted input.
Joins
Join two streams on a given key and outputs a joined set. The input streams must be sorted on the join key.
Joins
Merge two streams of rows, sorted on a certain key. The two streams are compared and the equals, changed, deleted and new rows are flagged.
Input
Execute and retrieve data using an MDX query against a Pentaho Analyses OLAP server (Mondrian).
Bulk loading
Load data into MonetDB by using their bulk load command in streaming mode.
Streaming
Pull streaming data from an MQTT broker or clients through an MQTT transformation.
Input
Execute and retrieve data using an MDX query against any XML/A OLAP datasource using olap4j.
OpenERP object delete (deprecated)
Deprecated
Delete data from the OpenERP server using the XMLRPC interface with the 'unlink' function.
OpenERP object input (deprecated)
Deprecated
Retrieve data from the OpenERP server using the XMLRPC interface with the 'read' function.
OpenERP object output (deprecated)
Deprecated
Update data on the OpenERP server using the XMLRPC interface and the 'import' function
Big Data
Serialize data from the PDI data stream into an ORC file format and writes it to a file.
Palo cell input (deprecated)
Deprecated
Retrieve all cell data from a Palo cube.
Palo cell output (deprecated)
Deprecated
Update cell data in a Palo cube.
Palo dim input (deprecated)
Deprecated
Return elements from a dimension in a Palo database.
Palo dim output (deprecated)
Deprecated
Create/update dimension elements and element consolidations in a Palo database.
Big Data
Map fields within data files and choose where you want to process those files.
Utility
Process one file per row (copy or move or delete). This step only accept filename in input.
Scripting
Map upstream data from a PDI input step or execute a Python script to generate data. When you send all rows, Python stores the dataset in a variable that kicks off your Python script.
Input
Uses the Metadata Query Engine (MQE) to query your Hitachi Content Platform (HCP) repository for objects, their URLs, and system metadata properties.
Input
Identifies an HCP object by its URL path then specifies a target annotation name to read.
Metadata Discovery
Reads a binary fixed-length copybook definition file and outputs the file and column descriptor information as fields to PDI rows.
Read metadata
Deprecated
Search and retrieve metadata in the Pentaho Data Catalog that is associated with specific data resources that are registered in Data Catalog.
Scripting
Evaluate regular expressions. This step uses a regular expression to evaluate a field. It can also extract new fields out of an existing field with capturing groups.
Lookup
Consume RESTful services. REpresentational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which the web services are viewed as resources and can be identified by their URLs
Transform
Denormalise rows by looking up key-value pairs and by assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields.
Transform
Flatten consecutive rows based on the order in which they appear in the input stream.
Steps: S - Z
SAP input (deprecated)
Deprecated
Read data from SAP ERP, optionally with parameters.
Experimental
Calculate values by scripting in Ruby, Python, Groovy, Javascript, and other scripting languages.
Transform
Select or remove fields in a row. Optionally, set the field meta-data: type, length and precision.
Job
Set filenames in the result of this transformation. Subsequent job entries can then use this information.
Flow
Refine your data for the Streamlined Data Refinery through the creation of dimensions which can be shared.
Mapping
Turn a repetitive, re-usable part of a transformation (a sequence of steps) into a mapping (sub-transformation).
Joins
Merge rows coming from multiple input steps providing these rows are sorted themselves on the given key fields.
Transform
Split a single string field by delimiter and creates a new row for each split term.
Transform
Apply certain operations like trimming, padding, and others to string value.
Utility
Compare the data from two tables (provided they have the same lay-out). It'll find differences between the data in the two tables and log it.
Input
Read data from a text file in several formats. This data can then be passed to your next step(s).
Transform
Remove double rows and leave only unique occurrences. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.
Transform
Remove double rows and leave only unique occurrences by using a HashSet.
Bulk loading
Bulk load data into a Vertica table using their high performance COPY feature.
Output
Write custom metadata fields to a Hitachi Content Platform object.
Write metadata
Deprecated
Add new metadata to metadata in the Pentaho Data Catalog that is associated with specific data resources.
Input
Read YAML source (file or stream) parse them and convert them to rows and writes these to one or more output.
Last updated
Was this helpful?