Working with transformations

Create, configure, and run transformations to perform ETL tasks as part of a workflow.

After you run a transformation, review results to validate output and troubleshoot issues.

circle-info

You can schedule a transformation to run once or on a recurring schedule. See Schedule a transformation or job.

In this topic:

Create a transformation

Create a transformation to arrange a network of logical ETL tasks, called steps, into a data workflow.

To create a transformation in Pipeline Designer, complete these steps:

  1. Sign in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using Modern Design, in the left menu, select Pipeline Designer.

    • If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the Transformation card, select Create Transformation.

    A new, blank transformation opens with the Design pane selected.

  4. Add steps to the transformation:

    1. In the Design pane, search for or browse to each step you want to use.

      You may need to expand sections in the Design pane to find steps.

    2. Drag the steps you want to use onto the canvas.

  5. Hover over a step to open the step menu.

    Select one of these options:

    • Delete: Delete the step from the canvas.

    • Edit: Open the step window and configure step properties.

      To learn about the step, in the lower-left corner, select Help.

    • Duplicate: Add a copy of the step to the canvas.

    • More Actions > Change Number of Copies: Set how many parallel copies run.

      In Number of copies (1 or higher), you can enter a number or a variable.

      To insert a variable, select the Select variable to insert icon.

    • More Actions > Data Movement: Choose how rows move to next steps.

      • Round-Robin: Distribute rows evenly across step copies.

      • Load Balance: Route rows to the step copy with the lightest load.

      • Copy Data to Next Steps: Send each row to all parallel step copies.

  6. Add hops between steps:

    Hover over a step handle until a plus sign (+) appears. Drag the connection to the handle of another step.

  7. Optional: Add a note on the canvas:

    In the canvas toolbar, select Add Note. In Notes, enter your note, then select Save.

    You can format the note in Notes. Select Style, then set font, color, and shadow options.

  8. Save the transformation:

    1. Select Save.

      The Select File or Directory dialog box opens.

    2. Search for or browse to the repository folder where you want to save.

    3. Optional: Create a folder.

      Select New Folder, enter a folder name, then select Save.

    4. Optional: Delete a folder.

      Select the folder, then select Delete.

    5. In Select File or Directory, select Save.

      The Save Change dialog box opens.

    6. Select Yes to confirm.

Edit transformation properties

Transformation properties describe the transformation and control its behavior. Use properties to customize processing and control output.

To edit transformation properties, complete these steps:

  1. Sign in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using Modern Design, in the left menu, select Pipeline Designer.

    • If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select Recently opened or Favorites.

  4. Open the transformation:

    1. Search for or browse to the transformation, then select Open.

    2. Or select Open files. Then in Select File or Directory, browse to the transformation and select Open.

  5. In the Canvas Action toolbar, select Settings.

    The Transformation Properties window opens.

  6. Edit properties in each tab.

    For details, see Transformation properties.

  7. If you updated options in the Logging tab, generate SQL for the logging table:

    1. Select SQL.

      The Simple SQL editor opens. It shows DDL for a new table or details for an existing table.

    2. Optional: Edit the SQL statements.

      See Use the SQL Editor.

    3. Optional: Select Clear cache to remove cached query results and metadata.

    4. Select Execute.

  8. Select Save.

Transformation properties

The Transformation Properties window has these sections:

Transformation tab

Use the Transformation tab to specify general properties.

Property
Description

Transformation name

The name of the transformation. This field is required to save settings to the repository.

Transformation filename

The file name (*.ktr) of the transformation.

Description

Short description of the transformation that displays in the repository explorer.

Extended description

Long extended description of the transformation.

Status

Draft or production status.

Version

Version description.

Directory

The repository directory where the transformation is stored.

Created by

Name of the original creator.

Created at

Date and time when the transformation was created.

Last modified by

Username of the last user who modified the transformation.

Last modified at

Date and time when the transformation was last modified.

Parameters tab

Use the Parameters tab to add parameters.

Property
Description

Parameter

A local variable that can be shared across all steps in a transformation.

Default Value

Value that is used if the parameter is not set elsewhere.

Description

Description of the user-defined parameter.

Logging tab

Use the Logging tab to configure how and where logging information is captured.

In the left navigation pane, select which type of logging you want to use.

Property
Description

Log Connection

Database connection used for logging. To configure a new connection, select New.

Log table schema

Schema name, if supported by your database.

Log table name

Name of the log table. If you also use job logging, use a different table name for transformation logging.

Logging interval (seconds)

Interval in which logs are written to the table. Applies only to Transformation and Performance logging types.

Log record timeout (in days)

Number of days to keep log entries before deletion. Applies only to Transformation and Performance logging types. If log data is not deleted as expected, see Log table data is not deleted.

Log size limit (in lines)

Limit for the number of lines stored in LOG_FIELD (CLOB). Applies only to Transformation logging.

Fields to log

Select the fields to log in Fields to log. For effective deletion of expired logs, LOGDATE and TRANSNAME are enabled by default. Monitoring LOG_FIELD can negatively impact Pentaho Server performance. If you do not select all fields, including LOG_FIELD, you will not see information about this transformation in Operations Mart logging.

Dates tab

Use the Dates tab to configure date ranges and limits for this connection.

Property
Description

Maxdate connection

Get the upper limit for a date range on this connection.

Maxdate table

Get the upper limit for a date range in this table.

Maxdate field

Get the upper limit for a date range in this field.

Maxdate offset (seconds)

Increase the upper date limit by this amount. Example: Set -60 to subtract 60 seconds.

Maximum date difference (seconds)

Set the maximum date difference in the obtained date range. Use it to limit job sizes.

Dependencies tab

Use the Dependencies tab to specify transformation dependencies.

The Dependencies tab lets you enter transformation dependencies.

Example: If a dimension depends on three lookup tables, you can track those tables. If values in the lookup tables change, extend the date range to force a full refresh.

Dependencies also help you detect change when a table has a “data last changed” column.

Select Get dependencies to detect dependencies automatically.

Property
Description

Connection

Select a database connection that is already defined for the transformation.

Table

Table from the selected database connection.

Field

Field within the selected table.

Miscellaneous tab

Use the Miscellaneous tab to configure buffer and feedback size. You can also perform administrative tasks.

Property
Description

Number of rows in rowset

Change buffer size between connected steps. Change it only if you are running low on memory.

Show a feedback row in transformation steps?

Add feedback entries to the log file during execution. Enabled by default. Default frequency is every 5000 rows.

The feedback size

Set the number of rows to process before adding a feedback entry. Increase it for large datasets.

Make the transformation database transactional

Open one unique connection per defined and used database connection in the transformation. Enable it to allow a failed transformation to be fully rolled back. Enable it when you need to alter connection settings before a query using an Execute SQL script step. See Database transactions in jobs and transformationsarrow-up-right. Transformation-wide commit happens when the last step finishes. When the transformation fails, a rollback happens. Commit sizes are ignored.

Shared objects file

Location of the XML file that stores shared objects, like database connections and clustering schemas.

Manage thread priorities?

Enable or disable changing Java thread priorities based on buffer levels. Disable it if the overhead outweighs the benefit.

Monitoring tab

Use the Monitoring tab to enable step performance monitoring.

Property
Description

Enable step performance monitoring?

Enable metrics for each step. Metrics show in the Performance Graph tab in the Execution Results panel.

Step performance measurement interval (ms)

Snapshot interval in milliseconds. Example: 10 ms.

Maximum number of snapshots in memory

Maximum number of measurement snapshots held in memory during runtime.

Run a transformation

Run a transformation to step through ETL tasks in the order specified by the transformation.

Complete these steps:

  1. Sign in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using Modern Design, in the left menu, select Pipeline Designer.

    • If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select Recently opened or Favorites.

  4. Open the transformation:

    1. Search for or browse to the transformation, then select Open.

    2. Or select Open files. Then in Select File or Directory, browse to the transformation and select Open.

  5. In the Canvas Action toolbar, select Run.

    Select one of these options:

    1. Select Run.

    2. Or select Run Options.

      In Run Options, set options, then select Run. For details, see Transformation run options.

The transformation runs. The Preview panel opens with the Logging tab selected.

circle-info

To stop a transformation while it is running, see Stop transformations and jobs.

Transformation run options

Option
Description

Select configuration

All transformations run using the Pentaho server configuration.

Clear log before running

Clear existing logs before execution. Use it if your log is large.

Enable safe mode

Validate that every row has an identical layout. Rows with a different layout generate an error.

Gather performance metrics

Monitor execution performance. You can view metrics in the Pentaho Data Integration client. See Use performance graphsarrow-up-right.

Log level

Amount of logging and detail captured. Options: Nothing (no logging), Error (errors only), Minimal, Basic (default), Detailed, Debug (very detailed), Row Level (very detailed) (logging per row). Debug and Row Level can expose sensitive data. Consider data sensitivity when selecting these levels.

Parameters

Apply parameter values at runtime. For details, see Parameters.

Variables

Apply temporary values for user-defined and environment variables at runtime. For details, see Variables.

Arguments

Apply named, user-supplied, single-value inputs. Arguments require transformation support. Each transformation supports up to 10 arguments. For details, see Arguments.

Analyze transformation results

Analyze results to identify errors, preview data, and assess performance.

After you run a transformation, Execution Results open at the bottom of the canvas.

You can take these actions:

  • To expand results, select Expand Preview.

  • To delete logs, select Delete all logs.

  • To close the pane, select Close.

If the pane is closed, you can open it from the Canvas Action toolbar. Select Logs.

Logging

The Logging tab shows details for the most recent execution. You can drill in to determine where errors occur. Error lines are highlighted in red.

Preview data

Use Preview Data to preview data per step. Select a step to see its output rows.

Step metrics

The Step Metrics tab shows per-step statistics. It includes rows read, rows written, errors, and rows per second.

If a step causes the transformation to fail, it is highlighted in red.

Transformation steps in Pipeline Designer

Steps extend and expand transformation functionality. Use the following steps in Pipeline Designer.

Steps: A - F

Name
Category
Description

Flow

Abort a transformation.

Transform

Add a checksum column for each input row.

Transform

Add one or more constants to the input rows.

Transform

Get the next value from a sequence.

Transform

Add a sequence based on field value changes. When at least one field changes, PDI resets the sequence.

Statistics

Execute analytic queries over a sorted dataset (LEAD, LAG, FIRST, LAST).

Flow

Append two streams in an ordered way.

Flow

Block this step until selected steps finish.

Flow

Block flow until all incoming rows have been processed. Subsequent steps only receive the last input row.

Transform

Create new fields by performing simple calculations.

Transform

Concatenate multiple fields into one target field.

Job

Write rows to the executing job.

Input

Read from a simple CSV file input.

Input

Enter rows of static data in a grid.

Validation

Validate passing data based on a set of rules.

Lookup

Look up values in a database using field values.

Output

Permanently remove a row from a database.

Flow

Do nothing. Useful for testing and splitting streams.

Flow

Filter rows using simple equations.

Scripting

Calculate a formula using Pentaho's libformula.

Steps: G - L

Name
Category
Description

Input

Generate random value.

Input

Generate a number of empty or equal rows.

Input

List detailed information about transformations and jobs in a repository.

Job

Read rows from a previous entry in a job.

Input

Read a parent folder and return all subfolders.

Input

Get system info, like date, arguments, and more.

Input

Get table names from a database connection.

Job

Determine variable values and put them in field values.

Input

Fetch data from Google Analytics.

Statistics

Build aggregates. Requires sorted input.

Lookup

Call a web service over HTTP.

Output

Update or insert rows in a database based on keys.

Flow

Filter rows using Java code.

Flow

Run a PDI job. Pass parameters and rows.

Joins

Output the cartesian product of input streams.

Input

Extract portions out of JSON structures and output rows.

Output

Create JSON and output it in a field to a file.

Steps: M - R

Name
Category
Description

Joins

Join two sorted streams on a key and output a joined set.

Joins

Compare two sorted streams and flag equals, changes, deletes, and new rows.

Input

Read data from Excel and OpenOffice workbooks (XLS, XLSX, ODS).

Output

Write or append data to an Excel file.

Big Data

Read all entries from a MongoDB collection.

Big Data

Write to a MongoDB collection.

Utility

Set a field value to null if it equals a constant value.

Scripting

Execute a Python script to process input rows or generate data.

Lookup

Consume REST services.

Transform

Denormalise rows by grouping and assigning values to new fields. Requires sorted input.

Transform

Normalise de-normalised information.

Steps: S - Z

Name
Category
Description

Output

Delete records in a Salesforce module.

Input

Read information from Salesforce.

Output

Insert records in a Salesforce module.

Output

Update records in a Salesforce module.

Output

Insert or update records in a Salesforce module.

Transform

Select or remove fields. Optionally set field metadata.

Job

Set environment variables based on a single input row.

Transform

Sort rows based on field values.

Joins

Merge multiple sorted input streams.

Transform

Split a string field by delimiter and create a row per split term.

Transform

Split a single field into multiple fields.

Lookup

Look up values coming from another stream.

Transform

Apply trimming, padding, and other operations to strings.

Transform

Cut a snippet from a string.

Flow

Route rows to a target step based on a case value.

Lookup

Check whether a table exists on a connection.

Input

Read information from a database table.

Output

Write information to a database table.

Input

Read text data in several formats.

Output

Write rows to a text file.

Flow

Run a PDI transformation. Set parameters and pass rows.

Transform

Remove duplicates and keep unique rows. Requires sorted input.

Output

Update data in a database table based on keys.

Utility

Write data to log.

Utility

Create a ZIP archive from data stream fields.

Last updated

Was this helpful?