Working with transformations

Create, configure, and run transformations to perform ETL tasks as part of a workflow.

After you run a transformation, review results to validate output and troubleshoot issues.

You can schedule a transformation to run once or on a recurring schedule. See Schedule a transformation or job.

In this topic:

Create a transformation

Create a transformation to arrange a network of logical ETL tasks, called steps, into a data workflow.

To create a transformation in Pipeline Designer, complete these steps:

Sign in to the Pentaho User Console.
Open Pipeline Designer:
- If you are using Modern Design, in the left menu, select Pipeline Designer.
- If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.
Pipeline Designer opens with the Quick Access section expanded.
In the Transformation card, select Create Transformation.
A new, blank transformation opens with the Design pane selected.
Add steps to the transformation:
1. In the Design pane, search for or browse to each step you want to use.
  You may need to expand sections in the Design pane to find steps.
2. Drag the steps you want to use onto the canvas.
Hover over a step to open the step menu.
Select one of these options:
- Delete: Delete the step from the canvas.
- Edit: Open the step window and configure step properties.
  To learn about the step, in the lower-left corner, select Help.
- Duplicate: Add a copy of the step to the canvas.
- More Actions > Change Number of Copies: Set how many parallel copies run.
  In Number of copies (1 or higher), you can enter a number or a variable.
  To insert a variable, select the Select variable to insert icon.
- More Actions > Data Movement: Choose how rows move to next steps.
  - Round-Robin: Distribute rows evenly across step copies.
  - Load Balance: Route rows to the step copy with the lightest load.
  - Copy Data to Next Steps: Send each row to all parallel step copies.
Add hops between steps:
Hover over a step handle until a plus sign (+) appears. Drag the connection to the handle of another step.
Optional: Add a note on the canvas:
In the canvas toolbar, select Add Note. In Notes, enter your note, then select Save.
You can format the note in Notes. Select Style, then set font, color, and shadow options.
Save the transformation:
1. Select Save.
  The Select File or Directory dialog box opens.
2. Search for or browse to the repository folder where you want to save.
3. Optional: Create a folder.
  Select New Folder, enter a folder name, then select Save.
4. Optional: Delete a folder.
  Select the folder, then select Delete.
5. In Select File or Directory, select Save.
  The Save Change dialog box opens.
6. Select Yes to confirm.

Edit transformation properties

Transformation properties describe the transformation and control its behavior. Use properties to customize processing and control output.

To edit transformation properties, complete these steps:

Sign in to the Pentaho User Console.
Open Pipeline Designer:
- If you are using Modern Design, in the left menu, select Pipeline Designer.
- If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.
Pipeline Designer opens with the Quick Access section expanded.
In the table at the bottom of the screen, select Recently opened or Favorites.
Open the transformation:
1. Search for or browse to the transformation, then select Open.
2. Or select Open files. Then in Select File or Directory, browse to the transformation and select Open.
In the Canvas Action toolbar, select Settings.
The Transformation Properties window opens.
Edit properties in each tab.
For details, see Transformation properties.
If you updated options in the Logging tab, generate SQL for the logging table:
1. Select SQL.
  The Simple SQL editor opens. It shows DDL for a new table or details for an existing table.
2. Optional: Edit the SQL statements.
  See Use the SQL Editor.
3. Optional: Select Clear cache to remove cached query results and metadata.
4. Select Execute.
Select Save.

Transformation properties

The Transformation Properties window has these sections:

Transformation tab

Use the Transformation tab to specify general properties.

Property

Description

Transformation name

The name of the transformation. This field is required to save settings to the repository.

Transformation filename

The file name (*.ktr) of the transformation.

Description

Short description of the transformation that displays in the repository explorer.

Extended description

Long extended description of the transformation.

Status

Draft or production status.

Version

Version description.

Parameters tab

Use the Parameters tab to add parameters.

Property

Description

Parameter

A local variable that can be shared across all steps in a transformation.

Default Value

Value that is used if the parameter is not set elsewhere.

Description

Description of the user-defined parameter.

Logging tab

Use the Logging tab to configure how and where logging information is captured.

In the left navigation pane, select which type of logging you want to use.

Property

Description

Log Connection

Database connection used for logging. To configure a new connection, select New.

Log table schema

Schema name, if supported by your database.

Log table name

Name of the log table. If you also use job logging, use a different table name for transformation logging.

Logging interval (seconds)

Interval in which logs are written to the table. Applies only to Transformation and Performance logging types.

Log record timeout (in days)

Number of days to keep log entries before deletion. Applies only to Transformation and Performance logging types. If log data is not deleted as expected, see Log table data is not deleted.

Log size limit (in lines)

Limit for the number of lines stored in LOG_FIELD (CLOB). Applies only to Transformation logging.

Fields to log

Select the fields to log in Fields to log. For effective deletion of expired logs, LOGDATE and TRANSNAME are enabled by default. Monitoring LOG_FIELD can negatively impact Pentaho Server performance. If you do not select all fields, including LOG_FIELD, you will not see information about this transformation in Operations Mart logging.

Dates tab

Use the Dates tab to configure date ranges and limits for this connection.

Property

Description

Maxdate connection

Get the upper limit for a date range on this connection.

Maxdate table

Get the upper limit for a date range in this table.

Maxdate field

Get the upper limit for a date range in this field.

Maxdate offset (seconds)

Increase the upper date limit by this amount. Example: Set -60 to subtract 60 seconds.

Maximum date difference (seconds)

Set the maximum date difference in the obtained date range. Use it to limit job sizes.

Dependencies tab

Use the Dependencies tab to specify transformation dependencies.

The Dependencies tab lets you enter transformation dependencies.

Example: If a dimension depends on three lookup tables, you can track those tables. If values in the lookup tables change, extend the date range to force a full refresh.

Dependencies also help you detect change when a table has a “data last changed” column.

Select Get dependencies to detect dependencies automatically.

Property

Description

Connection

Select a database connection that is already defined for the transformation.

Table

Table from the selected database connection.

Field

Field within the selected table.

Miscellaneous tab

Use the Miscellaneous tab to configure buffer and feedback size. You can also perform administrative tasks.

Property

Description

Number of rows in rowset

Change buffer size between connected steps. Change it only if you are running low on memory.

Show a feedback row in transformation steps?

Add feedback entries to the log file during execution. Enabled by default. Default frequency is every 5000 rows.

The feedback size

Set the number of rows to process before adding a feedback entry. Increase it for large datasets.

Make the transformation database transactional

Open one unique connection per defined and used database connection in the transformation. Enable it to allow a failed transformation to be fully rolled back. Enable it when you need to alter connection settings before a query using an Execute SQL script step. See Database transactions in jobs and transformations. Transformation-wide commit happens when the last step finishes. When the transformation fails, a rollback happens. Commit sizes are ignored.

Shared objects file

Location of the XML file that stores shared objects, like database connections and clustering schemas.

Manage thread priorities?

Enable or disable changing Java thread priorities based on buffer levels. Disable it if the overhead outweighs the benefit.

Monitoring tab

Use the Monitoring tab to enable step performance monitoring.

Property

Description

Enable step performance monitoring?

Enable metrics for each step. Metrics show in the Performance Graph tab in the Execution Results panel.

Step performance measurement interval (ms)

Snapshot interval in milliseconds. Example: 10 ms.

Maximum number of snapshots in memory

Maximum number of measurement snapshots held in memory during runtime.

Run a transformation

Run a transformation to step through ETL tasks in the order specified by the transformation.

Complete these steps:

Sign in to the Pentaho User Console.
Open Pipeline Designer:
- If you are using Modern Design, in the left menu, select Pipeline Designer.
- If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.
Pipeline Designer opens with the Quick Access section expanded.
In the table at the bottom of the screen, select Recently opened or Favorites.
Open the transformation:
1. Search for or browse to the transformation, then select Open.
2. Or select Open files. Then in Select File or Directory, browse to the transformation and select Open.
In the Canvas Action toolbar, select Run.
Select one of these options:
1. Select Run.
2. Or select Run Options.
  In Run Options, set options, then select Run. For details, see Transformation run options.

The transformation runs. The Preview panel opens with the Logging tab selected.

To stop a transformation while it is running, see Stop transformations and jobs.

Transformation run options

Option

Description

Select configuration

All transformations run using the Pentaho server configuration.

Clear log before running

Clear existing logs before execution. Use it if your log is large.

Enable safe mode

Validate that every row has an identical layout. Rows with a different layout generate an error.

Gather performance metrics

Monitor execution performance. You can view metrics in the Pentaho Data Integration client. See Use performance graphs.

Log level

Amount of logging and detail captured. Options: Nothing (no logging), Error (errors only), Minimal, Basic (default), Detailed, Debug (very detailed), Row Level (very detailed) (logging per row). Debug and Row Level can expose sensitive data. Consider data sensitivity when selecting these levels.

Parameters

Apply parameter values at runtime. For details, see Parameters.

Variables

Apply temporary values for user-defined and environment variables at runtime. For details, see Variables.

Arguments

Apply named, user-supplied, single-value inputs. Arguments require transformation support. Each transformation supports up to 10 arguments. For details, see Arguments.

Analyze transformation results

Analyze results to identify errors, preview data, and assess performance.

After you run a transformation, Execution Results open at the bottom of the canvas.

You can take these actions:

To expand results, select Expand Preview.
To delete logs, select Delete all logs.
To close the pane, select Close.

If the pane is closed, you can open it from the Canvas Action toolbar. Select Logs.

Logging

The Logging tab shows details for the most recent execution. You can drill in to determine where errors occur. Error lines are highlighted in red.

Preview data

Use Preview Data to preview data per step. Select a step to see its output rows.

Step metrics

The Step Metrics tab shows per-step statistics. It includes rows read, rows written, errors, and rows per second.

If a step causes the transformation to fail, it is highlighted in red.

Transformation steps in Pipeline Designer

Steps extend and expand transformation functionality. Use the following steps in Pipeline Designer.

Steps: A - F

Name

Category

Description

Abort

Flow

Abort a transformation.

Add a checksum

Transform

Add a checksum column for each input row.

Add constants

Transform

Add one or more constants to the input rows.

Add sequence

Transform

Get the next value from a sequence.

Add value fields changing sequence

Transform

Add a sequence based on field value changes. When at least one field changes, PDI resets the sequence.

Analytic query

Statistics

Execute analytic queries over a sorted dataset (LEAD, LAG, FIRST, LAST).

Append streams

Flow

Append two streams in an ordered way.

Block this step until steps finish

Flow

Block this step until selected steps finish.

Blocking step

Flow

Block flow until all incoming rows have been processed. Subsequent steps only receive the last input row.

Calculator

Transform

Create new fields by performing simple calculations.

Concat fields

Transform

Concatenate multiple fields into one target field.

Copy rows to result

Job

Write rows to the executing job.

CSV file input

Input

Read from a simple CSV file input.

Data Grid

Input

Enter rows of static data in a grid.

Data validator

Validation

Validate passing data based on a set of rules.

Database lookup

Lookup

Look up values in a database using field values.

Delete

Output

Permanently remove a row from a database.

Dummy (do nothing)

Flow

Do nothing. Useful for testing and splitting streams.

Filter rows

Flow

Filter rows using simple equations.

Formula

Scripting

Calculate a formula using Pentaho's libformula.

Steps: G - L

Name

Category

Description

Generate Random Value

Input

Generate random value.

Generate rows

Input

Generate a number of empty or equal rows.

Get Repository Names

Input

List detailed information about transformations and jobs in a repository.

Get Rows From Result

Job

Read rows from a previous entry in a job.

Get subfolder names

Input

Read a parent folder and return all subfolders.

Get System Info

Input

Get system info, like date, arguments, and more.

Get Table Names

Input

Get table names from a database connection.

Get variables

Job

Determine variable values and put them in field values.

Google Analytics v4

Input

Fetch data from Google Analytics.

Group by

Statistics

Build aggregates. Requires sorted input.

HTTP client

Lookup

Call a web service over HTTP.

Insert / update

Output

Update or insert rows in a database based on keys.

Java filter

Flow

Filter rows using Java code.

Job executor

Flow

Run a PDI job. Pass parameters and rows.

Join rows (cartesian product)

Joins

Output the cartesian product of input streams.

Json Input

Input

Extract portions out of JSON structures and output rows.

Json Output

Output

Create JSON and output it in a field to a file.

Steps: M - R

Name

Category

Description

Merge join

Joins

Join two sorted streams on a key and output a joined set.

Merge rows (diff)

Joins

Compare two sorted streams and flag equals, changes, deletes, and new rows.

Microsoft Excel Input

Input

Read data from Excel and OpenOffice workbooks (XLS, XLSX, ODS).

Microsoft Excel writer

Output

Write or append data to an Excel file.

MongoDB input

Big Data

Read all entries from a MongoDB collection.

MongoDB output

Big Data

Write to a MongoDB collection.

Null if

Utility

Set a field value to null if it equals a constant value.

Python Executor

Scripting

Execute a Python script to process input rows or generate data.

REST client

Lookup

Consume REST services.

Row Denormaliser

Transform

Denormalise rows by grouping and assigning values to new fields. Requires sorted input.

Row Normaliser

Transform

Normalise de-normalised information.

Steps: S - Z

Name

Category

Description

Salesforce delete

Output

Delete records in a Salesforce module.

Salesforce input

Input

Read information from Salesforce.

Salesforce insert

Output

Insert records in a Salesforce module.

Salesforce update

Output

Update records in a Salesforce module.

Salesforce upsert

Output

Insert or update records in a Salesforce module.

Select values

Transform

Select or remove fields. Optionally set field metadata.

Set variables

Job

Set environment variables based on a single input row.

Sort rows

Transform

Sort rows based on field values.

Sorted Merge

Joins

Merge multiple sorted input streams.

Split field to rows

Transform

Split a string field by delimiter and create a row per split term.

Split fields

Transform

Split a single field into multiple fields.

Stream lookup

Lookup

Look up values coming from another stream.

String operations

Transform

Apply trimming, padding, and other operations to strings.

Strings cut

Transform

Cut a snippet from a string.

Switch / case

Flow

Route rows to a target step based on a case value.

Table Exists

Lookup

Check whether a table exists on a connection.

Table input

Input

Read information from a database table.

Table output

Output

Write information to a database table.

Text file input

Input

Read text data in several formats.

Text file Output

Output

Write rows to a text file.

Transformation executor

Flow

Run a PDI transformation. Set parameters and pass rows.

Unique Rows

Transform

Remove duplicates and keep unique rows. Requires sorted input.

Update

Output

Update data in a database table based on keys.

Write to log

Utility

Write data to log.

Zip file

Utility

Create a ZIP archive from data stream fields.

Last updated 8 days ago

Was this helpful?

hashtagCreate a transformation

hashtagEdit transformation properties

hashtagTransformation properties

hashtagTransformation tab

hashtagParameters tab

hashtagLogging tab

hashtagDates tab

hashtagDependencies tab

hashtagMiscellaneous tab

hashtagMonitoring tab

hashtagRun a transformation

hashtagTransformation run options

hashtagAnalyze transformation results

hashtagLogging

hashtagPreview data

hashtagStep metrics

hashtagTransformation steps in Pipeline Designer

hashtagSteps: A - F

hashtagSteps: G - L

hashtagSteps: M - R

hashtagSteps: S - Z

Create a transformation

Edit transformation properties

Transformation properties

Transformation tab

Parameters tab

Logging tab

Dates tab

Dependencies tab

Miscellaneous tab

Monitoring tab

Run a transformation

Transformation run options

Analyze transformation results

Logging

Preview data

Step metrics

Transformation steps in Pipeline Designer

Steps: A - F

Steps: G - L

Steps: M - R

Steps: S - Z