Pipeline Designer

Pipeline Designer is a web-based interface that you can use to design, execute, and manage data integration workflows directly in your browser. A wide range of database connections, advanced transformation steps, and robust execution monitoring is supported in Pipeline Designer.

By default, Pipeline Designer is installed when you install the Pentaho Server. To update the Pipeline Designer plugin, use the Plugin Manager. For details, see Update plugins.

Note: The Project option appears on the main page of the Pipeline Designer but is not currently supported in Pipeline Designer. You can use Projects in the Pentaho Data Integration client. For details, see Organizing data integration with projects.

In this topic:

Views in Pipeline Designer

In Pipeline Designer, you can work with transformations and jobs in the Design View and review the raw XML in the XML View.

Design View

The Design View is the primary view to use for creating and running transformations and jobs in the Pipeline Designer.

Canvas Action toolbar

While working with a transformation or job, use the following options in the Canvas Action toolbar:

Option
Description

Add note

Adds a note to the canvas. For details, see Use notes on the canvas.

Reset

Clears all steps in the transformation or job.

Note: Resetting a transformation or job cannot be undone.

Run

Runs the transformation or job.

Run > Run Options

Opens the Run Options window for the transformation or job. For details, see Transformation run options or Job run options.

Pause

Pauses a running transformation or job.

Stop

Stops processing all data in the transformation or job immediately.

Stop > Stop Processing

Makes the transformation or job finish any records that were initiated or retrieved and then stops processing data.

Logs

Opens the Execution Results pane for a transformation or job. For details, see Analyze transformation results or Analyze job results.

Kettle status

Opens the Kettle status page (http://server:port/pentaho/kettle/status) in a new tab of your browser, showing the execution status and details of your transformation or job.

Export

Exports a transformation as a .ktr file or a job as a .kjb file. For details, see Export a transformation or job.

Settings

Opens the Transformation Properties or Job Properties window, where you can configure properties to describe or control the transformation or job. For details, see Edit transformation properties or Edit job properties.

Canvas toolbar

Move around the canvas while working with a semantic model by using the following options in the canvas toolbar:

Option
Description

Zoom In

Enlarges the size of the semantic model on the canvas so that you can see more details of individual elements in the model.

Zoom Out

Decreases the size of the semantic model on the canvas so that you can see more elements in the model at the same time, with less detail for individual elements.

Fit View

Resizes the semantic model and centers it on the canvas so that you can see as many elements in the model at the same time as possible. Whether the entire model can fit in the view depends on the size of the model.

Interactive

Locks and unlocks the canvas to control whether elements can be moved on the canvas. Turn the Interactive toggle on when you want to lock the canvas so that you can move around it without moving elements on the canvas.

More actions > Auto format

Uniformly arranges all steps in the transformation or job so that the steps all fit on the canvas, but do not overlap.

More actions > Horizontal format

Uniformly arranges steps in each data stream horizontally (left to right) without overlap. Separate streams are distributed vertically to maintain a clear and organized canvas layout.

More actions > Vertical format

Uniformly arranges steps in each data stream vertically (top to bottom) without overlap. Separate streams are distributed horizontally to maintain a clear and organized canvas layout.

</> XML View

Switches to the XML View, which shows the raw XML for the transformation or job.

XML View

In the XML View, you can review the raw XML for a transformation or job.

You can click Design View to continue working on the transformation in the Design View.

Basic concepts of ETL

The Pentaho Data Integration (PDI) platform, which includes the PDI client and Pipeline Designer, uses a workflow metaphor as building blocks for transforming your data and other tasks. Workflows are built using steps as you create transformations and jobs. Each step is joined by a hop which passes the flow of data from one item to the next.

Transformations

A transformation is a network of logical tasks called steps. Transformations are essentially data flows. The transformation is, in essence, a directed graph of a logical set of data transformation configurations. Transformation file names have a .ktr extension.

The two main components associated with transformations are steps and hops:

  • Steps are the building blocks of a transformation, for example a text file input or a table output. There are many steps available in the Pipeline Designer and they are grouped according to their function; for example, input, output, transform, and so on. Each step in a transformation is designed to perform a specific task, such as reading data from a flat file, filtering rows, or logging to a database. You can add a step by dragging it from the Design pane onto the canvas. Steps can be configured to perform the tasks you require. See Transformation steps in Pipeline Designer for details about the features and ETL functions of the various transformation steps available in Pipeline Designer.

  • Hops are data pathways that connect steps together and allow schema metadata to pass from one step to another. Hops determine the flow of data through the steps not necessarily the sequence in which they run. When you run a transformation, each step starts up in its own thread and pushes and passes data.

Note: All steps in a transformation are started and run in parallel, so the initialization sequence is not predictable. That is why you cannot, for example, set a variable in a first step and attempt to use that variable in a subsequent step.

You can connect steps together with hops. For details, see Hops. A step can have many connections. Some steps join other steps together, while some serve as an input or output for another step. The data stream flows through steps to the various steps in a transformation. Hops are represented in Pipeline Designer as arrows. Hops allow data to be passed from step to step and also determine the direction and flow of data through the steps. If a step sends outputs to more than one step, the data can either be copied to each step or distributed among them.

Jobs

Jobs are workflow-like models for coordinating resources, execution, and dependencies of ETL activities.

Jobs aggregate individual pieces of functionality to implement an entire process. Examples of common tasks performed in a job include getting FTP files, checking conditions such as existence of a necessary target database table, running a transformation that populates that table, and e-mailing an error log if a transformation fails. The final job outcome might be a nightly warehouse update, for example.

Job entries are the individual configured pieces; they are the primary building blocks of a job. In data transformations these individual pieces are called steps. Job entries can provide you with a wide range of functionality ranging from executing transformations to getting files from a Web server. A single job entry can be placed multiple times on the canvas; for example, you can take a single job entry such as a transformation run and place it on the canvas multiple times using different configurations. Job settings are the options that control the behavior of a job and the method of logging a job’s actions. Job file names have a .kjb extension. See Job steps in Pipeline Designer for details about the features and ETL functions of the various job entries available in Pipeline Designer.

Job hops control the execution order and the condition on which the next job entry will be executed. A job hop is just a flow of control. Hops link to job entries and, based on the results of the previous job entry, determine what happens next.

Note: Hops behave differently when used in a job than when used in a transformation.

Job hop conditions are described in the following table:

Condition
Description

Unconditional

The next job entry is executed regardless of the result of the originating job entry

Follow when result is true

The next job entry is executed only when the result of the originating job entry is true; this means a successful execution such as, file found, table found, without error, and so on

Follow when result is false

The next job entry is executed only when the result of the originating job entry was false, meaning unsuccessful execution, file not found, table not found, error(s) occurred, and so on

Hops

A hop connects one transformation step or job entry with another. The direction of the data flow is indicated by an arrow. To create the hop, click the handle of one step and drag the connection to the handle of another step.

Loops are not allowed in transformations because Pipeline Designer depends heavily on the previous steps to determine the field values that are passed from one step to another. Allowing loops in transformations may result in endless loops and other problems. Loops are allowed in jobs because Pipeline Designer executes job entries sequentially; however, make sure you do not create endless loops.

Mixing rows that have a different layout is not allowed in a transformation; for example, if you have two table input steps that use a varying number of fields. Mixing row layouts causes steps to fail because fields cannot be found where expected or the data type changes unexpectedly. The trap detector displays warnings at design time if a step is receiving mixed layouts.

In transformations, you can specify if data can either be copied, distributed, or load balanced between multiple hops leaving a step. Select the step, right-click and choose Data Movement.

A hop can be enabled or disabled (for testing purposes for example). Click the hop to enable or disable it.

Working with transformations

Create, configure, and run transformations to perform ETL tasks as part of a workflow.

After you run a transformation, review results to validate output and troubleshoot issues.

circle-info

You can schedule a transformation to run once or on a recurring schedule. See Schedule a transformation or job.

In this topic:

Create a transformation

Create a transformation to arrange a network of logical ETL tasks, called steps, into a data workflow.

To create a transformation in Pipeline Designer, complete these steps:

  1. Sign in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using Modern Design, in the left menu, select Pipeline Designer.

    • If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the Transformation card, select Create Transformation.

    A new, blank transformation opens with the Design pane selected.

  4. Add steps to the transformation:

    1. In the Design pane, search for or browse to each step you want to use.

      You may need to expand sections in the Design pane to find steps.

    2. Drag the steps you want to use onto the canvas.

  5. Hover over a step to open the step menu.

    Select one of these options:

    • Delete: Delete the step from the canvas.

    • Edit: Open the step window and configure step properties.

      To learn about the step, in the lower-left corner, select Help.

    • Duplicate: Add a copy of the step to the canvas.

    • More Actions > Change Number of Copies: Set how many parallel copies run.

      In Number of copies (1 or higher), you can enter a number or a variable.

      To insert a variable, select the Select variable to insert icon.

    • More Actions > Data Movement: Choose how rows move to next steps.

      • Round-Robin: Distribute rows evenly across step copies.

      • Load Balance: Route rows to the step copy with the lightest load.

      • Copy Data to Next Steps: Send each row to all parallel step copies.

  6. Add hops between steps:

    Hover over a step handle until a plus sign (+) appears. Drag the connection to the handle of another step.

  7. Optional: Add a note on the canvas:

    In the canvas toolbar, select Add Note. In Notes, enter your note, then select Save.

    You can format the note in Notes. Select Style, then set font, color, and shadow options.

  8. Save the transformation:

    1. Select Save.

      The Select File or Directory dialog box opens.

    2. Search for or browse to the repository folder where you want to save.

    3. Optional: Create a folder.

      Select New Folder, enter a folder name, then select Save.

    4. Optional: Delete a folder.

      Select the folder, then select Delete.

    5. In Select File or Directory, select Save.

      The Save Change dialog box opens.

    6. Select Yes to confirm.

Edit transformation properties

Transformation properties describe the transformation and control its behavior. Use properties to customize processing and control output.

To edit transformation properties, complete these steps:

  1. Sign in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using Modern Design, in the left menu, select Pipeline Designer.

    • If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select Recently opened or Favorites.

  4. Open the transformation:

    1. Search for or browse to the transformation, then select Open.

    2. Or select Open files. Then in Select File or Directory, browse to the transformation and select Open.

  5. In the Canvas Action toolbar, select Settings.

    The Transformation Properties window opens.

  6. Edit properties in each tab.

    For details, see Transformation properties.

  7. If you updated options in the Logging tab, generate SQL for the logging table:

    1. Select SQL.

      The Simple SQL editor opens. It shows DDL for a new table or details for an existing table.

    2. Optional: Edit the SQL statements.

      See Use the SQL Editor.

    3. Optional: Select Clear cache to remove cached query results and metadata.

    4. Select Execute.

  8. Select Save.

Transformation properties

The Transformation Properties window has these sections:

Transformation tab

Use the Transformation tab to specify general properties.

Property
Description

Transformation name

The name of the transformation. This field is required to save settings to the repository.

Transformation filename

The file name (*.ktr) of the transformation.

Description

Short description of the transformation that displays in the repository explorer.

Extended description

Long extended description of the transformation.

Status

Draft or production status.

Version

Version description.

Directory

The repository directory where the transformation is stored.

Created by

Name of the original creator.

Created at

Date and time when the transformation was created.

Last modified by

Username of the last user who modified the transformation.

Last modified at

Date and time when the transformation was last modified.

Parameters tab

Use the Parameters tab to add parameters.

Property
Description

Parameter

A local variable that can be shared across all steps in a transformation.

Default Value

Value that is used if the parameter is not set elsewhere.

Description

Description of the user-defined parameter.

Logging tab

Use the Logging tab to configure how and where logging information is captured.

In the left navigation pane, select which type of logging you want to use.

Property
Description

Log Connection

Database connection used for logging. To configure a new connection, select New.

Log table schema

Schema name, if supported by your database.

Log table name

Name of the log table. If you also use job logging, use a different table name for transformation logging.

Logging interval (seconds)

Interval in which logs are written to the table. Applies only to Transformation and Performance logging types.

Log record timeout (in days)

Number of days to keep log entries before deletion. Applies only to Transformation and Performance logging types. If log data is not deleted as expected, see Log table data is not deleted.

Log size limit (in lines)

Limit for the number of lines stored in LOG_FIELD (CLOB). Applies only to Transformation logging.

Fields to log

Select the fields to log in Fields to log. For effective deletion of expired logs, LOGDATE and TRANSNAME are enabled by default. Monitoring LOG_FIELD can negatively impact Pentaho Server performance. If you do not select all fields, including LOG_FIELD, you will not see information about this transformation in Operations Mart logging.

Dates tab

Use the Dates tab to configure date ranges and limits for this connection.

Property
Description

Maxdate connection

Get the upper limit for a date range on this connection.

Maxdate table

Get the upper limit for a date range in this table.

Maxdate field

Get the upper limit for a date range in this field.

Maxdate offset (seconds)

Increase the upper date limit by this amount. Example: Set -60 to subtract 60 seconds.

Maximum date difference (seconds)

Set the maximum date difference in the obtained date range. Use it to limit job sizes.

Dependencies tab

Use the Dependencies tab to specify transformation dependencies.

The Dependencies tab lets you enter transformation dependencies.

Example: If a dimension depends on three lookup tables, you can track those tables. If values in the lookup tables change, extend the date range to force a full refresh.

Dependencies also help you detect change when a table has a “data last changed” column.

Select Get dependencies to detect dependencies automatically.

Property
Description

Connection

Select a database connection that is already defined for the transformation.

Table

Table from the selected database connection.

Field

Field within the selected table.

Miscellaneous tab

Use the Miscellaneous tab to configure buffer and feedback size. You can also perform administrative tasks.

Property
Description

Number of rows in rowset

Change buffer size between connected steps. Change it only if you are running low on memory.

Show a feedback row in transformation steps?

Add feedback entries to the log file during execution. Enabled by default. Default frequency is every 5000 rows.

The feedback size

Set the number of rows to process before adding a feedback entry. Increase it for large datasets.

Make the transformation database transactional

Open one unique connection per defined and used database connection in the transformation. Enable it to allow a failed transformation to be fully rolled back. Enable it when you need to alter connection settings before a query using an Execute SQL script step. See Database transactions in jobs and transformationsarrow-up-right. Transformation-wide commit happens when the last step finishes. When the transformation fails, a rollback happens. Commit sizes are ignored.

Shared objects file

Location of the XML file that stores shared objects, like database connections and clustering schemas.

Manage thread priorities?

Enable or disable changing Java thread priorities based on buffer levels. Disable it if the overhead outweighs the benefit.

Monitoring tab

Use the Monitoring tab to enable step performance monitoring.

Property
Description

Enable step performance monitoring?

Enable metrics for each step. Metrics show in the Performance Graph tab in the Execution Results panel.

Step performance measurement interval (ms)

Snapshot interval in milliseconds. Example: 10 ms.

Maximum number of snapshots in memory

Maximum number of measurement snapshots held in memory during runtime.

Run a transformation

Run a transformation to step through ETL tasks in the order specified by the transformation.

Complete these steps:

  1. Sign in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using Modern Design, in the left menu, select Pipeline Designer.

    • If you are using Classic Design, select Switch to the Modern Design. Then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select Recently opened or Favorites.

  4. Open the transformation:

    1. Search for or browse to the transformation, then select Open.

    2. Or select Open files. Then in Select File or Directory, browse to the transformation and select Open.

  5. In the Canvas Action toolbar, select Run.

    Select one of these options:

    1. Select Run.

    2. Or select Run Options.

      In Run Options, set options, then select Run. For details, see Transformation run options.

The transformation runs. The Preview panel opens with the Logging tab selected.

circle-info

To stop a transformation while it is running, see Stop transformations and jobs.

Transformation run options

Option
Description

Select configuration

All transformations run using the Pentaho server configuration.

Clear log before running

Clear existing logs before execution. Use it if your log is large.

Enable safe mode

Validate that every row has an identical layout. Rows with a different layout generate an error.

Gather performance metrics

Monitor execution performance. You can view metrics in the Pentaho Data Integration client. See Use performance graphsarrow-up-right.

Log level

Amount of logging and detail captured. Options: Nothing (no logging), Error (errors only), Minimal, Basic (default), Detailed, Debug (very detailed), Row Level (very detailed) (logging per row). Debug and Row Level can expose sensitive data. Consider data sensitivity when selecting these levels.

Parameters

Apply parameter values at runtime. For details, see Parameters.

Variables

Apply temporary values for user-defined and environment variables at runtime. For details, see Variables.

Arguments

Apply named, user-supplied, single-value inputs. Arguments require transformation support. Each transformation supports up to 10 arguments. For details, see Arguments.

Analyze transformation results

Analyze results to identify errors, preview data, and assess performance.

After you run a transformation, Execution Results open at the bottom of the canvas.

You can take these actions:

  • To expand results, select Expand Preview.

  • To delete logs, select Delete all logs.

  • To close the pane, select Close.

If the pane is closed, you can open it from the Canvas Action toolbar. Select Logs.

Logging

The Logging tab shows details for the most recent execution. You can drill in to determine where errors occur. Error lines are highlighted in red.

Preview data

Use Preview Data to preview data per step. Select a step to see its output rows.

Step metrics

The Step Metrics tab shows per-step statistics. It includes rows read, rows written, errors, and rows per second.

If a step causes the transformation to fail, it is highlighted in red.

Transformation steps in Pipeline Designer

Steps extend and expand transformation functionality. Use the following steps in Pipeline Designer.

Steps: A - F

Name
Category
Description

Flow

Abort a transformation.

Transform

Add a checksum column for each input row.

Transform

Add one or more constants to the input rows.

Transform

Get the next value from a sequence.

Transform

Add a sequence based on field value changes. When at least one field changes, PDI resets the sequence.

Statistics

Execute analytic queries over a sorted dataset (LEAD, LAG, FIRST, LAST).

Flow

Append two streams in an ordered way.

Flow

Block this step until selected steps finish.

Flow

Block flow until all incoming rows have been processed. Subsequent steps only receive the last input row.

Transform

Create new fields by performing simple calculations.

Transform

Concatenate multiple fields into one target field.

Job

Write rows to the executing job.

Input

Read from a simple CSV file input.

Input

Enter rows of static data in a grid.

Validation

Validate passing data based on a set of rules.

Lookup

Look up values in a database using field values.

Output

Permanently remove a row from a database.

Flow

Do nothing. Useful for testing and splitting streams.

Flow

Filter rows using simple equations.

Scripting

Calculate a formula using Pentaho's libformula.

Steps: G - L

Name
Category
Description

Input

Generate random value.

Input

Generate a number of empty or equal rows.

Input

List detailed information about transformations and jobs in a repository.

Job

Read rows from a previous entry in a job.

Input

Read a parent folder and return all subfolders.

Input

Get system info, like date, arguments, and more.

Input

Get table names from a database connection.

Job

Determine variable values and put them in field values.

Input

Fetch data from Google Analytics.

Statistics

Build aggregates. Requires sorted input.

Lookup

Call a web service over HTTP.

Output

Update or insert rows in a database based on keys.

Flow

Filter rows using Java code.

Flow

Run a PDI job. Pass parameters and rows.

Joins

Output the cartesian product of input streams.

Input

Extract portions out of JSON structures and output rows.

Output

Create JSON and output it in a field to a file.

Steps: M - R

Name
Category
Description

Joins

Join two sorted streams on a key and output a joined set.

Joins

Compare two sorted streams and flag equals, changes, deletes, and new rows.

Input

Read data from Excel and OpenOffice workbooks (XLS, XLSX, ODS).

Output

Write or append data to an Excel file.

Big Data

Read all entries from a MongoDB collection.

Big Data

Write to a MongoDB collection.

Utility

Set a field value to null if it equals a constant value.

Scripting

Execute a Python script to process input rows or generate data.

Lookup

Consume REST services.

Transform

Denormalise rows by grouping and assigning values to new fields. Requires sorted input.

Transform

Normalise de-normalised information.

Steps: S - Z

Name
Category
Description

Output

Delete records in a Salesforce module.

Input

Read information from Salesforce.

Output

Insert records in a Salesforce module.

Output

Update records in a Salesforce module.

Output

Insert or update records in a Salesforce module.

Transform

Select or remove fields. Optionally set field metadata.

Job

Set environment variables based on a single input row.

Transform

Sort rows based on field values.

Joins

Merge multiple sorted input streams.

Transform

Split a string field by delimiter and create a row per split term.

Transform

Split a single field into multiple fields.

Lookup

Look up values coming from another stream.

Transform

Apply trimming, padding, and other operations to strings.

Transform

Cut a snippet from a string.

Flow

Route rows to a target step based on a case value.

Lookup

Check whether a table exists on a connection.

Input

Read information from a database table.

Output

Write information to a database table.

Input

Read text data in several formats.

Output

Write rows to a text file.

Flow

Run a PDI transformation. Set parameters and pass rows.

Transform

Remove duplicates and keep unique rows. Requires sorted input.

Output

Update data in a database table based on keys.

Utility

Write data to log.

Utility

Create a ZIP archive from data stream fields.

Working with jobs

Create, configure, and run jobs to orchestrate ETL activities. After running a job, analyze its results to identify improvements and problems.

In this topic

Create a job

Create a job to coordinate resources, execution, and dependencies of an ETL activity.

To create a job in Pipeline Designer, follow these steps:

  1. Log in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using the Modern Design, in the menu on the left side of the page, select Pipeline Designer.

    • If you are using the Classic Design, select Switch to the Modern Design, then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the Job card, select Create Job.

    A new, blank job opens with the Design pane selected.

  4. Add steps to the job:

    1. In the Design pane, search for or browse to each step you want to use in the job.

    2. Drag the steps you want to use onto the canvas.

  5. Work with steps on the canvas.

    Hover over a step to open the step menu, then select an option:

    Menu option
    Description

    Delete

    Deletes the step from the canvas.

    Edit

    Opens the Step Name window where you can configure the properties of the step. Step properties may appear in multiple sections, tabs, or both.

    Note: To learn more about the step you're configuring, in the lower-left corner of the Step Name window, click Help.

    Duplicate

    Adds a copy of the step to the canvas.

  6. Add hops between steps.

    Hover over a step handle until a plus sign (+) appears, then drag the connection to another step handle.

  7. Optional: Add a note on the canvas.

    In the canvas toolbar, select the Add Note icon. In the Notes dialog box, enter your note, then select Save.

    Note: To format the note, select Style and set font, color, and shadow options.

  8. Save the job:

    1. Select Save.

      The Select File or Directory dialog box opens.

    2. Search for or browse to the folder where you want to save the job.

    3. Optional: Create a folder.

      Select the New Folder icon. In the New folder dialog box, enter a folder name, then select Save.

    4. Optional: Delete a folder.

      Select the folder, then select the Delete icon.

    5. In the Select File or Directory dialog box, select Save.

      The Save Change dialog box opens.

    6. Select Yes to confirm.

Edit job properties

Job properties control how a job behaves and how it logs what it is doing.

To configure job properties, follow these steps:

  1. Log in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using the Modern Design, in the menu on the left side of the page, select Pipeline Designer.

    • If you are using the Classic Design, select Switch to the Modern Design, then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select the Recently opened tab or the Favorites tab.

  4. Open the job:

    1. Search for or browse to the job, then select Open.

    2. Select Open files, then in the Select File or Directory dialog box, select the job and select Open.

  5. In the Canvas Action toolbar, select the Settings icon.

    The Job Properties window opens.

  6. Configure the properties in each tab.

    For details, see the tab sections in this topic.

  7. Optional: Generate SQL for the logging table.

    1. Select SQL.

      The Simple SQL editor opens with DDL generated from the job properties.

    2. Optional: Edit the SQL statements.

      See Use the SQL Editor.

    3. Optional: Clear cached results.

      Select Clear cache.

    4. Select Execute.

  8. Select Save.

Job tab

General properties for jobs are found on the Job tab.

Option
Description

Job Name

The name of the job.

Note: This information is required if you want to save to a repository.

Job filename

The file name of the job if it is not stored in the repository.

Description

A user-defined short description of the job which is shown in the repository explorer.

Extended description

A user-defined longer description of the job.

Status

The status of the job. The values are draft and production.

Version

A description of the version.

Directory

The directory in the repository where the job is kept.

Created by

The original creator of the job.

Created at

The date and time when the job was created.

Last modified by

The name of the last user who modified the job.

Last modified at

The date and time when the job was last modified.

Parameters tab

Use the Parameters tab to define parameters for your jobs.

Option
Description

Parameter

A user-defined parameter.

Default value

The default value of the user-defined parameter.

Description

A description of the parameter.

Settings tab

Option
Description

Pass batch ID?

Select to pass the identification number of the batch to the transformation.

Shared objects file

PDI uses a single shared objects file for each user. The default filename is shared.xml and is located in the .kettle directory in the user’s home directory. You can define a different shared objects file, location, and name.

Log tab

Use the Log tab to specify logging settings.

Option
Description

Log connection

Specify the database connection you are using for logging. You can configure a new connection by selecting New.

Log Schema

Specify the schema name, if supported by your database.

Log table

Specify the name of the log table. If you also use transformation logging, use a different table name for job logging.

Logging interval (seconds)

Specify the interval in which logs are written to the table. This property only applies to Transformation and Performance logging types.

Log line timeout (days)

Specify the number of days to keep log entries in the table before they are deleted. This property only applies to Transformation and Performance logging types.

Log size limit in lines

Enter the limit for the number of lines that are stored in the LOG_FIELD. PDI stores logging for the transformation in a long text field (CLOB). This property only applies to Transformation and Performance logging types.

SQL button

Generates the SQL needed to create the logging table and lets you run the SQL statement.

Run a job

After you create a job and configure its properties, you can run the job. You can also control how the job runs without modifying the job itself by configuring run options.

To run a job, follow these steps:

  1. Log in to the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using the Modern Design, in the menu on the left side of the page, select Pipeline Designer.

    • If you are using the Classic Design, select Switch to the Modern Design, then select Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select the Recently opened tab or the Favorites tab.

  4. Open the job:

    1. Search for or browse to the job, then select Open.

    2. Select Open files, then in the Select File or Directory dialog box, select the job and select Open.

  5. In the Canvas Action toolbar, select the Run icon, then select an option:

    1. To run the job, select Run.

    2. To run the job with options, select Run Options. Configure options, then select Run.

      For details, see Job run options.

The job runs and the Preview panel opens with the Logging tab selected.

Note: To stop a job while it is running, see Stop transformations and jobs.

Job run options

Option
Description

Select configuration

All jobs are run using the Pentaho server configuration.

Clear log before running

Indicates whether to clear all your logs before you run your job. If your log is large, you might need to clear it before the next execution to conserve space.

Enable safe mode

Checks every row passed through your job and ensures all layouts are identical. If a row does not have the same layout as the first row, an error is generated and reported.

Gather performance metrics

Monitors the performance of your job execution. You can view performance metrics in the Pentaho Data Integration client. For details see Use performance graphsarrow-up-right.

Log level

Specifies how much logging is performed and the amount of information captured:

  • Nothing: No logging occurs.

  • Error: Only errors are logged.

  • Minimal: Only use minimal logging.

  • Basic: This is the default level.

  • Detailed: Give detailed logging output.

  • Debug: For debugging purposes, very detailed output.

  • Row Level (very detailed): Logging at a row level, which generates a lot of log data.

Debug and Row Level logging levels contain information you may consider too sensitive to be shown. Consider the sensitivity of your data when selecting these logging levels. See the Administer Pentaho Data Integration and Analytics guide for instructions on how best to use these logging methods.

Expand Remote Job

Bundles all required files for a job, including its sub-components, so they can be sent to a remote server for execution. The remote server runs the complete job without needing to retrieve additional files from the original environment.

Start job at

Specifies the step where the job begins execution. By default, execution begins at the Start step.

Parameters

Applies parameter values during runtime. A parameter is a local variable. For details, see Parameters.

Variables

Applies temporary values for user-defined and environment variables during runtime. For details, see Variables.

Arguments

Applies a named, user-supplied, single-value input given as a command line argument when running the job manually or with a script. Arguments are handled according to a job's design. If the job is not designed to handle arguments, nothing happens. Typically, argument values are numbers, strings, or system or script variables. Each job can have a maximum of 10 arguments. For details, see Arguments.

Analyze job results

You can see how your job performed and if errors occurred by viewing logs and job metrics. After you run a job, the Logs panel opens with tabs that help you pinpoint errors.

Logging

The Logging tab displays details for the most recent execution of the job. Error lines are highlighted in red.

Job metrics

The Job Metrics tab shows statistics for each step in your job. Statistics include records read and written, processing speed (rows per second), and errors. Steps that caused the job to fail are highlighted in red.

Job steps in Pipeline Designer

Steps extend and expand the functionality of jobs. You can use the following steps in Pipeline Designer.

Name
Category
Description

Utility

Abort the job.

Conditions

Checks if files exist.

File management

Create a folder.

File management

Create an empty file.

File management

Delete a file.

File management

Delete files.

File management

Delete specified folders. If a the folder contains files, PDI will delete them all.

File management

Compare two files.

File management

Get or upload a file using HTTP (Hypertext Transfer Protocol).

General

Execute a job.

General

Set one or several variables.

Scripting

Execute a shell script.

General

Defines the starting point for job execution. Every job must have one (and only one) Start.

General

Clear any error state encountered in a job and forces it to a success state.

General

Run a transformation.

Conditions

Wait for a delay.

File management

Wait for a file.

Utility

Write message to log.

Managing transformations and jobs

Managing transformations and jobs includes managing files in Pipeline Designer, managing their connections, and performing tasks common to both transformations and jobs.

Manage transformations and jobs

Manage transformations and jobs on the Pipeline Designer main page. Available actions include marking transformations and jobs as favorites, downloading them as files, creating duplicates, moving them to different folders or to the trash, renaming them, and viewing their details.

To manage a transformation or job, complete the following steps:

  1. Log into the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using the Modern Design, in the menu on the left side of the page, click Pipeline Designer.

    • If you are using the Classic Design, click Switch to the Modern Design, and then in the menu on the left side of the page, click Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select either the Recently opened or Favorites tab.

  4. Search for or browse to the transformation or job you want to manage, then click the More Actions icon.

  5. In the More Actions menu, select one of the following options:

    • Not favorite (click to add): Adds the transformation or job to the Favorites tab. It changes the heart icon in the Favorite column from empty to red.

    • Favorite (click to remove): Removes the transformation or job from the Favorites tab. It changes the heart icon in the Favorite column from red to empty.

    • Download: Downloads the transformation or job as a file to your default download folder.

      • Transformations download as KTR files.

      • Jobs download as KJB files.

    • Move: Opens the Move to window. Navigate to a folder (or create one), then click Move here.

    • Duplicate: Opens the Duplicate to window. Navigate to a folder (or create one), then click Paste here.

    • Send to trash: Opens the Send to trash? dialog box. Click Yes to move the transformation or job to the trash.

    • Info: Opens the Transformation Name or Job Name window. You can view details and copy the file path by clicking Copy source.

    • Rename: Makes the transformation or job name editable in the table.

      circle-exclamation

Manage connections for transformations and jobs

While creating or editing a transformation or job in Pipeline Designer, you can define connections to multiple databases provided by multiple database vendors such as MySQL and Oracle. Pipeline Designer ships with the most suitable JDBC drivers for PostgreSQL, our default database.

Pentaho recommends avoiding ODBC connections. The ODBC to JDBC bridge driver does not always provide an exact match and adds another level of complexity, which affects performance. The only time you may have to use ODBC is if no JDBC driver is available. For details, see the Pentaho Community article on why you should avoid ODBCarrow-up-right.

When you define a database connection in Pentaho Designer, the connection information (such as the user name, password, and port number) is stored in the Pentaho Repository and is available to other users when they connect to the repository. If you are not using the Pentaho Repository, the database connection information is stored in the XML file associated with your transformation or job. See the Pentaho Data Integration document for details on the Pentaho Repository.

You must have information about your database (such as your database type, port number, user name and password) before you define a JDBC connection. In PDI, you can also set connection properties as variables. Through such variables, your transformations and jobs can access data from multiple database types.

Make sure to use clean ANSI SQL that works on all the database types used.

You must have a transformation or job open to manage connections from within the Pipeline Designer. To see steps for opening a transformation or job, see Create a transformation, Create a job, or Edit a transformation or job.

Tasks

If you need to run standard SQL commands against a connection, see Use the SQL Editor.

Define a new database connection

While working on a transformation or job, you can define a new database connection to use.

Before you can create a connection, the appropriate driver must be installed for your particular data connection. Your IT administrator should be able to install the appropriate driver for you. For details, see Specify data connections for the Pentaho Server in the Install Pentaho Data Integration and Analytics guide.

To define a new database connection, complete the following steps:

  1. With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections list.

  2. Find Database Connections, click the More Actions icon, and then select New. The Database Connection window opens.

  3. Enter database connection information for your new Database Connection. The type of database connection information entered depends on your access protocol. Refer to the examples in the following sections of this topic for Native (JDBC) and OCI protocols:

Native (JDBC) protocol information

Create a Native (JDBC) connection in the Database Connection dialog box by completing the following steps:

  1. In the Connection Name field, enter a name that uniquely describes this connection.

    The name can have spaces, but it cannot have special characters (such as #, $, and %).

  2. In the Connection Type list, select the database you want to use (for example, MySQL or Oracle).

  3. In the Access Type list, select Native (JDBC). The access protocol which appears depends on the database type you select.

  4. In the Settings section, enter the following information:

    Field
    Description

    Host Name

    The name of the server that hosts the database to which you are connecting. Alternatively, you can specify the host by IP address.

    Database Name

    The name of the database to which you are connecting. If you are using an ODBC connection, enter the Data Source Name (DSN) in this field.

    Port Number

    The TCP/IP port number (if it is different from the default)

    Username

    Optional user name used to connect to the database

    Password

    Optional password used to connect to the database

  5. Click Test Connection. A success message appears if the connection is established.

  6. Click OK to close the connection test dialog box.

  7. To save the connection, click Save. The database connection is saved and appears in the Database Connections list.

OCI protocol information

Perform the following steps to create an OCI connection in the PDI Database Connection dialog box:

  1. In the Connection Name field, enter a name that uniquely describes this connection.

    The name can have spaces, but it cannot have special characters (such as #, $, and %).

  2. In the Connection Type list, select Oracle.

  3. In the Access list, select OCI. The access protocol which appears depends on the database type you select.

  4. In the Settings section, enter the following information as directed by the Oracle OCI documentationarrow-up-right.

    Field
    Description

    SID

    The Oracle system ID that uniquely identifies the database on the system

    Tablespace for Data

    The name of the tablespace where the data is stored

    Tablespace for Indices

    The name of the tablespace where the indices is stored

    User Name

    The user name used to connect to the database

    Password

    The password used to connect to the database

  5. Click Test Connection.

    A success message appears if the connection is established.

  6. Click OK to close the connection test dialog box.

  7. To save the connection, click OK to close the Database Connection dialog box.

If you want to use Advanced, Options, or Pooling for your OCI connection, refer to the Oracle OCI documentationarrow-up-right to understand how to specify these settings.

Connect to Snowflake using strong authentication

If you are defining a data connection to Pentaho Data Integration and Analytics from a Snowflake data warehouse in the cloud, you can improve connection security by applying strong authentication.

You can apply strong authentication to your defined Pentaho data connection from Snowflake through a key pair.

Configure key pair strong authentication for your Snowflake data connection by completing the following steps:

  1. After entering the information for your Snowflake data connection in the General tab of the Database Connection dialog box, select the Options tab.

  2. Set the key pair parameters as indicated in the following table:

    Parameter
    Value

    authenticator

    snowflake_jwt

    private_key_file

    Specify the name of the private key file you use in your environment. For example, /rsa_key.p8

    private_key_file_pwd

    Specify the password for accessing the private key file you use in your environment. For example, PentahoSnowFlake123

    See https://docs.snowflake.com/en/developer-guide/jdbc/jdbc-configure#private-key-file-name-and-password-as-connection-propertiesarrow-up-right for details on the private key file and its password.

  3. Click Test Connection to verify your connection. A success message appears if the connection is established.

  4. Click OK to close the connection test dialog box.

  5. To save the connection, click OK to close the Database Connection dialog box.

You have applied key pair authentication to your defined data connection between Pentaho and Snowflake.

Connect to an Azure SQL database

You can use an Azure SQL database as a data source with the Pipeline Designer. This connection is required if you want to bulk load into Azure SQL DB job entry to load data into your Azure SQL database from Azure Data Lake Storage. Pentaho supports the Always Encryptedarrow-up-right option, dynamic maskingarrow-up-right, and multiple authentication methods for connecting to an Azure SQL database.

Because one physical server may host databases for multiple customers, keep in mind that SQL for Azure is different from MSSQL. For more information regarding the differences between Azure SQL and MSSQL, see https://docs.microsoft.com/en-us/azure/azure-sql/database/features-comparisonarrow-up-right

Before you begin

You must have an Azure account with an active subscription and an instance of an Azure SQL database. You also need to install the Azure SQL database drivers. For help installing your drivers, see your Microsoft documentation for details.

Additionally, you need to obtain the following information from your system administrator:

  • Host name

  • Database name

  • Port number

  • Authentication method

  • Username

  • Password

If you use the Always Encryption Enabled option, you also need to obtain the Client id and Client Secret Key.

Authentication method

Pentaho supports four authentication methods for connecting to the Azure SQL DB instance:

  • SQL Authentication

    Connect using the Azure SQL Server username and password.

  • Azure Active Directory

    Connect using Multi Factor Authentication (MFA). The MFA password must be entered on the displayed webpage.

  • Azure Active Directory with password

    Connect using an Azure AD username and password.

  • Azure Active Directory with integrated authentication

    Connect using the federated on-premises Active Directory Federation Services (ADFS) with Azure Active Directory in the cloud.

Connect to an Azure database

  1. In the Connection Name field, enter a name that uniquely describes this connection. The name can have spaces, but it cannot have special characters (such as #, $, and %).

  2. In the Connection Type list, select Azure SQL DB.

  3. In the Access list, select Native (JDBC).

  4. Enter your database connection information.

    Field
    Description

    Host Name

    The name of the Azure SQL server instance.

    Database Name

    The name of the Azure SQL database to which you are connecting.

    Port Number

    The TCP/IP port number. The Azure SQL Database service is only available through TCP port 1433. You must set your firewall to allow outgoing TCP communication on port 1433.

    Authentication method

    The authentication method used to connect to the Azure SQL DB instance. The default is SQL Authentication.

    Username

    The username used to connect to the database.

    Password

    The password used to connect to the database.

    Always Encryption Enabled

    Select to use encryption. See Use the Always Encryption Enabled option for instructions on using this option.

    Client id

    The unique client identifier, used to identify and set up a durable connection path to the server.

    Client Secret Key

    The unique name of the key value in the Azure Key Vault.

  5. Click Test Connection to verify your connection.

Use the Always Encryption Enabled option

Before you can use the Always Encryption Enabled option, you must perform the following steps. Consult the Microsoft Azure SQL documentationarrow-up-right for assistance with your Azure SQL tools.

  1. Generate a column master key in the Azure Key Vault.

  2. Encrypt the column using the column master key.

  3. Register the app under Azure Active Directory and obtain both the Client id and Client Secret Key.

  4. Grant permissions to the Client id for accessing the Azure Key Vault.

  5. Select Always Encryption Enabled and provide the Client id and Client Secret Key.

The Azure Always Encrypted feature is now active.

Clear cached database metadata

When working with complex transformations or jobs, Pipeline Designer might accumulate outdated or incorrect metadata due to changes in the underlying database. You can use the Clear Complete DB Cache option to clear out the outdated or incorrect metadata the next time you access the transformation or job.

Cached metadata might include information about:

  • Table structures

  • Column types

  • Indexes

  • Primary and foreign keys

  • Other schema-related metadata

Note: Clearing cached database metadata does not delete any data from your database, affect transformation or job files, or clear runtime data caches that are used during execution.

To clear cached database metadata, complete the following steps:

  1. With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections list.

  2. Find Database Connections, click the More Actions icon, and then select Clear Complete DB Cache. The cache is cleared, and a Success message is displayed. Fresh metadata is retrieved from the database the next time you access it.

Edit a database connection

You can edit an existing database connection to refine and change aspects of the connection.

To edit a database connection, complete the following steps:

  1. With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections.

  2. Expand Database Connections, find the database connection you want to edit, and click the More Actions icon.

  3. Select Edit. The Database Connection window opens.

  4. Configure the options in each tab of the Database Connections window:

  5. (Optional) To view features of the database connection, click Feature List.

  6. (Optional) To explore configured database connections, click Explore. For details, see Explore configured database connections.

  7. Click Test Connection. If the connection is established, a success message is displayed.

  8. Click OK to close the success message.

  9. Click Save. The connection is saved and the Database Connections window closes.

General

In the General tab, the options you have to edit depend on the type of database connection you are editing. Connection information depends on your access protocol. For details about general connection settings, refer to examples in Define a new database connection.

Advanced

The Advanced tab contains options for configuring properties associated with how SQL is generated. With these properties, you can set a standard across all your SQL tools, ETL tools, and design tools.

Option
Description

Supports the Boolean data type

Instructs Pipeline Designer to use native Boolean data types supported by the database.

Supports the timestamp data type

Instructs Pipeline Designer to use the timestamp data type supported by the database.

Quote all in database

Enables case-sensitive table names. For example, MySQL is case-sensitive on Linux, but not case-sensitive on Microsoft Windows. If you quote the identifiers, the database uses a case-sensitive table name.

Force all to lower-case

Enables the system to change the case of all identifiers to lower-case.

Force all to upper-case

Enables the system to change the case of all identifiers to upper-case.

Preserve case of reserved words

Instructs Pipeline Designer to use a list of reserved words supported by the database.

The Preferred Schema name where no schema is used

For Pipeline Designer, enter the preferred schema name (for example, MYSCHEMA).

SQL Code Editor

Enter the SQL statements to execute right after connecting.

Options

Use the Options tab to add or delete parameters. Parameters enable you to control database-specific behavior.

  • To add more parameters to the list, click Add Row.

  • To delete rows, click the Delete icon next to the row.

Pooling

Configure options in the Pooling tab to set up a connection pool and define options like the initial pool size, maximum pool size, and connection pool parameters. By default, a connection remains open for each individual report or set of reports in PUC and for each individual step in a transformation in PDI. For example, you might start by specifying a pool of ten or fifteen connections, and as you run reports in PUC or transformations in PDI, the unused connections drop off. Pooling helps control database access, especially if you have dashboards that contain many reports and require a large number of connections. Pooling can also be implemented when your database licensing restricts the number of active concurrent connections.

You can take the following action in the parameters section:

  • To add a new parameter, click Add Row and then enter the Parameter name and Value.

  • To delete a parameter, click the Delete icon.

  • To change how many parameters are shown at one time, select a new Items per page value.

  • If there are multiple pages of parameters, scroll through the pages using the left and right arrow that appear under the list of parameters.

The following table shows an example of Pooling options that might be available in a typical JDBC driver. Check your driver documentation for driver-specific pooling details.

Option
Description

Enable Connection Pooling

Enables connection pooling.

Pool Size

  • Initial

Set the initial size of the connection pool.

  • Maximum

Set the maximum number of connections in the connection pool.

Parameters

You can define additional custom pool parameters. Click on any parameter to view a short description of that parameter. Click Restore Defaults when to restore the default values for selected parameters.The most commonly-used parameter is validationQuery. The parameter differs slightly depending on your RDBMS connection. The basic set of Pentaho databases use the following values for validationQuery:

  • For Oracle and PostgreSQL, use Select 1 from dual.

  • For MS SQL Server and MySQL, use Select 1.

Description

Enter a description for your parameters.

Clustering

Use the Clustering options to cluster the database connection and create connections to data partitions in Pipeline Designer. To create a new connection to a data partition, enter a Partition ID, the Host Name, the Port, the Database Name, User Name, and Password for the connection.

If you have the Pentaho Server configured in a cluster of servers, and use the Data Source Wizard (DSW) in PUC to add a new data source, the new data source will only be seen on the cluster node where the user has a session. For the new data source to be seen by all the cluster nodes, you must disable DSW data source caching. This may cause the loading of the data source list to be slower since the list is not cached.

To disable the cache, navigate to the server/pentaho-server/pentaho-solutions/system folder and set the enableDomainIdCache value in the system.properties file to false.

Delete a database connection

Delete a database connection you no longer need.

circle-exclamation

To delete a database connection, complete the following steps:

  1. With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections.

  2. Expand Database Connections, find the database connection you want to delete, and click the More Actions icon.

  3. Select Delete. The Confirm deletion dialog box opens.

  4. Click Yes to confirm deletion. The database connection is deleted.

Explore configured database connections

The Database Explorer allows you to explore configured database connections. The Database Explorer also supports tables, views, and synonyms along with the catalog, schema, or both to which the table belongs.

  1. With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections list.

  2. Expand Database Connections, find the database connection you want to explore, and click the More Actions icon.

  3. Select Explore. The Database Explorer window opens.

  4. (Optional) Click the refresh icon to refresh the list.

  5. Expand the folders and find the item you want to review.

  6. Click Actions, and then select one of the following features:

    Feature
    Description

    Preview first 100

    Returns the first 100 rows from the selected table.

    Preview x Rows

    Prompts you for the number of rows to return from the selected table.

    Row Count

    Specifies the total number of rows in the selected table.

    Show Layout

    Displays a list of column names, data types, and so on from the selected table.

    DDL

    Generates the DDL to create the selected table based on the current connection type, the drop-down.

    View SQL

    Launches the Simple SQL Editor for the selected table.

    Truncate Table

    Generates a TRUNCATE table statement for the current table.Note: The statement is commented out by default to prevent users from accidentally deleting the table data

    Data Profile

    Provides basic information about the data.

  7. When you finish exploring the database connection, click OK. The Database Explorer window closes.

Show dependencies

Expand the connection to display a list of dependencies across the platform, including transformations and jobs.

To show the dependencies for a database connection, complete the following steps:

  1. With a transformation or job open, on the left side of the Pipeline Designer interface, click the View icon. The View pane opens with the Transformations folder expanded, containing the Database Connections list.

  2. Expand Database Connections, find the database connection you want to explore, and click the More Actions icon.

  3. Select Show dependencies. The database connection is expanded to show the transformations and jobs that depend on that connection.

Edit a transformation or job

Edit an existing transformation or job to rename it, add new steps, edit steps, configure hops, edit properties, reset it, or save it in a different location.

To edit a transformation or job, complete the following steps:

  1. Log into the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using the Modern Design, in the menu on the left side of the page, click Pipeline Designer.

    • If you are using the Classic Design, click Switch to the Modern Design, and then in the menu on the left side of the page, click Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select either the Recently opened or Favorites tab.

  4. Open a transformation or job:

    1. Search for or browse to the transformation or job, and then click Open.

    2. Click Open files, and then in the Select File or Directory dialog box, search for or browse to the transformation or job and click Open.

  5. Edit the transformation or job by taking one or more of the following actions:

    • Rename a transformation or job

      1. In the transformation or job tab, click the name to make it editable.

      2. Type a new name and press Enter.

      Note: You can also change the name of a transformation or job in the properties window. For details see Configure transformation properties or Configure job properties.

    • Add a new step

      1. In the Design pane, search for or browse to a step you want to use in the transformation. You may need to expand sections in the Design pane to find steps.

      2. Drag the step you want to add onto the canvas.

    • Edit steps

      Hover over a step to open the step menu, and then select one of the following options for either a transformation or job:

      Step option
      Description

      Delete

      Deletes the step from the canvas.

      Edit

      Opens the Step Name window where you can configure the properties of the step. Step properties may appear in multiple sections, tabs, or both.

      Note: To learn more about the step you're configuring, in the lower-left corner of the Step Name window, click Help.

      Duplicate

      Adds a copy of the step to the canvas.

      For transformations only, you can select one of the following additional options:

      Step option
      Description

      More Actions > Change Number of Copies

      Opens the Number of copies dialog box, where you can enter a number or a variable to specify how many copies of the step are processed in parallel when the transformation or job is run. To find a variable, in the Number of copies (1 or higher) box, click the Select variable to insert icon.

      More Actions > Data Movement

      Opens a list of data movement options for you to select from to specify how data rows are distributed to the next steps of the transformation or job. Round-Robin is the default setting.

      • Round-Robin: Distributes rows evenly across all parallel step copies using round-robin logic. This setting optimizes load balancing when the transformation includes multiple instances of the next step.

      • Load Balance: Routes rows to the step copy with the lightest processing load. This setting can improve performance when processing times vary across parallel step instances.

      • Copy Data to Next Steps: Sends each row to all parallel step copies. Use this setting when every downstream branch must process the complete dataset independently.

    • Configure hops

      • To add hops between steps, hover over a step’s handle until a plus sign (+) appears, then drag the connection to the handle of another step.

      • Disable a hop by selecting it and clicking the Disable icon.

      • Delete the hop by selecting it and clicking the Delete icon.

    • Edit transformation or job properties

      • In the Canvas Action toolbar, click the Settings icon. The Transformation Properties or Job Properties window opens.

      • Change properties in one or more tabs. For details, see Transformation properties or Job properties.

    • Reset the transformation or job

      Note: Resetting a transformation or job clears all steps and connections from the transformation or job and cannot be undone.

      1. In the Canvas Action toolbar, click the Reset icon. The Confirm Reset window opens.

      2. Click Reset. All steps and connections are removed from the transformation or job.

    • Save transformation or job in a different location

      1. Click Save as. The Select File or Directory dialog box opens.

      2. Search for or browse to the folder in the repository where you want to save the transformation or job.

      3. (Optional) To create a new folder in the repository, click the New Folder icon, and then in the New folder dialog box, enter a New folder name and click Save.

      4. (Optional) To delete a folder from the repository, select the folder and click the Delete icon.

      5. In the Select File or Directory dialog box, click Save. The Save Change dialog box opens.

      6. Click Yes to confirm that you want to save the transformation or job.

Use the SQL Editor

The SQL Editor is a good tool to use when you must execute standard SQL commands for tasks such as creating tables, dropping indexes, and modifying fields. The SQL Editor is used to preview and execute DDL (Data Definition Language) generated by the Pipeline Designer such as create and alter table, create index, and create sequence SQL commands. For example, if you add a Table Output step to a transformation and click the SQL button at the bottom of the Table Input dialog box, Pipeline Designer automatically generates the necessary DDL for the output step to function properly and presents it to the end user through the SQL Editor.

Below are some points to consider:

  • Multiple SQL Statements must be separated by semi-colons.

  • Before SQL Statements are sent to the database to be executed, Pipeline Designer removes returns, line-feeds, and separating semi-colons.

  • Pentaho Data Integration clears the database cache for the database connection on which you launch DDL statements.

The SQL Editor does not recognize the dialects of all supported databases. That means that creating stored procedures, triggers, and other database-specific objects may pose problems. Consider using the tools that came with the database in these instances.

Use notes on the canvas

While creating or editing a transformation or job, you can add notes to the canvas to help you and others understand the structure, design decisions, business rules, dependencies, and other aspects of your transformations and jobs.

For instructions on creating and editing transformations and jobs, see Create a job, Create a transformation, and Edit a transformation or job.

With a transformation or job open, take one or more of the following actions.

Add a note

  1. In the canvas toolbar, click the Add Note icon. The Notes dialog box opens.

  2. In the Note box, enter the content for the note.

  3. To format the note, click Style, and then enter the font, color, and shadow options you want to use for the note.

  4. Click Save. The note appears on the canvas.

Edit a note

  1. Hover over the note on the canvas. The Delete and Edit icons are displayed above the note.

  2. Click the Edit icon. The Notes dialog box opens.

  3. In the Note box, edit the content of the note.

  4. To edit the format of the note, click Style, and then edit the font, color, and shadow options for the note.

  5. Click Save. The edited note is saved.

Reposition a note

To reposition a note, click and drag the note on the canvas to the location where you want it to appear.

Delete a note

  1. Hover over the note on the canvas. The Delete and Edit icons are displayed above the note.

  2. Click the Delete icon. The note is deleted.

Run modifiers

Run modifiers include parameters, variables, and arguments, which you can use to modify how you run transformations and jobs.

Parameters

Parameters are local variables that apply only to a specific transformation where you define it. When defining a parameter, you can assign it a default value to use in the event that one is not fetched for it. This feature makes it unique among dynamic input types in Pipeline Designer.

Note: If there is a name collision between a parameter and a variable, the parameter will take precedence.

Variables

A variable in Pipeline Designer is a piece of user-supplied information that you can use dynamically and programmatically in a variety of different scopes. A variable can be local to a single step, or be available to the entire JVM (Java Virtual Machine) that Pipeline Designer is running in.

Pipeline Designer variables can be used in both transformation steps and job entries. For details, see Basic concepts of ETL in Pipeline Designer. You can define variables by using the Set Variables step and Set Session Variables steps in a transformation or by editing the kettle.properties file.

The Get Variable step can explicitly retrieve a value from a variable, or you can use it in any Pipeline Designer text field which has the dollar sign icon next to it by using a metadata string in either the Unix or Windows formats:

  • ${VARIABLE}

  • %%VARIABLE%%

Both formats can be used and even mixed. In fact, you can create variable recursion by alternating between the Unix and Windows syntax. For example, if you wanted to resolve a variable that depends on another variable, then you could use this example: ${%%inner_var%%}.

Note: If there is a name collision with a parameter or argument, variables will defer.

You can also use ASCII or hexadecimal character codes in place of variables, using the same format: $[hex value]. This makes it possible to escape the variable syntax in instances where you need to put variable-like text into a variable. For instance if you wanted to use ${foobar} in your data stream, then you can escape it like this: $[24]{foobar}. Pipeline Designer will replace $[24] with a $ without resolving it as a variable.

Environment variables

This is the traditional variable type in Pipeline Designer. You define an environment variable through the Set Environment Variables dialog box in the Edit menu, or by hand by passing it as an option to the Java Virtual Machine (JVM) with the -D flag.

Environment variables are an easy way to specify the location of temporary files in a platform-independent way; for example, the ${java.io.tmpdir} variable points to the /tmp/ directory on Unix/Linux/OS X and to the C:\Documents and Settings\<username>\Local Settings\Temp\ directory on Windows.

The only problem with using environment variables is that they cannot be used dynamically. For example, if you run two or more transformations or jobs at the same time on the same application server, you may get conflicts. Changes to the environment variables are visible to all software running on the virtual machine.

Kettle Variables

Kettle variables provide a way to store small pieces of information dynamically in a narrower scope than environment variables. A Kettle variable is local to Kettle, and can be scoped down to the job or transformation in which it is set, or up to a related job. The Set Session Variables step in a transformation allows you to specify the related job that you want to limit the scope to (for example, the parent job, grandparent job, or the root job).

Kettle variables configure various Pipeline Designer-specific options such as the location of the shared object file for transformations and jobs or the log size limit.

To edit Kettle variables manually, complete these steps.

  1. Open the kettle.properties file in a text editor. By default, the kettle.properties file is typically stored in your home directory or the .pentaho directory.

  2. Edit the file.

  3. When complete, close and save the file.

Set the LAZY_REPOSITORY variable

The LAZY_REPOSITORY variable restores the directory-loading behavior of the repository to be as it was before Pentaho 6.1.

Note: Changing this variable to false will make repository loading more expensive.

To set the LAZY_REPOSITORY variable, complete these steps.

  1. Open the kettle.properties file in a text editor. By default, the kettle.properties file is typically stored in your home directory or the .pentaho directory.

  2. Look for KETTLE_LAZY_REPOSITORY and, if it is set to false, change the value to true.

  3. When complete, close and save the file.

Arguments

An argument is a named, user-supplied, single-value input given as a command line argument (running a transformation or job manually from Pan or Kitchen, or as part of a script). Each transformation or job can have a maximum of 10 arguments. Each argument is declared as space-separated values given after the rest of the Pan or Kitchen line:

In the above example, the values argOne, argTwo, and argThree are passed into the transformation, where they will be handled according to the way the transformation is designed. If it was not designed to handle arguments, nothing will happen. Typically, these values would be numbers, words (strings), or variables (system or script variables, not Pipeline Designer variables).

In Pipeline Designer, you can test argument handling by defining a set of arguments when you run a transformation or job. For details, see Run a transformation or Run a job.

Export a transformation or job

Export transformations and jobs to save backup copies, migrate them to other systems, share them with other users, or use them in other tools. You can export a transformation as a .ktr file or a job as a .kjb files.

To export a transformation or job, complete the following steps:

  1. Log into the Pentaho User Console.

  2. Open Pipeline Designer:

    • If you are using the Modern Design, in the menu on the left side of the page, click Pipeline Designer.

    • If you are using the Classic Design, click Switch to the Modern Design, and then in the menu on the left side of the page, click Pipeline Designer.

    Pipeline Designer opens with the Quick Access section expanded.

  3. In the table at the bottom of the screen, select either the Recently opened or Favorites tab.

  4. Open a transformation or job:

    1. Search for or browse to the transformation or job, and then click Open.

    2. Click Open files, and then in the Select File or Directory dialog box, search for or browse to the transformation or job and click Open.

  5. In the Canvas Action toolbar, click the Export icon. The transformation or job downloads to your default download folder as either a .ktr file or .kjb file, respectively.

Stop transformations and jobs

There are two different methods you can use to stop a transformation or job running in the Pipeline Designer. The method you use depends on the processing requirements of your ETL task. Most transformations and jobs can be stopped immediately without concern. However, since some transformations and jobs are ingesting records using messaging or streaming data, such incoming data might need to be stopped safely so that the potential for data loss is avoided.

In the Canvas Action toolbar, take one of the following actions:

  • To stop processing all data immediately, click Stop.

  • To make the transformation or job finish any records that were initiated or retrieved before it stops processing data, click the arrow next to Stop, and then select Stop input processing.

Last updated

Was this helpful?