Loading data from PDI

Load data from Pentaho Data Integration (PDI) to external tools.

Use these features to run outside the PDI client, run remotely, expose virtual tables, and capture lineage.

Run transformations and jobs from the command line

You can use command line tools to execute PDI content outside the PDI client.

Use them in scripts and schedulers, like cron.

Use Pan to run transformations.

Use Kitchen to run jobs.

Command-line reference (Pan, Kitchen, ZIP, export, Hadoop)

Startup script options

Pan and Kitchen recognize the startup-script options used by the PDI client.

These options are in Spoon.bat (Windows) and Spoon.sh (Linux).

To use these options with Pan or Kitchen, add them to your startup script.

Note: The default directory for the startup script is design-tools/data-integration.

Option

Description

FILTER_GTK_WARNINGS

Suppresses GTK warnings from spoon.sh and kitchen.sh. Set to true to suppress warnings. Leave empty to show warnings.

SKIP_WEBKITGTK_CHECK

Suppresses warnings about missing libwebkitgtk when launching the PDI client. Set to true to suppress warnings. Leave empty to show warnings.

KETTLE_HOME

Identifies the user's home directory for PDI configuration files. Use it to change the location of files normally in <user home>/.kettle.

KETTLE_LOG_SIZE_LIMIT

Limits the log size for transformations and jobs that do not set a log size limit property.

KETTLE_JNDI_ROOT

Changes the Simple JNDI path, which contains jdbc.properties.

KETTLE_DIR

Directory where the PDI client is installed.

KETTLE_REPOSITORY

Repository that Kettle connects to at startup.

LIBPATH

Value passed as the -Djava.library.path Java parameter.

PENTAHO_DI_JAVA_OPTIONS

Additional Java arguments when running Kettle. Use it for settings like memory limits.

Pan (run transformations)

Pan runs transformations from a PDI repository (database or enterprise) or a local file.

The options are the same for the shell script and batch file.

Note: Windows uses the forward slash (/) and colon (:) syntax. If option values contain spaces, quote the full argument. Example: "-param:MASTER_HOST=192.168.1.3" "-param:MASTER_PORT=8181".

pan.sh -option=value arg1 arg2

pan.bat /option:value arg1 arg2

Example:

sh pan.sh -rep=initech_pdi_repo -user=pgibbons -pass=lumburgh -trans=TPS_reports_2011

pan.bat /rep:initech_pdi_repo /user:pgibbons /pass:lumburgh /trans:TPS_reports_2011

Switch

Purpose

rep

Enterprise repository name.

user

Repository username.

pass

Repository password.

trans

Name of the transformation to run.

dir

Repository directory that contains the transformation, including the leading slash.

file

Local .ktr file path.

level

Logging level: Basic, Detailed, Debug, Rowlevel, Error, Nothing.

logfile

Log file path.

listdir

Lists directories in the specified repository.

listtrans

Lists transformations in the specified repository directory.

listrep

Lists available repositories.

exprep

Exports all repository objects to one XML file.

norep

Prevents Pan from logging into a repository. Useful when environment variables like KETTLE_REPOSITORY are set, but you want to run a local .ktr.

safemode

Runs in safe mode with extra checking.

version

Shows version, revision, and build date.

param

Sets a named parameter in name=value format. Example: -param:Foo=bar.

listparam

Lists information about named parameters in the specified transformation.

metrics

Gathers metrics during execution.

maxloglines

Maximum number of log lines kept internally. 0 keeps all lines (default).

maxlogtimeout

Maximum age (minutes) of a log line kept internally. 0 keeps lines indefinitely (default).

Pan status codes

Pan returns one of these status codes:

Status code

Definition

Transformation ran without a problem.

Errors occurred during processing.

Unexpected error during loading or running the transformation.

Unable to prepare and initialize the transformation.

Transformation could not be loaded from XML or the repository.

Error loading steps or plugins.

Command line usage was printed.

Kitchen (run jobs)

Kitchen runs jobs from a PDI repository (database or enterprise) or a local file.

The options are the same for the shell script and batch file.

Note: Windows uses the forward slash (/) and colon (:) syntax. If option values contain spaces, quote the full argument. Example: "-param:MASTER_HOST=192.168.1.3" "-param:MASTER_PORT=8181".

kitchen.sh -option=value arg1 arg2

kitchen.bat /option:value arg1 arg2

Switch

Purpose

rep

Enterprise or database repository name.

user

Repository username.

pass

Repository password.

job

Name of the job (as it appears in the repository) to run.

dir

Repository directory that contains the job, including the leading slash.

file

Local .kjb file path.

level

Logging level: Basic, Detailed, Debug, Rowlevel, Error, Nothing.

logfile

Log file path.

listdir

Lists subdirectories within the specified repository directory.

listjob

Lists jobs in the specified repository directory.

listrep

Lists available repositories.

export

Exports all linked resources of the specified job. Argument is a ZIP filename.

norep

Prevents Kitchen from logging into a repository. Useful when environment variables like KETTLE_REPOSITORY are set, but you want to run a local .kjb.

version

Shows version, revision, and build date.

param

Sets a named parameter in name=value format. Example: -param:FOO=bar.

listparam

Lists information about named parameters in the specified job.

maxloglines

Maximum number of log lines kept internally. 0 keeps all lines (default).

maxlogtimeout

Maximum age (minutes) of a log line kept internally. 0 keeps lines indefinitely (default).

Example:

sh kitchen.sh -rep=initech_pdi_repo -user=pgibbons -pass=lumburghsux -job=TPS_reports_2011

kitchen.bat /rep:initech_pdi_repo /user:pgibbons /pass:lumburghsux /job:TPS_reports_2011

Kitchen status codes

Kitchen returns one of these status codes:

Status code

Definition

Job ran without a problem.

Errors occurred during processing.

Unexpected error during loading or running the job.

Job could not be loaded from XML or the repository.

Error loading steps or plugins.

Command line usage was printed.

Import `.kjb` or `.ktr` files from a ZIP archive

Pan and Kitchen can read PDI content from ZIP files.

Use the ! switch.

Windows example:

Kitchen.bat /file:"zip:file:///C:/Pentaho/PDI Examples/Sandbox/linked_executable_job_and_transform.zip!Hourly_Stats_Job_Unix.kjb"

Linux and Solaris example (escape !):

./kitchen.sh -file:"zip:file:////home/user/pentaho/pdi-ee/my_package/linked_executable_job_and_transform.zip\!Hourly_Stats_Job_Unix.kjb"

Export repository content from the command line

To export repository objects into XML format using command-line tools, pass named parameters when calling Kitchen or Pan.

Example (Kitchen):

call kitchen.bat /file:C:\Pentaho_samples\repository\repository_export.kjb ^
"/param:rep_name=PDI2000" "/param:rep_user=admin" "/param:rep_password=password" ^
"/param:rep_folder=/public/dev" ^
"/param:target_filename=C:\Pentaho_samples\repository\export\dev.xml"

Parameter

Description

rep_folder

Repository folder

rep_name

Repository name

rep_password

Repository password

rep_user

Repository username

target_filename

Target filename

Note: You can use obfuscated passwords with Encr, the command line tool for encrypting strings for storage and use by PDI.

Example batch file that checks for errors:

@echo off
ECHO This is an example of a batch file calling repository_export.kjb

cd C:\Pentaho\pdi-ee-<version>\data-integration

call kitchen.bat /file:C:\Pentaho_samples\repository\repository_export.kjb "/param:rep_name=PDI2000" ^
"/param:rep_user=admin" "/param:rep_password=password" "/param:rep_folder=/public/dev" ^
"/param:target_filename=C:\Pentaho_samples\repository\export\dev.xml"

if errorlevel 1 goto error
echo Export finished successful.
goto finished

:error
echo ERROR: An error occurred during repository export.
:finished
REM Allow the user to read the message when testing.
pause

Use Pan and Kitchen with a Hadoop cluster

To use Pan or Kitchen on a Hadoop cluster, configure Pentaho to run transformations and jobs with the PDI client or the Pentaho Server.

You do not need these configurations if the PDI client connects to the Pentaho Repository.

To use Pan and Kitchen from a repository directly on the Pentaho Server, create the named cluster definition in the server repository.

See Connecting to a Hadoop cluster with the PDI client.

Note: If the PDI client and Pentaho Server run on the same platform, cluster configuration files in /home/<user>/.pentaho/metastore can be overwritten. Use the same cluster connection names on both hosts.

Configure the PDI client

Create a connection to the Hadoop cluster where you want to run the job or transformation.
Create and test the job or transformation in the PDI client.
Go to design-tools/data-integration/plugins/pentaho-big-data-plugin.
Open plugin.properties in a text editor.
Set hadoop.configurations.path to the directory that contains metastore.
Example: hadoop.configurations.path=/home/<user>/.pentaho
The default metastore location is /home/<user>/.pentaho/metastore.
Save and close plugin.properties.

Configure the Pentaho Server

If the server is on a different host, copy the metastore directory and its contents from the PDI client to a location the server can access.
The default metastore location for the PDI client is /home/<user>/.pentaho/metastore.
Go to pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin.
Open plugin.properties in a text editor.
Set hadoop.configurations.path to the directory that contains metastore.
Save and close plugin.properties.

Run and monitor remotely with Carte

Carte is a lightweight web server for running PDI transformations and jobs remotely.

It receives the transformation or job (as XML) plus the run configuration.

It also exposes endpoints to monitor, start, and stop executions.

Carte setup and reference (clusters, security, client config)

Carte clusters

Use a Carte cluster to distribute transformation processing across multiple Carte servers.

A cluster includes:

One master node that tracks execution.
Two or more slave nodes that do the work.

You can also run a single Carte instance as a standalone remote execution engine.

Define one or more Carte servers in the PDI client (Spoon), then send jobs and transformations to them.

You can cluster Pentaho Server for failover. If you use Pentaho Server as the cluster master (dynamic cluster), enable the proxy trusting filter. See Schedule jobs to run on a remote Carte server.

Cluster types

Static cluster

Static clusters have a fixed schema.

You define the master and slave nodes at design time.

Static clusters fit smaller, stable environments.

Dynamic cluster

Dynamic clusters discover slave nodes at run time.

Slave nodes are registered with the master.

PDI monitors slaves every 30 seconds to see if they are available.

Dynamic clusters fit cloud-like environments where nodes come and go.

Set up servers

Prerequisites

Copy required JDBC drivers and PDI plugins from your dev system to each Carte instance.
If you will run content from a Pentaho Repository, copy repositories.xml from your workstation’s .kettle directory to the same location on each Carte server.

Set up a static cluster (start slave servers)

Start each slave server with the host and port you want to expose:
```
./carte.sh 127.0.0.1 8081
```
Verify each server is reachable from your PDI client.
(Optional) Create an init/startup script to start Carte on boot.

When Carte runs embedded in Pentaho Server, configuration is controlled by slave-server-config.xml under .../pentaho-solutions/system/kettle/. Stop Pentaho Server before editing that file.

Set up a dynamic cluster

Dynamic clusters use two configuration files:

carte-master-config.xml for the master.
carte-slave-config.xml for each slave.

You can rename the files.

Keep the required XML structure and values.

Configure a Carte master server

Copy required JDBC drivers and plugins to the master host.

Create carte-master-config.xml using this template:

<slave_config>
  <!-- On a master server, the slaveserver node describes this Carte instance -->
  <slaveserver>
    <name>Master</name>
    <hostname>yourhostname</hostname>
    <port>9001</port>
    <username>cluster</username>
    <password>cluster</password>
    <master>Y</master>
  </slaveserver>
</slave_config>

The master <name> must be unique in the cluster.

Start Carte using the master config file:
```
./carte.sh carte-master-config.xml
```
Verify the master is running.
(Optional) Create an init/startup script for boot-time startup.

Configure Carte slave servers

Ensure the master is running.
Copy required JDBC drivers and plugins to each slave host.

Create carte-slave-config.xml using this template:

<slave_config>
  <!-- The masters node defines the load-balancing Carte instance(s) managing this slave -->
  <masters>
    <slaveserver>
      <name>Master</name>
      <hostname>yourhostname</hostname>
      <port>9000</port>
      <!-- Uncomment if you want DI Server to act as the load balancer -->
      <!-- <webAppName>pentaho</webAppName> -->
      <username>cluster</username>
      <password>cluster</password>
      <master>Y</master>
    </slaveserver>
  </masters>

  <report_to_masters>Y</report_to_masters>

  <!-- The slaveserver node describes this slave instance -->
  <slaveserver>
    <name>SlaveOne</name>
    <hostname>yourhostname</hostname>
    <port>9001</port>
    <username>cluster</username>
    <password>cluster</password>
    <master>N</master>
  </slaveserver>
</slave_config>

Each slave <name> must be unique in the cluster.

(Optional) To use the master’s Kettle properties on a slave, add these tags inside the slave’s <slaveserver>:

<get_properties_from_master>Master</get_properties_from_master>
<override_existing_properties>Y</override_existing_properties>

Start Carte using the slave config file:
```
./carte.sh carte-slave-config.xml
```
If you use Pentaho Repository content, copy repositories.xml to each slave’s .kettle directory.
Restart the master and slave servers. Restart Pentaho Server if it participates.

Carte and PDI track object age for transformations and jobs. Objects are purged only when servers are idle. Purge verification runs every 20 seconds.

Configure schedule and remote execution log cleanup

These settings live in slave-server-config.xml.

Stop Pentaho Server before editing this file.

max_log_lines: Max log lines per execution. Use 0 for no limit.
max_log_timeout_minutes: Remove log lines older than this value. Use 0 for no timeout.
object_timeout_minutes: Remove execution entries older than this value. Use 0 for no timeout.

Example:

<slave_config>
  <max_log_lines>0</max_log_lines>
  <max_log_timeout_minutes>0</max_log_timeout_minutes>
  <object_timeout_minutes>0</object_timeout_minutes>
</slave_config>

Security and advanced server settings

Configure Carte servers for SSL

Carte SSL uses the JKS keystore format.

Keep the keystore in a restricted-access directory.

Carte runs on Jetty.

For Jetty SSL details, see: https://wiki.eclipse.org/Jetty/Howto/Configure_SSL.

Stop Carte.
Open carte-master-config.xml.
Add these values inside the master server <slaveserver>:
- keyStore (required): Path to the keystore file.
- keyStorePassword (required): Keystore password.
- keyPassword (optional): Private key password. Omit if it matches keyStorePassword.
Example:
```
<sslConfig>
  <keyStore>D:\KEY_STORE\Pentaho</keyStore>
  <keyStorePassword>OBF:...</keyStorePassword>
  <keyPassword>OBF:...</keyPassword>
</sslConfig>
```
Use the encr tool in the data-integration directory to obfuscate passwords: encr.bat -carte <password> (Windows) or encr.sh -carte <password> (Linux).
Add the same <sslConfig> block to each carte-slave-config.xml.
Start Carte.
Access Carte over HTTPS:
```
https://<host>:<port>/
```

Configure Carte servers for JAAS

You can use JAAS for user authentication.

Create a JAAS config file (example below) and save it as carte-ldap.jaas.conf on the Carte host:

Kettle {
  org.eclipse.jetty.jaas.spi.LdapLoginModule required
  debug="true"
  contextFactory="com.sun.jndi.ldap.LdapCtxFactory"
  hostname="localhost"
  port="389"
  bindDn="cn=admin,dc=example,dc=com"
  bindPassword="admin"
  authenticationMethod="simple"
  forceBindingLogin="true"
  userBaseDn="ou=People,dc=example,dc=com"
  userRdnAttribute="uid"
  userIdAttribute="uid"
  userPasswordAttribute="userPassword"
  userObjectClass="inetOrgPerson";
};

Kettle2 {
  org.eclipse.jetty.jaas.spi.PropertyFileLoginModule required
  debug="true"
  file="/installs/common/carte.users";
};

Set debug="false" in production environments.

Add these Java options to Spoon.bat (Windows) or spoon.sh (Linux), updating the path:

-Djava.security.auth.login.config=<install path>/openldap/carte-ldap.jaas.conf -Dloginmodulename=Kettle

Start Carte. Verify the server does not prompt for BASIC authentication.

Change Jetty server parameters

Carte uses an embedded Jetty server.

Only change these settings if you need to tune connection handling.

acceptors: Threads dedicated to accepting connections. Keep it at or below CPU count.
acceptQueueSize: Backlog size before the OS starts rejecting connections.
lowResourcesMaxIdleTime: Close idle connections faster under high load.

Jetty docs:

Set Jetty parameters in a Carte config file

Add this block inside <slave_config> in carte-slave-config.xml:

<jetty_options>
  <acceptors>2</acceptors>
  <acceptQueueSize>2</acceptQueueSize>
  <lowResourcesMaxIdleTime>2</lowResourcesMaxIdleTime>
</jetty_options>

Adjust values, then save the file.

Set Jetty parameters in kettle.properties

Set these variables to numeric values:

KETTLE_CARTE_JETTY_ACCEPTORS
KETTLE_CARTE_JETTY_ACCEPT_QUEUE_SIZE
KETTLE_CARTE_JETTY_RES_MAX_IDLE_TIME

Configure the PDI client

Initialize slave servers

Open a transformation.
In Explorer View, select the Slave tab.
Select New.
Enter the slave server connection details:
- Server name
- Hostname or IP address
- Port (leave blank for port 80)
- Web App Name (required only for Pentaho Server)
- User name and password
- Is the master
For clustered executions, define one master and the rest as slaves.
Select OK.

Create a cluster schema

In Explorer View, right-click Kettle cluster schemas, then select New.

Configure:

Schema name
Port: Starting port for slave step numbering.
Sockets buffer size
Sockets flush interval rows
Sockets data compressed?
Dynamic cluster: Enable if a master Carte server performs failover.
Slave Servers: Add one master and any number of slaves.

Run transformations in a cluster

Open the Run Options window (toolbar Run context menu or F8).
Select a run configuration that runs the transformation in clustered mode.
To run a clustered transformation from a job, open the Transformation job entry, then set Run this transformation in a clustered mode? on the Advanced tab.
To assign a cluster to a step, right-click the step, select Clusters, then pick a cluster schema.
When running clustered transformations, enable Show transformations to see the generated transformations that run on the cluster.

Schedule and run remotely

Schedule jobs to run on a remote Carte server

These changes are required to schedule a job to run on a remote Carte server.

They are also required if Pentaho Server acts as the load balancer in a dynamic Carte cluster.

Stop Pentaho Server and the remote Carte server.
Copy repositories.xml from your workstation’s .kettle directory to the same location on the Carte host.
Open .../tomcat/webapps/pentaho/WEB-INF/web.xml.
In the Proxy Trusting Filter section, add the Carte server IP to TrustedIpAddrs.
Uncomment the proxy trusting filter mappings between the  and  markers.
Save web.xml.
Add -Dpentaho.repository.client.attemptTrust=true to the Carte startup script:
- Windows (Carte.bat): add to the OPT line.
- Linux (Carte.sh): add to the OPT variable before export OPT.
Start the Carte server and Pentaho Server.

Run transformations and jobs from a repository on the Carte server

Copy repositories.xml from the user’s .kettle directory to the Carte host’s $HOME/.kettle directory.

Carte also looks for repositories.xml in the directory where you started Carte.

Stop Carte

You can stop Carte from the command line or from a URL.

Stop from the CLI

Arguments:

Carte <Interface address> <Port> [-s] [-p <arg>] [-u <arg>]

Example:

Carte 127.0.0.1 8080 -s -p amidala4ever -u dvader

Options:

-h, --help: Help text.
-s, --stop: Stop the running Carte server.
-u, --username <arg>: Admin user name.
-p, --password <arg>: Admin password.

Stop from a URL

http://localhost:8080/kettle/stopCarte

Expose transformation output with Pentaho Data Services

Use Pentaho Data Services to expose a transformation step as a virtual table.

Query it over JDBC using SQL.

You need a Pentaho Server and repository to publish a data service.

Pentaho Data Services guide (install, create, test, optimize)

Prototyping a data model can be time consuming, particularly when it involves setting up databases, creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can visualize the data and provide feedback.

One way to streamline this process is to make the output of a transformation step a Pentaho Data Service.

The output of the transformation step is exposed by the data service so that the output data can be queried as if it were stored in a physical table, even though the results of the transformation are not stored in a physical database.

Instead, results are published to the Pentaho Server as a virtual table.

You must have a Pentaho Server and repository to publish the data service.

The virtual table is a JDBC-compliant data source that you and others can connect to or query with SQL, provided they can access the server and the transformation.

The Pentaho Data Service can be connected to or queried by a JDBC-compliant tool such as Pentaho Report Designer, Pentaho Interactive Reports, and CTools as well as other compatible tools like RStudio, DBVisualizer, or SQuirreL.

The Pentaho Data Service can also be used in some instances where building and maintaining a data warehouse is sometimes impractical or inefficient, especially when you need to quickly blend and visualize fast-moving or quickly evolving data sets on the fly.

For example, if you want to compare your product prices with your competitors, you can create a transformation that blends prices from your in-house data sources and competitor prices.

Then, you can convert the output step in the transformation into a Pentaho Data Service that creates a virtual table for querying when you connect to the Pentaho Server.

You or others can connect to and query the virtual table, as you would any other JDBC data source to visualize the results in Analyzer or another tool.

The Pentaho Data Service also has a testing tool.

This tool generates several logs and reports that you can use to refine the data service and determine where to apply specialized optimizations.

You can also define parameters that others can use to pose customized queries.

For example, you can create a data service that publishes a virtual “fact” table of a moderately-sized research dataset to a Pentaho Server.

You can test and add optimizations and parameters, such as gender or test type so that the data service runs more quickly.

Then, you can share connection and parameter information with a group of researchers, who can query the virtual table.

Researchers can use Pentaho Interactive Reporting, a dashboard created with CTools, or an application of their choice, such as RStudio, to analyze and visualize the research dataset.

Pentaho Data Services support a subset of SQL.

For details, see SQL support reference.

Install the Pentaho Data Services plugin before you create data services.

Install Pentaho Data Services

Pentaho Data Services plugin is an optional plugin.

Pentaho Data Integration

Install the plugin in Pentaho Data Integration by using the Plugin Manager. For instructions, see Install plugins.

Pentaho Server

Install the plugin on Pentaho Server by using the Plugin manager. For instructions, see Install plugins.

After you have installed the plugin, add the driver for the Pentaho Data Services plugin by completing the following steps:

In the PDI client, open a new transformation.
In the View tab, expand the Configurations section.
Right-click Data Service and select Driver Details.
In the Driver Details window, click Get Pentaho Driver.
In the Save As window, indicate where you want to save the PDI-Data-Service-Driver-Pentaho.jar file, then click Save.
Move or copy the PDI-Data-Service-Driver-Pentaho.jar into the pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib folder.
Close the transformation without saving.
Restart the Pentaho Server.

Create a regular or streaming Pentaho Data Service

You can create either a regular data service or a streaming analytics service.

Streaming data services are commonly used when creating streaming dashboards with CTools.

Create or open a transformation.
Save the transformation to the Pentaho Server.
Right-click the transformation step that outputs the data you want.
Select Data Services > New.
Enter a unique name in Service Name (Virtual Table Name).
Confirm Output step is the step you selected.
Optional: Select Streaming for Data Service Type.
Select OK.

Data service badge

After you create a Pentaho Data Service from a step, a data service badge is added to that step.

The badge indicates whether the step has a regular or streaming data service.

Open or edit a Pentaho Data Service

To open a data service for viewing or editing, do one of the following in the PDI client:

In Explore > View, select Data Services. Right-click a data service, then select Edit.
Right-click the step with the data service badge, then select Data Services > Edit.

Delete a Pentaho Data Service

To delete a data service, do one of the following in the PDI client:

In Explore > View, select Data Services. Right-click a data service, then select Delete.
Right-click the step with the data service badge, then select Data Services > Delete.

Test a Pentaho Data Service

After creating your data service, test it to ensure that it runs properly and generates the data you need.

Testing can uncover bottlenecks in your transformation and help you decide which optimization techniques to apply.

Run a basic test

To run a basic test on a regular data service:

Verify Data Service Type is set to Regular.
Open the Test Data Service window using one of the following:
- In the Data Service window, select Test Data Service.
- In Explore > View > Data Services, right-click a data service and select Test.
- Right-click the step with the data service badge, then select Data Services > Test.
Optional: Adjust settings:
- Log Level. Controls how much detail appears in logs.
- Max Rows. Limits how many rows appear in results.
Select Execute SQL.
Review the output in Examine test results.
Optional: If you need a clean run, clear the cache and test again.
Select Close.
Optional: Add an optimization.
Publish the data service.

Run a streaming optimization test

When you test streaming data, the stream is partitioned into windows (batches).

Windows can be time-based or row-based.

To test a streaming data service:

Verify Data Service Type is set to Streaming.
Open the Test Data Service window.
Select a window mode (Time Based or Row Based).
Configure window settings:
- Window Size
- Every
- Limit
Optional: Adjust Log Level and Max Rows.
Select Execute SQL.
Review the output in Examine test results.
Select Stop to stop execution.
Select Close.
Optional: Add an optimization.
Publish the data service.

Run an optimization test

If you have added an optimization, run a test that passes the optimization parameter.

Example query:

SELECT * FROM Mars WHERE rover='Curiosity'

Select Preview Optimization in the Test Data Service window.
Use Examine test results to interpret results.

Examine test results

Test results appear in the tabs in the bottom half of the Test Data Service window.

When you test or run a data service, two transformations run:

The service transformation. This is the transformation you built.
The generated transformation. PDI generates this based on executed SQL.

The following table describes the tabs and what to look for:

Tab

Description

Tips

Query Results

Events during the test run and query results.

Verify results. For streaming, watch updates per window.

Optimized Queries

Processing information and results of optimizations.

Verify optimizations applied correctly. Compare with Service Metrics and Generated Transformation Logging.

Service Transformation Logging

Logs from the service transformation.

Check for design or runtime issues. Compare with Generated Transformation Logging.

Generated Transformation Logging

Logs from the generated transformation.

Check SQL support issues. See SQL support reference.

Service Metrics

GANTT chart timings for the service transformation.

Find bottlenecks. Compare with SQL Trans Metrics.

SQL Trans Metrics

GANTT chart timings for the generated transformation.

Find bottlenecks. Compare with Service Metrics.

SQL support reference

The Pentaho Data Service is designed to support a subset of SQL clauses and literals that are useful for data blending, optimizations, and other scenarios.

Limitations and constraints are listed at the end.

Supported SQL literals

The Pentaho Data Service supports the following literals:

Strings use single quotation marks. Escape a single quote using another single quote: ''.
Dates use square brackets. Supported formats:
- [yyyy/MM/dd HH:mm:ss.SSS]
- [yyyy/MM/dd HH:mm:ss]
- [yyyy/MM/dd]
For an IN list, date formats can use single quotes and dashes, for example:
- SELECT * FROM BUILDS WHERE BuildDate IN ('2015-03-18', '2015-03-22')
- You cannot use bracketed date formats in an IN list.
Number and BigNumber cannot use grouping symbols. Use . for decimals, for example 123.45.
Integers contain digits only.
Boolean values can be TRUE or FALSE.

Supported SQL clauses

The Pentaho Data Service supports the following clauses.

Clause

What is supported

SELECT

COUNT(field)
COUNT(*)
COUNT(DISTINCT field)
DISTINCT <fields>
IIF (condition, true-value or field, false-value or field)
CASE WHEN condition THEN true-value ELSE false-value END
SUM
AVG
MIN
MAX
Aliases with AS or with spaces. Example: SUM(sales) AS "Total Sales" or SUM(sales) TotalSales
Constant expressions. See Supported SQL literals.

FROM

Only one Pentaho service name.
Aliases for the service name.
Omit the service name to query an empty row. SELECT 1 and SELECT 1 FROM dual are the same.

WHERE

Nested brackets
AND, OR, NOT followed by brackets (example: NOT (A=5 OR C=3))
Precedence
Literals (String and Integer)
PARAMETER('parameter-name')='value' (always evaluates to TRUE)
Operators: =, <, >, <=, =<, >=, =>, <>
LIKE (wildcards: %, ?)
REGEX
IS NULL, IS NOT NULL
IN
Conditions on IIF or its alias
DATE_TO_STR(date-field, <mask>)

GROUP BY

Group on fields. Not on IIF().

LIMIT

LIMIT
LIMIT offset, count
LIMIT count OFFSET offset

HAVING

Apply conditions to aggregates, not aliases.
Use identical strings for expressions.
Use conditions on aggregations not in SELECT.

ORDER BY

Order on any column, even if not in the result.
Order on IIF or CASE-WHEN expressions.

Other development considerations

Keep these constraints in mind:

You cannot JOIN one data service virtual table to another.
Data services use the Memory Group By step to group. Watch memory consumption for many groups.
You cannot specify the same field twice in the same SELECT clause.
Calculations and functions like string concatenation are not supported in queries. Do them in the transformation.

Optimize a Pentaho Data Service

As you test your data service, you might see bottlenecks or steps that could run more efficiently.

If you want to improve performance, apply an optimization technique.

Service cache

This optimization stores results in a cache.

By default, caching is enabled and results are stored for an hour.

Use this technique when result sets are modest and you expect repeat queries.

How the service cache optimization works

If you run the data service while results are cached, PDI can run your query against cached data instead of running the full transformation.

This behavior depends on:

Whether the cached dataset contains all the records required by the new query
Other optimizations you apply

When you run tests from the Test Data Service window and change the SQL, PDI does not use the cache.

Tests only return up to Max Rows, so using cached results could return incomplete results.

Adjust the cache duration

In the Data Service window, select the Service Cache tab.
Select Enable Caching.
Update Cache Duration (seconds).
Select OK.
Run an optimization test.
Publish the data service.

Disable the cache

In the Data Service window, select the Service Cache tab.
Clear Enable Caching.
Select OK.

Clear the cache

In the Data Service window, select the Service Cache tab.
Clear Enable Caching.
Select OK.
Open the Data Service window again and return to the Service Cache tab.
Select Enable Caching.

Query pushdown

Use Query Pushdown to translate the SQL WHERE clause into a corresponding WHERE clause in:

Table Input steps
MongoDB Input steps

The input queries are filtered and handled at the source.

How query pushdown works

To apply Query Pushdown, set the optimization values, then add the optimization parameter to the input step query.

The optimization uses a parameter in place of the WHERE clause, for example:

SELECT * FROM 'employee' WHERE ${countryParam}

Add the query pushdown parameter to Table Input or MongoDB Input

Create a transformation with a Table Input or MongoDB Input step.
Run your transformation.
Open the input step.
Add a parameter where your WHERE clause value belongs:
- SQL: SELECT * FROM media WHERE ${countryParam}
- MongoDB: {$match : ${mongoDbParam}}
Press CTRL+SPACE to list parameters.
Select Replace Variables in Script?.
Select OK.
Set up query pushdown parameter optimization.

Set up query pushdown parameter optimization

Open the Data Service window and select the Query Pushdown tab.
Select the plus sign near Parameters.
Enter the name of the optimization parameter you used in the input query.
Select OK.
Select the step that contains the parameter in Step Name.
In Definitions, map fields:
- Data Service Field. Field name as it appears in transformation output.
- Step Field. Field name as it appears in the data source.
Optional: Select Get Optimizations to generate optimizations automatically.
Select OK.
Test and publish:
- Run an optimization test.
- Publish the data service.

Disable the query pushdown optimization

To disable this optimization, select Disable an Optimization in the Data Services window.

Parameter pushdown

Parameter Pushdown can be applied to any step in the transformation.

It maps a field value to a transformation parameter for simple equality conditions (example: WHERE region="South").

How parameter pushdown works

To set up Parameter Pushdown, configure the optimization, then add the parameter to the step.

Add the parameter pushdown parameter to the step

Create a transformation.
Run your transformation.
Add the parameter to a step, such as JSON Input or REST Client.
Select OK.
Set up parameter pushdown optimization.

Set up parameter pushdown optimization

Open the Data Service window and select the Parameter Pushdown tab.
In WHERE Clause Column, enter the data service field name from the existing list.
Update Transformation Parameter as needed. The name must be unique in the data service.
Optional: Update Value Format. Default %s is usually sufficient.
Select OK.
Test and publish:
- Run an optimization test.
- Publish the data service.

Streaming optimization

This optimization limits the batch size used for processing.

Streaming records are partitioned into windows (batches) for processing.

How streaming optimization works

You can optimize processing by setting either:

Rows Limit. Maximum rows in a window.
Time Limit. Maximum elapsed time to create a window.

Adjust the row or time limits

In the Data Service window, select the Streaming tab.
Update Rows Limit or Time Limit.
Select OK.
Test and publish:
- Run an optimization test.
- Publish the data service.

Publish a Pentaho Data Service

To publish your data service, save the transformation containing the data service to the Pentaho Repository.

To validate that your data service was published:

Ensure external assets required by the transformation are accessible on the server.
Open a browser and log in to the Pentaho Server.

List data services on the server, for example:

http://<Pentaho Server Host:Port>/pentaho/kettle/listServices

Once your data service is created and tested, you can share it so others can connect and query it.

Share prerequisites

Ensure the user or group has permissions to:
- Access the Pentaho Server and run queries.
- Read and run the transformation that contains the data service.
Provide the data service name as saved in the Pentaho Repository.
If you use optimizations, provide names and definitions for any parameters.
Provide connection instructions:
- Connect from a Pentaho tool
- Connect from a non-Pentaho tool

Connect to the Pentaho Data Service from a Pentaho tool

A Pentaho Data Service is a virtual table containing the output of a step in a PDI transformation.

You can connect to and query:

A regular data service from Pentaho tools such as Report Designer, Analyzer, and the PDI client.
A streaming data service from a dashboard created with CTools.

You need the data service name and permission to run the transformation and access the Pentaho Server.

Connection parameters:

Connection Name. Unique name of the data service.
Connection Type. Pentaho Data Services.
Access. Native (JDBC).
Hostname. Pentaho Server hostname or IP address. Default is localhost for local installs.
Port Number. Pentaho Server port. Default is 8080.
Web App Name. Default is pentaho.
Username. A user allowed to run the data service.
Password. Password for that user.

Optional connection parameters:

proxyhostname. Proxy server for HTTP connections.
proxyport. Proxy server port.
nonproxyhosts. Hosts that bypass the proxy. Separate multiple hosts with commas.
debugtrans. File path where PDI saves the generated debug transformation. Example: /tmp/debug.ktr.
PARAMETER_[optionname]=value. Sets a transformation parameter. Example: PARAMETER_model=E6530.
secure. Set to TRUE for HTTPS. Default is HTTP.

Connect to the Pentaho Data Service from a non-Pentaho tool

You can connect to and query a data service from tools like RStudio or SQuirreL.

To connect and query, you need the data service name and permission to run the transformation and access the server.

To use a non-Pentaho tool, install the data service JDBC driver, then create a connection in your tool.

Step 1: Download the Pentaho Data Service JDBC driver

You can download the driver using the PDI client or manually.

Download using the PDI client

Open the transformation and identify the step with the data service badge.
Open Driver Details using one of the following:
- Right-click the step and select Data Service > Driver Details.
- In Explore > View > Data Services, right-click a data service and select Driver Details.
- In the Data Services window, select Driver Details.
In Driver Details, select Get Third-Party Driver.
Save Pentaho-Data-Service-Driver.zip.
Select Close.

Download manually

Go to <Pentaho installation directory>/design-tools/data-integration/plugins/pdi-dataservices-ee-plugin/Data Service JDBC Driver.
Copy pdi-dataservice-driver-thirdparty-<release-version>.jar.

These steps assume you installed the Pentaho Data Services plugin. See Install Pentaho Data Services.

Step 2: Install the Pentaho Data Service JDBC driver

Extract the driver files and copy them to your application's JDBC driver directory.
Start and stop the application.
Create a connection using the instructions in Connect to the Pentaho Data Service from a non-Pentaho tool.

Step 3: Create a connection from a non-Pentaho tool

Most tools let you create a JDBC connection by specifying a driver class and connection string.

Driver class:

org.pentaho.di.trans.dataservice.jdbc.ThinDriver

Connection string format:

jdbc:pdi://<Pentaho Server Hostname:Port>/kettle?option=value&option=value

Example:

jdbc:pdi://localhost:8080/kettle?webappname=pentaho

webappname is required when the data service runs on Pentaho Server.

Query a Pentaho Data Service

You can query a data service using SQL.

If the transformation uses a parameter, you can assign a value in your SQL query.

Limitations:

SQL support is limited. See SQL support reference.
Data services can be queried only with SQL.

Example: query with a parameter

Syntax:

SELECT * FROM '<data-service-name>'
WHERE PARAMETER('<parameter_name>') = '<parameter_value>'

Example:

SELECT *
FROM 'employeeList'
WHERE PARAMETER('employeeRegion') = 'USA EAST'

Monitor a Pentaho Data Service

To monitor a data service, enter one of the following URLs in your browser:

List names of data services:

http://<Pentaho Server Host:Port>/pentaho/kettle/listServices

List data service status:

http://<Pentaho Server Host:Port>/pentaho/kettle/status

Replace <Pentaho Server Host:Port> with the host name or IP address and port for the Pentaho Server running the data service.

You must be authenticated and authorized on the Pentaho Server to access these endpoints.

Capture lineage with the OpenLineage plugin

Use the OpenLineage plugin to emit standardized lineage events from PDI.

Use these events in Pentaho Data Catalog (PDC) for end-to-end data flow visibility.

OpenLineage plugin guide (setup, configuration, supported steps)

The Pentaho Data Integration (PDI) OpenLineage plugin enables PDI to emit rich, standardized OpenLineage events that can be consumed by Pentaho Data Catalog (PDC) to capture how data moves and is transformed in PDI ETL pipelines.

PDC uses information it captures to provide visual end-to-end transparency of data flows, which improves data observability, strengthens compliance and governance, aids in troubleshooting data issues, and enhances data trust and quality for business users.

OpenLineage events are emitted from PDI when supported transformations are executed by discovering input and output datasets and, when possible, generating column-level lineage.

The OpenLineage plugin emits events for:

Start: transformation starts
Complete: transformations ends
Abort: transformation was stopped without errors
Fail: transformation ended with errors

Compatibility matrix

OpenLineage plugin functionality is certified to work as intended for the following versions of PDI:

10.2.0.1 (SP1)
10.2.0.2 (SP2)
10.2.0.3 (SP3)
10.2.0.4 (SP4)
10.2.0.5 (SP5)
10.2.0.6 (SP6)
11.0

Setting up the plugin

Before you begin, verify that you have a valid license for the OpenLineage plugin.

For information about licenses, see Acquire and install enterprise licenses.

To set up the OpenLineage plugin, you must complete the following tasks:

Download the plugin

Download the OpenLineage plugin from the Pentaho Support Portal.

On the Support Portal home page, sign in using the Pentaho support username and password provided in your Pentaho Welcome Packet.
In the Pentaho card, click Download. The Downloads page opens.
In the .x list, click Pentaho EE Marketplace Plugins Release.
Scroll to the bottom of the page.
In the Marketplace Plugins section, click Open Lineage.
Download the pdi-openlineage-plugin-<plugin_version>-<build number>.zip file.

Install the plugin

Install the OpenLineage plugin in the PDI client and Pentaho Server by running commands appropriate for your operating system.

Note: The plugin can be installed in the PDI client, Pentaho Server, or both.

Installation commands include the following placeholders that must be replaced:

<path-to-data-integration>: Replace with full path to the PDI client.
<path-to-pentaho-server>: Replace with full path to the Pentaho Server.
<version_check_option>: Replace with one of the following options:
- none: Installs the plugin on any version of Pentaho. If the Pentaho version is unsupported, an error is shown.
- loose: Default option. Installs the plugin on certified and compatible, newer Pentaho versions.
- strict: Installs plugin only on certified Pentaho versions.

To install the OpenLineage plugin, complete the following steps:

Stop the PDI client and Pentaho Server.
Extract the pdi-openlineage-plugin-<plugin_version>-<build number>.zip file to a folder on the computer where the PDI client or PDI Server is installed.
In the pdi-openlineage-plugin-<plugin_version>-<build number> folder, open a command prompt as an administrator.
In the command prompt, run the following installation commands for your operating system, replacing the placeholders for paths and version check options.
- Windows
  - PDI client
    install.bat -t <path-to-data-integration> --platformVersionCheck <version_check_option>
  - PDI Server
    install.bat -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>
- Linux
  - PDI client
    ./install.sh -t <path-to-data-integration> --platformVersionCheck <version_check_option>
  - PDI Server
    ./install.sh -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>
Start the PDI client and Pentaho Server.

Generate an encrypted password

If you plan to emit events to PDC, and want to secure your password so that it's not in plain text, you can generate an encrypted password to authenticate to PDC.

The encrypted password is used in the configuration file for the OpenLineage plugin.

On the computer where the PDI client or PDI Server is installed, open a command prompt.

Run one of the following commands for your operating system:

Windows

To generate a password using the default Pentaho encryption seed, run the following command:
```
cd <path-to-data-integration> # or <path-to-pentaho-server>
sh encr.bat <your_password>
```

To generate a password using your own custom encryption seed, run the following command:

export KETTLE_TWO_WAY_PASSWORD_ENCODER_SEED=<your_custom_seed>your-custom-seed"
cd <path-to-data-integration> # or <path-to-pentaho-server>
sh encr.bat <your_password>

Linux

To generate a password using the default Pentaho encryption seed, run the following command:
```
cd /<path-to-data-integration> # or <path-to-pentaho-server>
sh encr.sh <your_password>
```

To generate a password using your own custom encryption seed, run the following command:

export KETTLE_TWO_WAY_PASSWORD_ENCODER_SEED=<your_custom_seed>your-custom-seed"
cd /<path-to-data-integration> # or <path-to-pentaho-server>
sh encr.sh <your_password>

An encrypted password is generated and displayed in the command prompt, like the following example:

Encrypted 2be98afc86aa7f297a414ab3dce93bcc9

Create a configuration file for the plugin

After you install the plugin, create a configuration file that specifies where to send open lineage events.

You can create a simple configuration file for testing or a custom configuration to use in production.

In a text editor, create a configuration file with content from one of the following examples, based on your needs:

To create a simple configuration file that you can use to quickly validate that the plugin is working, include only the following content:
```
version: 0.0.1
consumers:
  console:
```

To create a custom configuration file that includes OpenLineage event consumers in your Pentaho deployment, such as a PDC Server, include the following content:

version: 0.0.1
localHostname: <localhostName>   # optional
debugMode: false               # PDI client (Spoon) only
consumers:
  console:
  file:
    - path: /<path_to_file>/openlineage.json
  http:
    - name: PDC
      url: https://<pdc_server_host_name>
      endpoint: /lineage/api/events
      authenticationParameters:
        endpoint: /keycloak/realms/pdc/protocol/openid-connect/token
        username: <pdc_server_username>
        password: <pdc_server_password>
        client_id: pdc-client
        scope: openid

Save the file as openlineageConfig.yml in the PDI directory that contains your user-specific configuration files.
Notes:
- By default, user-specific configuration files are stored in the .kettle directory, which is usually in one of the following locations:
  - Windows: C:\Documents and Settings\example_user\.kettle
  - Linux: ~/.kettle)
  However, if you run PDI in a container, configuration files might resolve to the /root/.kettle directory.
- You can add multiple http consumers in the configuration file.

Enable the plugin

After you install the OpenLineage plugin and create its configuration file, you must enable the plugin so that it can send open lineage events to the consumers you specified in the configuration file.

Enable in PDI client

Enable the plugin in the PDI client by completing the following steps:

Log into the PDI client and click Edit > Edit the Kettle.properties file. The Kettle properties window opens.
To make the plugin active, add the following variable and value: KETTLE_OPEN_LINEAGE_ACTIVE=true
To point PDI to your openlineageConfig.yml file, add the following variable with the <path-to-config-file> placeholder replaced by the full path to your configuration file directory: KETTLE_OPEN_LINEAGE_CONFIG_FILE=/<path-to-config-file>/openlineageConfig.yml
Click OK. The kettle.properties file is saved and the OpenLineage plugin is enabled.

Enable in Pentaho Server

Enable the client in the Pentaho Server, by completing the following steps:

Navigate to the kettle.properties file.
Note: The kettle.properties file is usually in one of the following locations:
- Windows: C:\Documents and Settings\example_user\.kettle
- Linux: ~/.kettle)
If you run PDI in a container, the kettle.properties file is in the /root/.kettle directory.
Open the kettle.properties file in a text editor.
Enable the plugin with its configuration file by adding the following variables and values:
KETTLE_OPEN_LINEAGE_ACTIVE=true
KETTLE_OPEN_LINEAGE_CONFIG_FILE=/<path-to-config-file>/openlineageConfig.yml
Save the kettle.properties file.

Validate the plugin works

You can validate that the plugin is working by verifying that text related to OpenLineage appears in the appropriate logs and files.

To validate that the plugin is working, complete the following steps:

In the PDI client, click File > Open, and then navigate to sample transformations in your Pentaho folder. For example, in Windows the sampls are in <path_to_Pentaho>\Pentaho\design-tools\data-integration\samples\transformations.
Select the sample transformation, TextInput and Output using variables.ktr, and click Open.
To run the transformation click Action > Run, and then in the Run Options window, click Run. The transformation runs and Execution Results pane appears at the bottom of the PDI client.
Validate that consumers you have enabled are receiving OpenLineage events by taking one of the following actions:
- If the console consumer is enabled, in the Execution Results pane of the PDI client, click the Logging tab and verify that the log contains lines with the text, "OpenLineage-Plugin".
- If a file consumer is enabled, open the openlineage.json file in a text editor and verify that it contains lines with the text, "OpenLineage-Plugin". The openlineage.json file location is defined in the openlineageConfig.yml file.
- If an HTTP consumer is enabled, confirm OpenLineage events are arriving for that consumer. For example, if the PDC is a configured consumer, verify the events arrive in PDC.

Troubleshoot plugin

If you are unable to validate that the plugin is working, perform the following troubleshooting actions:

Verify dataset lineage (input text file -> output text file) and column lineage mappings.
Validate that the Kettle.properties file contains the following variable and value: KETTLE_OPEN_LINEAGE_ACTIVE=true.
Verify that the credentials specified in the openlineageConfig.yml file are correct.
Check your network and firewall settings.

Supported steps

Note: This list of supported steps is for version 0.5.0 of the plugin.

Steps that support dataset lineage and column-level lineage

Abort
Append Streams
Block this step until steps finish
Blocking Step
Data Grid
Delay Row
Delete
Dummy
Filter Rows
Generate Rows
Get Variables
Group By
Java Filter
Mail
Merge Join
Microsoft Excel Input
Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections.
Microsoft Excel Output (deprecated)
Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections. [1]
Microsoft Excel Writer
Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections. [1]
Prioritize streams
S3 CSV Input
S3 File Output [1]
Send message to syslog
Set Variables
Sort Rows
Switch/Case
Table input
Lineage is supported for the following connections, using the listed SQL functions and clauses:
- Connection types: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Google BigQuery, Redshift, and Generic Connection [2]
- SQL functions: aliases, joins, subqueries, functions, aggregations, constants, expressions, cases, window functions, CTEs, and the set operators: unions, intersects, and excepts.
- Clauses: GROUP BY, ORDER BY, WHERE, WITH, and HAVING.
Table output
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
Text file input
Lineage is supported for local files, AWS, Mineo, HCP, and other S3- compatible connections. Fixed filetype is not supported.
Text file output
Lineage is supported for local files, AWS, Mineo, HCP, and other S3- compatible file systems. [1] Fixed filetype is not supported.
Write to Log

Steps that support only dataset lineage, not column-level lineage

Combination lookup/update
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
CSV File Input
Database Lookup
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
De-serialize from file
Dimension lookup/update
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
Fixed file input
Gzip Csv Input
Insert/Update
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
JSON Input
JSON Output [1]
LDIF Input
Load file content in memory
Property Input
Properties Output [1]
Sql File Output [1]
Synchronize after merge
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
Update
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
XBase Input

Notes:

[1] Step, which can create multiple files as its output, can be configured to add filenames to its results file so that the name of each file is recorded in lineage. If the Add filenames to result option is disabled for the step, only a single, generic target is recorded in lineage. For example, if the Add filenames to result option is enabled for the step, the output is recorded in lineage as <filename>_001.csv, <filename>_002.csv, <filename>_003.csv, and so on. But, if the option is disabled, the output is recorded as only <filename>.csv.

[2] Step allows generic connections, but lineage works only with generic connections that are listed as supported.

Note: The Google Big Query connection is not supported on table output step. An OpenLineage event won't have any dataset outputs from any Google Big Query storage.

Uninstall plugin

Uninstall the OpenLineage plugin from the PDI client and Pentaho Server by running commands appropriate for your operating system.

Before you begin, you must download the OpenLineage plugin from the Pentaho Support Portal, which contains script files for uninstalling the plugin. For details, see Download the plugin.

Note: The plugin can be uninstalled from the PDI client, Pentaho Server, or both.

Commands for uninstalling the plugin include the following placeholders that must be replaced:

<path-to-data-integration>: Replace with full path to the PDI client.
<path-to-pentaho-server>: Replace with full path to the Pentaho Server.
<version_check_option>: Replace with one of the following options:
- none: Installs the plugin on any version of Pentaho. If the Pentaho version is unsupported, an error is shown.
- loose: Default option. Installs the plugin on certified and compatible, newer Pentaho versions.
- strict: Installs plugin only on certified Pentaho versions.

To uninstall the OpenLineage plugin, complete the following steps:

Stop the PDI client and Pentaho Server.
Extract the pdi-openlineage-plugin-<plugin_version>-<build number>.zip file to a folder on the computer where the PDI client or PDI Server is installed.
In the pdi-openlineage-plugin-<plugin_version>-<build number> folder, open a command prompt as an administrator.
In the command prompt, run the following installation commands for your operating system, replacing the placeholders for paths and version check options.
- Windows
  - PDI client
    uninstall.bat -t <path-to-data-integration> --platformVersionCheck <version_check_option>
  - PDI Server
    uninstall.bat -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>
- Linux
  - PDI client
    ./uninstall.sh -t <path-to-data-integration> --platformVersionCheck <version_check_option>
  - PDI Server
    ./uninstall.sh -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>
Start the PDI client and Pentaho Server.

Upgrade plugin

Important: Do not install a new version of the OpenLineage plugin over an existing installation of the plugin.

To upgrade the OpenLineage plugin, you must uninstall the plugin and then download and install the new version of the plugin.

For details, see the following sections:

Archived source pages

These pages were merged into this single topic page:

PreviousTransforming data with PDI NextCollaborating with PDI

Last updated 6 days ago

Was this helpful?

Loading data from PDI

In this article

Run transformations and jobs from the command line

Startup script options

Pan (run transformations)

Kitchen (run jobs)

Import `.kjb` or `.ktr` files from a ZIP archive

Export repository content from the command line

Use Pan and Kitchen with a Hadoop cluster

Run and monitor remotely with Carte

Carte clusters

Set up servers

Security and advanced server settings

Configure the PDI client

Run transformations in a cluster

Schedule and run remotely

Stop Carte

Expose transformation output with Pentaho Data Services

In this article

Install Pentaho Data Services

Create a regular or streaming Pentaho Data Service

Open or edit a Pentaho Data Service

Delete a Pentaho Data Service

Test a Pentaho Data Service

SQL support reference

Optimize a Pentaho Data Service

Publish a Pentaho Data Service

Monitor a Pentaho Data Service

Capture lineage with the OpenLineage plugin

Compatibility matrix

Setting up the plugin

Supported steps

Uninstall plugin

Upgrade plugin

Archived source pages

hashtagIn this article

hashtagRun transformations and jobs from the command line

hashtagStartup script options

hashtagPan (run transformations)

hashtagKitchen (run jobs)

hashtagImport .kjb or .ktr files from a ZIP archive

hashtagExport repository content from the command line

hashtagUse Pan and Kitchen with a Hadoop cluster

hashtagRun and monitor remotely with Carte

hashtagCarte clusters

hashtagSet up servers

hashtagSecurity and advanced server settings

hashtagConfigure the PDI client

hashtagRun transformations in a cluster

hashtagSchedule and run remotely

hashtagStop Carte

hashtagExpose transformation output with Pentaho Data Services

hashtagIn this article

hashtagInstall Pentaho Data Services

hashtagCreate a regular or streaming Pentaho Data Service

hashtagOpen or edit a Pentaho Data Service

hashtagDelete a Pentaho Data Service

hashtagTest a Pentaho Data Service

hashtagSQL support reference

hashtagOptimize a Pentaho Data Service

hashtagPublish a Pentaho Data Service

hashtagShare a Pentaho Data Service with others

hashtagMonitor a Pentaho Data Service

hashtagCapture lineage with the OpenLineage plugin

hashtagCompatibility matrix

hashtagSetting up the plugin

hashtagSupported steps

hashtagUninstall plugin

hashtagUpgrade plugin

hashtagArchived source pages

In this article

Run transformations and jobs from the command line

Startup script options

Pan (run transformations)

Kitchen (run jobs)

Import `.kjb` or `.ktr` files from a ZIP archive

Export repository content from the command line

Use Pan and Kitchen with a Hadoop cluster

Run and monitor remotely with Carte

Carte clusters

Set up servers

Security and advanced server settings

Configure the PDI client

Run transformations in a cluster

Schedule and run remotely

Stop Carte

Expose transformation output with Pentaho Data Services

In this article

Install Pentaho Data Services

Create a regular or streaming Pentaho Data Service

Open or edit a Pentaho Data Service

Delete a Pentaho Data Service

Test a Pentaho Data Service

SQL support reference

Optimize a Pentaho Data Service

Publish a Pentaho Data Service

Share a Pentaho Data Service with others

Monitor a Pentaho Data Service

Capture lineage with the OpenLineage plugin

Compatibility matrix

Setting up the plugin

Supported steps

Uninstall plugin

Upgrade plugin

Archived source pages