Loading data from PDI
Load data from Pentaho Data Integration (PDI) to external tools.
Use these features to run outside the PDI client, run remotely, expose virtual tables, and capture lineage.
In this article
Run transformations and jobs from the command line
You can use command line tools to execute PDI content outside the PDI client.
Use them in scripts and schedulers, like cron.
Use Pan to run transformations.
Use Kitchen to run jobs.
Command-line reference (Pan, Kitchen, ZIP, export, Hadoop)
Startup script options
Pan and Kitchen recognize the startup-script options used by the PDI client.
These options are in Spoon.bat (Windows) and Spoon.sh (Linux).
To use these options with Pan or Kitchen, add them to your startup script.
Note: The default directory for the startup script is design-tools/data-integration.
FILTER_GTK_WARNINGS
Suppresses GTK warnings from spoon.sh and kitchen.sh. Set to true to suppress warnings. Leave empty to show warnings.
SKIP_WEBKITGTK_CHECK
Suppresses warnings about missing libwebkitgtk when launching the PDI client. Set to true to suppress warnings. Leave empty to show warnings.
KETTLE_HOME
Identifies the user's home directory for PDI configuration files. Use it to change the location of files normally in <user home>/.kettle.
KETTLE_LOG_SIZE_LIMIT
Limits the log size for transformations and jobs that do not set a log size limit property.
KETTLE_JNDI_ROOT
Changes the Simple JNDI path, which contains jdbc.properties.
KETTLE_DIR
Directory where the PDI client is installed.
KETTLE_REPOSITORY
Repository that Kettle connects to at startup.
LIBPATH
Value passed as the -Djava.library.path Java parameter.
PENTAHO_DI_JAVA_OPTIONS
Additional Java arguments when running Kettle. Use it for settings like memory limits.
Pan (run transformations)
Pan runs transformations from a PDI repository (database or enterprise) or a local file.
The options are the same for the shell script and batch file.
Note: Windows uses the forward slash (/) and colon (:) syntax. If option values contain spaces, quote the full argument. Example: "-param:MASTER_HOST=192.168.1.3" "-param:MASTER_PORT=8181".
pan.sh -option=value arg1 arg2pan.bat /option:value arg1 arg2Example:
sh pan.sh -rep=initech_pdi_repo -user=pgibbons -pass=lumburgh -trans=TPS_reports_2011pan.bat /rep:initech_pdi_repo /user:pgibbons /pass:lumburgh /trans:TPS_reports_2011rep
Enterprise repository name.
user
Repository username.
pass
Repository password.
trans
Name of the transformation to run.
dir
Repository directory that contains the transformation, including the leading slash.
file
Local .ktr file path.
level
Logging level: Basic, Detailed, Debug, Rowlevel, Error, Nothing.
logfile
Log file path.
listdir
Lists directories in the specified repository.
listtrans
Lists transformations in the specified repository directory.
listrep
Lists available repositories.
exprep
Exports all repository objects to one XML file.
norep
Prevents Pan from logging into a repository. Useful when environment variables like KETTLE_REPOSITORY are set, but you want to run a local .ktr.
safemode
Runs in safe mode with extra checking.
version
Shows version, revision, and build date.
param
Sets a named parameter in name=value format. Example: -param:Foo=bar.
listparam
Lists information about named parameters in the specified transformation.
metrics
Gathers metrics during execution.
maxloglines
Maximum number of log lines kept internally. 0 keeps all lines (default).
maxlogtimeout
Maximum age (minutes) of a log line kept internally. 0 keeps lines indefinitely (default).
Pan status codes
Pan returns one of these status codes:
0
Transformation ran without a problem.
1
Errors occurred during processing.
2
Unexpected error during loading or running the transformation.
3
Unable to prepare and initialize the transformation.
7
Transformation could not be loaded from XML or the repository.
8
Error loading steps or plugins.
9
Command line usage was printed.
Kitchen (run jobs)
Kitchen runs jobs from a PDI repository (database or enterprise) or a local file.
The options are the same for the shell script and batch file.
Note: Windows uses the forward slash (/) and colon (:) syntax. If option values contain spaces, quote the full argument. Example: "-param:MASTER_HOST=192.168.1.3" "-param:MASTER_PORT=8181".
kitchen.sh -option=value arg1 arg2kitchen.bat /option:value arg1 arg2rep
Enterprise or database repository name.
user
Repository username.
pass
Repository password.
job
Name of the job (as it appears in the repository) to run.
dir
Repository directory that contains the job, including the leading slash.
file
Local .kjb file path.
level
Logging level: Basic, Detailed, Debug, Rowlevel, Error, Nothing.
logfile
Log file path.
listdir
Lists subdirectories within the specified repository directory.
listjob
Lists jobs in the specified repository directory.
listrep
Lists available repositories.
export
Exports all linked resources of the specified job. Argument is a ZIP filename.
norep
Prevents Kitchen from logging into a repository. Useful when environment variables like KETTLE_REPOSITORY are set, but you want to run a local .kjb.
version
Shows version, revision, and build date.
param
Sets a named parameter in name=value format. Example: -param:FOO=bar.
listparam
Lists information about named parameters in the specified job.
maxloglines
Maximum number of log lines kept internally. 0 keeps all lines (default).
maxlogtimeout
Maximum age (minutes) of a log line kept internally. 0 keeps lines indefinitely (default).
Example:
sh kitchen.sh -rep=initech_pdi_repo -user=pgibbons -pass=lumburghsux -job=TPS_reports_2011kitchen.bat /rep:initech_pdi_repo /user:pgibbons /pass:lumburghsux /job:TPS_reports_2011Kitchen status codes
Kitchen returns one of these status codes:
0
Job ran without a problem.
1
Errors occurred during processing.
2
Unexpected error during loading or running the job.
7
Job could not be loaded from XML or the repository.
8
Error loading steps or plugins.
9
Command line usage was printed.
Import .kjb or .ktr files from a ZIP archive
.kjb or .ktr files from a ZIP archivePan and Kitchen can read PDI content from ZIP files.
Use the ! switch.
Windows example:
Kitchen.bat /file:"zip:file:///C:/Pentaho/PDI Examples/Sandbox/linked_executable_job_and_transform.zip!Hourly_Stats_Job_Unix.kjb"Linux and Solaris example (escape !):
./kitchen.sh -file:"zip:file:////home/user/pentaho/pdi-ee/my_package/linked_executable_job_and_transform.zip\!Hourly_Stats_Job_Unix.kjb"Export repository content from the command line
To export repository objects into XML format using command-line tools, pass named parameters when calling Kitchen or Pan.
Example (Kitchen):
call kitchen.bat /file:C:\Pentaho_samples\repository\repository_export.kjb ^
"/param:rep_name=PDI2000" "/param:rep_user=admin" "/param:rep_password=password" ^
"/param:rep_folder=/public/dev" ^
"/param:target_filename=C:\Pentaho_samples\repository\export\dev.xml"rep_folder
Repository folder
rep_name
Repository name
rep_password
Repository password
rep_user
Repository username
target_filename
Target filename
Note: You can use obfuscated passwords with Encr, the command line tool for encrypting strings for storage and use by PDI.
Example batch file that checks for errors:
@echo off
ECHO This is an example of a batch file calling repository_export.kjb
cd C:\Pentaho\pdi-ee-<version>\data-integration
call kitchen.bat /file:C:\Pentaho_samples\repository\repository_export.kjb "/param:rep_name=PDI2000" ^
"/param:rep_user=admin" "/param:rep_password=password" "/param:rep_folder=/public/dev" ^
"/param:target_filename=C:\Pentaho_samples\repository\export\dev.xml"
if errorlevel 1 goto error
echo Export finished successful.
goto finished
:error
echo ERROR: An error occurred during repository export.
:finished
REM Allow the user to read the message when testing.
pauseUse Pan and Kitchen with a Hadoop cluster
To use Pan or Kitchen on a Hadoop cluster, configure Pentaho to run transformations and jobs with the PDI client or the Pentaho Server.
You do not need these configurations if the PDI client connects to the Pentaho Repository.
To use Pan and Kitchen from a repository directly on the Pentaho Server, create the named cluster definition in the server repository.
See Connecting to a Hadoop cluster with the PDI client.
Note: If the PDI client and Pentaho Server run on the same platform, cluster configuration files in /home/<user>/.pentaho/metastore can be overwritten. Use the same cluster connection names on both hosts.
Configure the PDI client
Create a connection to the Hadoop cluster where you want to run the job or transformation.
Create and test the job or transformation in the PDI client.
Go to
design-tools/data-integration/plugins/pentaho-big-data-plugin.Open
plugin.propertiesin a text editor.Set
hadoop.configurations.pathto the directory that containsmetastore.Example:
hadoop.configurations.path=/home/<user>/.pentahoThe default metastore location is
/home/<user>/.pentaho/metastore.Save and close
plugin.properties.
Configure the Pentaho Server
If the server is on a different host, copy the
metastoredirectory and its contents from the PDI client to a location the server can access.The default metastore location for the PDI client is
/home/<user>/.pentaho/metastore.Go to
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin.Open
plugin.propertiesin a text editor.Set
hadoop.configurations.pathto the directory that containsmetastore.Save and close
plugin.properties.
Run and monitor remotely with Carte
Carte is a lightweight web server for running PDI transformations and jobs remotely.
It receives the transformation or job (as XML) plus the run configuration.
It also exposes endpoints to monitor, start, and stop executions.
Carte setup and reference (clusters, security, client config)
Carte clusters
Use a Carte cluster to distribute transformation processing across multiple Carte servers.
A cluster includes:
One master node that tracks execution.
Two or more slave nodes that do the work.
You can also run a single Carte instance as a standalone remote execution engine.
Define one or more Carte servers in the PDI client (Spoon), then send jobs and transformations to them.
You can cluster Pentaho Server for failover. If you use Pentaho Server as the cluster master (dynamic cluster), enable the proxy trusting filter. See Schedule jobs to run on a remote Carte server.
Cluster types
Static cluster
Static clusters have a fixed schema.
You define the master and slave nodes at design time.
Static clusters fit smaller, stable environments.
Dynamic cluster
Dynamic clusters discover slave nodes at run time.
Slave nodes are registered with the master.
PDI monitors slaves every 30 seconds to see if they are available.
Dynamic clusters fit cloud-like environments where nodes come and go.
Set up servers
Prerequisites
Copy required JDBC drivers and PDI plugins from your dev system to each Carte instance.
If you will run content from a Pentaho Repository, copy
repositories.xmlfrom your workstation’s.kettledirectory to the same location on each Carte server.
Set up a static cluster (start slave servers)
Start each slave server with the host and port you want to expose:
Verify each server is reachable from your PDI client.
(Optional) Create an init/startup script to start Carte on boot.
When Carte runs embedded in Pentaho Server, configuration is controlled by slave-server-config.xml under .../pentaho-solutions/system/kettle/. Stop Pentaho Server before editing that file.
Set up a dynamic cluster
Dynamic clusters use two configuration files:
carte-master-config.xmlfor the master.carte-slave-config.xmlfor each slave.
You can rename the files.
Keep the required XML structure and values.
Configure a Carte master server
Copy required JDBC drivers and plugins to the master host.
Create
carte-master-config.xmlusing this template:The master
<name>must be unique in the cluster.Start Carte using the master config file:
Verify the master is running.
(Optional) Create an init/startup script for boot-time startup.
Configure Carte slave servers
Ensure the master is running.
Copy required JDBC drivers and plugins to each slave host.
Create
carte-slave-config.xmlusing this template:Each slave
<name>must be unique in the cluster.(Optional) To use the master’s Kettle properties on a slave, add these tags inside the slave’s
<slaveserver>:Start Carte using the slave config file:
If you use Pentaho Repository content, copy
repositories.xmlto each slave’s.kettledirectory.Restart the master and slave servers. Restart Pentaho Server if it participates.
Carte and PDI track object age for transformations and jobs. Objects are purged only when servers are idle. Purge verification runs every 20 seconds.
Configure schedule and remote execution log cleanup
These settings live in slave-server-config.xml.
Stop Pentaho Server before editing this file.
max_log_lines: Max log lines per execution. Use0for no limit.max_log_timeout_minutes: Remove log lines older than this value. Use0for no timeout.object_timeout_minutes: Remove execution entries older than this value. Use0for no timeout.
Example:
Security and advanced server settings
Configure Carte servers for SSL
Carte SSL uses the JKS keystore format.
Keep the keystore in a restricted-access directory.
Carte runs on Jetty.
For Jetty SSL details, see: https://wiki.eclipse.org/Jetty/Howto/Configure_SSL.
Stop Carte.
Open
carte-master-config.xml.Add these values inside the master server
<slaveserver>:keyStore(required): Path to the keystore file.keyStorePassword(required): Keystore password.keyPassword(optional): Private key password. Omit if it matcheskeyStorePassword.
Example:
Use the
encrtool in thedata-integrationdirectory to obfuscate passwords:encr.bat -carte <password>(Windows) orencr.sh -carte <password>(Linux).Add the same
<sslConfig>block to eachcarte-slave-config.xml.Start Carte.
Access Carte over HTTPS:
Configure Carte servers for JAAS
You can use JAAS for user authentication.
Create a JAAS config file (example below) and save it as
carte-ldap.jaas.confon the Carte host:Set
debug="false"in production environments.Add these Java options to
Spoon.bat(Windows) orspoon.sh(Linux), updating the path:Start Carte. Verify the server does not prompt for BASIC authentication.
Change Jetty server parameters
Carte uses an embedded Jetty server.
Only change these settings if you need to tune connection handling.
acceptors: Threads dedicated to accepting connections. Keep it at or below CPU count.acceptQueueSize: Backlog size before the OS starts rejecting connections.lowResourcesMaxIdleTime: Close idle connections faster under high load.
Jetty docs:
Set Jetty parameters in a Carte config file
Add this block inside <slave_config> in carte-slave-config.xml:
Adjust values, then save the file.
Set Jetty parameters in kettle.properties
Set these variables to numeric values:
KETTLE_CARTE_JETTY_ACCEPTORSKETTLE_CARTE_JETTY_ACCEPT_QUEUE_SIZEKETTLE_CARTE_JETTY_RES_MAX_IDLE_TIME
Configure the PDI client
Initialize slave servers
Open a transformation.
In Explorer View, select the Slave tab.
Select New.
Enter the slave server connection details:
Server name
Hostname or IP address
Port (leave blank for port 80)
Web App Name (required only for Pentaho Server)
User name and password
Is the master
For clustered executions, define one master and the rest as slaves.
Select OK.
Create a cluster schema
In Explorer View, right-click Kettle cluster schemas, then select New.
Configure:
Schema name
Port: Starting port for slave step numbering.
Sockets buffer size
Sockets flush interval rows
Sockets data compressed?
Dynamic cluster: Enable if a master Carte server performs failover.
Slave Servers: Add one master and any number of slaves.
Run transformations in a cluster
Open the Run Options window (toolbar Run context menu or
F8).Select a run configuration that runs the transformation in clustered mode.
To run a clustered transformation from a job, open the Transformation job entry, then set Run this transformation in a clustered mode? on the Advanced tab.
To assign a cluster to a step, right-click the step, select Clusters, then pick a cluster schema.
When running clustered transformations, enable Show transformations to see the generated transformations that run on the cluster.
Schedule and run remotely
Schedule jobs to run on a remote Carte server
These changes are required to schedule a job to run on a remote Carte server.
They are also required if Pentaho Server acts as the load balancer in a dynamic Carte cluster.
Stop Pentaho Server and the remote Carte server.
Copy
repositories.xmlfrom your workstation’s.kettledirectory to the same location on the Carte host.Open
.../tomcat/webapps/pentaho/WEB-INF/web.xml.In the Proxy Trusting Filter section, add the Carte server IP to
TrustedIpAddrs.Uncomment the proxy trusting filter mappings between the
<!-- begin trust -->and<!-- end trust -->markers.Save
web.xml.Add
-Dpentaho.repository.client.attemptTrust=trueto the Carte startup script:Windows (
Carte.bat): add to theOPTline.Linux (
Carte.sh): add to theOPTvariable beforeexport OPT.
Start the Carte server and Pentaho Server.
Run transformations and jobs from a repository on the Carte server
Copy repositories.xml from the user’s .kettle directory to the Carte host’s $HOME/.kettle directory.
Carte also looks for repositories.xml in the directory where you started Carte.
Stop Carte
You can stop Carte from the command line or from a URL.
Stop from the CLI
Arguments:
Example:
Options:
-h, --help: Help text.-s, --stop: Stop the running Carte server.-u, --username <arg>: Admin user name.-p, --password <arg>: Admin password.
Stop from a URL
Expose transformation output with Pentaho Data Services
Use Pentaho Data Services to expose a transformation step as a virtual table.
Query it over JDBC using SQL.
You need a Pentaho Server and repository to publish a data service.
Pentaho Data Services guide (install, create, test, optimize)
Prototyping a data model can be time consuming, particularly when it involves setting up databases, creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can visualize the data and provide feedback.
One way to streamline this process is to make the output of a transformation step a Pentaho Data Service.
The output of the transformation step is exposed by the data service so that the output data can be queried as if it were stored in a physical table, even though the results of the transformation are not stored in a physical database.
Instead, results are published to the Pentaho Server as a virtual table.
You must have a Pentaho Server and repository to publish the data service.
The virtual table is a JDBC-compliant data source that you and others can connect to or query with SQL, provided they can access the server and the transformation.
The Pentaho Data Service can be connected to or queried by a JDBC-compliant tool such as Pentaho Report Designer, Pentaho Interactive Reports, and CTools as well as other compatible tools like RStudio, DBVisualizer, or SQuirreL.
The Pentaho Data Service can also be used in some instances where building and maintaining a data warehouse is sometimes impractical or inefficient, especially when you need to quickly blend and visualize fast-moving or quickly evolving data sets on the fly.
For example, if you want to compare your product prices with your competitors, you can create a transformation that blends prices from your in-house data sources and competitor prices.
Then, you can convert the output step in the transformation into a Pentaho Data Service that creates a virtual table for querying when you connect to the Pentaho Server.
You or others can connect to and query the virtual table, as you would any other JDBC data source to visualize the results in Analyzer or another tool.
The Pentaho Data Service also has a testing tool.
This tool generates several logs and reports that you can use to refine the data service and determine where to apply specialized optimizations.
You can also define parameters that others can use to pose customized queries.
For example, you can create a data service that publishes a virtual “fact” table of a moderately-sized research dataset to a Pentaho Server.
You can test and add optimizations and parameters, such as gender or test type so that the data service runs more quickly.
Then, you can share connection and parameter information with a group of researchers, who can query the virtual table.
Researchers can use Pentaho Interactive Reporting, a dashboard created with CTools, or an application of their choice, such as RStudio, to analyze and visualize the research dataset.
Pentaho Data Services support a subset of SQL.
For details, see SQL support reference.
Install the Pentaho Data Services plugin before you create data services.
In this article
Install Pentaho Data Services
Pentaho Data Services plugin is an optional plugin.
Pentaho Data Integration
Install the plugin in Pentaho Data Integration by using the Plugin Manager. For instructions, see Install plugins.
Pentaho Server
Install the plugin on Pentaho Server by using the Plugin manager. For instructions, see Install plugins.
After you have installed the plugin, add the driver for the Pentaho Data Services plugin by completing the following steps:
In the PDI client, open a new transformation.
In the View tab, expand the Configurations section.
Right-click Data Service and select Driver Details.
In the Driver Details window, click Get Pentaho Driver.
In the Save As window, indicate where you want to save the
PDI-Data-Service-Driver-Pentaho.jarfile, then click Save.Move or copy the
PDI-Data-Service-Driver-Pentaho.jarinto thepentaho-server/tomcat/webapps/pentaho/WEB-INF/libfolder.Close the transformation without saving.
Restart the Pentaho Server.
Create a regular or streaming Pentaho Data Service
You can create either a regular data service or a streaming analytics service.
Streaming data services are commonly used when creating streaming dashboards with CTools.
Save the transformation to the Pentaho Server.
Right-click the transformation step that outputs the data you want.
Select Data Services > New.
Enter a unique name in Service Name (Virtual Table Name).
Confirm Output step is the step you selected.
Optional: Select Streaming for Data Service Type.
Select OK.
Data service badge
After you create a Pentaho Data Service from a step, a data service badge is added to that step.
The badge indicates whether the step has a regular or streaming data service.

Open or edit a Pentaho Data Service
To open a data service for viewing or editing, do one of the following in the PDI client:
In Explore > View, select Data Services. Right-click a data service, then select Edit.
Right-click the step with the data service badge, then select Data Services > Edit.
Delete a Pentaho Data Service
To delete a data service, do one of the following in the PDI client:
In Explore > View, select Data Services. Right-click a data service, then select Delete.
Right-click the step with the data service badge, then select Data Services > Delete.
Test a Pentaho Data Service
After creating your data service, test it to ensure that it runs properly and generates the data you need.
Testing can uncover bottlenecks in your transformation and help you decide which optimization techniques to apply.
Run a basic test
To run a basic test on a regular data service:
Verify Data Service Type is set to Regular.
Open the Test Data Service window using one of the following:
In the Data Service window, select Test Data Service.
In Explore > View > Data Services, right-click a data service and select Test.
Right-click the step with the data service badge, then select Data Services > Test.
Optional: Adjust settings:
Log Level. Controls how much detail appears in logs.
Max Rows. Limits how many rows appear in results.
Select Execute SQL.
Review the output in Examine test results.
Optional: If you need a clean run, clear the cache and test again.
Select Close.
Optional: Add an optimization.
Run a streaming optimization test
When you test streaming data, the stream is partitioned into windows (batches).
Windows can be time-based or row-based.
To test a streaming data service:
Verify Data Service Type is set to Streaming.
Open the Test Data Service window.
Select a window mode (Time Based or Row Based).
Configure window settings:
Window Size
Every
Limit
Optional: Adjust Log Level and Max Rows.
Select Execute SQL.
Review the output in Examine test results.
Select Stop to stop execution.
Select Close.
Optional: Add an optimization.
Run an optimization test
If you have added an optimization, run a test that passes the optimization parameter.
Example query:
Select Preview Optimization in the Test Data Service window.
Use Examine test results to interpret results.
Examine test results
Test results appear in the tabs in the bottom half of the Test Data Service window.
When you test or run a data service, two transformations run:
The service transformation. This is the transformation you built.
The generated transformation. PDI generates this based on executed SQL.
The following table describes the tabs and what to look for:
Query Results
Events during the test run and query results.
Verify results. For streaming, watch updates per window.
Optimized Queries
Processing information and results of optimizations.
Verify optimizations applied correctly. Compare with Service Metrics and Generated Transformation Logging.
Service Transformation Logging
Logs from the service transformation.
Check for design or runtime issues. Compare with Generated Transformation Logging.
Generated Transformation Logging
Logs from the generated transformation.
Check SQL support issues. See SQL support reference.
Service Metrics
GANTT chart timings for the service transformation.
Find bottlenecks. Compare with SQL Trans Metrics.
SQL Trans Metrics
GANTT chart timings for the generated transformation.
Find bottlenecks. Compare with Service Metrics.
SQL support reference
The Pentaho Data Service is designed to support a subset of SQL clauses and literals that are useful for data blending, optimizations, and other scenarios.
Limitations and constraints are listed at the end.
Supported SQL literals
The Pentaho Data Service supports the following literals:
Strings use single quotation marks. Escape a single quote using another single quote:
''.Dates use square brackets. Supported formats:
[yyyy/MM/dd HH:mm:ss.SSS][yyyy/MM/dd HH:mm:ss][yyyy/MM/dd]
For an
INlist, date formats can use single quotes and dashes, for example:SELECT * FROM BUILDS WHERE BuildDate IN ('2015-03-18', '2015-03-22')You cannot use bracketed date formats in an
INlist.
NumberandBigNumbercannot use grouping symbols. Use.for decimals, for example123.45.Integers contain digits only.
Boolean values can be
TRUEorFALSE.
Supported SQL clauses
The Pentaho Data Service supports the following clauses.
SELECT
COUNT(field)COUNT(*)COUNT(DISTINCT field)DISTINCT <fields>IIF (condition, true-value or field, false-value or field)CASE WHEN condition THEN true-value ELSE false-value ENDSUMAVGMINMAXAliases with
ASor with spaces. Example:SUM(sales) AS "Total Sales"orSUM(sales) TotalSalesConstant expressions. See Supported SQL literals.
FROM
Only one Pentaho service name.
Aliases for the service name.
Omit the service name to query an empty row.
SELECT 1andSELECT 1 FROM dualare the same.
WHERE
Nested brackets
AND,OR,NOTfollowed by brackets (example:NOT (A=5 OR C=3))Precedence
Literals (String and Integer)
PARAMETER('parameter-name')='value'(always evaluates toTRUE)Operators:
=,<,>,<=,=<,>=,=>,<>LIKE(wildcards:%,?)REGEXIS NULL,IS NOT NULLINConditions on
IIFor its aliasDATE_TO_STR(date-field, <mask>)
GROUP BY
Group on fields. Not on
IIF().
LIMIT
LIMITLIMIT offset, countLIMIT count OFFSET offset
HAVING
Apply conditions to aggregates, not aliases.
Use identical strings for expressions.
Use conditions on aggregations not in
SELECT.
ORDER BY
Order on any column, even if not in the result.
Order on
IIForCASE-WHENexpressions.
Other development considerations
Keep these constraints in mind:
You cannot
JOINone data service virtual table to another.Data services use the Memory Group By step to group. Watch memory consumption for many groups.
You cannot specify the same field twice in the same
SELECTclause.Calculations and functions like string concatenation are not supported in queries. Do them in the transformation.
Optimize a Pentaho Data Service
As you test your data service, you might see bottlenecks or steps that could run more efficiently.
If you want to improve performance, apply an optimization technique.
Service cache
This optimization stores results in a cache.
By default, caching is enabled and results are stored for an hour.
Use this technique when result sets are modest and you expect repeat queries.
How the service cache optimization works
If you run the data service while results are cached, PDI can run your query against cached data instead of running the full transformation.
This behavior depends on:
Whether the cached dataset contains all the records required by the new query
Other optimizations you apply

When you run tests from the Test Data Service window and change the SQL, PDI does not use the cache.
Tests only return up to Max Rows, so using cached results could return incomplete results.
Adjust the cache duration
In the Data Service window, select the Service Cache tab.
Select Enable Caching.
Update Cache Duration (seconds).
Select OK.
Disable the cache
In the Data Service window, select the Service Cache tab.
Clear Enable Caching.
Select OK.
Clear the cache
In the Data Service window, select the Service Cache tab.
Clear Enable Caching.
Select OK.
Open the Data Service window again and return to the Service Cache tab.
Select Enable Caching.
Query pushdown
Use Query Pushdown to translate the SQL WHERE clause into a corresponding WHERE clause in:
Table Input steps
MongoDB Input steps
The input queries are filtered and handled at the source.
How query pushdown works
To apply Query Pushdown, set the optimization values, then add the optimization parameter to the input step query.
The optimization uses a parameter in place of the WHERE clause, for example:
SELECT * FROM 'employee' WHERE ${countryParam}

Add the query pushdown parameter to Table Input or MongoDB Input
Create a transformation with a Table Input or MongoDB Input step.
Open the input step.
Add a parameter where your
WHEREclause value belongs:SQL:
SELECT * FROM media WHERE ${countryParam}MongoDB:
{$match : ${mongoDbParam}}
Press CTRL+SPACE to list parameters.
Select Replace Variables in Script?.
Select OK.
Set up query pushdown parameter optimization
Open the Data Service window and select the Query Pushdown tab.
Select the plus sign near Parameters.
Enter the name of the optimization parameter you used in the input query.
Select OK.
Select the step that contains the parameter in Step Name.
In Definitions, map fields:
Data Service Field. Field name as it appears in transformation output.
Step Field. Field name as it appears in the data source.
Optional: Select Get Optimizations to generate optimizations automatically.
Select OK.
Test and publish:
Disable the query pushdown optimization
To disable this optimization, select Disable an Optimization in the Data Services window.
Parameter pushdown
Parameter Pushdown can be applied to any step in the transformation.
It maps a field value to a transformation parameter for simple equality conditions (example: WHERE region="South").
How parameter pushdown works
To set up Parameter Pushdown, configure the optimization, then add the parameter to the step.

Add the parameter pushdown parameter to the step
Add the parameter to a step, such as JSON Input or REST Client.
Select OK.
Set up parameter pushdown optimization
Open the Data Service window and select the Parameter Pushdown tab.
In WHERE Clause Column, enter the data service field name from the existing list.
Update Transformation Parameter as needed. The name must be unique in the data service.
Optional: Update Value Format. Default
%sis usually sufficient.Select OK.
Test and publish:
Streaming optimization
This optimization limits the batch size used for processing.
Streaming records are partitioned into windows (batches) for processing.
How streaming optimization works
You can optimize processing by setting either:
Rows Limit. Maximum rows in a window.
Time Limit. Maximum elapsed time to create a window.
Adjust the row or time limits
In the Data Service window, select the Streaming tab.
Update Rows Limit or Time Limit.
Select OK.
Test and publish:
Publish a Pentaho Data Service
To publish your data service, save the transformation containing the data service to the Pentaho Repository.
To validate that your data service was published:
Ensure external assets required by the transformation are accessible on the server.
Open a browser and log in to the Pentaho Server.
List data services on the server, for example:
Share a Pentaho Data Service with others
Once your data service is created and tested, you can share it so others can connect and query it.
Share prerequisites
Ensure the user or group has permissions to:
Access the Pentaho Server and run queries.
Read and run the transformation that contains the data service.
Provide the data service name as saved in the Pentaho Repository.
If you use optimizations, provide names and definitions for any parameters.
Provide connection instructions:
Connect to the Pentaho Data Service from a Pentaho tool
A Pentaho Data Service is a virtual table containing the output of a step in a PDI transformation.
You can connect to and query:
A regular data service from Pentaho tools such as Report Designer, Analyzer, and the PDI client.
A streaming data service from a dashboard created with CTools.
You need the data service name and permission to run the transformation and access the Pentaho Server.
Connection parameters:
Connection Name. Unique name of the data service.
Connection Type. Pentaho Data Services.
Access. Native (JDBC).
Hostname. Pentaho Server hostname or IP address. Default is
localhostfor local installs.Port Number. Pentaho Server port. Default is
8080.Web App Name. Default is
pentaho.Username. A user allowed to run the data service.
Password. Password for that user.
Optional connection parameters:
proxyhostname. Proxy server for HTTP connections.proxyport. Proxy server port.nonproxyhosts. Hosts that bypass the proxy. Separate multiple hosts with commas.debugtrans. File path where PDI saves the generated debug transformation. Example:/tmp/debug.ktr.PARAMETER_[optionname]=value. Sets a transformation parameter. Example:PARAMETER_model=E6530.secure. Set toTRUEfor HTTPS. Default is HTTP.
Connect to the Pentaho Data Service from a non-Pentaho tool
You can connect to and query a data service from tools like RStudio or SQuirreL.
To connect and query, you need the data service name and permission to run the transformation and access the server.
To use a non-Pentaho tool, install the data service JDBC driver, then create a connection in your tool.
Step 1: Download the Pentaho Data Service JDBC driver
You can download the driver using the PDI client or manually.
Download using the PDI client
Open the transformation and identify the step with the data service badge.
Open Driver Details using one of the following:
Right-click the step and select Data Service > Driver Details.
In Explore > View > Data Services, right-click a data service and select Driver Details.
In the Data Services window, select Driver Details.
In Driver Details, select Get Third-Party Driver.
Save
Pentaho-Data-Service-Driver.zip.Select Close.
Download manually
Go to
<Pentaho installation directory>/design-tools/data-integration/plugins/pdi-dataservices-ee-plugin/Data Service JDBC Driver.Copy
pdi-dataservice-driver-thirdparty-<release-version>.jar.
These steps assume you installed the Pentaho Data Services plugin. See Install Pentaho Data Services.
Step 2: Install the Pentaho Data Service JDBC driver
Extract the driver files and copy them to your application's JDBC driver directory.
Start and stop the application.
Create a connection using the instructions in Connect to the Pentaho Data Service from a non-Pentaho tool.
Step 3: Create a connection from a non-Pentaho tool
Most tools let you create a JDBC connection by specifying a driver class and connection string.
Driver class:
org.pentaho.di.trans.dataservice.jdbc.ThinDriver
Connection string format:
Example:
webappname is required when the data service runs on Pentaho Server.
Query a Pentaho Data Service
You can query a data service using SQL.
If the transformation uses a parameter, you can assign a value in your SQL query.
Limitations:
SQL support is limited. See SQL support reference.
Data services can be queried only with SQL.
Example: query with a parameter
Syntax:
Example:
Monitor a Pentaho Data Service
To monitor a data service, enter one of the following URLs in your browser:
List names of data services:
List data service status:
Replace <Pentaho Server Host:Port> with the host name or IP address and port for the Pentaho Server running the data service.
You must be authenticated and authorized on the Pentaho Server to access these endpoints.
Capture lineage with the OpenLineage plugin
Use the OpenLineage plugin to emit standardized lineage events from PDI.
Use these events in Pentaho Data Catalog (PDC) for end-to-end data flow visibility.
OpenLineage plugin guide (setup, configuration, supported steps)
The Pentaho Data Integration (PDI) OpenLineage plugin enables PDI to emit rich, standardized OpenLineage events that can be consumed by Pentaho Data Catalog (PDC) to capture how data moves and is transformed in PDI ETL pipelines.
PDC uses information it captures to provide visual end-to-end transparency of data flows, which improves data observability, strengthens compliance and governance, aids in troubleshooting data issues, and enhances data trust and quality for business users.
OpenLineage events are emitted from PDI when supported transformations are executed by discovering input and output datasets and, when possible, generating column-level lineage.
The OpenLineage plugin emits events for:
Start: transformation starts
Complete: transformations ends
Abort: transformation was stopped without errors
Fail: transformation ended with errors
Compatibility matrix
OpenLineage plugin functionality is certified to work as intended for the following versions of PDI:
10.2.0.1 (SP1)
10.2.0.2 (SP2)
10.2.0.3 (SP3)
10.2.0.4 (SP4)
10.2.0.5 (SP5)
10.2.0.6 (SP6)
11.0
Setting up the plugin
Before you begin, verify that you have a valid license for the OpenLineage plugin.
For information about licenses, see Acquire and install enterprise licenses.
To set up the OpenLineage plugin, you must complete the following tasks:
Download the plugin
Download the OpenLineage plugin from the Pentaho Support Portal.
On the Support Portal home page, sign in using the Pentaho support username and password provided in your Pentaho Welcome Packet.
In the Pentaho card, click Download. The Downloads page opens.
In the .x list, click Pentaho EE Marketplace Plugins Release.
Scroll to the bottom of the page.
In the Marketplace Plugins section, click Open Lineage.
Download the
pdi-openlineage-plugin-<plugin_version>-<build number>.zipfile.
Install the plugin
Install the OpenLineage plugin in the PDI client and Pentaho Server by running commands appropriate for your operating system.
Note: The plugin can be installed in the PDI client, Pentaho Server, or both.
Installation commands include the following placeholders that must be replaced:
<path-to-data-integration>: Replace with full path to the PDI client.<path-to-pentaho-server>: Replace with full path to the Pentaho Server.<version_check_option>: Replace with one of the following options:none: Installs the plugin on any version of Pentaho. If the Pentaho version is unsupported, an error is shown.loose: Default option. Installs the plugin on certified and compatible, newer Pentaho versions.strict: Installs plugin only on certified Pentaho versions.
To install the OpenLineage plugin, complete the following steps:
Stop the PDI client and Pentaho Server.
Extract the
pdi-openlineage-plugin-<plugin_version>-<build number>.zipfile to a folder on the computer where the PDI client or PDI Server is installed.In the
pdi-openlineage-plugin-<plugin_version>-<build number>folder, open a command prompt as an administrator.In the command prompt, run the following installation commands for your operating system, replacing the placeholders for paths and version check options.
Windows
PDI client
install.bat -t <path-to-data-integration> --platformVersionCheck <version_check_option>PDI Server
install.bat -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>
Linux
PDI client
./install.sh -t <path-to-data-integration> --platformVersionCheck <version_check_option>PDI Server
./install.sh -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>
Start the PDI client and Pentaho Server.
Generate an encrypted password
If you plan to emit events to PDC, and want to secure your password so that it's not in plain text, you can generate an encrypted password to authenticate to PDC.
The encrypted password is used in the configuration file for the OpenLineage plugin.
On the computer where the PDI client or PDI Server is installed, open a command prompt.
Run one of the following commands for your operating system:
Windows
To generate a password using the default Pentaho encryption seed, run the following command:
To generate a password using your own custom encryption seed, run the following command:
Linux
To generate a password using the default Pentaho encryption seed, run the following command:
To generate a password using your own custom encryption seed, run the following command:
An encrypted password is generated and displayed in the command prompt, like the following example:
Create a configuration file for the plugin
After you install the plugin, create a configuration file that specifies where to send open lineage events.
You can create a simple configuration file for testing or a custom configuration to use in production.
In a text editor, create a configuration file with content from one of the following examples, based on your needs:
To create a simple configuration file that you can use to quickly validate that the plugin is working, include only the following content:
To create a custom configuration file that includes OpenLineage event consumers in your Pentaho deployment, such as a PDC Server, include the following content:
Save the file as
openlineageConfig.ymlin the PDI directory that contains your user-specific configuration files.Notes:
By default, user-specific configuration files are stored in the
.kettledirectory, which is usually in one of the following locations:Windows:
C:\Documents and Settings\example_user\.kettleLinux:
~/.kettle)
However, if you run PDI in a container, configuration files might resolve to the
/root/.kettledirectory.You can add multiple http consumers in the configuration file.
Enable the plugin
After you install the OpenLineage plugin and create its configuration file, you must enable the plugin so that it can send open lineage events to the consumers you specified in the configuration file.
Enable in PDI client
Enable the plugin in the PDI client by completing the following steps:
Log into the PDI client and click Edit > Edit the Kettle.properties file. The Kettle properties window opens.
To make the plugin active, add the following variable and value:
KETTLE_OPEN_LINEAGE_ACTIVE=trueTo point PDI to your
openlineageConfig.ymlfile, add the following variable with the<path-to-config-file>placeholder replaced by the full path to your configuration file directory:KETTLE_OPEN_LINEAGE_CONFIG_FILE=/<path-to-config-file>/openlineageConfig.ymlClick OK. The
kettle.propertiesfile is saved and the OpenLineage plugin is enabled.
Enable in Pentaho Server
Enable the client in the Pentaho Server, by completing the following steps:
Navigate to the
kettle.propertiesfile.Note: The
kettle.propertiesfile is usually in one of the following locations:Windows:
C:\Documents and Settings\example_user\.kettleLinux:
~/.kettle)
If you run PDI in a container, the
kettle.propertiesfile is in the/root/.kettledirectory.Open the
kettle.propertiesfile in a text editor.Enable the plugin with its configuration file by adding the following variables and values:
KETTLE_OPEN_LINEAGE_ACTIVE=trueKETTLE_OPEN_LINEAGE_CONFIG_FILE=/<path-to-config-file>/openlineageConfig.ymlSave the
kettle.propertiesfile.
Validate the plugin works
You can validate that the plugin is working by verifying that text related to OpenLineage appears in the appropriate logs and files.
To validate that the plugin is working, complete the following steps:
In the PDI client, click File > Open, and then navigate to sample transformations in your Pentaho folder. For example, in Windows the sampls are in
<path_to_Pentaho>\Pentaho\design-tools\data-integration\samples\transformations.Select the sample transformation,
TextInput and Output using variables.ktr, and click Open.To run the transformation click Action > Run, and then in the Run Options window, click Run. The transformation runs and Execution Results pane appears at the bottom of the PDI client.
Validate that consumers you have enabled are receiving OpenLineage events by taking one of the following actions:
If the
consoleconsumer is enabled, in the Execution Results pane of the PDI client, click the Logging tab and verify that the log contains lines with the text, "OpenLineage-Plugin".If a
fileconsumer is enabled, open theopenlineage.jsonfile in a text editor and verify that it contains lines with the text, "OpenLineage-Plugin". Theopenlineage.jsonfile location is defined in theopenlineageConfig.ymlfile.If an
HTTPconsumer is enabled, confirm OpenLineage events are arriving for that consumer. For example, if the PDC is a configured consumer, verify the events arrive in PDC.
Troubleshoot plugin
If you are unable to validate that the plugin is working, perform the following troubleshooting actions:
Verify dataset lineage (input text file -> output text file) and column lineage mappings.
Validate that the
Kettle.propertiesfile contains the following variable and value:KETTLE_OPEN_LINEAGE_ACTIVE=true.Verify that the credentials specified in the
openlineageConfig.ymlfile are correct.Check your network and firewall settings.
Supported steps
Note: This list of supported steps is for version 0.5.0 of the plugin.
Steps that support dataset lineage and column-level lineage
Abort
Append Streams
Block this step until steps finish
Blocking Step
Data Grid
Delay Row
Delete
Dummy
Filter Rows
Generate Rows
Get Variables
Group By
Java Filter
Mail
Merge Join
Microsoft Excel Input
Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections.
Microsoft Excel Output (deprecated)
Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections. [1]
Microsoft Excel Writer
Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections. [1]
Prioritize streams
S3 CSV Input
S3 File Output [1]
Send message to syslog
Set Variables
Sort Rows
Switch/Case
Table input
Lineage is supported for the following connections, using the listed SQL functions and clauses:
Connection types: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Google BigQuery, Redshift, and Generic Connection [2]
SQL functions: aliases, joins, subqueries, functions, aggregations, constants, expressions, cases, window functions, CTEs, and the set operators: unions, intersects, and excepts.
Clauses: GROUP BY, ORDER BY, WHERE, WITH, and HAVING.
Table output
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
Text file input
Lineage is supported for local files, AWS, Mineo, HCP, and other S3- compatible connections. Fixed filetype is not supported.
Text file output
Lineage is supported for local files, AWS, Mineo, HCP, and other S3- compatible file systems. [1] Fixed filetype is not supported.
Write to Log
Steps that support only dataset lineage, not column-level lineage
Combination lookup/update
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
CSV File Input
Database Lookup
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
De-serialize from file
Dimension lookup/update
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
Fixed file input
Gzip Csv Input
Insert/Update
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
JSON Input
JSON Output [1]
LDIF Input
Load file content in memory
Property Input
Properties Output [1]
Sql File Output [1]
Synchronize after merge
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
Update
Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]
XBase Input
Notes:
[1] Step, which can create multiple files as its output, can be configured to add filenames to its results file so that the name of each file is recorded in lineage. If the Add filenames to result option is disabled for the step, only a single, generic target is recorded in lineage. For example, if the Add filenames to result option is enabled for the step, the output is recorded in lineage as <filename>_001.csv, <filename>_002.csv, <filename>_003.csv, and so on. But, if the option is disabled, the output is recorded as only <filename>.csv.
[2] Step allows generic connections, but lineage works only with generic connections that are listed as supported.
Note: The Google Big Query connection is not supported on table output step. An OpenLineage event won't have any dataset outputs from any Google Big Query storage.
Uninstall plugin
Uninstall the OpenLineage plugin from the PDI client and Pentaho Server by running commands appropriate for your operating system.
Before you begin, you must download the OpenLineage plugin from the Pentaho Support Portal, which contains script files for uninstalling the plugin. For details, see Download the plugin.
Note: The plugin can be uninstalled from the PDI client, Pentaho Server, or both.
Commands for uninstalling the plugin include the following placeholders that must be replaced:
<path-to-data-integration>: Replace with full path to the PDI client.<path-to-pentaho-server>: Replace with full path to the Pentaho Server.<version_check_option>: Replace with one of the following options:none: Installs the plugin on any version of Pentaho. If the Pentaho version is unsupported, an error is shown.loose: Default option. Installs the plugin on certified and compatible, newer Pentaho versions.strict: Installs plugin only on certified Pentaho versions.
To uninstall the OpenLineage plugin, complete the following steps:
Stop the PDI client and Pentaho Server.
Extract the
pdi-openlineage-plugin-<plugin_version>-<build number>.zipfile to a folder on the computer where the PDI client or PDI Server is installed.In the
pdi-openlineage-plugin-<plugin_version>-<build number>folder, open a command prompt as an administrator.In the command prompt, run the following installation commands for your operating system, replacing the placeholders for paths and version check options.
Windows
PDI client
uninstall.bat -t <path-to-data-integration> --platformVersionCheck <version_check_option>PDI Server
uninstall.bat -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>
Linux
PDI client
./uninstall.sh -t <path-to-data-integration> --platformVersionCheck <version_check_option>PDI Server
./uninstall.sh -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>
Start the PDI client and Pentaho Server.
Upgrade plugin
Important: Do not install a new version of the OpenLineage plugin over an existing installation of the plugin.
To upgrade the OpenLineage plugin, you must uninstall the plugin and then download and install the new version of the plugin.
For details, see the following sections:
Archived source pages
These pages were merged into this single topic page:
Last updated
Was this helpful?

