Pentaho Data Services
Prototyping a data model can be time consuming, particularly when it involves setting up databases, creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can visualize the data and provide feedback.
One way to streamline this process is to make the output of a transformation step a Pentaho Data Service. The output of the transformation step is exposed by the data service so that the output data can be queried as if it were stored in a physical table, even though the results of the transformation are not stored in a physical database. Instead, results are published to the Pentaho Server as a virtual table.
You must have a Pentaho Server and repository to publish the data service.
The virtual table is a JDBC-compliant data source that you and others can connect to or query with SQL, provided they can access the server and the transformation. The Pentaho Data Service can be connected to or queried by a JDBC-compliant tool such as Pentaho Report Designer, Pentaho Interactive Reports, and CTools as well as other compatible tools like RStudio, DBVisualizer, or SQuirreL.
The Pentaho Data Service can also be used in some instances where building and maintaining a data warehouse is sometimes impractical or inefficient, especially when you need to quickly blend and visualize fast-moving or quickly evolving data sets on the fly. For example, if you want to compare your product prices with your competitors, you can create a transformation that blends prices from your in-house data sources and competitor prices. Then, you can convert the output step in the transformation into a Pentaho Data Service that creates a virtual table for querying when you connect to the Pentaho Server. You or others can connect to and query the virtual table, as you would any other JDBC data source to visualize the results in Analyzer or another tool.
The Pentaho Data Service also has a testing tool. This tool generates several logs and reports that you can use to refine the data service and determine where to apply specialized optimizations. You can also define parameters that others can use to pose customized queries. For example, you can create a data service that publishes a virtual “fact” table of a moderately-sized research dataset to a Pentaho Server. You can test and add optimizations and parameters, such as gender or test type so that the data service runs more quickly. Then, you can share connection and parameter information with a group of researchers, who can query the virtual table. Researchers can use Pentaho Interactive Reporting, a dashboard created with CTools, or an application of their choice, such as RStudio, to analyze and visualize the research dataset.
Pentaho Data Services support a subset of SQL. For details, see SQL support reference.
Install the Pentaho Data Services plugin before you create data services.
In this article
Install Pentaho Data Services
Pentaho Data Services plugin is an optional plugin. You can download the plugin from the Support Portal.
Pentaho Data Integration
To install the plugin in Pentaho Data Integration:
Unpack
pdi-dataservice-ee-plugin-<release-version>.zipto the plugins folder (data-integration/plugins).Restart Pentaho Data Integration.
Pentaho Server
To install the plugin on Pentaho Server:
Unpack
pdi-dataservice-ee-plugin-<release-version>.zipto the plugins folder (pentaho-server/pentaho-solution/system/kettle/plugins).Copy
pdi-dataservice-driver-pentaho-<release-version>.jarfrom:<Pentaho installation directory>/server/pentaho-server/pentaho-solutions/system/kettle/plugins/pdi-dataservices-ee-plugin/Data Service JDBC Driverto
<Pentaho installation directory>/server/pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib
Restart Pentaho Server.
Create a regular or streaming Pentaho Data Service
You can create either a regular data service or a streaming analytics service.
Streaming data services are commonly used when creating streaming dashboards with CTools.
Save the transformation to the Pentaho Server.
Right-click the transformation step that outputs the data you want.
Select Data Services > New.
Enter a unique name in Service Name (Virtual Table Name).
Confirm Output step is the step you selected.
Optional: Select Streaming for Data Service Type.
Select OK.
Data service badge
After you create a Pentaho Data Service from a step, a data service badge is added to that step.
The badge indicates whether the step has a regular or streaming data service.

Open or edit a Pentaho Data Service
To open a data service for viewing or editing, do one of the following in the PDI client:
In Explore > View, select Data Services. Right-click a data service, then select Edit.
Right-click the step with the data service badge, then select Data Services > Edit.
Delete a Pentaho Data Service
To delete a data service, do one of the following in the PDI client:
In Explore > View, select Data Services. Right-click a data service, then select Delete.
Right-click the step with the data service badge, then select Data Services > Delete.
Test a Pentaho Data Service
After creating your data service, test it to ensure that it runs properly and generates the data you need.
Testing can uncover bottlenecks in your transformation and help you decide which optimization techniques to apply.
Run a basic test
To run a basic test on a regular data service:
Verify Data Service Type is set to Regular.
Open the Test Data Service window using one of the following:
In the Data Service window, select Test Data Service.
In Explore > View > Data Services, right-click a data service and select Test.
Right-click the step with the data service badge, then select Data Services > Test.
Optional: Adjust settings:
Log Level. Controls how much detail appears in logs.
Max Rows. Limits how many rows appear in results.
Select Execute SQL.
Review the output in Examine test results.
Optional: If you need a clean run, clear the cache and test again.
Select Close.
Optional: Add an optimization.
Run a streaming optimization test
When you test streaming data, the stream is partitioned into windows (batches).
Windows can be time-based or row-based.
To test a streaming data service:
Verify Data Service Type is set to Streaming.
Open the Test Data Service window.
Select a window mode (Time Based or Row Based).
Configure window settings:
Window Size
Every
Limit
Optional: Adjust Log Level and Max Rows.
Select Execute SQL.
Review the output in Examine test results.
Select Stop to stop execution.
Select Close.
Optional: Add an optimization.
Run an optimization test
If you have added an optimization, run a test that passes the optimization parameter.
Example query:
Select Preview Optimization in the Test Data Service window.
Use Examine test results to interpret results.
Examine test results
Test results appear in the tabs in the bottom half of the Test Data Service window.
When you test or run a data service, two transformations run:
The service transformation. This is the transformation you built.
The generated transformation. PDI generates this based on executed SQL.
The following table describes the tabs and what to look for:
Query Results
Events during the test run and query results.
Verify results. For streaming, watch updates per window.
Optimized Queries
Processing information and results of optimizations.
Verify optimizations applied correctly. Compare with Service Metrics and Generated Transformation Logging.
Service Transformation Logging
Logs from the service transformation.
Check for design or runtime issues. Compare with Generated Transformation Logging.
Generated Transformation Logging
Logs from the generated transformation.
Check SQL support issues. See SQL support reference.
Service Metrics
GANTT chart timings for the service transformation.
Find bottlenecks. Compare with SQL Trans Metrics.
SQL Trans Metrics
GANTT chart timings for the generated transformation.
Find bottlenecks. Compare with Service Metrics.
SQL support reference
The Pentaho Data Service is designed to support a subset of SQL clauses and literals that are useful for data blending, optimizations, and other scenarios.
Limitations and constraints are listed at the end.
Supported SQL literals
The Pentaho Data Service supports the following literals:
Strings use single quotation marks. Escape a single quote using another single quote:
''.Dates use square brackets. Supported formats:
[yyyy/MM/dd HH:mm:ss.SSS][yyyy/MM/dd HH:mm:ss][yyyy/MM/dd]
For an
INlist, date formats can use single quotes and dashes, for example:SELECT * FROM BUILDS WHERE BuildDate IN ('2015-03-18', '2015-03-22')You cannot use bracketed date formats in an
INlist.
NumberandBigNumbercannot use grouping symbols. Use.for decimals, for example123.45.Integers contain digits only.
Boolean values can be
TRUEorFALSE.
Supported SQL clauses
The Pentaho Data Service supports the following clauses.
SELECT
COUNT(field)COUNT(*)COUNT(DISTINCT field)DISTINCT <fields>IIF (condition, true-value or field, false-value or field)CASE WHEN condition THEN true-value ELSE false-value ENDSUMAVGMINMAXAliases with
ASor with spaces. Example:SUM(sales) AS "Total Sales"orSUM(sales) TotalSalesConstant expressions. See Supported SQL literals.
FROM
Only one Pentaho service name.
Aliases for the service name.
Omit the service name to query an empty row.
SELECT 1andSELECT 1 FROM dualare the same.
WHERE
Nested brackets
AND,OR,NOTfollowed by brackets (example:NOT (A=5 OR C=3))Precedence
Literals (String and Integer)
PARAMETER('parameter-name')='value'(always evaluates toTRUE)Operators:
=,<,>,<=,=<,>=,=>,<>LIKE(wildcards:%,?)REGEXIS NULL,IS NOT NULLINConditions on
IIFor its aliasDATE_TO_STR(date-field, <mask>)
GROUP BY
Group on fields. Not on
IIF().
LIMIT
LIMITLIMIT offset, countLIMIT count OFFSET offset
HAVING
Apply conditions to aggregates, not aliases.
Use identical strings for expressions.
Use conditions on aggregations not in
SELECT.
ORDER BY
Order on any column, even if not in the result.
Order on
IIForCASE-WHENexpressions.
Other development considerations
Keep these constraints in mind:
You cannot
JOINone data service virtual table to another.Data services use the Memory Group By step to group. Watch memory consumption for many groups.
You cannot specify the same field twice in the same
SELECTclause.Calculations and functions like string concatenation are not supported in queries. Do them in the transformation.
Optimize a Pentaho Data Service
As you test your data service, you might see bottlenecks or steps that could run more efficiently.
If you want to improve performance, apply an optimization technique.
Service cache
This optimization stores results in a cache.
By default, caching is enabled and results are stored for an hour.
Use this technique when result sets are modest and you expect repeat queries.
How the service cache optimization works
If you run the data service while results are cached, PDI can run your query against cached data instead of running the full transformation.
This behavior depends on:
Whether the cached dataset contains all the records required by the new query
Other optimizations you apply

When you run tests from the Test Data Service window and change the SQL, PDI does not use the cache.
Tests only return up to Max Rows, so using cached results could return incomplete results.
Adjust the cache duration
In the Data Service window, select the Service Cache tab.
Select Enable Caching.
Update Cache Duration (seconds).
Select OK.
Disable the cache
In the Data Service window, select the Service Cache tab.
Clear Enable Caching.
Select OK.
Clear the cache
In the Data Service window, select the Service Cache tab.
Clear Enable Caching.
Select OK.
Open the Data Service window again and return to the Service Cache tab.
Select Enable Caching.
Query pushdown
Use Query Pushdown to translate the SQL WHERE clause into a corresponding WHERE clause in:
Table Input steps
MongoDB Input steps
The input queries are filtered and handled at the source.
How query pushdown works
To apply Query Pushdown, set the optimization values, then add the optimization parameter to the input step query.
The optimization uses a parameter in place of the WHERE clause, for example:
SELECT * FROM 'employee' WHERE ${countryParam}

Add the query pushdown parameter to Table Input or MongoDB Input
Create a transformation with a Table Input or MongoDB Input step.
Open the input step.
Add a parameter where your
WHEREclause value belongs:SQL:
SELECT * FROM media WHERE ${countryParam}MongoDB:
{$match : ${mongoDbParam}}
Press CTRL+SPACE to list parameters.
Select Replace Variables in Script?.
Select OK.
Set up query pushdown parameter optimization
Open the Data Service window and select the Query Pushdown tab.
Select the plus sign near Parameters.
Enter the name of the optimization parameter you used in the input query.
Select OK.
Select the step that contains the parameter in Step Name.
In Definitions, map fields:
Data Service Field. Field name as it appears in transformation output.
Step Field. Field name as it appears in the data source.
Optional: Select Get Optimizations to generate optimizations automatically.
Select OK.
Test and publish:
Disable the query pushdown optimization
To disable this optimization, select Disable an Optimization in the Data Services window.
Parameter pushdown
Parameter Pushdown can be applied to any step in the transformation.
It maps a field value to a transformation parameter for simple equality conditions (example: WHERE region="South").
How parameter pushdown works
To set up Parameter Pushdown, configure the optimization, then add the parameter to the step.

Add the parameter pushdown parameter to the step
Add the parameter to a step, such as JSON Input or REST Client.
Select OK.
Set up parameter pushdown optimization
Open the Data Service window and select the Parameter Pushdown tab.
In WHERE Clause Column, enter the data service field name from the existing list.
Update Transformation Parameter as needed. The name must be unique in the data service.
Optional: Update Value Format. Default
%sis usually sufficient.Select OK.
Test and publish:
Streaming optimization
This optimization limits the batch size used for processing.
Streaming records are partitioned into windows (batches) for processing.
How streaming optimization works
You can optimize processing by setting either:
Rows Limit. Maximum rows in a window.
Time Limit. Maximum elapsed time to create a window.
Adjust the row or time limits
In the Data Service window, select the Streaming tab.
Update Rows Limit or Time Limit.
Select OK.
Test and publish:
Publish a Pentaho Data Service
To publish your data service, save the transformation containing the data service to the Pentaho Repository.
To validate that your data service was published:
Ensure external assets required by the transformation are accessible on the server.
Open a browser and log in to the Pentaho Server.
List data services on the server, for example:
Share a Pentaho Data Service with others
Once your data service is created and tested, you can share it so others can connect and query it.
Share prerequisites
Ensure the user or group has permissions to:
Access the Pentaho Server and run queries.
Read and run the transformation that contains the data service.
Provide the data service name as saved in the Pentaho Repository.
If you use optimizations, provide names and definitions for any parameters.
Provide connection instructions:
Connect to the Pentaho Data Service from a Pentaho tool
A Pentaho Data Service is a virtual table containing the output of a step in a PDI transformation.
You can connect to and query:
A regular data service from Pentaho tools such as Report Designer, Analyzer, and the PDI client.
A streaming data service from a dashboard created with CTools.
You need the data service name and permission to run the transformation and access the Pentaho Server.
Connection parameters:
Connection Name. Unique name of the data service.
Connection Type. Pentaho Data Services.
Access. Native (JDBC).
Hostname. Pentaho Server hostname or IP address. Default is
localhostfor local installs.Port Number. Pentaho Server port. Default is
8080.Web App Name. Default is
pentaho.Username. A user allowed to run the data service.
Password. Password for that user.
Optional connection parameters:
proxyhostname. Proxy server for HTTP connections.proxyport. Proxy server port.nonproxyhosts. Hosts that bypass the proxy. Separate multiple hosts with commas.debugtrans. File path where PDI saves the generated debug transformation. Example:/tmp/debug.ktr.PARAMETER_[optionname]=value. Sets a transformation parameter. Example:PARAMETER_model=E6530.secure. Set toTRUEfor HTTPS. Default is HTTP.
Connect to the Pentaho Data Service from a non-Pentaho tool
You can connect to and query a data service from tools like RStudio or SQuirreL.
To connect and query, you need the data service name and permission to run the transformation and access the server.
To use a non-Pentaho tool, install the data service JDBC driver, then create a connection in your tool.
Step 1: Download the Pentaho Data Service JDBC driver
You can download the driver using the PDI client or manually.
Download using the PDI client
Open the transformation and identify the step with the data service badge.
Open Driver Details using one of the following:
Right-click the step and select Data Service > Driver Details.
In Explore > View > Data Services, right-click a data service and select Driver Details.
In the Data Services window, select Driver Details.
In Driver Details, select Get Third-Party Driver.
Save
Pentaho-Data-Service-Driver.zip.Select Close.
Download manually
Go to
<Pentaho installation directory>/design-tools/data-integration/plugins/pdi-dataservices-ee-plugin/Data Service JDBC Driver.Copy
pdi-dataservice-driver-thirdparty-<release-version>.jar.
These steps assume you installed the Pentaho Data Services plugin. See Install Pentaho Data Services.
Step 2: Install the Pentaho Data Service JDBC driver
Extract the driver files and copy them to your application's JDBC driver directory.
Start and stop the application.
Create a connection using the instructions in Connect to the Pentaho Data Service from a non-Pentaho tool.
Step 3: Create a connection from a non-Pentaho tool
Most tools let you create a JDBC connection by specifying a driver class and connection string.
Driver class:
org.pentaho.di.trans.dataservice.jdbc.ThinDriver
Connection string format:
Example:
webappname is required when the data service runs on Pentaho Server.
Query a Pentaho Data Service
You can query a data service using SQL.
If the transformation uses a parameter, you can assign a value in your SQL query.
Limitations:
SQL support is limited. See SQL support reference.
Data services can be queried only with SQL.
Example: query with a parameter
Syntax:
Example:
Monitor a Pentaho Data Service
To monitor a data service, enter one of the following URLs in your browser:
List names of data services:
List data service status:
Replace <Pentaho Server Host:Port> with the host name or IP address and port for the Pentaho Server running the data service.
You must be authenticated and authorized on the Pentaho Server to access these endpoints.
Last updated
Was this helpful?

