Pentaho Data Services

Prototyping a data model can be time consuming, particularly when it involves setting up databases, creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can visualize the data and provide feedback.

One way to streamline this process is to make the output of a transformation step a Pentaho Data Service. The output of the transformation step is exposed by the data service so that the output data can be queried as if it were stored in a physical table, even though the results of the transformation are not stored in a physical database. Instead, results are published to the Pentaho Server as a virtual table.

circle-info

You must have a Pentaho Server and repository to publish the data service.

The virtual table is a JDBC-compliant data source that you and others can connect to or query with SQL, provided they can access the server and the transformation. The Pentaho Data Service can be connected to or queried by a JDBC-compliant tool such as Pentaho Report Designer, Pentaho Interactive Reports, and CTools as well as other compatible tools like RStudio, DBVisualizer, or SQuirreL.

The Pentaho Data Service can also be used in some instances where building and maintaining a data warehouse is sometimes impractical or inefficient, especially when you need to quickly blend and visualize fast-moving or quickly evolving data sets on the fly. For example, if you want to compare your product prices with your competitors, you can create a transformation that blends prices from your in-house data sources and competitor prices. Then, you can convert the output step in the transformation into a Pentaho Data Service that creates a virtual table for querying when you connect to the Pentaho Server. You or others can connect to and query the virtual table, as you would any other JDBC data source to visualize the results in Analyzer or another tool.

The Pentaho Data Service also has a testing tool. This tool generates several logs and reports that you can use to refine the data service and determine where to apply specialized optimizations. You can also define parameters that others can use to pose customized queries. For example, you can create a data service that publishes a virtual “fact” table of a moderately-sized research dataset to a Pentaho Server. You can test and add optimizations and parameters, such as gender or test type so that the data service runs more quickly. Then, you can share connection and parameter information with a group of researchers, who can query the virtual table. Researchers can use Pentaho Interactive Reporting, a dashboard created with CTools, or an application of their choice, such as RStudio, to analyze and visualize the research dataset.

Pentaho Data Services support a subset of SQL. For details, see SQL support reference.

circle-info

Install the Pentaho Data Services plugin before you create data services.

In this article

Install Pentaho Data Services

Pentaho Data Services plugin is an optional plugin. You can download the plugin from the Support Portalarrow-up-right.

Pentaho Data Integration

To install the plugin in Pentaho Data Integration:

  1. Unpack pdi-dataservice-ee-plugin-<release-version>.zip to the plugins folder (data-integration/plugins).

  2. Restart Pentaho Data Integration.

Pentaho Server

To install the plugin on Pentaho Server:

  1. Unpack pdi-dataservice-ee-plugin-<release-version>.zip to the plugins folder (pentaho-server/pentaho-solution/system/kettle/plugins).

  2. Copy pdi-dataservice-driver-pentaho-<release-version>.jar from:

    • <Pentaho installation directory>/server/pentaho-server/pentaho-solutions/system/kettle/plugins/pdi-dataservices-ee-plugin/Data Service JDBC Driver

    • to <Pentaho installation directory>/server/pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib

  3. Restart Pentaho Server.

Create a regular or streaming Pentaho Data Service

You can create either a regular data service or a streaming analytics service.

Streaming data services are commonly used when creating streaming dashboards with CTools.

  1. Save the transformation to the Pentaho Server.

  2. Right-click the transformation step that outputs the data you want.

  3. Select Data Services > New.

  4. Enter a unique name in Service Name (Virtual Table Name).

  5. Confirm Output step is the step you selected.

  6. Optional: Select Streaming for Data Service Type.

  7. Select OK.

Data service badge

After you create a Pentaho Data Service from a step, a data service badge is added to that step.

The badge indicates whether the step has a regular or streaming data service.

Regular and streaming data service badge types

Open or edit a Pentaho Data Service

To open a data service for viewing or editing, do one of the following in the PDI client:

  • In Explore > View, select Data Services. Right-click a data service, then select Edit.

  • Right-click the step with the data service badge, then select Data Services > Edit.

Delete a Pentaho Data Service

To delete a data service, do one of the following in the PDI client:

  • In Explore > View, select Data Services. Right-click a data service, then select Delete.

  • Right-click the step with the data service badge, then select Data Services > Delete.

Test a Pentaho Data Service

After creating your data service, test it to ensure that it runs properly and generates the data you need.

Testing can uncover bottlenecks in your transformation and help you decide which optimization techniques to apply.

Run a basic test

To run a basic test on a regular data service:

  1. Verify Data Service Type is set to Regular.

  2. Open the Test Data Service window using one of the following:

    • In the Data Service window, select Test Data Service.

    • In Explore > View > Data Services, right-click a data service and select Test.

    • Right-click the step with the data service badge, then select Data Services > Test.

  3. Optional: Adjust settings:

    • Log Level. Controls how much detail appears in logs.

    • Max Rows. Limits how many rows appear in results.

  4. Select Execute SQL.

  5. Review the output in Examine test results.

  6. Optional: If you need a clean run, clear the cache and test again.

  7. Select Close.

  8. Optional: Add an optimization.

Run a streaming optimization test

When you test streaming data, the stream is partitioned into windows (batches).

Windows can be time-based or row-based.

To test a streaming data service:

  1. Verify Data Service Type is set to Streaming.

  2. Open the Test Data Service window.

  3. Select a window mode (Time Based or Row Based).

  4. Configure window settings:

    • Window Size

    • Every

    • Limit

  5. Optional: Adjust Log Level and Max Rows.

  6. Select Execute SQL.

  7. Review the output in Examine test results.

  8. Select Stop to stop execution.

  9. Select Close.

  10. Optional: Add an optimization.

Run an optimization test

If you have added an optimization, run a test that passes the optimization parameter.

Example query:

  • Select Preview Optimization in the Test Data Service window.

  • Use Examine test results to interpret results.

Examine test results

Test results appear in the tabs in the bottom half of the Test Data Service window.

When you test or run a data service, two transformations run:

  • The service transformation. This is the transformation you built.

  • The generated transformation. PDI generates this based on executed SQL.

The following table describes the tabs and what to look for:

Tab
Description
Tips

Query Results

Events during the test run and query results.

Verify results. For streaming, watch updates per window.

Optimized Queries

Processing information and results of optimizations.

Verify optimizations applied correctly. Compare with Service Metrics and Generated Transformation Logging.

Service Transformation Logging

Logs from the service transformation.

Check for design or runtime issues. Compare with Generated Transformation Logging.

Generated Transformation Logging

Logs from the generated transformation.

Check SQL support issues. See SQL support reference.

Service Metrics

GANTT chart timings for the service transformation.

Find bottlenecks. Compare with SQL Trans Metrics.

SQL Trans Metrics

GANTT chart timings for the generated transformation.

Find bottlenecks. Compare with Service Metrics.

SQL support reference

The Pentaho Data Service is designed to support a subset of SQL clauses and literals that are useful for data blending, optimizations, and other scenarios.

Limitations and constraints are listed at the end.

Supported SQL literals

The Pentaho Data Service supports the following literals:

  • Strings use single quotation marks. Escape a single quote using another single quote: ''.

  • Dates use square brackets. Supported formats:

    • [yyyy/MM/dd HH:mm:ss.SSS]

    • [yyyy/MM/dd HH:mm:ss]

    • [yyyy/MM/dd]

  • For an IN list, date formats can use single quotes and dashes, for example:

    • SELECT * FROM BUILDS WHERE BuildDate IN ('2015-03-18', '2015-03-22')

    • You cannot use bracketed date formats in an IN list.

  • Number and BigNumber cannot use grouping symbols. Use . for decimals, for example 123.45.

  • Integers contain digits only.

  • Boolean values can be TRUE or FALSE.

Supported SQL clauses

The Pentaho Data Service supports the following clauses.

Clause
What is supported

SELECT

  • COUNT(field)

  • COUNT(*)

  • COUNT(DISTINCT field)

  • DISTINCT <fields>

  • IIF (condition, true-value or field, false-value or field)

  • CASE WHEN condition THEN true-value ELSE false-value END

  • SUM

  • AVG

  • MIN

  • MAX

  • Aliases with AS or with spaces. Example: SUM(sales) AS "Total Sales" or SUM(sales) TotalSales

  • Constant expressions. See Supported SQL literals.

FROM

  • Only one Pentaho service name.

  • Aliases for the service name.

  • Omit the service name to query an empty row. SELECT 1 and SELECT 1 FROM dual are the same.

WHERE

  • Nested brackets

  • AND, OR, NOT followed by brackets (example: NOT (A=5 OR C=3))

  • Precedence

  • Literals (String and Integer)

  • PARAMETER('parameter-name')='value' (always evaluates to TRUE)

  • Operators: =, <, >, <=, =<, >=, =>, <>

  • LIKE (wildcards: %, ?)

  • REGEX

  • IS NULL, IS NOT NULL

  • IN

  • Conditions on IIF or its alias

  • DATE_TO_STR(date-field, <mask>)

GROUP BY

  • Group on fields. Not on IIF().

LIMIT

  • LIMIT

  • LIMIT offset, count

  • LIMIT count OFFSET offset

HAVING

  • Apply conditions to aggregates, not aliases.

  • Use identical strings for expressions.

  • Use conditions on aggregations not in SELECT.

ORDER BY

  • Order on any column, even if not in the result.

  • Order on IIF or CASE-WHEN expressions.

Other development considerations

Keep these constraints in mind:

  • You cannot JOIN one data service virtual table to another.

  • Data services use the Memory Group By step to group. Watch memory consumption for many groups.

  • You cannot specify the same field twice in the same SELECT clause.

  • Calculations and functions like string concatenation are not supported in queries. Do them in the transformation.

Optimize a Pentaho Data Service

As you test your data service, you might see bottlenecks or steps that could run more efficiently.

If you want to improve performance, apply an optimization technique.

Service cache

This optimization stores results in a cache.

By default, caching is enabled and results are stored for an hour.

Use this technique when result sets are modest and you expect repeat queries.

How the service cache optimization works

If you run the data service while results are cached, PDI can run your query against cached data instead of running the full transformation.

This behavior depends on:

  • Whether the cached dataset contains all the records required by the new query

  • Other optimizations you apply

PDI Data Service Cache Optimization Workflow

When you run tests from the Test Data Service window and change the SQL, PDI does not use the cache.

Tests only return up to Max Rows, so using cached results could return incomplete results.

Adjust the cache duration

  1. In the Data Service window, select the Service Cache tab.

  2. Select Enable Caching.

  3. Update Cache Duration (seconds).

  4. Select OK.

Disable the cache

  1. In the Data Service window, select the Service Cache tab.

  2. Clear Enable Caching.

  3. Select OK.

Clear the cache

  1. In the Data Service window, select the Service Cache tab.

  2. Clear Enable Caching.

  3. Select OK.

  4. Open the Data Service window again and return to the Service Cache tab.

  5. Select Enable Caching.

Query pushdown

Use Query Pushdown to translate the SQL WHERE clause into a corresponding WHERE clause in:

  • Table Input steps

  • MongoDB Input steps

The input queries are filtered and handled at the source.

How query pushdown works

To apply Query Pushdown, set the optimization values, then add the optimization parameter to the input step query.

The optimization uses a parameter in place of the WHERE clause, for example:

SELECT * FROM 'employee' WHERE ${countryParam}

PDI Data Service Query Pushdown Optimization Workflow

Add the query pushdown parameter to Table Input or MongoDB Input

  1. Create a transformation with a Table Input or MongoDB Input step.

  2. Open the input step.

  3. Add a parameter where your WHERE clause value belongs:

    • SQL: SELECT * FROM media WHERE ${countryParam}

    • MongoDB: {$match : ${mongoDbParam}}

  4. Press CTRL+SPACE to list parameters.

  5. Select Replace Variables in Script?.

  6. Select OK.

Set up query pushdown parameter optimization

  1. Open the Data Service window and select the Query Pushdown tab.

  2. Select the plus sign near Parameters.

  3. Enter the name of the optimization parameter you used in the input query.

  4. Select OK.

  5. Select the step that contains the parameter in Step Name.

  6. In Definitions, map fields:

    • Data Service Field. Field name as it appears in transformation output.

    • Step Field. Field name as it appears in the data source.

  7. Optional: Select Get Optimizations to generate optimizations automatically.

  8. Select OK.

Disable the query pushdown optimization

To disable this optimization, select Disable an Optimization in the Data Services window.

Parameter pushdown

Parameter Pushdown can be applied to any step in the transformation.

It maps a field value to a transformation parameter for simple equality conditions (example: WHERE region="South").

How parameter pushdown works

To set up Parameter Pushdown, configure the optimization, then add the parameter to the step.

PDI Parameter Pushdown Optimization Workflow

Add the parameter pushdown parameter to the step

  1. Add the parameter to a step, such as JSON Input or REST Client.

  2. Select OK.

Set up parameter pushdown optimization

  1. Open the Data Service window and select the Parameter Pushdown tab.

  2. In WHERE Clause Column, enter the data service field name from the existing list.

  3. Update Transformation Parameter as needed. The name must be unique in the data service.

  4. Optional: Update Value Format. Default %s is usually sufficient.

  5. Select OK.

Streaming optimization

This optimization limits the batch size used for processing.

Streaming records are partitioned into windows (batches) for processing.

How streaming optimization works

You can optimize processing by setting either:

  • Rows Limit. Maximum rows in a window.

  • Time Limit. Maximum elapsed time to create a window.

Adjust the row or time limits

  1. In the Data Service window, select the Streaming tab.

  2. Update Rows Limit or Time Limit.

  3. Select OK.

Publish a Pentaho Data Service

To publish your data service, save the transformation containing the data service to the Pentaho Repository.

To validate that your data service was published:

  1. Ensure external assets required by the transformation are accessible on the server.

  2. Open a browser and log in to the Pentaho Server.

  3. List data services on the server, for example:

Share a Pentaho Data Service with others

Once your data service is created and tested, you can share it so others can connect and query it.

Share prerequisites

  1. Ensure the user or group has permissions to:

    • Access the Pentaho Server and run queries.

    • Read and run the transformation that contains the data service.

  2. Provide the data service name as saved in the Pentaho Repository.

  3. If you use optimizations, provide names and definitions for any parameters.

Connect to the Pentaho Data Service from a Pentaho tool

A Pentaho Data Service is a virtual table containing the output of a step in a PDI transformation.

You can connect to and query:

  • A regular data service from Pentaho tools such as Report Designer, Analyzer, and the PDI client.

  • A streaming data service from a dashboard created with CTools.

circle-info

You need the data service name and permission to run the transformation and access the Pentaho Server.

Connection parameters:

  • Connection Name. Unique name of the data service.

  • Connection Type. Pentaho Data Services.

  • Access. Native (JDBC).

  • Hostname. Pentaho Server hostname or IP address. Default is localhost for local installs.

  • Port Number. Pentaho Server port. Default is 8080.

  • Web App Name. Default is pentaho.

  • Username. A user allowed to run the data service.

  • Password. Password for that user.

Optional connection parameters:

  • proxyhostname. Proxy server for HTTP connections.

  • proxyport. Proxy server port.

  • nonproxyhosts. Hosts that bypass the proxy. Separate multiple hosts with commas.

  • debugtrans. File path where PDI saves the generated debug transformation. Example: /tmp/debug.ktr.

  • PARAMETER_[optionname]=value. Sets a transformation parameter. Example: PARAMETER_model=E6530.

  • secure. Set to TRUE for HTTPS. Default is HTTP.

Connect to the Pentaho Data Service from a non-Pentaho tool

You can connect to and query a data service from tools like RStudio or SQuirreL.

To connect and query, you need the data service name and permission to run the transformation and access the server.

To use a non-Pentaho tool, install the data service JDBC driver, then create a connection in your tool.

Step 1: Download the Pentaho Data Service JDBC driver

You can download the driver using the PDI client or manually.

Download using the PDI client

  1. Open the transformation and identify the step with the data service badge.

  2. Open Driver Details using one of the following:

    • Right-click the step and select Data Service > Driver Details.

    • In Explore > View > Data Services, right-click a data service and select Driver Details.

    • In the Data Services window, select Driver Details.

  3. In Driver Details, select Get Third-Party Driver.

  4. Save Pentaho-Data-Service-Driver.zip.

  5. Select Close.

Download manually

  1. Go to <Pentaho installation directory>/design-tools/data-integration/plugins/pdi-dataservices-ee-plugin/Data Service JDBC Driver.

  2. Copy pdi-dataservice-driver-thirdparty-<release-version>.jar.

circle-info

These steps assume you installed the Pentaho Data Services plugin. See Install Pentaho Data Services.

Step 2: Install the Pentaho Data Service JDBC driver

  1. Extract the driver files and copy them to your application's JDBC driver directory.

  2. Start and stop the application.

  3. Create a connection using the instructions in Connect to the Pentaho Data Service from a non-Pentaho tool.

Step 3: Create a connection from a non-Pentaho tool

Most tools let you create a JDBC connection by specifying a driver class and connection string.

Driver class:

org.pentaho.di.trans.dataservice.jdbc.ThinDriver

Connection string format:

Example:

webappname is required when the data service runs on Pentaho Server.

Query a Pentaho Data Service

You can query a data service using SQL.

If the transformation uses a parameter, you can assign a value in your SQL query.

Limitations:

Example: query with a parameter

Syntax:

Example:

Monitor a Pentaho Data Service

To monitor a data service, enter one of the following URLs in your browser:

  • List names of data services:

  • List data service status:

Replace <Pentaho Server Host:Port> with the host name or IP address and port for the Pentaho Server running the data service.

You must be authenticated and authorized on the Pentaho Server to access these endpoints.

Last updated

Was this helpful?