Work with the Streamlined Data Refinery

Install, configure, and use the Streamlined Data Refinery (SDR) sample.

The Streamlined Data Refinery (SDR) is a simplified ETL “refinery”. It uses Pentaho Data Integration (PDI) jobs to refine raw data. It then publishes a model for use in Analyzer.

To get hands-on fast, use the Movie Ratings-SDR sample. Download it from SDR.zip.

How SDR works

SDR combines:

PDI for jobs, transformations, and parameter handling.
An SDR app (based on CTools) for user-driven refinement.
Pentaho Server to run jobs and host the published model.

At a high level:

A user makes selections in the SDR form.
The app calls Pentaho Server to run the refinery job chain.
PDI builds a model with Build Model.
PDI publishes the model back to Pentaho Server with Publish Model.
Report designers use the published data source in Analyzer.

App Builder, Community Dashboard Editor, and CTools

App Builder helps you build Pentaho plugins without Java. You still need to know CTools and PDI basics.

Community Dashboard Editor (CDE) helps you build and preview dashboards. It integrates with the Pentaho User Console (PUC).

See the CDE tutorials for more depth: https://community.hitachivantara.com/s/article/cdf-tutorial.

Get started with App Builder

Log in to PUC.
Open App Builder:
- {your PUC location}/pentaho/plugin/sparkl/api/main
Click the Plus Sign to create a new app.
Name the app.
Fill in plugin details on the About tab.
On Element, add elements and dashboards.
Click Edit to adjust dashboards.
Click Preview to test.
Share the app with users when ready.

Install and configure the SDR sample

Use this section if you want the sample running end-to-end. If you used the default evaluation install with Postgres, you may only need the sample install steps.

Install Pentaho software

Install the latest Pentaho software.
If you used the Evaluation method with Postgres, use password for Postgres.
Log in to PUC and verify it works.
Log out of PUC.
Stop Pentaho Server.

If you use Vertica, install the Vertica JDBC driver before you install the SDR sample.

Install the Vertica JDBC driver (Vertica only)

Exit the PDI client if it is running.
Stop Pentaho Server.
Copy the Vertica JDBC driver to:
- INSTALL_DIR/server/pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib
- INSTALL_DIR/design-tools/data-integration/lib

Download and install the SDR sample

Download SDR.zip from https://pentaho.app.box.com/folder/41400845668.
Extract SDR.zip into:
- pentaho/server/pentaho-server/pentaho-solutions/system
Start Pentaho Server.
Log in to PUC.
In PUC, open Tools.
Confirm Movie Ratings-SDR Sample appears.

Next, go to Use the SDR sample form. If you are not using the default evaluation setup, finish the config steps first.

Configure KTR files for your environment (non-default installs)

If you are not using Postgres and the default evaluation setup, update the sample KTRs.

In Spoon, click File > Open.
Open SDR_data.ktr:
- pentaho/server/pentaho-server/pentaho-solutions/system/SDR/endpoints/kettle
Right-click Set Local Variables.
Set the URL for your Pentaho Server.
Click OK.
In SDR_data.ktr, right-click Call_ML_SDR.
Select Open Referenced Object > Job.
In _ML_SDR.job, edit Create Table:
- Point it to your staging database.
Edit Publish Model:
- Enter your Pentaho Server user ID and password if prompted.
- Click Test Connection.
- Click OK.
Open _ML_SDR_REFINERY.ktr.
Edit Out to Staging DB and select your staging database.
Save all changed files.
Exit the PDI client.
Restart Pentaho Server.

Use Hadoop with the SDR

You need Hadoop prerequisites first. See Connecting to a Hadoop cluster with the PDI client.

Open ML_SDR_REFINERY.ktr.
Locate the Hadoop File Input step.
Activate the hop between Hadoop File Input and Parse weblog.
Deactivate all hops between Log File List and Read Weblog Files.
Edit Out to Staging DB and select your staging database.
Click OK, then save the transformation.

App endpoints for SDR forms

You can run the app via API endpoints. Use this pattern:

http://{host}/pentaho/plugin/{pluginID}/api/{command}

Examples:

Genre selector
- http://localhost:8080/pentaho/plugin/SDR/api/genre
Gender selector
- http://localhost:8080/pentaho/plugin/SDR/api/gender

Example response:

{"metadata":[{"colIndex":0,"colType":"String","colName":"GenreID"},{"colIndex":1,"colType":"String","colName":"Genre"}],"queryInfo":{"totalRows":18},"resultset":[["Action","Action"],["Adventure","Adventure"],["Animation","Animation"],["Childrens","Childrens"],["Comedy","Comedy"],["Crime","Crime"],["Documentary","Documentary"],["Drama","Drama"],["Fantasy","Fantasy"],["Film-Noir","Film-Noir"],["Horror","Horror"],["Musical","Musical"],["Mystery","Mystery"],["Romance","Romance"],["Sci-Fi","Sci-Fi"],["Thriller","Thriller"],["War","War"],["Western","Western"]]}

Supported endpoint parameters:

genre: options for the Genre selector.
gender: options for the Gender selector.
occupation: options for the Occupation selector.
income: options for the Income selector.
firstdate: limit dates for processing.
data_source_name: names of data sources available on the server.
latest_requests: latest 10 requests as a table response.
sdr_data: processes the request and returns status.
refresh: clears cache and refreshes kettle and dashboard elements.

Clean up the All Requests Processed list

You can delete items to clean up Latest Requests.

Deleting entries also deletes the tables behind them. It also removes the Analyzer data source from PUC.

Click All Requests Processed.
Click Edit.
Select the items you want to delete.
Click Delete.

Use the SDR sample form

SDR refines data in PDI with Build Model and Annotate Stream. It then publishes the model with Publish Model. Report designers use the SDR form to create Analyzer reports.

Open and run the Movie Ratings-SDR Sample form

Log in to PUC with administrator permissions.
Open the Movie Ratings-SDR Sample form.

The All Requests Processed list shows up to 10 recent runs. Click an item to reuse it. Or create a new one.

Select Genre filters in Movie Review Filter.
Select profile items in Member Profile.

Enter Start Date.
Enter End Date.
Choose whether to Grant Access to Everyone.
Enter a data source name.
Click Let's do this.
Use Go to Analyzer when you are ready to build reports.

Building blocks

SDR models typically use:

Build Model job entry to build a model from upstream data.
Annotate Stream step to add measures, attributes, and link dimensions.
Shared Dimension step to create reusable dimensions.
Publish Model job entry to publish the final model to Pentaho Server.

Build Model job entry

The Build Model job entry creates Data Source Wizard data sources. This can include both Metadata and Mondrian models.

It searches upstream for an output step or data service. You can also base a model on an existing Data Source Wizard or Analyzer model.

If the source is a Pentaho Data Service, connect to a Pentaho Repository before publishing.

After Build Model runs, publish the model with Publish Model.

Create a Build Model job entry

This task assumes you are on the job canvas in the PDI client.

In Design, open Modeling.
Double-click Build Model (or drag it to the canvas).
Double-click the Build Model entry on the canvas.
Set Entry name.

These characters are not valid in Data Source Wizard source names:

% ? : [] * \t \r \n

In Source, select an output step or Pentaho Data Service.
In Model Name, enter the model name.
In Modeling Method, choose:
- Auto Model to build a new model.
- Use Existing to base the model on an existing server model.
Click OK.

Select existing model options

If you use Use Existing, connect to Pentaho Server first. Your source and the existing model must share the same physical schema.

When you use Use Existing, annotations are not applied when the model opens in Analyzer. Use Auto Model if you want Analyzer to pick up your annotations.

Fill in Pentaho Server Connection:
- URL: base URL for the server.
- User Name: server user name.
- Password: server password.
Click Connect.
Select a model from Existing Models.
Optional: select Create model if not found.
Click OK.

Variables for Build Model job entry

These variables are generated when the model is created. They are available after the first successful run.

JobEntryBuildModel.DatabaseConnection.<Model Name>
- Returns the database connection name used by the source.
JobEntryBuildModel.XMI.<Model Name>
- Returns the published XMI contents.
JobEntryBuildModel.XMI.DSW.<Model Name>
- True if the XMI is delivered as a Data Source Wizard data source.

Annotate Stream step

The Annotate Stream step refines SDR models by adding:

Measures
Attributes
Link dimensions
Calculated measures

It modifies the default model produced by Build Model. After annotating, publish the model with Publish Model.

Metadata injection support

All fields support metadata injection. Use it with ETL metadata injection.

When injecting a shared annotation group:

You can reuse an existing shared annotation group.
You cannot create a new shared annotation group.
If you provide SHARED_ANNOTATION_GROUP, injected annotations are ignored.

Use the Annotate Stream step

This task assumes you are on the transformation canvas in the PDI client.

In Design, open Flow.
Double-click Annotate Stream (or drag it to the canvas).
Double-click Annotate Stream on the canvas.
Set Step name.
Choose where to save annotations:
- Local: saved in the transformation.
- Shared: stored in the metastore for reuse.
Optional: set Description.
Click Select Fields and move fields to Selected Fields.
The Annotations list shows:
- Field: annotated field name.
- Model Action: measure, attribute, or link dimension.
- Summary: action details.
Double-click a field to annotate it.
Choose an action:

Create Measure. See Create a measure on a stream field.
Create Attribute. See Create an attribute on a field.
Link Dimension. See Create a link dimension.

Optional: click Add Calculated Measure and fill in:
- Measure Name
- Format
- Formula (MDX)
- When calculating subtotals use this formula (optional)
- Hide this calculated measure in the model (optional)
Click Apply to save changes.
Click OK to close.

Create a measure on a stream field

Measures support aggregations such as SUM, AVERAGE, COUNT, and MINIMUM.

Complete steps 1–7 in Use the Annotate Stream step.
Select Create Measure from Actions.
Set measure properties:
- Measure Name: display name.
- Aggregation Type: SUM, AVERAGE, COUNT, COUNT_DISTINCT, MINIMUM, MAXIMUM.
- Format String: formatting for reports.
- Description: short description stored on the server.
- Hidden: True to hide it in Analyzer.
Click OK.

Create an attribute on a field

Attributes create a single-level hierarchy.

Complete steps 1–7 in Use the Annotate Stream step.
Select Create Attribute from Actions.
Set attribute properties as needed:
- Attribute Name (required)
- Dimension (required)
- Hierarchy (optional)
- Description
- Parent Attribute (for multi-level hierarchies)
- Ordinal Field (for sorting members)
- Format (for numbers and dates)
- Is Unique (Mondrian SQL optimization)
- Hidden (True hides it in Analyzer)
For time attributes, also set:
- Time Level Type
- Time Source Format
For geo attributes, also set:
- Geo Type
- Optional: Latitude and Longitude for Location
Click OK.

Create a link dimension

Link dimensions reuse a previously created shared dimension. Create the shared dimension first. See Shared Dimension step.

Complete steps 1–7 in Use the Annotate Stream step.
Select Link Dimension from Actions.
Set:
- Dimension Name: dimension name in the model.
- Shared Dimension: existing shared dimension to link.
Click OK.

To link a shared dimension, you must create a dimension key for it. See Create a dimension key.

Create a dimension key

Complete steps 1–7 in Use the Annotate Stream step.
Double-click the field you want as the key.
Select Create Dimension Key from Actions.
Edit the key value if needed.
Click OK.

Create annotation groups

Annotation groups help you reuse annotations across transformations. This is useful when the same source table is reused.

You can save groups:

Locally in the transformation.
Shared in the metastore for other users.

Create an annotation group locally

Complete steps 1–3 in Use the Annotate Stream step.
Select Local.
Click Select Fields and create annotations.
Click Apply.

Create an annotation group for sharing with other users

Complete steps 1–3 in Use the Annotate Stream step.
Select Shared.
Click Add Annotation Group (Plus Sign).
Enter a unique group name.
Click Select Fields and create annotations.
Click Apply.

Shared Dimension step

The Shared Dimension step is similar to Annotate Stream. It creates reusable dimensions and stores them for later use.

You must designate one annotation as the dimension key. See Create a dimension key in Shared Dimension step.

This transformation should run before Build Model. Or run it in a separate job that runs before Build Model.

After you build the model, publish it with Publish Model.

Metadata injection support

All fields support metadata injection. Use it with ETL metadata injection.

When you use metadata injection with Shared Dimension:

The system loads the currently defined annotations.
It overwrites the metastore entry if the name matches.

Create a shared dimension

In Design, open Flow.
Double-click Shared Dimension (or drag it to the canvas).
Double-click Shared Dimension on the canvas.
Set Step name.
In Shared Dimension Name, select an existing shared dimension or enter a new name.
In Data Provider Step, select the source step for the dimension.
- Valid sources include Combination Lookup, Database Lookup, Dimension Lookup, Insert/Update, Table Output, and Vertica Bulk Loader.
Set Description.
Click Select Fields and move fields to Selected Fields.
In Annotations, double-click a field and select an action:
- Create Attribute
- Create Dimension Key
Click Apply, then click OK.

Create a dimension key in Shared Dimension step

Complete steps 1–8 in Create a shared dimension.
Select Create Dimension Key from Actions.
In Dimension, enter or select the dimension name.
Click OK.

Publish Model job entry

The Publish Model job entry publishes the model created by Build Model. It publishes to Pentaho Server for use in PUC and Analyzer.

Put Build Model before Publish Model in the job.

After publishing:

The Data Source Wizard data source uses the model name.
You cannot edit connection info in Data Source Wizard.
- The connection is stored in the Build Model entry.

Publish Model job entry (how to)

This task assumes you are on the job canvas in PDI.

You need Pentaho Server permissions:

Publish Content
Manage Data Sources

In Design, open Modeling.
Double-click Publish Model (or drag it to the canvas).
Double-click Publish Model on the canvas.
Set Entry name.
Optional: select Replace Existing Published Model.
In Pentaho Server Connection, set:
- URL
- User Name
- Password
Click Test Connection.
In Share, set:
- Grant Access To: Everyone, User, or Role.
- User/Role Name: only for User or Role.
Click OK.

Last updated 4 hours ago

Was this helpful?

hashtagHow SDR works

hashtagApp Builder, Community Dashboard Editor, and CTools

hashtagInstall and configure the SDR sample

hashtagInstall Pentaho software

hashtagInstall the Vertica JDBC driver (Vertica only)

hashtagDownload and install the SDR sample

hashtagConfigure KTR files for your environment (non-default installs)

hashtagUse Hadoop with the SDR

hashtagApp endpoints for SDR forms

hashtagClean up the All Requests Processed list

hashtagUse the SDR sample form

hashtagOpen and run the Movie Ratings-SDR Sample form

hashtagBuilding blocks

hashtagBuild Model job entry

hashtagAnnotate Stream step

hashtagShared Dimension step

hashtagPublish Model job entry

How SDR works

App Builder, Community Dashboard Editor, and CTools

Install and configure the SDR sample

Install Pentaho software

Install the Vertica JDBC driver (Vertica only)

Download and install the SDR sample

Configure KTR files for your environment (non-default installs)

Use Hadoop with the SDR

App endpoints for SDR forms

Clean up the All Requests Processed list

Use the SDR sample form

Open and run the Movie Ratings-SDR Sample form

Building blocks

Build Model job entry

Annotate Stream step

Shared Dimension step

Publish Model job entry