Work with the Streamlined Data Refinery

Install, configure, and use the Streamlined Data Refinery (SDR) sample.

The Streamlined Data Refinery (SDR) is a simplified ETL “refinery”. It uses Pentaho Data Integration (PDI) jobs to refine raw data. It then publishes a model for use in Analyzer.

To get hands-on fast, use the Movie Ratings-SDR sample. Download it from SDR.ziparrow-up-right.

How SDR works

SDR combines:

  • PDI for jobs, transformations, and parameter handling.

  • An SDR app (based on CTools) for user-driven refinement.

  • Pentaho Server to run jobs and host the published model.

At a high level:

  1. A user makes selections in the SDR form.

  2. The app calls Pentaho Server to run the refinery job chain.

  3. PDI builds a model with Build Model.

  4. PDI publishes the model back to Pentaho Server with Publish Model.

  5. Report designers use the published data source in Analyzer.

SDR overview

App Builder, Community Dashboard Editor, and CTools

App Builder helps you build Pentaho plugins without Java. You still need to know CTools and PDI basics.

Community Dashboard Editor (CDE) helps you build and preview dashboards. It integrates with the Pentaho User Console (PUC).

See the CDE tutorials for more depth: https://community.hitachivantara.com/s/article/cdf-tutorialarrow-up-right.

Get started with App Builder

  1. Log in to PUC.

  2. Open App Builder:

    • {your PUC location}/pentaho/plugin/sparkl/api/main

  3. Click the Plus Sign to create a new app.

  4. Name the app.

  5. Fill in plugin details on the About tab.

  6. On Element, add elements and dashboards.

  7. Click Edit to adjust dashboards.

  8. Click Preview to test.

  9. Share the app with users when ready.

Install and configure the SDR sample

Use this section if you want the sample running end-to-end. If you used the default evaluation install with Postgres, you may only need the sample install steps.

Install Pentaho software

  1. Install the latest Pentaho software.

  2. If you used the Evaluation method with Postgres, use password for Postgres.

  3. Log in to PUC and verify it works.

  4. Log out of PUC.

  5. Stop Pentaho Server.

circle-info

If you use Vertica, install the Vertica JDBC driver before you install the SDR sample.

Install the Vertica JDBC driver (Vertica only)

  1. Exit the PDI client if it is running.

  2. Stop Pentaho Server.

  3. Copy the Vertica JDBC driver to:

    • INSTALL_DIR/server/pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib

    • INSTALL_DIR/design-tools/data-integration/lib

Download and install the SDR sample

  1. Extract SDR.zip into:

    • pentaho/server/pentaho-server/pentaho-solutions/system

  2. Start Pentaho Server.

  3. Log in to PUC.

  4. In PUC, open Tools.

  5. Confirm Movie Ratings-SDR Sample appears.

Tools menu

Next, go to Use the SDR sample form. If you are not using the default evaluation setup, finish the config steps first.

Configure KTR files for your environment (non-default installs)

If you are not using Postgres and the default evaluation setup, update the sample KTRs.

  1. In Spoon, click File > Open.

  2. Open SDR_data.ktr:

    • pentaho/server/pentaho-server/pentaho-solutions/system/SDR/endpoints/kettle

  3. Right-click Set Local Variables.

  4. Set the URL for your Pentaho Server.

  5. Click OK.

  6. In SDR_data.ktr, right-click Call_ML_SDR.

  7. Select Open Referenced Object > Job.

  8. In _ML_SDR.job, edit Create Table:

    • Point it to your staging database.

  9. Edit Publish Model:

    • Enter your Pentaho Server user ID and password if prompted.

    • Click Test Connection.

    • Click OK.

  10. Open _ML_SDR_REFINERY.ktr.

  11. Edit Out to Staging DB and select your staging database.

  12. Save all changed files.

  13. Exit the PDI client.

  14. Restart Pentaho Server.

Use Hadoop with the SDR

You need Hadoop prerequisites first. See Connecting to a Hadoop cluster with the PDI client.

  1. Open ML_SDR_REFINERY.ktr.

  2. Locate the Hadoop File Input step.

  3. Activate the hop between Hadoop File Input and Parse weblog.

  4. Deactivate all hops between Log File List and Read Weblog Files.

  5. Edit Out to Staging DB and select your staging database.

  6. Click OK, then save the transformation.

SDR transformation with Hadoop

App endpoints for SDR forms

You can run the app via API endpoints. Use this pattern:

Examples:

  • Genre selector

    • http://localhost:8080/pentaho/plugin/SDR/api/genre

  • Gender selector

    • http://localhost:8080/pentaho/plugin/SDR/api/gender

Example response:

Supported endpoint parameters:

  • genre: options for the Genre selector.

  • gender: options for the Gender selector.

  • occupation: options for the Occupation selector.

  • income: options for the Income selector.

  • firstdate: limit dates for processing.

  • data_source_name: names of data sources available on the server.

  • latest_requests: latest 10 requests as a table response.

  • sdr_data: processes the request and returns status.

  • refresh: clears cache and refreshes kettle and dashboard elements.

Clean up the All Requests Processed list

You can delete items to clean up Latest Requests.

circle-exclamation
  1. Click All Requests Processed.

  2. Click Edit.

  3. Select the items you want to delete.

  4. Click Delete.

Select data sources to delete
Cleaned-up list

Use the SDR sample form

SDR refines data in PDI with Build Model and Annotate Stream. It then publishes the model with Publish Model. Report designers use the SDR form to create Analyzer reports.

Open and run the Movie Ratings-SDR Sample form

  1. Log in to PUC with administrator permissions.

  2. Open the Movie Ratings-SDR Sample form.

Movie Ratings-SDR Sample form

The All Requests Processed list shows up to 10 recent runs. Click an item to reuse it. Or create a new one.

All requests processed
  1. Select Genre filters in Movie Review Filter.

  2. Select profile items in Member Profile.

Filters
  1. Enter Start Date.

  2. Enter End Date.

  3. Choose whether to Grant Access to Everyone.

  4. Enter a data source name.

  5. Click Let's do this.

  6. Use Go to Analyzer when you are ready to build reports.

Building blocks

SDR models typically use:

  • Build Model job entry to build a model from upstream data.

  • Annotate Stream step to add measures, attributes, and link dimensions.

  • Shared Dimension step to create reusable dimensions.

  • Publish Model job entry to publish the final model to Pentaho Server.

Build Model job entry

The Build Model job entry creates Data Source Wizard data sources. This can include both Metadata and Mondrian models.

It searches upstream for an output step or data service. You can also base a model on an existing Data Source Wizard or Analyzer model.

circle-info

If the source is a Pentaho Data Service, connect to a Pentaho Repository before publishing.

After Build Model runs, publish the model with Publish Model.

SDR basic workflow

Create a Build Model job entry

This task assumes you are on the job canvas in the PDI client.

  1. In Design, open Modeling.

  2. Double-click Build Model (or drag it to the canvas).

  3. Double-click the Build Model entry on the canvas.

  4. Set Entry name.

circle-info

These characters are not valid in Data Source Wizard source names:

  1. In Source, select an output step or Pentaho Data Service.

  2. In Model Name, enter the model name.

  3. In Modeling Method, choose:

    • Auto Model to build a new model.

    • Use Existing to base the model on an existing server model.

  4. Click OK.

Select existing model options

If you use Use Existing, connect to Pentaho Server first. Your source and the existing model must share the same physical schema.

circle-exclamation
  1. Fill in Pentaho Server Connection:

    • URL: base URL for the server.

    • User Name: server user name.

    • Password: server password.

  2. Click Connect.

  3. Select a model from Existing Models.

  4. Optional: select Create model if not found.

  5. Click OK.

Variables for Build Model job entry

These variables are generated when the model is created. They are available after the first successful run.

  • JobEntryBuildModel.DatabaseConnection.<Model Name>

    • Returns the database connection name used by the source.

  • JobEntryBuildModel.XMI.<Model Name>

    • Returns the published XMI contents.

  • JobEntryBuildModel.XMI.DSW.<Model Name>

    • True if the XMI is delivered as a Data Source Wizard data source.

Annotate Stream step

The Annotate Stream step refines SDR models by adding:

  • Measures

  • Attributes

  • Link dimensions

  • Calculated measures

It modifies the default model produced by Build Model. After annotating, publish the model with Publish Model.

SDR workflow

Metadata injection support

All fields support metadata injection. Use it with ETL metadata injection.

When injecting a shared annotation group:

  • You can reuse an existing shared annotation group.

  • You cannot create a new shared annotation group.

  • If you provide SHARED_ANNOTATION_GROUP, injected annotations are ignored.

Use the Annotate Stream step

This task assumes you are on the transformation canvas in the PDI client.

  1. In Design, open Flow.

  2. Double-click Annotate Stream (or drag it to the canvas).

  3. Double-click Annotate Stream on the canvas.

  4. Set Step name.

  5. Choose where to save annotations:

    • Local: saved in the transformation.

    • Shared: stored in the metastore for reuse.

  6. Optional: set Description.

  7. Click Select Fields and move fields to Selected Fields.

  8. The Annotations list shows:

    • Field: annotated field name.

    • Model Action: measure, attribute, or link dimension.

    • Summary: action details.

  9. Double-click a field to annotate it.

  10. Choose an action:

  1. Optional: click Add Calculated Measure and fill in:

    • Measure Name

    • Format

    • Formula (MDX)

    • When calculating subtotals use this formula (optional)

    • Hide this calculated measure in the model (optional)

  2. Click Apply to save changes.

  3. Click OK to close.

Annotate Stream step

Create a measure on a stream field

Measures support aggregations such as SUM, AVERAGE, COUNT, and MINIMUM.

  1. Complete steps 1–7 in Use the Annotate Stream step.

  2. Select Create Measure from Actions.

  3. Set measure properties:

    • Measure Name: display name.

    • Aggregation Type: SUM, AVERAGE, COUNT, COUNT_DISTINCT, MINIMUM, MAXIMUM.

    • Format String: formatting for reports.

    • Description: short description stored on the server.

    • Hidden: True to hide it in Analyzer.

  4. Click OK.

Annotate dialog box

Create an attribute on a field

Attributes create a single-level hierarchy.

  1. Complete steps 1–7 in Use the Annotate Stream step.

  2. Select Create Attribute from Actions.

  3. Set attribute properties as needed:

    • Attribute Name (required)

    • Dimension (required)

    • Hierarchy (optional)

    • Description

    • Parent Attribute (for multi-level hierarchies)

    • Ordinal Field (for sorting members)

    • Format (for numbers and dates)

    • Is Unique (Mondrian SQL optimization)

    • Hidden (True hides it in Analyzer)

  4. For time attributes, also set:

    • Time Level Type

    • Time Source Format

  5. For geo attributes, also set:

    • Geo Type

    • Optional: Latitude and Longitude for Location

  6. Click OK.

Create an attribute

Create a link dimension

Link dimensions reuse a previously created shared dimension. Create the shared dimension first. See Shared Dimension step.

  1. Complete steps 1–7 in Use the Annotate Stream step.

  2. Select Link Dimension from Actions.

  3. Set:

    • Dimension Name: dimension name in the model.

    • Shared Dimension: existing shared dimension to link.

  4. Click OK.

Create link dimension
circle-info

To link a shared dimension, you must create a dimension key for it. See Create a dimension key.

Create a dimension key

  1. Complete steps 1–7 in Use the Annotate Stream step.

  2. Double-click the field you want as the key.

  3. Select Create Dimension Key from Actions.

  4. Edit the key value if needed.

  5. Click OK.

Create annotation groups

Annotation groups help you reuse annotations across transformations. This is useful when the same source table is reused.

You can save groups:

  • Locally in the transformation.

  • Shared in the metastore for other users.

Create an annotation group locally

  1. Complete steps 1–3 in Use the Annotate Stream step.

  2. Select Local.

  3. Click Select Fields and create annotations.

  4. Click Apply.

Create an annotation group for sharing with other users

  1. Complete steps 1–3 in Use the Annotate Stream step.

  2. Select Shared.

  3. Click Add Annotation Group (Plus Sign).

  4. Enter a unique group name.

  5. Click Select Fields and create annotations.

  6. Click Apply.

Shared annotation group

Shared Dimension step

The Shared Dimension step is similar to Annotate Stream. It creates reusable dimensions and stores them for later use.

You must designate one annotation as the dimension key. See Create a dimension key in Shared Dimension step.

This transformation should run before Build Model. Or run it in a separate job that runs before Build Model.

Shared dimension workflow

After you build the model, publish it with Publish Model.

Metadata injection support

All fields support metadata injection. Use it with ETL metadata injection.

When you use metadata injection with Shared Dimension:

  • The system loads the currently defined annotations.

  • It overwrites the metastore entry if the name matches.

Create a shared dimension

  1. In Design, open Flow.

  2. Double-click Shared Dimension (or drag it to the canvas).

  3. Double-click Shared Dimension on the canvas.

  4. Set Step name.

  5. In Shared Dimension Name, select an existing shared dimension or enter a new name.

  6. In Data Provider Step, select the source step for the dimension.

    • Valid sources include Combination Lookup, Database Lookup, Dimension Lookup, Insert/Update, Table Output, and Vertica Bulk Loader.

  7. Set Description.

  8. Click Select Fields and move fields to Selected Fields.

  9. In Annotations, double-click a field and select an action:

    • Create Attribute

    • Create Dimension Key

  10. Click Apply, then click OK.

Shared Dimension step

Create a dimension key in Shared Dimension step

  1. Complete steps 1–8 in Create a shared dimension.

  2. Select Create Dimension Key from Actions.

  3. In Dimension, enter or select the dimension name.

  4. Click OK.

Publish Model job entry

The Publish Model job entry publishes the model created by Build Model. It publishes to Pentaho Server for use in PUC and Analyzer.

Put Build Model before Publish Model in the job.

SDR workflow

After publishing:

  • The Data Source Wizard data source uses the model name.

  • You cannot edit connection info in Data Source Wizard.

    • The connection is stored in the Build Model entry.

Publish Model job entry (how to)

This task assumes you are on the job canvas in PDI.

You need Pentaho Server permissions:

  • Publish Content

  • Manage Data Sources

  1. In Design, open Modeling.

  2. Double-click Publish Model (or drag it to the canvas).

  3. Double-click Publish Model on the canvas.

  4. Set Entry name.

  5. Optional: select Replace Existing Published Model.

  6. In Pentaho Server Connection, set:

    • URL

    • User Name

    • Password

  7. Click Test Connection.

  8. In Share, set:

    • Grant Access To: Everyone, User, or Role.

    • User/Role Name: only for User or Role.

  9. Click OK.

Publish Model dialog box

Last updated

Was this helpful?