Work with the Streamlined Data Refinery
Install, configure, and use the Streamlined Data Refinery (SDR) sample.
The Streamlined Data Refinery (SDR) is a simplified ETL “refinery”. It uses Pentaho Data Integration (PDI) jobs to refine raw data. It then publishes a model for use in Analyzer.
To get hands-on fast, use the Movie Ratings-SDR sample. Download it from SDR.zip.
How SDR works
SDR combines:
PDI for jobs, transformations, and parameter handling.
An SDR app (based on CTools) for user-driven refinement.
Pentaho Server to run jobs and host the published model.
At a high level:
A user makes selections in the SDR form.
The app calls Pentaho Server to run the refinery job chain.
PDI builds a model with Build Model.
PDI publishes the model back to Pentaho Server with Publish Model.
Report designers use the published data source in Analyzer.

App Builder, Community Dashboard Editor, and CTools
App Builder helps you build Pentaho plugins without Java. You still need to know CTools and PDI basics.
Community Dashboard Editor (CDE) helps you build and preview dashboards. It integrates with the Pentaho User Console (PUC).
See the CDE tutorials for more depth: https://community.hitachivantara.com/s/article/cdf-tutorial.
Get started with App Builder
Log in to PUC.
Open App Builder:
{your PUC location}/pentaho/plugin/sparkl/api/main
Click the Plus Sign to create a new app.
Name the app.
Fill in plugin details on the About tab.
On Element, add elements and dashboards.
Click Edit to adjust dashboards.
Click Preview to test.
Share the app with users when ready.
Install and configure the SDR sample
Use this section if you want the sample running end-to-end. If you used the default evaluation install with Postgres, you may only need the sample install steps.
Install Pentaho software
Install the latest Pentaho software.
If you used the Evaluation method with Postgres, use
passwordfor Postgres.Log in to PUC and verify it works.
Log out of PUC.
Stop Pentaho Server.
If you use Vertica, install the Vertica JDBC driver before you install the SDR sample.
Install the Vertica JDBC driver (Vertica only)
Exit the PDI client if it is running.
Stop Pentaho Server.
Copy the Vertica JDBC driver to:
INSTALL_DIR/server/pentaho-server/tomcat/webapps/pentaho/WEB-INF/libINSTALL_DIR/design-tools/data-integration/lib
Download and install the SDR sample
Download
SDR.zipfrom https://pentaho.app.box.com/folder/41400845668.Extract
SDR.zipinto:pentaho/server/pentaho-server/pentaho-solutions/system
Start Pentaho Server.
Log in to PUC.
In PUC, open Tools.
Confirm Movie Ratings-SDR Sample appears.

Next, go to Use the SDR sample form. If you are not using the default evaluation setup, finish the config steps first.
Configure KTR files for your environment (non-default installs)
If you are not using Postgres and the default evaluation setup, update the sample KTRs.
In Spoon, click File > Open.
Open
SDR_data.ktr:pentaho/server/pentaho-server/pentaho-solutions/system/SDR/endpoints/kettle
Right-click Set Local Variables.
Set the URL for your Pentaho Server.
Click OK.
In
SDR_data.ktr, right-click Call_ML_SDR.Select Open Referenced Object > Job.
In
_ML_SDR.job, edit Create Table:Point it to your staging database.
Edit Publish Model:
Enter your Pentaho Server user ID and password if prompted.
Click Test Connection.
Click OK.
Open
_ML_SDR_REFINERY.ktr.Edit Out to Staging DB and select your staging database.
Save all changed files.
Exit the PDI client.
Restart Pentaho Server.
Use Hadoop with the SDR
You need Hadoop prerequisites first. See Connecting to a Hadoop cluster with the PDI client.
Open
ML_SDR_REFINERY.ktr.Locate the Hadoop File Input step.
Activate the hop between Hadoop File Input and Parse weblog.
Deactivate all hops between Log File List and Read Weblog Files.
Edit Out to Staging DB and select your staging database.
Click OK, then save the transformation.

App endpoints for SDR forms
You can run the app via API endpoints. Use this pattern:
Examples:
Genre selector
http://localhost:8080/pentaho/plugin/SDR/api/genre
Gender selector
http://localhost:8080/pentaho/plugin/SDR/api/gender
Example response:
Supported endpoint parameters:
genre: options for the Genre selector.gender: options for the Gender selector.occupation: options for the Occupation selector.income: options for the Income selector.firstdate: limit dates for processing.data_source_name: names of data sources available on the server.latest_requests: latest 10 requests as a table response.sdr_data: processes the request and returns status.refresh: clears cache and refreshes kettle and dashboard elements.
Clean up the All Requests Processed list
You can delete items to clean up Latest Requests.
Deleting entries also deletes the tables behind them. It also removes the Analyzer data source from PUC.
Click All Requests Processed.
Click Edit.
Select the items you want to delete.
Click Delete.


Use the SDR sample form
SDR refines data in PDI with Build Model and Annotate Stream. It then publishes the model with Publish Model. Report designers use the SDR form to create Analyzer reports.
Open and run the Movie Ratings-SDR Sample form
Log in to PUC with administrator permissions.
Open the Movie Ratings-SDR Sample form.

The All Requests Processed list shows up to 10 recent runs. Click an item to reuse it. Or create a new one.

Select Genre filters in Movie Review Filter.
Select profile items in Member Profile.

Enter Start Date.
Enter End Date.
Choose whether to Grant Access to Everyone.
Enter a data source name.
Click Let's do this.
Use Go to Analyzer when you are ready to build reports.
Building blocks
SDR models typically use:
Build Model job entry to build a model from upstream data.
Annotate Stream step to add measures, attributes, and link dimensions.
Shared Dimension step to create reusable dimensions.
Publish Model job entry to publish the final model to Pentaho Server.
Build Model job entry
The Build Model job entry creates Data Source Wizard data sources. This can include both Metadata and Mondrian models.
It searches upstream for an output step or data service. You can also base a model on an existing Data Source Wizard or Analyzer model.
If the source is a Pentaho Data Service, connect to a Pentaho Repository before publishing.
After Build Model runs, publish the model with Publish Model.

Create a Build Model job entry
This task assumes you are on the job canvas in the PDI client.
In Design, open Modeling.
Double-click Build Model (or drag it to the canvas).
Double-click the Build Model entry on the canvas.
Set Entry name.
These characters are not valid in Data Source Wizard source names:
In Source, select an output step or Pentaho Data Service.
In Model Name, enter the model name.
In Modeling Method, choose:
Auto Model to build a new model.
Use Existing to base the model on an existing server model.
Click OK.
Select existing model options
If you use Use Existing, connect to Pentaho Server first. Your source and the existing model must share the same physical schema.
When you use Use Existing, annotations are not applied when the model opens in Analyzer. Use Auto Model if you want Analyzer to pick up your annotations.
Fill in Pentaho Server Connection:
URL: base URL for the server.
User Name: server user name.
Password: server password.
Click Connect.
Select a model from Existing Models.
Optional: select Create model if not found.
Click OK.
Variables for Build Model job entry
These variables are generated when the model is created. They are available after the first successful run.
JobEntryBuildModel.DatabaseConnection.<Model Name>Returns the database connection name used by the source.
JobEntryBuildModel.XMI.<Model Name>Returns the published XMI contents.
JobEntryBuildModel.XMI.DSW.<Model Name>Trueif the XMI is delivered as a Data Source Wizard data source.
Annotate Stream step
The Annotate Stream step refines SDR models by adding:
Measures
Attributes
Link dimensions
Calculated measures
It modifies the default model produced by Build Model. After annotating, publish the model with Publish Model.

Metadata injection support
All fields support metadata injection. Use it with ETL metadata injection.
When injecting a shared annotation group:
You can reuse an existing shared annotation group.
You cannot create a new shared annotation group.
If you provide
SHARED_ANNOTATION_GROUP, injected annotations are ignored.
Use the Annotate Stream step
This task assumes you are on the transformation canvas in the PDI client.
In Design, open Flow.
Double-click Annotate Stream (or drag it to the canvas).
Double-click Annotate Stream on the canvas.
Set Step name.
Choose where to save annotations:
Local: saved in the transformation.
Shared: stored in the metastore for reuse.
Optional: set Description.
Click Select Fields and move fields to Selected Fields.
The Annotations list shows:
Field: annotated field name.
Model Action: measure, attribute, or link dimension.
Summary: action details.
Double-click a field to annotate it.
Choose an action:
Create Measure. See Create a measure on a stream field.
Create Attribute. See Create an attribute on a field.
Link Dimension. See Create a link dimension.
Optional: click Add Calculated Measure and fill in:
Measure Name
Format
Formula (MDX)
When calculating subtotals use this formula (optional)
Hide this calculated measure in the model (optional)
Click Apply to save changes.
Click OK to close.

Create a measure on a stream field
Measures support aggregations such as SUM, AVERAGE, COUNT, and MINIMUM.
Complete steps 1–7 in Use the Annotate Stream step.
Select Create Measure from Actions.
Set measure properties:
Measure Name: display name.
Aggregation Type: SUM, AVERAGE, COUNT, COUNT_DISTINCT, MINIMUM, MAXIMUM.
Format String: formatting for reports.
Description: short description stored on the server.
Hidden:
Trueto hide it in Analyzer.
Click OK.

Create an attribute on a field
Attributes create a single-level hierarchy.
Complete steps 1–7 in Use the Annotate Stream step.
Select Create Attribute from Actions.
Set attribute properties as needed:
Attribute Name (required)
Dimension (required)
Hierarchy (optional)
Description
Parent Attribute (for multi-level hierarchies)
Ordinal Field (for sorting members)
Format (for numbers and dates)
Is Unique (Mondrian SQL optimization)
Hidden (
Truehides it in Analyzer)
For time attributes, also set:
Time Level Type
Time Source Format
For geo attributes, also set:
Geo Type
Optional: Latitude and Longitude for
Location
Click OK.

Create a link dimension
Link dimensions reuse a previously created shared dimension. Create the shared dimension first. See Shared Dimension step.
Complete steps 1–7 in Use the Annotate Stream step.
Select Link Dimension from Actions.
Set:
Dimension Name: dimension name in the model.
Shared Dimension: existing shared dimension to link.
Click OK.

To link a shared dimension, you must create a dimension key for it. See Create a dimension key.
Create a dimension key
Complete steps 1–7 in Use the Annotate Stream step.
Double-click the field you want as the key.
Select Create Dimension Key from Actions.
Edit the key value if needed.
Click OK.
Create annotation groups
Annotation groups help you reuse annotations across transformations. This is useful when the same source table is reused.
You can save groups:
Locally in the transformation.
Shared in the metastore for other users.
Create an annotation group locally
Complete steps 1–3 in Use the Annotate Stream step.
Select Local.
Click Select Fields and create annotations.
Click Apply.
Create an annotation group for sharing with other users
Complete steps 1–3 in Use the Annotate Stream step.
Select Shared.
Click Add Annotation Group (Plus Sign).
Enter a unique group name.
Click Select Fields and create annotations.
Click Apply.

Shared Dimension step
The Shared Dimension step is similar to Annotate Stream. It creates reusable dimensions and stores them for later use.
You must designate one annotation as the dimension key. See Create a dimension key in Shared Dimension step.
This transformation should run before Build Model. Or run it in a separate job that runs before Build Model.

After you build the model, publish it with Publish Model.
Metadata injection support
All fields support metadata injection. Use it with ETL metadata injection.
When you use metadata injection with Shared Dimension:
The system loads the currently defined annotations.
It overwrites the metastore entry if the name matches.
Create a shared dimension
In Design, open Flow.
Double-click Shared Dimension (or drag it to the canvas).
Double-click Shared Dimension on the canvas.
Set Step name.
In Shared Dimension Name, select an existing shared dimension or enter a new name.
In Data Provider Step, select the source step for the dimension.
Valid sources include Combination Lookup, Database Lookup, Dimension Lookup, Insert/Update, Table Output, and Vertica Bulk Loader.
Set Description.
Click Select Fields and move fields to Selected Fields.
In Annotations, double-click a field and select an action:
Create Attribute
Create Dimension Key
Click Apply, then click OK.

Create a dimension key in Shared Dimension step
Complete steps 1–8 in Create a shared dimension.
Select Create Dimension Key from Actions.
In Dimension, enter or select the dimension name.
Click OK.
Publish Model job entry
The Publish Model job entry publishes the model created by Build Model. It publishes to Pentaho Server for use in PUC and Analyzer.
Put Build Model before Publish Model in the job.

After publishing:
The Data Source Wizard data source uses the model name.
You cannot edit connection info in Data Source Wizard.
The connection is stored in the Build Model entry.
Publish Model job entry (how to)
This task assumes you are on the job canvas in PDI.
You need Pentaho Server permissions:
Publish Content
Manage Data Sources
In Design, open Modeling.
Double-click Publish Model (or drag it to the canvas).
Double-click Publish Model on the canvas.
Set Entry name.
Optional: select Replace Existing Published Model.
In Pentaho Server Connection, set:
URL
User Name
Password
Click Test Connection.
In Share, set:
Grant Access To:
Everyone,User, orRole.User/Role Name: only for
UserorRole.
Click OK.

Last updated
Was this helpful?

