Getting Started with PDI
If you're new to Pentaho Data Integration (PDI), start here.
This topic consolidates the transformation and job tutorials into one page.
Pentaho Data Integration (PDI) tutorial
This tutorial is for new users of the Pentaho suite.
You build a transformation and a job using the PDI client (Spoon).
Transformations describe data flows for ETL.
Jobs coordinate ETL activities and dependencies.
In this scenario, you load a CSV file of sales data into a database.
Several customer records are missing postal codes.
You resolve missing postal codes before loading the data.
Prerequisites
To complete this tutorial, you need:
An installed version of the Pentaho 30-day trial.
Tutorial steps
Step 1: Extract and load data
In Step 1, you retrieve data from a CSV flat file and set up a basic transformation.
In this step
Create a new transformation
Follow these steps to create a new transformation.
To insert a variable into a field that accepts variables, put your cursor in the field and press CTRL+Spacebar.
Fields that accept variables have a blue diamond.
Select File > New > Transformation in the upper-left corner of the PDI window.

Under the Design tab, expand the Input node, then drag a Text File Input step onto the canvas.
Double-click the Text File input step.

Text File Input File tab In the Step Name field, type
Read Sales Data.Click Browse to locate the
sales_data.csvsource file in the...\design-tools\data-integration\samples\transformations\filesfolder.Change File type to
*.csv. Selectsales_data.csv, then click OK.Click Add.
The path to the file appears under Selected Files.
View the content in the sample file
Follow these steps to look at the contents of the sample file.
Click the Content tab, then set the Format field to Unix.
Click the File tab again and click Show file content in the lower section of the window.
When the Number of lines (0-all lines) window appears, click OK.
In the Content of first file window, examine the file:
Check the delimiter.
Check the enclosure character.
Check whether a header row is present.
In the sample, the input file is comma-delimited.
It uses the quotation mark (
") enclosure character.It contains a single header row with field names.
Click Close.
Step 1: Edit and save the transformation
Follow these steps to provide information about the data’s content.
Click the Content tab. Use the fields to define how your data is formatted.
Verify these settings:
Separator: comma (
,)Enclosure: quotation mark (
")Header: selected
Number of header lines:
1

Text File Input Content tab Click the Fields tab, then click Get Fields to retrieve the input fields from your source file.
When the Number of lines to sample window appears, enter
0, then click OK.If the Scan Result window appears, click Close.

Text File Input Fields tab To verify the data is read correctly, click the Content tab, then click Preview Rows.
In the preview window, click OK to accept the default number of rows.
Review the data for missing, incomplete, or inconsistent values. For example:
STATEandPOSTALCODEboth contain<null>.COUNTRYcontains bothUSAandUnited States.
Click OK to save the step settings.
Open the Transformation Properties window using one of these methods:
Right-click an empty area on the canvas, then select Properties.
Double-click an empty area on the canvas, then select Properties.
Press CTRL+T.
In the Transformation Name field, enter
Getting Started Transformation, then click OK.Save the transformation. Select File > Save.
When you save a transformation for the first time, you’re prompted for a file location and name.
The usual file extension is .ktr.
Load data into a relational database
Now you are ready to take all records that exit the Filter rows step where POSTALCODE is not null (the true condition) and load them into a database table.
You will use the Table Output step and a hop from the Text File Input step to direct the data stream into a database table.
This section uses a pre-existing sample database that is established during the Pentaho installation.
It starts along with the server.
Create the Table Output step
Follow these instructions to create the Table Output step.
Under the Design tab, expand the Output node.
Drag a Table Output step into your transformation.
Create a hop between Read Sales Data and Table Output:
Press SHIFT.
Click Read Sales Data and drag to Table Output.
Release SHIFT.
Double-click the Table Output step to open its Edit properties dialog box.
Rename the Table Output step to
Write to Database.
Create a connection to the database
Follow these steps to create a connection to the database.
Click New next to the Connection field.
In the Database Connection window, provide these settings:
Connection Name: Sample Data
Connection Type: Hypersonic
Host Name: localhost
Database Name: sampledata
Port Number: 9001
User Name: pentaho_admin
Password: password
Click Test. If the test succeeds, click OK.
If the test fails, confirm the sample database is running.
Click OK to exit the Database Connection window.
Define the Data Definition Language (DDL)
DDLs are the SQL commands that define structures in a database, such as CREATE TABLE.
Enter
SALES_DATAin the Target Table field.Since the table doesn’t exist, you can generate and execute the DDL based on the data stream coming from Read Sales Data.
In the Table Output window, select Truncate Table.

Table Output step Truncate table field Click SQL to generate the DDL for creating your target table.
In the Simple SQL editor window, review the generated SQL statements.

Simple SQL editor Click Execute.
Examine the results, then click OK to close the results window.
Click Close in the Simple SQL editor window.
Click OK to close the Table Output window.
Save your transformation.
Step 2: Filter for missing codes
After completing Step 1: Extract and load data, you are ready to add a transformation component to your data pipeline.
The source file contains several records that are missing postal codes.
This section filters out records that have missing postal codes.
It ensures that only complete records are loaded into the database table.
Preview the rows read by the input step
Follow these steps to preview the rows read by the input step.
Right-click the Read Sales Data step and select Preview.

Transformation Menu showing how to access Preview Specify the number of rows to preview.
Optionally, set breakpoints to pause execution when a condition is met.
Click the Quick Launch button.
Preview the data. Notice that several input rows are missing values for POSTALCODE.

Preview showing missing postalcode fields Click Stop on the preview window to end the preview.
Separate the records with missing postal codes
Follow these instructions to use the Filter Rows transformation step to separate records missing postal codes.
You resolve these records later in the tutorial.
Add a Filter Rows step to your transformation.
Under the Design tab, select Flow > Filter Rows.
Insert your Filter Rows step between your Read Sales Data step and your Write to Database step.
To create a hop, click a step, hold SHIFT, then click-and-drag to the next step.
Right-click and delete the hop between the Read Sales Data step and Write to Database steps.
Create a hop between the Read Sales Data step and the Filter Rows step.
Create a hop between the Filter Rows step and the Write to Database step.
In the dialog box that appears, select Result is TRUE.

Hop dialog set to Result is True
Double-click the Filter Rows step. The Filter Rows window appears.
In the Step Name field, enter
Filter Missing Zips.Click in The condition field to open the Fields window.
In the Fields window, select POSTALCODE and click OK.
Click the comparison operator field. It is set to = by default. The Functions window appears.
Select IS NOT NULL, then click OK.

Filter rows is set postalcode is not null Click OK to exit the Filter Rows window.
You will return to this step later.
You will configure Send true data to step and Send false data to step.
Save your transformation.
Step 3: Resolve missing data
After completing Step 2: Filter for missing codes, you are ready to resolve the missing postal codes.
You will use a second text file that lists cities, states, and postal codes.
You will use it to look up postal codes for records with missing values.
This work uses the false branch of your Filter rows step.
First, use a Text file input step to read from the source file.
Then, use a Stream lookup step to bring the resolved postal codes into the stream.
Finally, use a Select values step to rename fields and remove unnecessary fields.
Retrieve data from your lookup file
Follow these steps to retrieve data from your lookup file.
Add a new Text File Input step to your transformation.
This step retrieves the records from your lookup file.
Do not add a hop yet.

Add Text File Input step to canvas Open the Text File Input step window, then enter
Read Postal Codesin the Step name property.Click Browse to navigate to the
Zipssortedbycitystate.csvsource file located in the directory...\design-tools\data-integration\samples\transformations\files.Change File type to
*.csv, selectZipsortedbycitrystate.csv, and click OK.The path to the source file appears in the File or directory field.
Click Add.
The path to the file appears under Selected files.
View the contents of the sample file
Follow these steps to view the contents of the sample file.
Click the Content tab, then set the Format field to Unix.
Click the File tab again and click the Show file content near the bottom of the window.
The Number of lines(0=all lines) window appears. Click the OK button to accept the default.
The Content of first file window shows the file.
Examine the file to see how the input file is delimited.
Check what enclosure character is used.
Check whether a header row is present.
Click Close to close the window.
Edit and save the transformation
Follow these steps to edit and save your transformation.
In the Content tab, change the Separator character to a comma (,) and confirm that the Enclosure setting is a quotation mark (").
Verify that the Header option is selected.
Under the Fields tab, click Get Fields to retrieve the data from your CSV file.
The Number of lines to sample window appears. Enter
0in the field, then click OK.
Results from Get Fields in the Fields tab If the Scan Result window displays, click Close to close it.
Click Preview rows to verify that your entries are correct.
When prompted to enter the preview size, click OK.
Review the information in the window, then click Close.
Click OK to exit the Text File input window.
Save the transformation.
Resolve missing zip code information
Follow these steps to resolve the missing postal code information.
Add a Stream Lookup step to your transformation by clicking the Design tab, expanding the Lookup folder, then selecting Stream Lookup.
Draw a hop from the Filter Missing Zips to the Stream lookup step. In the dialog box that appears, select Result is FALSE.
Create a hop from the Read Postal Codes step to the Stream lookup step.

Add a hop from Read Postal Codes to Stream Lookup Double-click the Stream lookup step to open the Stream Value Lookup window.
Rename Stream Lookup to Lookup Missing Zips.
From the Lookup step drop-down box, select Read Postal Codes as the lookup step. Perform the following:
In the key(s) to look up the value(s) table, define the CITY and STATE fields.
In row #1, open the drop-down menu in the Field column and select CITY.
Click in the LookupField column and select CITY.
In row #2, open the drop-down menu in the Field column and select STATE.
Click in the LookupField column and select STATE.

Stream value lookup example
Click Get Lookup Fields to pull the three fields from the Read Postal Code step.
POSTALCODE is the only field you want to retrieve.
To delete the CITY and STATE lines, right-click in the line and select Delete Selected Lines.
In the New Name field, change the name POSTALCODE to ZIP_RESOLVED and verify that Type is set to String.
Select Use sorted list (i.s.o. hashtable).

Value lookup example Click OK to close the Stream Value Lookup edit properties dialog box.
Save your transformation.
Preview your transformation
Follow these steps to preview your transformation.
To preview the data, select and right-click the Lookup Missing Zips step. From the menu that appears, select Preview.
In the Transformation debug dialog window, click Quick Launch to preview the data flowing through this step.
In the Examine preview data window that appears, note that the new field, ZIP_RESOLVED, has been added to the stream containing your resolved postal codes.

Examine ZIP_RESOLVED field Click Close to close the window.
If the Select the preview step window appears, click Close.
The execution results near the bottom of the PDI window show updated metrics in the Step Metrics tab.
Apply formatting to your transformation
Follow these steps to clean up the field layout on your lookup stream so that it matches the format and layout of the other stream going to the Write to Database step.
Add a Select Values step to your transformation by expanding the Transform folder and clicking Select Values.
Create a hop from the Lookup Missing Zips to the Select Values step.

Add hop from Lookup Missing Zips to Select Values Double-click the Select Values step to open its properties dialog box.
Rename the Select Values step to Prepare Field Layout.
Click Get fields to select to retrieve all fields and begin modifying the stream layout.
In the Fields list, find the # column and click the number for the ZIP_RESOLVED field.
Use CTRL UP (macOS, COMMAND UP) to move ZIP_RESOLVED just below the POSTALCODE field.

Move ZIP_RESOLVED field under POSTALCODES field Select the old POSTALCODE field in the list (line 20), right-click in the line, and select Delete Selected Lines.
The original POSTALCODE field was formatted as a 9-character string.
You must modify your new field to match the form.
Click the Meta-Data tab.
In the first row of the Fields to alter table the meta-data for section, click in the Fieldname column and select ZIP_RESOLVED. Perform the following steps:
Enter
POSTALCODEin the Rename to column.Select String in the Type column and enter
9in the Length column.
POSTALCODE String type and length Click OK to exit the edit properties dialog box.
Draw a hop from the Prepare Field Layout (Select values) step to the Write to Database (Table output) step.
When prompted, select the Main output of the step option.
Save your transformation.

Renaming fields workflow example
Step 4: Clean the data
After completing Step 3: Resolve missing data, you can further cleanse and categorize the data into buckets before loading it into a relational database.
In this section, you cleanse the COUNTRY field data by mapping United States to USA using the Value mapper step.
In addition, you use buckets for categorizing the SALES data using the Number range step.
Add a Value mapper step to the transformation
Follow these steps to add the Value mapper step to the transformation.
Delete both hops connected to the Write to Database step. For each hop, right-click and select Delete.
Create some extra space on the canvas. Drag the Write to Database step toward the right side of your canvas.

Add space on canvas for Value mapper step Add the Value mapper step to your transformation by expanding the Transform folder and choosing Value mapper.
Create a hop between the Filter Missing Zips and Value mapper steps. In the dialog box that appears, select Result is TRUE.
Create a hop between the Prepare Field Layout and Value mapper steps. When prompted, select the Main output of the step option.

Add Value mapper step to the canvas
Set the properties in the Value Mapper step
Follow these steps to set the properties in the Value mapper step.
Double-click the Value mapper step to open its properties dialog box.
Click in the Fieldname to use field and select COUNTRY.
In the Field Values table, define the
United StatesandUSAfield values.In row #1, click the field in the Source value column and enter
United States.Then, click the field in the Target value column and enter
USA.
Set values for fields in the Value mapper step
Click OK.
Save your transformation.
Apply ranges
Follow these steps to apply ranges to your transformation.
Add a Number range step to your transformation by expanding the Transform folder and selecting Number range.
Create a hop between the Value mapper and Number range steps.
Create a hop between the Number range and Write to Database steps.
When prompted, select the Main output of the step option.

Add Number range step to the canvas Double-click the Number range step to open its properties dialog box.
Click in the Input field and select SALES from the list.
In the Output field enter
DEALSIZE.In the Ranges (min <=x< max) table, define the ranges and bucket values.
In row #1, click the field in the Upper Bound column and enter
3000.0. Then, click the field in the Value column and enterSmall.In row #2, click the field in the Lower Bound column and enter
3000.0. Then, click the field in the Upper Bound column and enter7000.0. Click the field in the Value column and enterMedium.In row #3, click the field in the Lower Bound column and enter
7000.0. Then, click the field in the Value column and enterLarge.
Set ranges in Number Range step
Click OK.
Execute the SQL statement
Your database table does not yet contain the field DEALSIZE.
Perform these steps to execute the SQL statement.
Double-click the Write to Database step to open its properties dialog box.
Click the SQL button at the bottom of the window to generate the new DDL.
The Simple SQL editor window appears with the SQL statements needed to
alterthe table.
Simple SQL editor to generate the DDL Click Execute to execute the SQL statement.
The Results of the SQL statements window appears. Examine the results, then click OK to close the window.
Click Close in the Simple SQL editor window to close it.
Click OK to close the Write to Database window.
Save your transformation.
Step 5: Run the transformation
Pentaho Data Integration provides a number of deployment options.
In this section, you create a transformation using the Local run option.
Select Action > Run.
The Run Options window appears.
Keep the default Pentaho local option for this exercise.
Click Run.
The transformation executes.

Transformation runs without errors
After the transformation runs, the Execution Results panel opens below the canvas.
View the execution results
Use the tabs in Execution Results to verify execution, troubleshoot errors, and monitor performance.
Step Metrics
Shows statistics for each step. It includes rows read, rows written, errors, and rows per second.
This tutorial has no intentional errors. The transformation should run successfully.
If errors occur, steps with errors are highlighted in red.
In this example, the Lookup Missing Zips step caused an error.

Error message display Logging
Shows logging details for the most recent execution.
Use it to locate errors. Error lines are highlighted in red.
In this example, the Lookup Missing Zips step caused an error.
It tried to look up POSTALCODE2, which does not exist in the lookup stream.

Transformation logging display Execution History
Shows step metrics and log information from previous executions.
This works only if you configured database logging.
Use the Logging tab in Transformation Settings.
Performance Graph
Analyzes step performance based on metrics like rows read, rows written, errors, and rows per second.
This works only if you configured database logging.
Use the Logging tab in Transformation Settings.
Metrics
Shows a Gantt chart after the transformation or job runs.
It can include database connection time, SQL execution time, or transformation load time.

Step metrics tab Preview Data
Shows a preview of the data.
Step 6: Orchestrate with jobs
Jobs are used to coordinate ETL activities such as:
Defining the flow and dependencies that control the linear order for the transformations to run.
Preparing for execution by checking conditions such as, "Is my source file available?" or "Does a table exist?"
Performing bulk load database operations.
Assisting file management, such as posting or retrieving files using FTP, copying files, and deleting files.
Sending success or failure notifications through email.
For this part of the tutorial, imagine that an external system is responsible for placing your sales_data.csv input in its source location every Saturday night at 9 p.m.
You want to create a job that will verify that the file has arrived and then run the transformation to load the records into the database.
The following steps assume that you have built a Getting Started transformation as described in Step 1: Extract and load data.
Go to File > New > Job.

PDI job window Expand the General folder and drag a Start job entry onto the canvas.
The Start job entry defines where the execution will begin.
Expand the Conditions folder and add a File Exists job entry.
Draw a hop from the Start job entry to the File Exists job entry.

Draw hop from Start to File exists Double-click the File Exists job entry to open its properties dialog box.
Click Browse and set the filter near the bottom of the window to All Files.
Select the
sales_data.csvfrom the following directory:...\design-tools\data-integration\samples\transformations\files.Click OK to exit the Open File window.
Click OK to exit the Check if a file exists window.
Expand the General folder and add a Transformation job entry.
Draw a hop between the File Exists and the Transformation job entries.
Double-click the Transformation job entry to open its properties dialog box.
Click Browse to open the Select repository object window.
Browse to and select the Getting Started transformation.
Click OK to close the Transformation window.
Save your job as Sample Job.
Click Run in the toolbar.
When the Run Options window appears, select Local environment type and click Run.
The Execution Results panel should open showing job metrics and log information.

Job sample
PDI job tutorial (standalone)
This tutorial overlaps with Step 6: Orchestrate with jobs.
Use it if you want a shorter, job-only walkthrough.
Jobs are used to coordinate ETL activities such as:
Defining the flow and dependencies for what order transformations should be run.
Preparing for execution by checking conditions such as, "Is my source file available?" or "Does a table exist?"
Performing bulk load database operations.
File management such as posting or retrieving files using FTP, copying files and deleting files.
Sending success or failure notifications through email.
For this exercise, imagine that an external system is responsible for placing your sales_data.csv input in its source location every Saturday night at 9 p.m.
You want to create a job that will check to see that the file has arrived and run your transformation to load the records into the database.
To complete this exercise, complete the steps in Pentaho Data Integration (PDI) tutorial first.
Go to File > New > Job.

PDI Job Window Expand the General folder and drag a Start job entry onto the graphical workspace.
The Start job entry defines where the execution will begin.
Expand the Conditions folder and add a File Exists job entry.
Draw a hop from the Start job entry to the File Exists job entry.
Double-click the File Exists job entry to open its Edit Properties dialog box.
Click Browse and set the filter near the bottom of the window to All Files.
Select the
sales_data.csvfrom the following location:...\design-tools\data-integration\samples\transformations\files.Click OK to exit from the Open File window.
Click OK to exit from the Check if a file exists window.
In Spoon, expand the General folder and add a Transformation job entry.
Draw a hop between the File Exists and the Transformation job entries.
Double-click the Transformation job entry to open its edit Properties dialog box.
Click Browse to open the Select repository object window.
Expand the repository tree to find your sample transformation. Select it and click OK.

Select repository object window Save your job as Sample Job.
Click Run in the toolbar.
When the Run Options window appears, choose Local environment type and click Run.
The Execution Results panel should open showing job metrics and log information.

Job Sample
More PDI documentation
These tutorials are designed to quickly demonstrate basic PDI features.
For more detailed information, see these topics in the Pentaho Data Integration document:
Learn about the PDI Client
Use Pentaho Repositories in PDI
Schedule Perspective in the PDI Client
Last updated
Was this helpful?

