Pentaho Data Integration workflows

Pentaho Data Integration is a robust extract, transform, and load (ETL) tool that you can use to integrate, manipulate, and visualize your data. You can use PDI to import, transform, and export data from multiple data sources, including flat files, relational databases, Hadoop, NoSQL databases, analytic databases, social media streams, and operational stores. You can also use PDI to clean and enrich the data, move data between databases, and to visualize your data.

In this topic

Evaluate and learn PDI

As you explore Pentaho Data Integration (PDI), you will be introduced to the major components, watch videos, work through hands-on examples, and read about the different features.

Review the documentation and contact Pentaho sales support if you have questions.

PDI basics

This section familiarizes you with PDI and introduces you to basic terminology and concepts. Then, you learn how to start and configure Spoon and take a spin through the interface.

Get a basic understanding of what PDI does.
View a video that explains how PDI fits into the Business Analytics Platform.
Read about Pentaho Data Integration architecture in the Pentaho Data Integration document.

Get acquainted with the PDI client

Spoon is the PDI design tool. In this section you will set up Spoon, take a tour of the Spoon interface, and learn about the different Spoon perspectives.

Check out the hardware and software requirements for PDI.
Download trial version of the Pentaho Suite and install the software. (The platform includes PDI.)
Learn how to install PDI only. See Custom installation for details.
Configure the Pentaho Server. Depending on your platform, see Increase Pentaho Server memory limit for installations on Linux or Increase Pentaho Server memory limit for installations on Windows for details.
Start the Pentaho Server. Depending on your platform, see Start and stop the Pentaho Server for configuration on Linux or Start and stop the Pentaho Server for configuration on Windows for details.
Access the PDI client. See the Pentaho Data Integration document for details.
Tour the PDI client perspectives. See the Pentaho Data Integration document for details.
Read about terminology and basic concepts in the Pentaho Data Integration document.

Build transformations and jobs

Now that your environment is set up and you are familiar with the PDI client, you are ready to build transformations and jobs. Trying the following task may be helpful.

Create a connection to the Pentaho Repository.
Work through the exercise on Creating a Transformation that involves a flat file. Click through the links at the bottom of the page to complete the exercise.
Create a job to execute the transformation.
Schedule a job to execute the transformation at a later time.
Review commonly used steps and job entries.

Explore Big Data and Streamlined Data Refinery

In this section, you will learn how to use transformation steps to connect to a variety of big data sources, including Hadoop, NoSQL, and analytical databases such as MongoDB. You can then try working through the detailed, step-by-step tutorials, and peruse the out-of-the-box steps that Spoon provides. Learn how to work with Streamlined Data Refinery. Then, you will have an opportunity to move beyond the basics and learn how to edit transformations and metadata models.

Watch one of our Big Data Videos.
Learn how to work with Streamlined Data Refinery. See Pentaho Data Integration for details.
Learn how to auto model using the Build Model. See Pentaho Data Integration for details. job entry and how this feature intersects with Analyzer.
Find out what big data steps are available out-of-the-box. See Commonly used PDI steps and entries for details.
Find out which Hadoop distributions are available and how to configure them. See Pentaho, big data, and Hadoop for details.
Note: You should already have a cluster set up to perform this task.
Edit transformations and metadata models. See Pentaho Data Integration for details.
Watch a video about how to use PDI to blend Big Data.

About Kitchen, Pan, and Carte

Kitchen, Pan, and Carte are command line tools for executing transformations and jobs modeled in the PDI client.

Use Pan and Kitchen command line tools to work with transformations and jobs
Use Carte clusters to:
- Run transformations and jobs on a Carte cluster.
- Schedule jobs to run on a remote Carte server.
- Start or stop Carte from the command line interface or a URL.
- Run transformations and jobs from the repository on the Carte server

See the Pentaho Data Integration document for details on Kitchen, Pan, and Carte.

Learn more

Now that you have completed an initial evaluation of PDI, dig a little deeper. Find out how to:

Use newer steps and entries, like Spark Submit. See the Pentaho Data Integration document for details.
Read about how to turn a transformation into a data service. See the Pentaho Data Integration document for details.
Use the ETL Metadata Injection step. See the Pentaho Data Integration document for details.
Check out our What's New document.
Create other Data Integration solutions. See the Pentaho Data Integration document for details.
Administer PDI. See the administration documentation for details.
Integrate with different security protocols, like Pentaho security, LDAP, MSAD, and Kerberos. See the administration documentation for details.
Check out our developer center section in the administration documentation.

Develop your PDI solution

This workflow helps you to set up and configure the DI development and test environments, then build, test, and tune your Pentaho DI solution prototype. This process is similar to the trial download evaluation experience, except that you will be completely configuring the Pentaho Server for data integration and working with your own ETL developers.

If you need extra help, Pentaho professional services is available. The end result is to learn DI implementation best practices and deploy your DI solution to a production server. Most development and testing for DI occurs in Spoon.

Before you begin developing your DI solution, we recommend that you attend Pentaho training classes to learn how to install and configure the Pentaho Server, as well as how to develop data models.

This section is grouped into parts that will guide you during the development of your DI solution. These parts are iterative and you might bounce between them during development. For example, as you tune a job, you might find that although you have built a solution that produces the right results, it takes a long time to run. You might need to rebuild and test a transformation to improve efficiency, and then retest it.

Design DI solution

Design helps you think critically about the problem you want to solve and possible solutions. Consider these questions as you gather your requirements and design the solution.

Output
What does the overall solution look like? What questions are posing and how do you want the answers formatted?
Data Sources
What type(s) of data sources are you querying? Where are they located? How much data do you need to process? Are you using big data? Are you using relational or non-relational data sources? Will you have a target data source? If so, where are they located?
Content/Processing
What data quality issues do you have? How is the input data mapped to the output data? Where do you want to process the content, in PDI or in the data source? What hardware will you include in your development environment? Will you need one or more quality assurance test environments or production environments?

Also, consider templates or standards, naming conventions, and other requirements of your end users if you have them. Consider how you will back up your data as well.

Set up a development environment

Setting up the environment includes installing and configuring PDI on development computers, configuring clustering if needed, and connecting to data sources. If you have one or more quality assurance environments, you will need to set those up also.

Task

Do This

Objective

Verify System Requirements

Consult the following references to verify requirements:- Components Reference

JDBC Drivers Reference

Acquire one or more servers that meet the requirements.
Obtain the correct drivers for your system.

Obtain Software and Install PDI

See the Install Pentaho Data Integration and Analytics document for following instructions:- Installing PDI

Starting the Pentaho Server
Starting the PDI client (also known as Spoon)

Get the software from your Sales Support representative.
Install the software.
Start the Pentaho Server and Spoon.

Install licenses for the Pentaho Server

See the Administer Pentaho Data Integration and Analytics document for instructions on installing licenses.

Add all acquired Pentaho licenses.

Connect to the Pentaho Repository

See the Pentaho Data Integration for instructions on connecting to the Pentaho Repository.

Connect to the Pentaho Repository.

Apply Advanced Security (if needed)

See the Administer Pentaho Data Integration and Analytics document for details on Advanced Security.

Determine whether you need to apply Advanced Security.

Build and test solution

During this step, you develop transformations, jobs, and models, then test what you have developed. You will tune the transformations, jobs, and models for optimal performance.

Development occurs in the PDI client design tool. The PDI client's streamlined design tightly couples the build and test activities so that you can easily perform them iteratively. The PDI client has perspectives to help you perform ETL and visualize data. The PDI client also provides a scheduling perspective that can be used to automate testing. Testing encompasses verifying the quality of transformations and jobs, reviewing visualizations, and debugging issues. One common method of testing is to include steps in a transformation or job that calculates hash totals, checksums, record counts, and so forth to determine whether data is being properly processed. You can also visualize your data in analyzer and report designer and review the results as you develop. This can not only help you find errors and issues with processing but can help you get a jump on user acceptance testing if you show these reports to your customers or business analysts to get early feedback.

One basic question is how you can determine the number of transformations and jobs needed, as well as the order in which they should be executed. A good rule of thumb is to create one transformation for each combination of source system and target tables. You can often identify combinations in your mapping documents. Once you have identified the number of transformations that you need, you can use the same process to determine that number of jobs that you need. When considering the order of execution for transformations and jobs, consider how referential integrity is enforced. Run target table transformations that have no dependencies first, then run transformations that depend on those tables, and so forth.

Task

Do This

Objective

Understand the Basics

Read the overview of the PDI client process in the Pentaho Data Integration document.

Review information about the process and perspectives.

Review most often used steps and entries

Review commonly-used steps and entries.

Review available transformations and determine how you can use them for your solution.
Review job step references to identify which steps can be used in your solution.

Create and Run Transformations

Create and run a transformation. See the Pentaho Data Integration document for details.

Identify the transformations needed for your job and implement them.
Save transformation.
Run transformations locally.

Create and Run a Job

Create and run a job. See the Pentaho Data Integration document for details.

Create a job.
Arrange transformations in a job so that they execute logically.
Run a job.

Tune solution

Fine tune transformations and jobs to optimize performance. This involves using various tools such as the DI Operation and Audit Mart to determine where bottlenecks or other performance issues occur, and addressing them.

Task

Do This

Objective

Review the Performance Tuning Checklist and Make Changes to Transformations and Jobs

Review tuning tips. See the Administer Pentaho Data Integration and Analytics document for tuning tips.

Get familiar with things that you can do to optimize performance.
Apply tuning tips as needed.

Consider other performance tuning options

Read about transactional databases. See the Pentaho Data Integration document for details on transactional databases.
Read about using logs. See the Administer Pentaho Data Integration and Analytics document for details on logging.

Learn how to apply transactional databases.
Learn how to use logs to tune transformations and jobs.

Next steps

These resources will be helpful to you as you prepare to Go Live for Production:

Prepare to Go Live for Production - DI.
Support Portal: check with Support for service packs.

Go Live for production - DI

Go Live is the process by which you migrate a prototype to production. This process is divided into four parts:

Setting up the production environment
Deploying the solution
Tuning the solution
Scheduling the runs

Set up production environment

Setting up the environment includes installing the software on production computers, configuring clustering, and connecting to data sources. To set up the environment, install and configure the Pentaho Server, Spoon, and any plugins required. Then set up data sources and clusters.

Task

Do This

Objective

Verify system requirements

Consult the Components Reference.
Consult the JDBC Drivers Reference.

Acquire one or more servers that meet the requirements.
Obtain the correct drivers for your system.

Obtain software and install the Pentaho Server

Download the Pentaho software.
Start the Pentaho Server. See Install Pentaho Data Integration and Analytics for details.
Start the PDI client. See Pentaho Data Integration for details.
Install the licenses (if necessary). See Administer Pentaho Data Integration and Analytics for details.

Get the software from your Sales Support representative.
Install the software.

Change the Server Fully Qualified URL

Change the ports and URLs. See Administer Pentaho Data Integration and Analytics for details.

Change the server's URL so that you do not have a conflict.

Connect to the Pentaho Repository

Create a connection to the Pentaho Repository. See Pentaho Data Integration for details.

Connect to the Pentaho Repository.

Set up clusters

Optional: Set up clusters. See Pentaho Data Integration for details.

Become familiar with clustering.
Set up clusters, if they are needed in your environment.

Copy configuration files

Copy shared.xml, repositories.xml, kettle.properties, and JAR files from the development environment to the production environment.

System is set up and ready for production.

Logging and monitoring your server

Review logging and monitoring operations. See Pentaho Data Integration for details.
Enable logging. See Administer Pentaho Data Integration and Analytics for details.
Monitor PDI and SNMP traps. See Administer Pentaho Data Integration and Analytics for details.

Learn about the different ways to log and monitor Pentaho Server operations:
- Log through Spoon and Carte
- Use SNMP traps with PDI

Deploy solution

Export solutions from the Pentaho Repository that is in the development or test environments, to the Pentaho Repository that is in the production environment.

Task

Do This

Objective

Export and Import Pentaho Repository

See Export and Import Pentaho Repository Content in the Administer Pentaho Data Integration and Analytics document.

Export Pentaho Repository content from test environment
Import Pentaho Repository content to production environment

Tune solution

Fine tune transformations and jobs to optimize performance. This involves using various tools such as the DI Operations and Audit Marts to determine where bottlenecks or other performance issues occur, and attempting to address them.

Task

Do This

Objective

Review the Performance Tuning Checklist and Make Changes to Transformations and Jobs

Consult the tuning tips. See the Administer Pentaho Data Integration and Analytics document for tuning tips.

Get familiar with things that you can do to optimize performance.
Apply tuning tips as needed.

Consider other performance tuning options

Learn about transactional databases. See the Pentaho Data Integration document for details on transactional databases.
Learn about using logs. See the Administer Pentaho Data Integration and Analytics document for details on logging.

Learn how to apply transactional databases.
Learn how to use logs to tune transformations and jobs.

Schedule runs

Use the PDI client, Pan, or Kitchen to schedule executions of transformations and jobs.

Task

Do This

Objective

Schedule Transformations and Jobs From Spoon

Schedule transformations and jobs. See the Pentaho Data Integration document for details.

Schedule transformations and jobs

Command Line Scripting Through Pan and Kitchen

Learn about Pan's options. See the Pentaho Data Integration document for details.
Learn about Kitchen's options. See the Pentaho Data Integration document for details.

Use Pan and Kitchen to schedule transformations and jobs.

Next steps

These resources will be helpful to you after your production server is live.

Fine-tune Pentaho systems: Provides guidance on how to maintain and fine-tune your Pentaho Server. See the Administer Pentaho Data Integration and Analytics document for details.
Pentaho Training and Education
Support Portal: Check with support for service packs.

Commonly used PDI steps and entries

Although there are over 330 transformation steps and job entries, some steps and entries are used more often than others. If you are creating a transformation and job, but do not know where to begin, this list might be helpful to you.

Top ten transformation steps

PDI transformation steps are documented in Pentaho Data Integration.

Text File Input
Table Input
Microsoft Excel Input
Text File Output
Table Output
Microsoft Excel writer
Select Values
Filter Rows
Group By
Stream Lookup

Other commonly used transformation steps

PDI transformation steps are documented in Documentation

INPUT: Generate Rows, Data Grid, Get Data from XML, CSV File Input, Fixed File Input
OUTPUT: XML Output
TRANSFORM: Split Fields, Calculator, Add Constants, Add Sequence, Replacing Strings, Split Fields, Sort Rows, String Operations, Strings Cut
SCRIPTING: User Defined Java Class, Modified Java Script Value, User Defined Java Expression
FLOW: Abort, Append Streams, Block this step until steps finish, Blocking Step, Detect Empty Stream, Dummy, ETL Metadata Injection, Filter Rows, Identify Last Row in a Stream, Java Filter, Job Executor, Prioritize Streams, Single Threader, Switch/Case, Transformation Executor
LOOKUP
JOINS: Join Rows, Merge Join
JOB: Get Variables, Set Variables

Commonly used job entries

PDI job entries are documented in documentation.

GENERAL: Start, Job, Transformation, Success
UTILITY: Abort
MAIL: Mail
FILE MANAGEMENT: Add filenames to result, Compare folders, Convert file between Windows and Unix, Copy Files, Create a folder, Create file, Delete file, Delete filenames from result, Delete files, Delete folders, File Compare, HTTP, Move Files, Process result filenames, Unzip file, Wait for file, Write to file, Zip file
UTILITIES: Write to log

Last updated 1 month ago

Was this helpful?

hashtagIn this topic

hashtagEvaluate and learn PDI

hashtagPDI basics

hashtagGet acquainted with the PDI client

hashtagBuild transformations and jobs

hashtagExplore Big Data and Streamlined Data Refinery

hashtagAbout Kitchen, Pan, and Carte

hashtagLearn more

hashtagDevelop your PDI solution

hashtagDesign DI solution

hashtagSet up a development environment

hashtagBuild and test solution

hashtagNext steps

hashtagGo Live for production - DI

hashtagSet up production environment

hashtagDeploy solution

hashtagTune solution

hashtagSchedule runs

hashtagNext steps

hashtagCommonly used PDI steps and entries

hashtagTop ten transformation steps

hashtagOther commonly used transformation steps

hashtagCommonly used job entries

In this topic

Evaluate and learn PDI

PDI basics

Get acquainted with the PDI client

Build transformations and jobs

Explore Big Data and Streamlined Data Refinery

About Kitchen, Pan, and Carte

Learn more

Develop your PDI solution

Design DI solution

Set up a development environment

Build and test solution

Next steps

Go Live for production - DI

Set up production environment

Deploy solution

Tune solution

Schedule runs

Next steps

Commonly used PDI steps and entries

Top ten transformation steps

Other commonly used transformation steps

Commonly used job entries