Pentaho Data Integration workflows
Pentaho Data Integration is a robust extract, transform, and load (ETL) tool that you can use to integrate, manipulate, and visualize your data. You can use PDI to import, transform, and export data from multiple data sources, including flat files, relational databases, Hadoop, NoSQL databases, analytic databases, social media streams, and operational stores. You can also use PDI to clean and enrich the data, move data between databases, and to visualize your data.
In this topic
Evaluate and learn PDI
As you explore Pentaho Data Integration (PDI), you will be introduced to the major components, watch videos, work through hands-on examples, and read about the different features.
Review the documentation and contact Pentaho sales support if you have questions.
PDI basics
This section familiarizes you with PDI and introduces you to basic terminology and concepts. Then, you learn how to start and configure Spoon and take a spin through the interface.
Get a basic understanding of what PDI does.
View a video that explains how PDI fits into the Business Analytics Platform.
Read about Pentaho Data Integration architecture in the Pentaho Data Integration document.
Get acquainted with the PDI client
Spoon is the PDI design tool. In this section you will set up Spoon, take a tour of the Spoon interface, and learn about the different Spoon perspectives.
Check out the hardware and software requirements for PDI.
Download trial version of the Pentaho Suite and install the software. (The platform includes PDI.)
Learn how to install PDI only. See Custom installation for details.
Configure the Pentaho Server. Depending on your platform, see Increase Pentaho Server memory limit for installations on Linux or Increase Pentaho Server memory limit for installations on Windows for details.
Start the Pentaho Server. Depending on your platform, see Start and stop the Pentaho Server for configuration on Linux or Start and stop the Pentaho Server for configuration on Windows for details.
Access the PDI client. See the Pentaho Data Integration document for details.
Tour the PDI client perspectives. See the Pentaho Data Integration document for details.
Read about terminology and basic concepts in the Pentaho Data Integration document.
Build transformations and jobs
Now that your environment is set up and you are familiar with the PDI client, you are ready to build transformations and jobs. Trying the following task may be helpful.
Create a connection to the Pentaho Repository.
Work through the exercise on Creating a Transformation that involves a flat file. Click through the links at the bottom of the page to complete the exercise.
Create a job to execute the transformation.
Schedule a job to execute the transformation at a later time.
Explore Big Data and Streamlined Data Refinery
In this section, you will learn how to use transformation steps to connect to a variety of big data sources, including Hadoop, NoSQL, and analytical databases such as MongoDB. You can then try working through the detailed, step-by-step tutorials, and peruse the out-of-the-box steps that Spoon provides. Learn how to work with Streamlined Data Refinery. Then, you will have an opportunity to move beyond the basics and learn how to edit transformations and metadata models.
Watch one of our Big Data Videos.
Learn how to work with Streamlined Data Refinery. See Pentaho Data Integration for details.
Learn how to auto model using the Build Model. See Pentaho Data Integration for details. job entry and how this feature intersects with Analyzer.
Find out what big data steps are available out-of-the-box. See Commonly used PDI steps and entries for details.
Find out which Hadoop distributions are available and how to configure them. See Pentaho, big data, and Hadoop for details.
Note: You should already have a cluster set up to perform this task.
Edit transformations and metadata models. See Pentaho Data Integration for details.
Watch a video about how to use PDI to blend Big Data.
About Kitchen, Pan, and Carte
Kitchen, Pan, and Carte are command line tools for executing transformations and jobs modeled in the PDI client.
Use Pan and Kitchen command line tools to work with transformations and jobs
Use Carte clusters to:
Run transformations and jobs on a Carte cluster.
Schedule jobs to run on a remote Carte server.
Start or stop Carte from the command line interface or a URL.
Run transformations and jobs from the repository on the Carte server
See the Pentaho Data Integration document for details on Kitchen, Pan, and Carte.
Learn more
Now that you have completed an initial evaluation of PDI, dig a little deeper. Find out how to:
Use newer steps and entries, like Spark Submit. See the Pentaho Data Integration document for details.
Read about how to turn a transformation into a data service. See the Pentaho Data Integration document for details.
Use the ETL Metadata Injection step. See the Pentaho Data Integration document for details.
Check out our What's New document.
Create other Data Integration solutions. See the Pentaho Data Integration document for details.
Administer PDI. See the administration documentation for details.
Integrate with different security protocols, like Pentaho security, LDAP, MSAD, and Kerberos. See the administration documentation for details.
Check out our developer center section in the administration documentation.
Develop your PDI solution
This workflow helps you to set up and configure the DI development and test environments, then build, test, and tune your Pentaho DI solution prototype. This process is similar to the trial download evaluation experience, except that you will be completely configuring the Pentaho Server for data integration and working with your own ETL developers.
If you need extra help, Pentaho professional services is available. The end result is to learn DI implementation best practices and deploy your DI solution to a production server. Most development and testing for DI occurs in Spoon.
Before you begin developing your DI solution, we recommend that you attend Pentaho training classes to learn how to install and configure the Pentaho Server, as well as how to develop data models.
This section is grouped into parts that will guide you during the development of your DI solution. These parts are iterative and you might bounce between them during development. For example, as you tune a job, you might find that although you have built a solution that produces the right results, it takes a long time to run. You might need to rebuild and test a transformation to improve efficiency, and then retest it.
Design DI solution
Design helps you think critically about the problem you want to solve and possible solutions. Consider these questions as you gather your requirements and design the solution.
Output
What does the overall solution look like? What questions are posing and how do you want the answers formatted?
Data Sources
What type(s) of data sources are you querying? Where are they located? How much data do you need to process? Are you using big data? Are you using relational or non-relational data sources? Will you have a target data source? If so, where are they located?
Content/Processing
What data quality issues do you have? How is the input data mapped to the output data? Where do you want to process the content, in PDI or in the data source? What hardware will you include in your development environment? Will you need one or more quality assurance test environments or production environments?
Also, consider templates or standards, naming conventions, and other requirements of your end users if you have them. Consider how you will back up your data as well.
Set up a development environment
Setting up the environment includes installing and configuring PDI on development computers, configuring clustering if needed, and connecting to data sources. If you have one or more quality assurance environments, you will need to set those up also.
Verify System Requirements
Consult the following references to verify requirements:- Components Reference
Acquire one or more servers that meet the requirements.
Obtain the correct drivers for your system.
Obtain Software and Install PDI
See the Install Pentaho Data Integration and Analytics document for following instructions:- Installing PDI
Starting the Pentaho Server
Starting the PDI client (also known as Spoon)
Get the software from your Sales Support representative.
Install the software.
Start the Pentaho Server and Spoon.
Install licenses for the Pentaho Server
See the Administer Pentaho Data Integration and Analytics document for instructions on installing licenses.
Add all acquired Pentaho licenses.
Connect to the Pentaho Repository
See the Pentaho Data Integration for instructions on connecting to the Pentaho Repository.
Connect to the Pentaho Repository.
Apply Advanced Security (if needed)
See the Administer Pentaho Data Integration and Analytics document for details on Advanced Security.
Determine whether you need to apply Advanced Security.
Build and test solution
During this step, you develop transformations, jobs, and models, then test what you have developed. You will tune the transformations, jobs, and models for optimal performance.
Development occurs in the PDI client design tool. The PDI client's streamlined design tightly couples the build and test activities so that you can easily perform them iteratively. The PDI client has perspectives to help you perform ETL and visualize data. The PDI client also provides a scheduling perspective that can be used to automate testing. Testing encompasses verifying the quality of transformations and jobs, reviewing visualizations, and debugging issues. One common method of testing is to include steps in a transformation or job that calculates hash totals, checksums, record counts, and so forth to determine whether data is being properly processed. You can also visualize your data in analyzer and report designer and review the results as you develop. This can not only help you find errors and issues with processing but can help you get a jump on user acceptance testing if you show these reports to your customers or business analysts to get early feedback.
One basic question is how you can determine the number of transformations and jobs needed, as well as the order in which they should be executed. A good rule of thumb is to create one transformation for each combination of source system and target tables. You can often identify combinations in your mapping documents. Once you have identified the number of transformations that you need, you can use the same process to determine that number of jobs that you need. When considering the order of execution for transformations and jobs, consider how referential integrity is enforced. Run target table transformations that have no dependencies first, then run transformations that depend on those tables, and so forth.
Understand the Basics
Read the overview of the PDI client process in the Pentaho Data Integration document.
Review information about the process and perspectives.
Review most often used steps and entries
Review commonly-used steps and entries.
Review available transformations and determine how you can use them for your solution.
Review job step references to identify which steps can be used in your solution.
Create and Run Transformations
Create and run a transformation. See the Pentaho Data Integration document for details.
Identify the transformations needed for your job and implement them.
Save transformation.
Run transformations locally.
Create and Run a Job
Create and run a job. See the Pentaho Data Integration document for details.
Create a job.
Arrange transformations in a job so that they execute logically.
Run a job.
Tune solution
Fine tune transformations and jobs to optimize performance. This involves using various tools such as the DI Operation and Audit Mart to determine where bottlenecks or other performance issues occur, and addressing them.
Review the Performance Tuning Checklist and Make Changes to Transformations and Jobs
Review tuning tips. See the Administer Pentaho Data Integration and Analytics document for tuning tips.
Get familiar with things that you can do to optimize performance.
Apply tuning tips as needed.
Consider other performance tuning options
Read about transactional databases. See the Pentaho Data Integration document for details on transactional databases.
Read about using logs. See the Administer Pentaho Data Integration and Analytics document for details on logging.
Learn how to apply transactional databases.
Learn how to use logs to tune transformations and jobs.
Next steps
These resources will be helpful to you as you prepare to Go Live for Production:
Prepare to Go Live for Production - DI.
Support Portal: check with Support for service packs.
Go Live for production - DI
Go Live is the process by which you migrate a prototype to production. This process is divided into four parts:
Setting up the production environment
Deploying the solution
Tuning the solution
Scheduling the runs
Set up production environment
Setting up the environment includes installing the software on production computers, configuring clustering, and connecting to data sources. To set up the environment, install and configure the Pentaho Server, Spoon, and any plugins required. Then set up data sources and clusters.
Verify system requirements
Consult the Components Reference.
Consult the JDBC Drivers Reference.
Acquire one or more servers that meet the requirements.
Obtain the correct drivers for your system.
Obtain software and install the Pentaho Server
Download the Pentaho software.
Start the Pentaho Server. See Install Pentaho Data Integration and Analytics for details.
Start the PDI client. See Pentaho Data Integration for details.
Install the licenses (if necessary). See Administer Pentaho Data Integration and Analytics for details.
Get the software from your Sales Support representative.
Install the software.
Change the Server Fully Qualified URL
Change the ports and URLs. See Administer Pentaho Data Integration and Analytics for details.
Change the server's URL so that you do not have a conflict.
Connect to the Pentaho Repository
Create a connection to the Pentaho Repository. See Pentaho Data Integration for details.
Connect to the Pentaho Repository.
Set up clusters
Optional: Set up clusters. See Pentaho Data Integration for details.
Become familiar with clustering.
Set up clusters, if they are needed in your environment.
Copy configuration files
Copy shared.xml, repositories.xml, kettle.properties, and JAR files from the development environment to the production environment.
System is set up and ready for production.
Logging and monitoring your server
Review logging and monitoring operations. See Pentaho Data Integration for details.
Enable logging. See Administer Pentaho Data Integration and Analytics for details.
Monitor PDI and SNMP traps. See Administer Pentaho Data Integration and Analytics for details.
Learn about the different ways to log and monitor Pentaho Server operations:
Log through Spoon and Carte
Use SNMP traps with PDI
Deploy solution
Export solutions from the Pentaho Repository that is in the development or test environments, to the Pentaho Repository that is in the production environment.
Export and Import Pentaho Repository
See Export and Import Pentaho Repository Content in the Administer Pentaho Data Integration and Analytics document.
Export Pentaho Repository content from test environment
Import Pentaho Repository content to production environment
Tune solution
Fine tune transformations and jobs to optimize performance. This involves using various tools such as the DI Operations and Audit Marts to determine where bottlenecks or other performance issues occur, and attempting to address them.
Review the Performance Tuning Checklist and Make Changes to Transformations and Jobs
Consult the tuning tips. See the Administer Pentaho Data Integration and Analytics document for tuning tips.
Get familiar with things that you can do to optimize performance.
Apply tuning tips as needed.
Consider other performance tuning options
Learn about transactional databases. See the Pentaho Data Integration document for details on transactional databases.
Learn about using logs. See the Administer Pentaho Data Integration and Analytics document for details on logging.
Learn how to apply transactional databases.
Learn how to use logs to tune transformations and jobs.
Schedule runs
Use the PDI client, Pan, or Kitchen to schedule executions of transformations and jobs.
Schedule Transformations and Jobs From Spoon
Schedule transformations and jobs. See the Pentaho Data Integration document for details.
Schedule transformations and jobs
Command Line Scripting Through Pan and Kitchen
Learn about Pan's options. See the Pentaho Data Integration document for details.
Learn about Kitchen's options. See the Pentaho Data Integration document for details.
Use Pan and Kitchen to schedule transformations and jobs.
Next steps
These resources will be helpful to you after your production server is live.
Fine-tune Pentaho systems: Provides guidance on how to maintain and fine-tune your Pentaho Server. See the Administer Pentaho Data Integration and Analytics document for details.
Pentaho Training and Education
Support Portal: Check with support for service packs.
Commonly used PDI steps and entries
Although there are over 330 transformation steps and job entries, some steps and entries are used more often than others. If you are creating a transformation and job, but do not know where to begin, this list might be helpful to you.
Top ten transformation steps
PDI transformation steps are documented in Pentaho Data Integration.
Text File Input
Table Input
Microsoft Excel Input
Text File Output
Table Output
Microsoft Excel writer
Select Values
Filter Rows
Group By
Stream Lookup
Other commonly used transformation steps
PDI transformation steps are documented in Documentation
INPUT: Generate Rows, Data Grid, Get Data from XML, CSV File Input, Fixed File Input
OUTPUT: XML Output
TRANSFORM: Split Fields, Calculator, Add Constants, Add Sequence, Replacing Strings, Split Fields, Sort Rows, String Operations, Strings Cut
SCRIPTING: User Defined Java Class, Modified Java Script Value, User Defined Java Expression
FLOW: Abort, Append Streams, Block this step until steps finish, Blocking Step, Detect Empty Stream, Dummy, ETL Metadata Injection, Filter Rows, Identify Last Row in a Stream, Java Filter, Job Executor, Prioritize Streams, Single Threader, Switch/Case, Transformation Executor
LOOKUP
JOINS: Join Rows, Merge Join
JOB: Get Variables, Set Variables
Commonly used job entries
PDI job entries are documented in documentation.
GENERAL: Start, Job, Transformation, Success
UTILITY: Abort
MAIL: Mail
FILE MANAGEMENT: Add filenames to result, Compare folders, Convert file between Windows and Unix, Copy Files, Create a folder, Create file, Delete file, Delete filenames from result, Delete files, Delete folders, File Compare, HTTP, Move Files, Process result filenames, Unzip file, Wait for file, Write to file, Zip file
UTILITIES: Write to log
Last updated
Was this helpful?

