Transforming data with PDI
Transform data and manage jobs in Pentaho Data Integration (PDI).
Use this page to build, run, monitor, and optimize transformations and jobs.
In this article
Work with transformations
In the PDI client (Spoon), you can develop transformations.
Transformations are data workflows representing your ETL activities.
Transformation steps define the individual ETL activities.
Transformations are stored in .ktr files.
Create a transformation
Follow these steps:
Do one of the following:
Select File > New > Transformation.
Select New file on the toolbar, then select Transformation.
Press
Ctrl+N.
Select the Design tab.
Expand folders or search in Steps.
Drag a step onto the canvas.
Double-click the step to open its properties.
Add more steps as needed:
Drag a step, then press
Shiftand draw a hop to connect steps.Double-click a step to add it with a hop from the previous step.
Save the transformation before you run it.
Open a transformation
How you open a transformation depends on where it lives.
You can open a local file, a repository object, or a file on a Virtual File System (VFS).
If you get a missing plugin error, see Troubleshooting transformation steps and job entries.
Open a local transformation
In the PDI client, do one of the following:
Select File > Open.
Select Open file on the toolbar.
Select OPEN Files on the Welcome screen.
Press
Ctrl+O.
Select the
.ktrfile, then select Open.
Open a transformation from the Pentaho Repository
Verify you are connected to a repository.
Open the repository browser:
Select File > Open.
Select Open file on the toolbar.
Select OPEN Files on the Welcome screen.
Press
Ctrl+O.
Use Recents, search, or browse folders to find your transformation.
Select the transformation, then select Open.
Open a transformation on a Virtual File System
Select File > Open.
For details, see Connecting to Virtual File Systems.
Rename a folder or file (local only)
You can rename folders and files from the Open window.
You can rename only when you are not connected to the Pentaho Repository.
In the Open window, select a folder or file.
Right-click the folder or file.
Select Rename.
Save a transformation
How you save a transformation depends on where it lives.
Save a local transformation
In the PDI client, do one of the following:
Select File > Save or File > Save as.
Select Save current file on the toolbar.
Press
Ctrl+S.
Enter a name and choose a location.
Select Save.
Save a transformation to the Pentaho Repository
Verify you are connected to a repository.
In the PDI client, do one of the following:
Select File > Save or File > Save as.
Select Save current file on the toolbar.
Press
Ctrl+S.
Browse to the repository folder.
Enter a name, then select Save.
Save a transformation on a Virtual File System
Select File > Open to save a transformation on a Virtual File System (VFS).
For details, see Connecting to Virtual File Systems.
Run a transformation
Run a transformation to test how it performs.
The Run Options window lets you set run configurations, logging, options, and temporary parameter values.
Open Run Options in one of these ways:
Select Run on the toolbar.
Select Action > Run.
Press
F9.
The Run Options window appears.
In Run Options, you choose a run configuration.
To set up run configurations, see Run configurations (transformations).
After you run a transformation, use the Execution Results section to review output.
Run configurations (transformations)
Some ETL activities are lightweight.
Others need dedicated servers or cluster execution.
You can create or edit run configurations in View > Run configurations.
Pentaho local is the default run configuration.
You cannot edit it.
Create or edit a run configuration
Right-click Run configurations, then select New.
Or, right-click a configuration and select Edit.
The dialog contains:
Name
Name of the run configuration.
Description
Optional details about the configuration.
Select an engine
You can select the Pentaho engine to run transformations in the default environment.
You can also use Spark Submit to run big data transformations on a Hadoop cluster.
See Spark Submit.
The Pentaho engine does not execute sub-transformations or sub-jobs when you select Pentaho server or Slave server.
If you need sub-transformations to run on the same host as the parent job, use Local.
Run options
Errors, warnings, and other information are stored in logs.
You set log verbosity and other behavior in the Options section.
For background, see Logging and performance monitoring.
Clear log before running
Clears logs before each run.
Log level
Controls how much information is logged.
Enable safe mode
Checks every row to ensure layout consistency.
Gather performance metrics
Captures performance metrics during the run. See Use performance graphs.
Parameters and variables (run-time overrides)
You can temporarily override parameters and variables for each run.
These overrides apply only to the current run.
Use parameters to set run-time parameter values.
Use variables to set run-time variable values.
Use arguments for command-line style arguments.
Analyze transformation results
After you run a transformation, the Execution Results panel appears.
It helps you inspect step behavior, errors, and performance.
Step metrics
The Step Metrics tab shows per-step statistics.
It includes records read, written, errors, and row throughput.
Steps that failed are highlighted in red.
Logging
The Logging tab shows log details for the most recent run.
Error lines are highlighted in red.
Execution history
The Execution History tab shows metrics and logs from previous runs.
It requires database logging configured in transformation properties.
See Set up transformation logging.
Performance graph
The Performance Graph tab shows step performance over time.
It requires database logging enabled.
Metrics
The Metrics tab shows a Gantt chart for run timings.
Preview data
Use Preview Data to inspect step output rows.
Select a step to view its data.
Stop your transformation
You can stop transformations in two ways.
Use Stop to stop immediately.
Use Stop input processing to stop input safely after in-flight records complete.
Transformation properties reference
Transformation properties describe the transformation and configure its behavior.
To open properties, press Ctrl+T or right-click the canvas and select Properties.
The settings are grouped into these tabs:
Transformation
Parameters
Logging
Dates
Dependencies
Miscellaneous
Monitoring
After you adjust settings, select SQL to generate SQL for logging tables.
For SQL execution details, see Use the SQL Editor.
Transformation tab
Use the Transformation tab to specify general properties.
Transformation name
Required to save settings to a repository.
Transformation filename
.ktr file name.
Description
Short description shown in Repository Explorer.
Extended description
Long description.
Status
Draft or production.
Version
Version description.
Directory
Repository directory.
Created by
Original creator.
Created at
Create time.
Last modified by
Last editor.
Last modified at
Last update time.
Parameters tab
Use the Parameters tab to add parameters.
Parameter
Local variable shared across steps in the transformation.
Default Value
Used if a parameter value is not set elsewhere.
Description
Description of the parameter.
Logging tab
Use the Logging tab to configure logging.
For a guided setup, see Set up transformation logging.
Dates tab
Use the Dates tab to configure date ranges and limits.
Dependencies tab
Use the Dependencies tab to list transformation dependencies.
Miscellaneous tab
Use the Miscellaneous tab to configure buffer sizes and admin settings.
This tab includes the Make the transformation database transactional option.
For rollback patterns, see Transactional databases and job rollback.
Monitoring tab
Use the Monitoring tab to enable step performance monitoring.
Transformation canvas context menu (Transformation menu)
Use the Transformation menu to access settings, options, and properties.
Right-click any step in the canvas.
Each item is described in this table.
New Hop
Creates a new hop.
Edit
Shows the step configuration window.
Description
Adds a description to the step.
Open Referenced Object
Maps a sub-transformation. See Mapping.
Data Movement
Round robin, load balance, or copy rows across hops.
Change Number of Copies to Start
Starts step copies in parallel.
Copy
Copies selected items to the clipboard.
Duplicate
Duplicates the selection on the canvas.
Delete
Deletes selected items.
Hide
Hides the step. You must edit the XML to show it again.
Detach
Detaches the step or entry.
Input Fields
Shows incoming field metadata.
Output Fields
Shows outgoing field metadata.
Sniff Test During Execution
Shows row data while executing. See Sniff Test tool.
Check Selected Step(s)
Checks steps for configuration problems.
Error Handling
Configures step error handling.
Preview
Launches the debug dialog.
Align/Distribute
Aligns or distributes steps on the canvas.
Data Services
Creates or manages data services. See Pentaho Data Services.
Mapping
Creates a Select/Rename Values step for field mappings.
Partitions
Configures partitioning. See Partitioning data.
Clusters
Configures Carte clusters. See Use Carte Clusters.
Work with jobs
In the PDI client (Spoon), you can develop jobs to orchestrate ETL activities.
Job entries define the work a job performs.
Jobs are stored in .kjb files.
Create a job
Follow these steps:
Do one of the following:
Select File > New > Job.
Select New file on the toolbar, then select Job.
Press
Ctrl+Alt+N.
Select the Design tab.
Expand folders or search in Entries.
Drag an entry onto the canvas.
Double-click the entry to open its properties.
Add more entries as needed:
Drag an entry, then press
Shiftand draw a hop.Double-click an entry to add it with a hop from the previous entry.
Save the job when you are done.
Open a job
How you open a job depends on where it lives.
You can open a local file, a repository object, or a file on a Virtual File System (VFS).
If you get a missing plugin error, see Troubleshooting transformation steps and job entries.
Open a local job
In the PDI client, do one of the following:
Select File > Open.
Select Open file on the toolbar.
Select OPEN Files on the Welcome screen.
Press
Ctrl+O.
Select the
.kjbfile, then select Open.
Open a job from the Pentaho Repository
Verify you are connected to a repository.
Open the repository browser:
Select File > Open.
Select Open file on the toolbar.
Press
Ctrl+O.
Use Recents, search, or browse folders to find your job.
Select the job, then select Open.
Open a job on a Virtual File System
Select File > Open.
For details, see Connecting to Virtual File Systems.
Rename a folder or file (local only)
Use Rename a folder or file (local only).
Save a job
How you save a job depends on where it lives.
Save a local job
In the PDI client, do one of the following:
Select File > Save or File > Save as.
Select Save current file on the toolbar.
Press
Ctrl+S.
Enter a name and choose a location.
Select Save.
Save a job to the Pentaho Repository
Verify you are connected to a repository.
In the PDI client, do one of the following:
Select File > Save or File > Save as.
Select Save current file on the toolbar.
Press
Ctrl+S.
Browse to the repository folder.
Enter a name, then select Save.
Save a job on a Virtual File System
Select File > Open to save a job on a Virtual File System (VFS).
For details, see Connecting to Virtual File Systems.
Run a job
Run a job to test how it performs.
The Run Options window lets you set run configurations, logging, options, and temporary values.
Open Run Options in one of these ways:
Select Run on the toolbar.
Select Action > Run.
Press
F9.
In Run Options, you choose a run configuration.
To set up run configurations, see Run configurations (jobs).
Run configurations (jobs)
You can create or edit configurations in View > Run configurations.
Pentaho local is the default run configuration.
You cannot edit it.
Pentaho engine
The Pentaho engine does not execute sub-transformations or sub-jobs when you select Pentaho server or Slave server.
If you need sub-transformations to run on the same host as the parent job, use Local.
Run options
Clear log before running
Clears logs before each run.
Log level
Controls how much information is logged.
Enable safe mode
Checks every row to ensure layout consistency.
Start job at
Starts the run at an alternative entry.
Gather performance metrics
Captures performance metrics during the run.
Parameters and variables (run-time overrides)
You can temporarily override parameters and variables for each run.
These overrides apply only to the current run.
Use parameters to set run-time parameter values.
Use variables to set run-time variable values.
Use arguments for command-line style arguments.
Stop your job
You can stop jobs in two ways.
Use Stop to stop immediately.
Use Stop input processing to stop input safely after in-flight records complete.
Job properties reference
Job properties control job behavior and logging.
To open properties, press Ctrl+T or right-click the canvas and select Properties.
The properties are grouped into these tabs:
Job
Parameters
Settings
Log
Transactions
Job canvas context menu (Job menu)
Right-click any entry in the job canvas to view the Job menu.
New Hop
Creates a new hop.
Edit
Shows the entry configuration window.
Description
Adds a description to the entry.
Open Referenced Object
Opens referenced transformations.
Copy
Copies selected items to the clipboard.
Duplicate
Duplicates the selection on the canvas.
Delete
Deletes selected items.
Hide
Hides the entry. You must edit the XML to show it again.
Detach
Detaches the entry from the job.
Align/Distribute
Aligns or distributes entries on the canvas.
Restartable Checkpoint
Adds a checkpoint to restart failed jobs. See Use checkpoints to restart jobs.
Run Next Entries in Parallel
Runs next entries in parallel.
PDI run modifiers
This section describes the types of run modifiers, their uses, and configuration.
You can use arguments, parameters, or variables to modify how you run transformations and jobs.
Arguments
A PDI argument is a named, user-supplied, single-value input.
Arguments are passed as command-line values after the Pan or Kitchen options.
Each transformation or job supports up to 10 arguments.
Example:
In Spoon, you can test arguments in the Run Options window.
Use the Arguments button to enter values.
Parameters
Parameters are local variables that apply only to the transformation where you define them.
You can assign a default value.
If a parameter name collides with a variable, the parameter takes precedence.
Define parameters in transformation settings:
Right-click the transformation canvas and select Transformation settings.
Or press
Ctrl+T.Select the Parameters tab.
VFS properties
You can specify VFS properties as parameters.
Specifying VFS properties as parameters
VFS properties can be specified as parameters.
The format of the reference to a VFS property is vfs.scheme.property.host.
The following list describes the subparts of the format:
The vfs subpart is required to identify this as a virtual file system configuration property.
The scheme subpart represents the VFS driver's scheme (or VFS type), such as HTTP, SFTP, or ZIP.
The property subpart is the name of a VFS driver's ConfigBuilder's setter (the specific VFS element that you want to set).
The host optionally defines a specific IP address or hostname that this setting applies to.
You must consult each scheme's API reference to determine which properties you can create variables for.
Apache provides VFS scheme documentation at https://commons.apache.org/proper/commons-vfs/commons-vfs2/apidocs/.
The org.apache.commons.vfs.provider package lists each of the configurable VFS providers (FTP, HTTP, SFTP, and others).
Each provider has a FileSystemConfigBuilder class that in turn has set*(FileSystemOptions, Object) methods.
If a method's second parameter is a String or a number (Integer, Long, and others), then you can create a PDI variable to set the value for VFS dialog boxes.
The table below explains VFS properties for the SFTP scheme.
Each property must be declared as a PDI variable and preceded by the vfs.sftp prefix as defined above.
compression
Specifies whether ZLIB compression is used for the destination files. Possible values are zlib and none.
identity
The private key file (fully qualified local or remote path and filename) to use for host authentication.
authkeypassphrase
The passphrase for the private key specified by the identity property.
StrictHostKeyChecking
If this is set to no, the certificate of any remote host will be accepted. If set to yes, the remote host must exist in the known hosts file (~/.ssh/known_hosts).
Note: All of these properties are optional.
The following examples show how to specify parameters as VFS properties:

Configure SFTP VFS
To configure the connection settings for SFTP dialog boxes in PDI, you must create either variables or parameters for each relevant value.
Possible values are determined by the VFS driver you are using.
You can use parameters to substitute VFS connection details, then use them in the VFS dialog box.
For example, assuming the parameters have been set:
sftp://${username}@${host}/${path}
This technique enables you to hide sensitive connection details, such as usernames and passwords.
You can see examples of these techniques in the VFS Configuration Sample transformation in the /data-integration/samples/transformations/ directory.
Variables
A variable is user-supplied information used dynamically in different scopes.
Variables can be local to a step or available to the whole JVM.
You can define variables in these ways:
Set Variable step or Set Session Variables step.
kettle.properties.Edit > Set Environment Variables.
You can reference variables in fields using:
${VARIABLE}%%VARIABLE%%
If a name collides with a parameter or argument, variables defer.
Environment variables
Environment variables are traditional variables in PDI.
You define them in Edit > Set Environment Variables.
Or you pass JVM system properties using the -D flag.
Environment variables are not safe for dynamic concurrent runs.
They are visible to all software running in the same JVM.
Kettle variables
Kettle variables are scoped to Kettle and can be limited to a job or transformation.
You can set Kettle variables in these ways:
Set Kettle variables in the PDI client
Select Edit > Edit the kettle.properties file.
Update existing values.
To add a variable:
Right-click a line number and select Insert before this row or Insert after this row.
Enter the variable name and value.
To reorder, right-click the line and select Move Up or Move Down.
Select OK.
Set Kettle variables manually
Open
kettle.propertiesin a text editor.Edit values.
Save the file.
Set Kettle or Java environment variables in the Pentaho MapReduce job entry
MapReduce jobs run distributed across nodes.
Set node-specific variables in the MapReduce job entry.
Double-click the Pentaho MapReduce job entry, then select User Defined.
In Name, set a variable name:
Kettle variable:
KETTLE_SAMPLE_VARJava system property: prefix with
java.system.(examplejava.system.SAMPLE_PATH_VAR)
In Value, enter the variable value.
Select OK.
Set the LAZY_REPOSITORY variable in the PDI client
This variable restores repository directory-loading behavior from before Pentaho 6.1.
Select Edit > Edit the kettle.properties file.
Set
KETTLE_LAZY_REPOSITORY=true.Select OK, then restart the PDI client.
Internal variables
These variables are always defined:
Internal.Kettle.Build.Date
2010/05/22 18:01:39
Internal.Kettle.Build.Version
2045
Internal.Kettle.Version
4.3
These variables are defined in a transformation:
Internal.Transformation.Filename.Directory
D:\\Kettle\\samples
Internal.Transformation.Filename.Name
Denormaliser - 2 series of key-value pairs.ktr
Internal.Transformation.Name
Denormaliser - 2 series of key-value pairs sample
Internal.Transformation.Repository.Directory
/
These variables are defined in a job:
Internal.Job.Filename.Directory
file:///home/matt/jobs
Internal.Job.Filename.Name
Nested jobs.kjb
Internal.Job.Name
Nested job test case
Internal.Job.Repository.Directory
/
These variables are defined in transformations and jobs within a project:
Internal.Project.Data.Directory
pvfs://Repository/home/admin/projects/myproject/
Internal.Project.Execution.Directory
/home/admin/projects/myproject
Internal.Project.Name
My project
Internal.Project.Description
Description of my project
Project directory variables are used in different situations.
Internal.Project.Data.Directory is used by steps that are not repository-aware, like Text File Input.
Internal.Project.Execution.Directory is used by repository-aware steps, like Transformation Executor.
These variables are defined in a clustered transformation on a slave server:
Internal.Slave.Transformation.Number
0..<cluster size-1>
Internal.Cluster.Size
<cluster size>
Partitioning data
Partitioning distributes rows into subsets according to a rule.
Use partitioning to scale up (more CPU cores) and scale out (multiple servers).
Get started
By default, each step in a transformation runs in parallel in a separate thread.
Partitioning during data processing
You can scale up using Change Number of Copies to Start.
This creates multiple copies of a step at runtime.
Without partition rules, parallel aggregation can produce incorrect results.
Understand repartitioning logic
When a step needs to repartition data, it creates buffers from each source copy to each target copy.
Partitioning applies a rule-based distribution so like rows go to the same copy.
Partitioning data over tables
The Table Output step supports partitioning rows to different tables.
It can accept the table name from a Partitioning field.
You can also partition per month or per day.
Use partitioning
Partitioning methods can be based on any criteria.
You can also use a partitioning plugin.
Set up a partition schema.
Apply the schema to a step.
Select a partitioning method for row distribution.
Use data swimlanes
When a partitioned step passes data to another partitioned step with the same schema, data stays in swimlanes.
No repartitioning is needed.
Rules for partitioning
These rules drive distribution and buffer allocation:
A partitioned step runs one copy per partition.
Repartitioning creates buffers from each source copy to each target copy.
Non-partitioned to partitioned causes repartitioning.
Same schema between partitioned steps avoids repartitioning.
Different schemas between partitioned steps causes repartitioning.
Partitioning clustered transformations
Partitioning can scale out on a cluster of slave servers.
Keep repartitioning to a minimum to avoid network overhead.
Try to keep data in swimlanes for as long as possible.
Learn more
Logging and performance monitoring
You can use logging and performance monitoring to troubleshoot, tune, and plan capacity.
You can also run an impact analysis from Action > Impact.
Set up transformation logging
Follow these steps to create a log table for a transformation:
Ask your system administrator to create a database or table space called
PdiLog.Open transformation properties (
Ctrl+T).Select the Logging tab.
Configure connection, schema, and table name.
Select fields to log.
Select SQL, then execute the generated SQL.
For effective deletion of expired logs, keep LOGDATE and TRANSNAME enabled.
Monitoring LOG_FIELD can negatively affect Pentaho Server performance.
Set up job logging
Follow these steps to create a log table for a job:
Ask your system administrator to create a database or table space called
PdiLog.Open job properties (
Ctrl+T).Select the Log tab.
Configure connection, schema, and table name.
Select fields to log.
Select SQL, then execute the generated SQL.
For effective deletion of expired logs, keep LOGDATE and JOBNAME enabled.
Logging levels
Nothing
No logging.
Error
Only errors.
Minimal
Minimal logging.
Basic
Default.
Detailed
Detailed logging output.
Debug
Very detailed output for debugging.
Row Level
Row-level logging. Generates a lot of log data.
Monitor performance
Use these tools:
Sniff Test tool
The Sniff Test displays row data as it travels from one step to another.
It is designed as a supplement to logs.
Sniff Test slows transformation run speed.
Right-click a step while the transformation runs.
Select Sniff Test During Execution.
Select an option:
Sniff test input rows
Sniff test output rows
Sniff test error handling
Monitoring tab
Enable step performance monitoring in transformation properties:
Open transformation properties (
Ctrl+T).Select Enable step performance monitoring?.
Step performance monitoring can increase memory consumption in long-running transformations.
Use performance graphs
If you configured performance monitoring with database logging, you can view performance graphs.
PDI performance tuning tips
The following tips can help diagnose performance issues.
JS
Turn off compatibility mode
Rewriting JavaScript to use a format that is not compatible with previous versions is, in most instances, easy to do and makes scripts easier to work with and to read. By default, old JavaScript programs run in compatibility mode. That means that the step will process like it did in a previous version. You may see a small performance drop because of the overload associated with forcing compatibility. If you want to make use of the new architecture, disable compatibility mode and change the code as shown below:
intField.getInteger() > intFieldnumberField.getNumber() > numberFielddateField.getDate() > dateFieldbigNumberField.getBigNumber() > bigNumberFieldand so on...
Instead of Java methods, use the built-in library. Notice that the resulting program code is more intuitive. For example:
checking for null is now:
field.isNull() > field==nullConverting string to date:
field.Clone().str2dat() > str2date(field)and so on...
If you convert your code as shown above, you may get significant performance benefits.
Note: It is no longer possible to modify data in-place using the value methods. This was a design decision to ensure that no data with the wrong type would end up in the output rows of the step. Instead of modifying fields in-place, create new fields using the table at the bottom of the Modified JavaScript transformation.
JS
Combine steps
One large JavaScript step runs faster than three consecutive smaller steps. Combining processes in one larger step helps to reduce overhead.
JS
Avoid the JavaScript step or write a custom plug in
Remember that while JavaScript is the fastest scripting language for Java, it is still a scripting language. If you do the same amount of work in a native step or plugin, you avoid the overhead of the JS scripting engine. This has been known to result in significant performance gains. It is also the primary reason why the Calculator step was created — to avoid the use of JavaScript for simple calculations.
JS
Create a copy of a field
No JavaScript is required for this, a Select Values step does the trick. You can specify the same field twice. Once without a rename, once (or more) with a rename. Another trick is to use B=NVL(A,A) in a Calculator step where B is forced to be a copy of A.
JS
Data conversion
Consider performing conversions between data types (dates, numeric data, and so on) in a Select Values step. You can do this in the Metadata tab of the step.
JS
Variable creation
If you have variables that can be declared once at the beginning of the transformation, make sure you put them in a separate script and mark that script as a startup script (right click on the script name in the tab). JavaScript object creation is time consuming so if you can avoid creating a new object for every row you are transforming, this will translate to a performance boost for the step.
N/A
Launch several copies of a step
There are two important reasons why launching multiple copies of a step may result in better performance:1. The step uses a lot of CPU resources and you have multiple processor cores in your computer. Example: a JavaScript step. 2. Network latencies and launching multiple copies of a step can reduce average latency. If you have a low network latency of say 5ms and you need to do a round trip to the database, the maximum performance you get is 200 (x5) rows per second, even if the database is running smoothly. You can try to reduce the round trips with caching, but if not, you can try to run multiple copies. Example: a database lookup or table output.
N/A
Manage thread priorities
This feature that is found in the Transformation Settings dialog box under the (Misc tab) improves performance by reducing the locking overhead in certain situations. This feature is enabled by default for new transformations that are created in recent versions, but for older transformations this can be different.
Select Values
If possible, don't remove fields in Select Values
Don't remove fields in Select Value unless you must. It's a CPU-intensive task as the engine needs to reconstruct the complete row. It is almost always faster to add fields to a row rather than delete fields from a row.
Get Variables
Watch your use of Get Variables
May cause bottlenecks if you use it in a high-volume stream (accepting input). To solve the problem, take the Get Variables step out of the transformation (right click, detach) then insert it in with a Join Rows step. Make sure to specify the main step from which to read in the Join Rows step. Set it to the step that originally provided the Get Variables step with data.
N/A
Use new text file input
The CSV File Input or Fixed File Input steps provide optimal performance. If you have a fixed width (field/row) input file, you can even read data in parallel. (multiple copies) These new steps have been rewritten using Non-blocking I/O (NIO) features. Typically, the larger the NIO buffer you specify in the step, the better your read performance will be.
N/A
When appropriate, use lazy conversion
In instances in which you are reading data from a text file and you write the data back to a text file, use Lazy conversion to speed up the process. The principle behind lazy conversion that it delays data conversion in hopes that it isn't necessary (reading from a file and writing it back comes to mind). Beyond helping with data conversion, lazy conversion also helps to keep the data in "binary" storage form. This, in turn, helps the internal Kettle engine to perform faster data serialization (sort, clustering, and so on). The Lazy Conversion option is available in the CSV File Input and Fixed File Input text file reading steps.
Join Rows
Use Join Rows
You need to specify the main step from which to read. This prevents the step from performing any unnecessary spooling to disk. If you are joining with a set of data that can fit into memory, make sure that the cache size (in rows of data) is large enough. This prevents (slow) spooling to disk.
N/A
Review the big picture: database, commit size, row set size and other factors
Consider how the whole environment influences performance. There can be limiting factors in the transformation itself and limiting factors that result from other applications and PDI. Performance depends on your database, your tables, indexes, the JDBC driver, your hardware, speed of the LAN connection to the database, the row size of data and your transformation itself. Test performance using different commit sizes and changing the number of rows in row sets in your transformation settings. Change buffer sizes in your JDBC drivers or database.
N/A
Step Performance Monitoring
Step Performance Monitoring is an important tool that allows you identify the slowest step in your transformation.
Logging best practices
You can improve logging with rotation and other best practices.
For more information, see the Administer Pentaho Data Integration and Analytics document.
Use checkpoints to restart jobs
Checkpoints let you restart jobs that fail without rerunning from the beginning.
Add a checkpoint
Open a job.
Right-click an entry, then select Restartable Checkpoint.
Delete a checkpoint
Open a job.
Right-click an entry, then select Clear Checkpoint Marker.
Set up a checkpoint log
Open a job.
Right-click the job canvas, then select Properties.
Select the Log tab, then select Checkpoints log table.
Configure the log connection and table name.
Select OK.
Use the SQL Editor
Use the SQL Editor to preview and execute DDL generated by the PDI client.
Keep these points in mind:
Separate statements with semicolons.
Spoon removes line breaks and semicolons before execution.
PDI clears database cache for the connection you execute on.
Use the Database Explorer
Use Database Explorer to explore configured database connections.
You open it from the Database Connections dialog box.
Transactional databases and job rollback
By default, changes are committed as a job or transformation executes.
If you need rollback behavior, make the database transactional.
Make a transformation database transactional
Open a transformation.
Open transformation properties.
Select the Miscellaneous tab.
Select Make the transformation database transactional.
Select OK.
Make a job database transactional
Open a job.
Open job properties.
Select the Transactions tab.
Select Make the job database transactional.
Select OK.
Add notes to transformations and jobs
Notes help document structure, design decisions, business rules, and dependencies.
Create a note
Right-click the canvas and select New Note.
Select Font Style to change font and color.
Select Note, then type your note.
Select OK.
Edit a note
Double-click the note.
Select Font Style to change font and color.
Select Note, then edit your note.
Select OK.
Reposition a note
Drag the note to a new position.
Optional: right-click the note:
Select Raise Note to move it above other items.
Select Lower Note to move it below other items.
Delete a note
Right-click the note.
Select Delete Note.
Manage PDI transformations and job schedules
Schedule transformations and jobs to run at specific times or intervals.
Schedule a transformation or job
Connect to the Pentaho Repository.
Open a job or transformation, then select Action > Schedule.
Set start time:
Now, or
Date with date, time, and time zone.
Set repeat schedule.
Set end time or select No end.
Optional: enable Safe mode.
Select Log Level.
Update argument, parameter, or variable values if needed.
Select OK.
Edit a scheduled run
Select the Scheduler perspective.
Select the schedule, then select Edit Scheduled Task.
Stop a schedule
Select the Scheduler perspective.
Select the schedule, then select Stop Scheduled Task.
Enable or disable a schedule
Select the Scheduler perspective.
Select the schedule.
Select Start Scheduler to enable.
Select Stop Scheduler to disable.
Delete a scheduled run
Select the Scheduler perspective.
Select the schedule, then select Remove.
Refresh the schedule list
Select the Scheduler perspective.
Select Refresh.
Archived source pages
These pages were merged into this single topic page and moved under Transforming data with PDI (archive):
Last updated
Was this helpful?

