Python Executor
The Python Executor step lets you run Python code as part of a Pentaho Data Integration (PDI) transformation.
This step is designed to help developers and data scientists focus on Python-based analytics and algorithms while using PDI for common ETL work such as connecting to sources, joining, and filtering.
You can run Python with either:
Row-by-row processing: PDI maps each incoming row to Python variables and runs the script once per row.
All-rows processing: PDI transfers the full dataset at once (for example, into a pandas DataFrame) and runs the script.
This step supports the CPython runtime only.
Before you begin
Install the following Python libraries before using this step:
pandas (1.5.3 or later)
NumPy (1.24.2 or later)
Py4J (0.10.9.7 or later)
matplotlib (3.7.1 or later)
If you install Python using Anaconda, all required libraries are installed.
Step name
Step name specifies the unique name of the step on the canvas. You can change it.
Configure the step (tabs)
Script tab
Use this tab to specify whether you will embed a script or link to a script file.
Script source
Embed (default): Runs the script entered in Manual Python script.
Link from file: Runs a Python script loaded from a file system (including virtual file systems).
Manual Python script (Embed)
Python script to embed and run.
Location (Link from file)
File system or cluster where the script is located. See the VFS browser for supported file systems.
File name (Link from file)
Fully qualified URL of the script file. Select Browse to locate it in the VFS browser. Supports variable substitution using ${...}.
Use a Python virtual environment
Use a specific Python executable instead of the default. When selected, specify the Python executable path (supports ${...}) or select Browse to locate it.
Input tab
Use this tab to move data from PDI fields to Python variables.
Choose one processing mode:
Row by row (standard PDI streaming behavior)
All rows (dataset-style processing, often used with data frames)
Row by row
When you select Row by row, the script runs once for each incoming row. Each row’s fields are mapped to Python variables defined in the Mapping table.
When using the PDI engine, you can include multiple input steps only if they share the same schema (same field order and data types). If you use multiple input steps with different schemas, the step fails.
Variable
Python variable name.
Python data type
Python data type for the variable (for example, str, int, float). See Map data types from PDI to Python.
PDI field
Incoming PDI field mapped to the Python variable.
PDI data type
PDI data type of the incoming field.
Select Get fields to populate the mapping table from upstream fields.
All rows
When you select All rows, the step accumulates all incoming rows and sends them to Python as a dataset.
With All rows, the full dataset accumulates in memory until all rows are available. For large datasets (GB+), this can significantly increase memory requirements.
Available variables
Select the plus button to add an input variable. Remove a variable by selecting the X icon.
Variable name
Python variable name receiving the dataset.
Step
Name of the input step to map from (a step with an outgoing hop into Python Executor).
Data structure
Dataset structure to use in Python: pandas DataFrame, NumPy array, or Python list of dictionaries.
The Mapping table contains:
Data structure field
Name of the field in the selected Python data structure.
Data structure type
Data type for that field in the selected structure. See Map data types from PDI to Python.
PDI field
PDI field mapped into the dataset.
PDI data type
PDI data type of the field.
Select Get fields to populate the mapping table.
Output tab
Use this tab to move data from Python variables back to PDI fields.
You can output results as:
Variable to fields: individual Python variables (built-in types)
Frames to fields: a pandas DataFrame or a Python list of dictionaries
Variable to fields
Use this option when your script outputs separate variables (for example, numerics, strings, Booleans).
Selecting Get fields executes your Python script using random input values to infer output variables. If your script is long-running or requires specific input data, define the output mapping manually.
Variable
Python variable name.
Python data type
Variable data type. See Map data types from Python to PDI.
PDI field
Output PDI field.
PDI data type
Output PDI data type.
Frames to fields
Use this option when your script outputs a pandas DataFrame or a Python list of dictionaries.
Data structure field
Field in the output data structure.
Data structure type
Data type of that field. See Map data types from Python to PDI.
PDI field
Output PDI field.
PDI data type
Output PDI data type.
Map data types from PDI to Python
Use the most precise mappings possible. The following table lists common mappings.
pandas DataFrame
BigNumber
float64
pandas DataFrame
Boolean
bool
pandas DataFrame
Date
datetime64[ns]
pandas DataFrame
Integer
int64
pandas DataFrame
Number
float64
pandas DataFrame
String
object
pandas DataFrame
Timestamp
datetime64[ns]
NumPy array
BigNumber
float64
NumPy array
Boolean
bool
NumPy array
Integer
int64
NumPy array
Number
float64
Basic Python types
BigNumber
float
Basic Python types
Boolean
bool
Basic Python types
Integer
int
Basic Python types
Number
float
Basic Python types
String
str
Basic Python types
Timestamp
datetime
Map data types from Python to PDI
Use the most precise mappings possible. The following table lists common mappings.
pandas DataFrame
bool
Boolean
pandas DataFrame
datetime64[ns]
Timestamp
pandas DataFrame
float64
BigNumber
pandas DataFrame
int64
Integer
pandas DataFrame
object
String
Basic Python types
bool
Boolean
Basic Python types
datetime
Timestamp
Basic Python types
float
BigNumber
Basic Python types
int
Integer
Basic Python types
str
String
In the Output tab, you can also convert a matplotlib figure to an SVG string.
Last updated
Was this helpful?

