Python Executor

The Python Executor step lets you run Python code as part of a Pentaho Data Integration (PDI) transformation.

This step is designed to help developers and data scientists focus on Python-based analytics and algorithms while using PDI for common ETL work such as connecting to sources, joining, and filtering.

You can run Python with either:

Row-by-row processing: PDI maps each incoming row to Python variables and runs the script once per row.
All-rows processing: PDI transfers the full dataset at once (for example, into a pandas DataFrame) and runs the script.

This step supports the CPython runtime only.

Before you begin

Install the following Python libraries before using this step:

pandas (1.5.3 or later)
NumPy (1.24.2 or later)
Py4J (0.10.9.7 or later)
matplotlib (3.7.1 or later)

If you install Python using Anaconda, all required libraries are installed.

Step name

Step name specifies the unique name of the step on the canvas. You can change it.

Configure the step (tabs)

Script tab

Use this tab to specify whether you will embed a script or link to a script file.

Script source

Embed (default): Runs the script entered in Manual Python script.
Link from file: Runs a Python script loaded from a file system (including virtual file systems).

Option

Description

Manual Python script (Embed)

Python script to embed and run.

Location (Link from file)

File system or cluster where the script is located. See the VFS browser for supported file systems.

File name (Link from file)

Fully qualified URL of the script file. Select Browse to locate it in the VFS browser. Supports variable substitution using ${...}.

Use a Python virtual environment

Use a specific Python executable instead of the default. When selected, specify the Python executable path (supports ${...}) or select Browse to locate it.

Input tab

Use this tab to move data from PDI fields to Python variables.

Choose one processing mode:

Row by row (standard PDI streaming behavior)
All rows (dataset-style processing, often used with data frames)

Row by row

When you select Row by row, the script runs once for each incoming row. Each row’s fields are mapped to Python variables defined in the Mapping table.

When using the PDI engine, you can include multiple input steps only if they share the same schema (same field order and data types). If you use multiple input steps with different schemas, the step fails.

Mapping column

Description

Variable

Python variable name.

Python data type

Python data type for the variable (for example, str, int, float). See Map data types from PDI to Python.

PDI field

Incoming PDI field mapped to the Python variable.

PDI data type

PDI data type of the incoming field.

Select Get fields to populate the mapping table from upstream fields.

All rows

When you select All rows, the step accumulates all incoming rows and sends them to Python as a dataset.

With All rows, the full dataset accumulates in memory until all rows are available. For large datasets (GB+), this can significantly increase memory requirements.

Option

Description

Available variables

Select the plus button to add an input variable. Remove a variable by selecting the X icon.

Variable name

Python variable name receiving the dataset.

Step

Name of the input step to map from (a step with an outgoing hop into Python Executor).

Data structure

Dataset structure to use in Python: pandas DataFrame, NumPy array, or Python list of dictionaries.

The Mapping table contains:

Mapping column

Description

Data structure field

Name of the field in the selected Python data structure.

Data structure type

Data type for that field in the selected structure. See Map data types from PDI to Python.

PDI field

PDI field mapped into the dataset.

PDI data type

PDI data type of the field.

Select Get fields to populate the mapping table.

Output tab

Use this tab to move data from Python variables back to PDI fields.

You can output results as:

Variable to fields: individual Python variables (built-in types)
Frames to fields: a pandas DataFrame or a Python list of dictionaries

Variable to fields

Use this option when your script outputs separate variables (for example, numerics, strings, Booleans).

Selecting Get fields executes your Python script using random input values to infer output variables. If your script is long-running or requires specific input data, define the output mapping manually.

Mapping column

Description

Variable

Python variable name.

Python data type

Variable data type. See Map data types from Python to PDI.

PDI field

Output PDI field.

PDI data type

Output PDI data type.

Frames to fields

Use this option when your script outputs a pandas DataFrame or a Python list of dictionaries.

Mapping column

Description

Data structure field

Field in the output data structure.

Data structure type

Data type of that field. See Map data types from Python to PDI.

PDI field

Output PDI field.

PDI data type

Output PDI data type.

Map data types from PDI to Python

Use the most precise mappings possible. The following table lists common mappings.

Structure

PDI data type

Python data type

pandas DataFrame

BigNumber

float64

pandas DataFrame

Boolean

bool

pandas DataFrame

Date

datetime64[ns]

pandas DataFrame

Integer

int64

pandas DataFrame

Number

float64

pandas DataFrame

String

object

pandas DataFrame

Timestamp

datetime64[ns]

NumPy array

BigNumber

float64

NumPy array

Boolean

bool

NumPy array

Integer

int64

NumPy array

Number

float64

Basic Python types

BigNumber

float

Basic Python types

Boolean

bool

Basic Python types

Integer

int

Basic Python types

Number

float

Basic Python types

String

str

Basic Python types

Timestamp

datetime

Map data types from Python to PDI

Use the most precise mappings possible. The following table lists common mappings.

Structure

Python data type

PDI data type

pandas DataFrame

bool

Boolean

pandas DataFrame

datetime64[ns]

Timestamp

pandas DataFrame

float64

BigNumber

pandas DataFrame

int64

Integer

pandas DataFrame

object

String

Basic Python types

bool

Boolean

Basic Python types

datetime

Timestamp

Basic Python types

float

BigNumber

Basic Python types

int

Integer

Basic Python types

str

String

In the Output tab, you can also convert a matplotlib figure to an SVG string.

PreviousPentaho Reporting Output NextQuery HCP

Last updated 2 months ago

Was this helpful?

hashtagBefore you begin

hashtagStep name

hashtagConfigure the step (tabs)

hashtagScript tab

hashtagInput tab

hashtagOutput tab

hashtagMap data types from PDI to Python

hashtagMap data types from Python to PDI

Before you begin

Step name

Configure the step (tabs)

Script tab

Input tab

Output tab

Map data types from PDI to Python

Map data types from Python to PDI