# Python Executor

The **Python Executor** step lets you run Python code as part of a Pentaho Data Integration (PDI) transformation.

This step is designed to help developers and data scientists focus on Python-based analytics and algorithms while using PDI for common ETL work such as connecting to sources, joining, and filtering.

You can run Python with either:

* **Row-by-row** processing: PDI maps each incoming row to Python variables and runs the script once per row.
* **All-rows** processing: PDI transfers the full dataset at once (for example, into a pandas DataFrame) and runs the script.

{% hint style="info" %}
This step supports the **CPython** runtime only.
{% endhint %}

### Before you begin

Install the following Python libraries before using this step:

* [pandas](http://pandas.pydata.org/) (1.5.3 or later)
* [NumPy](http://www.numpy.org/) (1.24.2 or later)
* [Py4J](https://www.py4j.org/) (0.10.9.7 or later)
* [matplotlib](https://matplotlib.org/) (3.7.1 or later)

{% hint style="info" %}
If you install Python using Anaconda, all required libraries are installed.
{% endhint %}

### Step name

**Step name** specifies the unique name of the step on the canvas. You can change it.

### Configure the step (tabs)

#### Script tab

Use this tab to specify whether you will embed a script or link to a script file.

**Script source**

* **Embed** (default): Runs the script entered in **Manual Python script**.
* **Link from file**: Runs a Python script loaded from a file system (including virtual file systems).

| Option                               | Description                                                                                                                                                                                                                                                           |
| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Manual Python script** (Embed)     | Python script to embed and run.                                                                                                                                                                                                                                       |
| **Location** (Link from file)        | File system or cluster where the script is located. See the [VFS browser](https://docs.pentaho.com/pdia-data-integration/archived-merged-pages/connecting-to-virtual-file-systems-archive/vfs-browser-connecting-to-virtual-file-systems) for supported file systems. |
| **File name** (Link from file)       | Fully qualified URL of the script file. Select **Browse** to locate it in the VFS browser. Supports variable substitution using `${...}`.                                                                                                                             |
| **Use a Python virtual environment** | Use a specific Python executable instead of the default. When selected, specify the Python executable path (supports `${...}`) or select **Browse** to locate it.                                                                                                     |

#### Input tab

Use this tab to move data from PDI fields to Python variables.

Choose one processing mode:

* **Row by row** (standard PDI streaming behavior)
* **All rows** (dataset-style processing, often used with data frames)

**Row by row**

When you select **Row by row**, the script runs once for each incoming row. Each row’s fields are mapped to Python variables defined in the **Mapping** table.

{% hint style="warning" %}
When using the PDI engine, you can include multiple input steps only if they share the **same schema** (same field order and data types). If you use multiple input steps with different schemas, the step fails.
{% endhint %}

| Mapping column       | Description                                                                                                                                          |
| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Variable**         | Python variable name.                                                                                                                                |
| **Python data type** | Python data type for the variable (for example, `str`, `int`, `float`). See [Map data types from PDI to Python](#map-data-types-from-pdi-to-python). |
| **PDI field**        | Incoming PDI field mapped to the Python variable.                                                                                                    |
| **PDI data type**    | PDI data type of the incoming field.                                                                                                                 |

Select **Get fields** to populate the mapping table from upstream fields.

**All rows**

When you select **All rows**, the step accumulates all incoming rows and sends them to Python as a dataset.

{% hint style="warning" %}
With **All rows**, the full dataset accumulates in memory until all rows are available. For large datasets (GB+), this can significantly increase memory requirements.
{% endhint %}

| Option                  | Description                                                                                                    |
| ----------------------- | -------------------------------------------------------------------------------------------------------------- |
| **Available variables** | Select the plus button to add an input variable. Remove a variable by selecting the X icon.                    |
| **Variable name**       | Python variable name receiving the dataset.                                                                    |
| **Step**                | Name of the input step to map from (a step with an outgoing hop into Python Executor).                         |
| **Data structure**      | Dataset structure to use in Python: **pandas DataFrame**, **NumPy array**, or **Python list of dictionaries**. |

The **Mapping** table contains:

| Mapping column           | Description                                                                                                                      |
| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------- |
| **Data structure field** | Name of the field in the selected Python data structure.                                                                         |
| **Data structure type**  | Data type for that field in the selected structure. See [Map data types from PDI to Python](#map-data-types-from-pdi-to-python). |
| **PDI field**            | PDI field mapped into the dataset.                                                                                               |
| **PDI data type**        | PDI data type of the field.                                                                                                      |

Select **Get fields** to populate the mapping table.

#### Output tab

Use this tab to move data from Python variables back to PDI fields.

You can output results as:

* **Variable to fields**: individual Python variables (built-in types)
* **Frames to fields**: a pandas DataFrame or a Python list of dictionaries

**Variable to fields**

Use this option when your script outputs separate variables (for example, numerics, strings, Booleans).

{% hint style="info" %}
Selecting **Get fields** executes your Python script using random input values to infer output variables. If your script is long-running or requires specific input data, define the output mapping manually.
{% endhint %}

| Mapping column       | Description                                                                                      |
| -------------------- | ------------------------------------------------------------------------------------------------ |
| **Variable**         | Python variable name.                                                                            |
| **Python data type** | Variable data type. See [Map data types from Python to PDI](#map-data-types-from-python-to-pdi). |
| **PDI field**        | Output PDI field.                                                                                |
| **PDI data type**    | Output PDI data type.                                                                            |

**Frames to fields**

Use this option when your script outputs a pandas DataFrame or a Python list of dictionaries.

| Mapping column           | Description                                                                                           |
| ------------------------ | ----------------------------------------------------------------------------------------------------- |
| **Data structure field** | Field in the output data structure.                                                                   |
| **Data structure type**  | Data type of that field. See [Map data types from Python to PDI](#map-data-types-from-python-to-pdi). |
| **PDI field**            | Output PDI field.                                                                                     |
| **PDI data type**        | Output PDI data type.                                                                                 |

### Map data types from PDI to Python

Use the most precise mappings possible. The following table lists common mappings.

| Structure          | PDI data type | Python data type |
| ------------------ | ------------- | ---------------- |
| pandas DataFrame   | BigNumber     | `float64`        |
| pandas DataFrame   | Boolean       | `bool`           |
| pandas DataFrame   | Date          | `datetime64[ns]` |
| pandas DataFrame   | Integer       | `int64`          |
| pandas DataFrame   | Number        | `float64`        |
| pandas DataFrame   | String        | `object`         |
| pandas DataFrame   | Timestamp     | `datetime64[ns]` |
| NumPy array        | BigNumber     | `float64`        |
| NumPy array        | Boolean       | `bool`           |
| NumPy array        | Integer       | `int64`          |
| NumPy array        | Number        | `float64`        |
| Basic Python types | BigNumber     | `float`          |
| Basic Python types | Boolean       | `bool`           |
| Basic Python types | Integer       | `int`            |
| Basic Python types | Number        | `float`          |
| Basic Python types | String        | `str`            |
| Basic Python types | Timestamp     | `datetime`       |

### Map data types from Python to PDI

Use the most precise mappings possible. The following table lists common mappings.

| Structure          | Python data type | PDI data type |
| ------------------ | ---------------- | ------------- |
| pandas DataFrame   | `bool`           | Boolean       |
| pandas DataFrame   | `datetime64[ns]` | Timestamp     |
| pandas DataFrame   | `float64`        | BigNumber     |
| pandas DataFrame   | `int64`          | Integer       |
| pandas DataFrame   | `object`         | String        |
| Basic Python types | `bool`           | Boolean       |
| Basic Python types | `datetime`       | Timestamp     |
| Basic Python types | `float`          | BigNumber     |
| Basic Python types | `int`            | Integer       |
| Basic Python types | `str`            | String        |

{% hint style="info" %}
In the **Output** tab, you can also convert a matplotlib figure to an SVG string.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdia-data-integration/pdi-transformation-steps-reference-overview/python-executor.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
