# Python Executor

The **Python Executor** step lets you run Python code as part of a Pentaho Data Integration (PDI) transformation.

This step is designed to help developers and data scientists focus on Python-based analytics and algorithms while using PDI for common ETL work such as connecting to sources, joining, and filtering.

You can run Python with either:

* **Row-by-row** processing: PDI maps each incoming row to Python variables and runs the script once per row.
* **All-rows** processing: PDI transfers the full dataset at once (for example, into a pandas DataFrame) and runs the script.

{% hint style="info" %}
This step supports the **CPython** runtime only.
{% endhint %}

### Before you begin

Install the following Python libraries before using this step:

* [pandas](http://pandas.pydata.org/) (1.5.3 or later)
* [NumPy](http://www.numpy.org/) (1.24.2 or later)
* [Py4J](https://www.py4j.org/) (0.10.9.7 or later)
* [matplotlib](https://matplotlib.org/) (3.7.1 or later)

{% hint style="info" %}
If you install Python using Anaconda, all required libraries are installed.
{% endhint %}

### Step name

**Step name** specifies the unique name of the step on the canvas. You can change it.

### Configure the step (tabs)

#### Script tab

Use this tab to specify whether you will embed a script or link to a script file.

**Script source**

* **Embed** (default): Runs the script entered in **Manual Python script**.
* **Link from file**: Runs a Python script loaded from a file system (including virtual file systems).

| Option                               | Description                                                                                                                                                                                                                                                           |
| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Manual Python script** (Embed)     | Python script to embed and run.                                                                                                                                                                                                                                       |
| **Location** (Link from file)        | File system or cluster where the script is located. See the [VFS browser](https://docs.pentaho.com/pdia-data-integration/archived-merged-pages/connecting-to-virtual-file-systems-archive/vfs-browser-connecting-to-virtual-file-systems) for supported file systems. |
| **File name** (Link from file)       | Fully qualified URL of the script file. Select **Browse** to locate it in the VFS browser. Supports variable substitution using `${...}`.                                                                                                                             |
| **Use a Python virtual environment** | Use a specific Python executable instead of the default. When selected, specify the Python executable path (supports `${...}`) or select **Browse** to locate it.                                                                                                     |

#### Input tab

Use this tab to move data from PDI fields to Python variables.

Choose one processing mode:

* **Row by row** (standard PDI streaming behavior)
* **All rows** (dataset-style processing, often used with data frames)

**Row by row**

When you select **Row by row**, the script runs once for each incoming row. Each row’s fields are mapped to Python variables defined in the **Mapping** table.

{% hint style="warning" %}
When using the PDI engine, you can include multiple input steps only if they share the **same schema** (same field order and data types). If you use multiple input steps with different schemas, the step fails.
{% endhint %}

| Mapping column       | Description                                                                                                                                          |
| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Variable**         | Python variable name.                                                                                                                                |
| **Python data type** | Python data type for the variable (for example, `str`, `int`, `float`). See [Map data types from PDI to Python](#map-data-types-from-pdi-to-python). |
| **PDI field**        | Incoming PDI field mapped to the Python variable.                                                                                                    |
| **PDI data type**    | PDI data type of the incoming field.                                                                                                                 |

Select **Get fields** to populate the mapping table from upstream fields.

**All rows**

When you select **All rows**, the step accumulates all incoming rows and sends them to Python as a dataset.

{% hint style="warning" %}
With **All rows**, the full dataset accumulates in memory until all rows are available. For large datasets (GB+), this can significantly increase memory requirements.
{% endhint %}

| Option                  | Description                                                                                                    |
| ----------------------- | -------------------------------------------------------------------------------------------------------------- |
| **Available variables** | Select the plus button to add an input variable. Remove a variable by selecting the X icon.                    |
| **Variable name**       | Python variable name receiving the dataset.                                                                    |
| **Step**                | Name of the input step to map from (a step with an outgoing hop into Python Executor).                         |
| **Data structure**      | Dataset structure to use in Python: **pandas DataFrame**, **NumPy array**, or **Python list of dictionaries**. |

The **Mapping** table contains:

| Mapping column           | Description                                                                                                                      |
| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------- |
| **Data structure field** | Name of the field in the selected Python data structure.                                                                         |
| **Data structure type**  | Data type for that field in the selected structure. See [Map data types from PDI to Python](#map-data-types-from-pdi-to-python). |
| **PDI field**            | PDI field mapped into the dataset.                                                                                               |
| **PDI data type**        | PDI data type of the field.                                                                                                      |

Select **Get fields** to populate the mapping table.

#### Output tab

Use this tab to move data from Python variables back to PDI fields.

You can output results as:

* **Variable to fields**: individual Python variables (built-in types)
* **Frames to fields**: a pandas DataFrame or a Python list of dictionaries

**Variable to fields**

Use this option when your script outputs separate variables (for example, numerics, strings, Booleans).

{% hint style="info" %}
Selecting **Get fields** executes your Python script using random input values to infer output variables. If your script is long-running or requires specific input data, define the output mapping manually.
{% endhint %}

| Mapping column       | Description                                                                                      |
| -------------------- | ------------------------------------------------------------------------------------------------ |
| **Variable**         | Python variable name.                                                                            |
| **Python data type** | Variable data type. See [Map data types from Python to PDI](#map-data-types-from-python-to-pdi). |
| **PDI field**        | Output PDI field.                                                                                |
| **PDI data type**    | Output PDI data type.                                                                            |

**Frames to fields**

Use this option when your script outputs a pandas DataFrame or a Python list of dictionaries.

| Mapping column           | Description                                                                                           |
| ------------------------ | ----------------------------------------------------------------------------------------------------- |
| **Data structure field** | Field in the output data structure.                                                                   |
| **Data structure type**  | Data type of that field. See [Map data types from Python to PDI](#map-data-types-from-python-to-pdi). |
| **PDI field**            | Output PDI field.                                                                                     |
| **PDI data type**        | Output PDI data type.                                                                                 |

### Map data types from PDI to Python

Use the most precise mappings possible. The following table lists common mappings.

| Structure          | PDI data type | Python data type |
| ------------------ | ------------- | ---------------- |
| pandas DataFrame   | BigNumber     | `float64`        |
| pandas DataFrame   | Boolean       | `bool`           |
| pandas DataFrame   | Date          | `datetime64[ns]` |
| pandas DataFrame   | Integer       | `int64`          |
| pandas DataFrame   | Number        | `float64`        |
| pandas DataFrame   | String        | `object`         |
| pandas DataFrame   | Timestamp     | `datetime64[ns]` |
| NumPy array        | BigNumber     | `float64`        |
| NumPy array        | Boolean       | `bool`           |
| NumPy array        | Integer       | `int64`          |
| NumPy array        | Number        | `float64`        |
| Basic Python types | BigNumber     | `float`          |
| Basic Python types | Boolean       | `bool`           |
| Basic Python types | Integer       | `int`            |
| Basic Python types | Number        | `float`          |
| Basic Python types | String        | `str`            |
| Basic Python types | Timestamp     | `datetime`       |

### Map data types from Python to PDI

Use the most precise mappings possible. The following table lists common mappings.

| Structure          | Python data type | PDI data type |
| ------------------ | ---------------- | ------------- |
| pandas DataFrame   | `bool`           | Boolean       |
| pandas DataFrame   | `datetime64[ns]` | Timestamp     |
| pandas DataFrame   | `float64`        | BigNumber     |
| pandas DataFrame   | `int64`          | Integer       |
| pandas DataFrame   | `object`         | String        |
| Basic Python types | `bool`           | Boolean       |
| Basic Python types | `datetime`       | Timestamp     |
| Basic Python types | `float`          | BigNumber     |
| Basic Python types | `int`            | Integer       |
| Basic Python types | `str`            | String        |

{% hint style="info" %}
In the **Output** tab, you can also convert a matplotlib figure to an SVG string.
{% endhint %}
