OpenLineage Plugin

The Pentaho Data Integration (PDI) OpenLineage plugin enables PDI to emit rich, standardized OpenLineage events that can be consumed by Pentaho Data Catalog (PDC) to capture how data moves and is transformed in PDI ETL pipelines. PDC uses information it captures to provide visual end-to-end transparency of data flows, which improves data observability, strengthens compliance and governance, aids in troubleshooting data issues, and enhances data trust and quality for business users.

OpenLineage events are emitted from PDI when supported transformations are executed by discovering input and output datasets and, when possible, generating column-level lineage.

The OpenLineage plugin emits events for:

  • Start: transformation starts

  • Complete: transformations ends

  • Abort: transformation was stopped without errors

  • Fail: transformation ended with errors

Compatibility matrix

OpenLineage plugin functionality is certified to work as intended for the following versions of PDI:

  • 10.2.0.1 (SP1)

  • 10.2.0.2 (SP2)

  • 10.2.0.3 (SP3)

  • 10.2.0.4 (SP4)

  • 10.2.0.5 (SP5)

  • 10.2.0.6 (SP6)

  • 11.0

Setting up the plugin

Before you begin, verify that you have a valid license for the OpenLineage plugin. For information about licenses, see Acquire and install enterprise licenses.

To set up the OpenLineage plugin, you must complete the following tasks:

Download the plugin

Download the OpenLineage plugin from the Pentaho Support Portal.

  1. On the Support Portal home page, sign in using the Pentaho support username and password provided in your Pentaho Welcome Packet.

  2. In the Pentaho card, click Download. The Downloads page opens.

  3. In the <version>.x list, click Pentaho <version> EE Marketplace Plugins Release.

  4. Scroll to the bottom of the page.

  5. In the Marketplace Plugins <version> section, click Open Lineage.

Install the plugin

Install the OpenLineage plugin in the PDI client and Pentaho Server by running commands appropriate for your operating system.

Note: The plugin can be installed in the PDI client, Pentaho Server, or both.

Installation commands include the following placeholders that must be replaced:

  • <path-to-data-integration>: Replace with full path to the PDI client.

  • <path-to-pentaho-server>: Replace with full path to the Pentaho Server.

  • <version_check_option>: Replace with one of the following options:

    • none: Installs the plugin on any version of Pentaho. If the Pentaho version is unsupported, an error is shown.

    • loose: Default option. Installs the plugin on certified and compatible, newer Pentaho versions.

    • strict: Installs plugin only on certified Pentaho versions.

To install the OpenLineage plugin, complete the following steps:

  1. Stop the PDI client and Pentaho Server.

  2. Extract the pdi-openlineage-plugin-<plugin_version>-<build number>.zip file to a folder on the computer where the PDI client or PDI Server is installed.

  3. In the pdi-openlineage-plugin-<plugin_version>-<build number> folder, open a command prompt as an administrator.

  4. In the command prompt, run the following installation commands for your operating system, replacing the placeholders for paths and version check options.

    • Windows

      • PDI client

        install.bat -t <path-to-data-integration> --platformVersionCheck <version_check_option>

      • PDI Server

        install.bat -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>

    • Linux

      • PDI client

        ./install.sh -t <path-to-data-integration> --platformVersionCheck <version_check_option>

      • PDI Server

        ./install.sh -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>

  5. Start the PDI client and Pentaho Server.

Generate an encrypted password

If you plan to emit events to PDC, and want to secure your password so that it's not in plain text, you can generate an encrypted password to authenticate to PDC. The encrypted password is used in the configuration file for the OpenLineage plugin.

  1. On the computer where the PDI client or PDI Server is installed, open a command prompt.

  2. Run one of the following commands for your operating system:

    • Windows

      • To generate a password using the default Pentaho encryption seed, run the following command:

      • To generate a password using your own custom encryption seed, run the following command:

    • Linux

      • To generate a password using the default Pentaho encryption seed, run the following command:

      • To generate a password using your own custom encryption seed, run the following command:

    An encrypted password is generated and displayed in the command prompt, like the following example:

Create a configuration file for the plugin

After you install the plugin, create a configuration file that specifies where to send open lineage events. You can create a simple configuration file for testing or a custom configuration to use in production.

  1. In a text editor, create a configuration file with content from one of the following examples, based on your needs:

    • To create a simple configuration file that you can use to quickly validate that the plugin is working, include only the following content:

    • To create a custom configuration file that includes OpenLineage event consumers in your Pentaho deployment, such as a PDC Server, include the following content:

  2. Save the file as openlineageConfig.yml in the PDI directory that contains your user-specific configuration files.

    Notes:

    • By default, user-specific configuration files are stored in the .kettle directory, which is usually in one of the following locations:

      • Windows: C:\Documents and Settings\example_user\.kettle

      • Linux: ~/.kettle)

      However, if you run PDI in a container, configuration files might resolve to the /root/.kettle directory.

    • You can add multiple http consumers in the configuration file.

Enable the plugin

After you install the OpenLineage plugin and create its configuration file, you must enable the plugin so that it can send open lineage events to the consumers you specified in the configuration file.

Enable in PDI client

Enable the plugin in the PDI client by completing the following steps:

  1. Log into the PDI client and click Edit > Edit the Kettle.properties file. The Kettle properties window opens.

  2. To make the plugin active, add the following variable and value: KETTLE_OPEN_LINEAGE_ACTIVE=true

  3. To point PDI to your openlineageConfig.yml file, add the following variable with the <path-to-config-file> placeholder replaced by the full path to your configuration file directory: KETTLE_OPEN_LINEAGE_CONFIG_FILE=/<path-to-config-file>/openlineageConfig.yml

  4. Click OK. The kettle.properties file is saved and the OpenLineage plugin is enabled.

Enable in Pentaho Server

Enable the client in the Pentaho Server, by completing the following steps:

  1. Navigate to the kettle.properties file.

    Note: The kettle.properties file is usually in one of the following locations:

    • Windows: C:\Documents and Settings\example_user\.kettle

    • Linux: ~/.kettle)

    If you run PDI in a container, the kettle.properties file is in the /root/.kettle directory.

  2. Open the kettle.properties file in a text editor.

  3. Enable the plugin with its configuration file by adding the following variables and values:

    KETTLE_OPEN_LINEAGE_ACTIVE=true

    KETTLE_OPEN_LINEAGE_CONFIG_FILE=/<path-to-config-file>/openlineageConfig.yml

  4. Save the kettle.properties file.

Validate the plugin works

You can validate that the plugin is working by verifying that text related to OpenLineage appears in the appropriate logs and files.

To validate that the plugin is working, complete the following steps:

  1. In the PDI client, click File > Open, and then navigate to sample transformations in your Pentaho folder. For example, in Windows the sampls are in <path_to_Pentaho>\Pentaho\design-tools\data-integration\samples\transformations.

  2. Select the sample transformation, TextInput and Output using variables.ktr, and click Open.

  3. To run the transformation click Action > Run, and then in the Run Options window, click Run. The transformation runs and Execution Results pane appears at the bottom of the PDI client.

  4. Validate that consumers you have enabled are receiving OpenLineage events by taking one of the following actions:

    • If the console consumer is enabled, in the Execution Results pane of the PDI client, click the Logging tab and verify that the log contains lines with the text, "OpenLineage-Plugin".

    • If a file consumer is enabled, open the openlineage.json file in a text editor and verify that it contains lines with the text, "OpenLineage-Plugin". The openlineage.json file location is defined in the openlineageConfig.yml file.

    • If an HTTP consumer is enabled, confirm OpenLineage events are arriving for that consumer. For example, if the PDC is a configured consumer, verify the events arrive in PDC.

Troubleshoot plugin

If you are unable to validate that the plugin is working, perform the following troubleshooting actions:

  • Verify dataset lineage (input text file -> output text file) and column lineage mappings.

  • Validate that the Kettle.properties file contains the following variable and value: KETTLE_OPEN_LINEAGE_ACTIVE=true.

  • Verify that the credentials specified in the openlineageConfig.yml file are correct.

  • Check your network and firewall settings.

Supported steps

Note: This list of supported steps is for version 0.5.0 of the plugin.

Steps that support dataset lineage and column-level lineage

  • Abort

  • Append Streams

  • Block this step until steps finish

  • Blocking Step

  • Data Grid

  • Delay Row

  • Delete

  • Dummy

  • Filter Rows

  • Generate Rows

  • Get Variables

  • Group By

  • Java Filter

  • Mail

  • Merge Join

  • Microsoft Excel Input

    Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections.

  • Microsoft Excel Output (deprecated)

    Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections. [1]

  • Microsoft Excel Writer

    Lineage is supported for local files, AWS, Mineo, HCP, and other S3-compatible connections. [1]

  • Prioritize streams

  • S3 CSV Input

  • S3 File Output [1]

  • Send message to syslog

  • Set Variables

  • Sort Rows

  • Switch/Case

  • Table input

    Lineage is supported for the following connections, using the listed SQL functions and clauses:

    • Connection types: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Google BigQuery, Redshift, and Generic Connection [2]

    • SQL functions: aliases, joins, subqueries, functions, aggregations, constants, expressions, cases, window functions, CTEs, and the set operators: unions, intersects, and excepts.

    • Clauses: GROUP BY, ORDER BY, WHERE, WITH, and HAVING.

  • Table output

    Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]

  • Text file input

    Lineage is supported for local files, AWS, Mineo, HCP, and other S3- compatible connections. Fixed filetype is not supported.

  • Text file output

    Lineage is supported for local files, AWS, Mineo, HCP, and other S3- compatible file systems. [1] Fixed filetype is not supported.

  • Write to Log

Steps that support only dataset lineage, not column-level lineage:

  • Combination lookup/update

    Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]

  • CSV File Input

  • Database Lookup

    Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]

  • De-serialize from file

  • Dimension lookup/update

    Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]

  • Fixed file input

  • Gzip Csv Input

  • Insert/Update

    Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]

  • JSON Input

  • JSON Output [1]

  • LDIF Input

  • Load file content in memory

  • Property Input

  • Properties Output [1]

  • Sql File Output [1]

  • Synchronize after merge

    Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]

  • Update

    Lineage is supported for the following connections: MySQL, PostgreSQL, Denodo, Sybase, Oracle, Vertica, SQL Server, Snowflake, Redshift, and Generic Connection. [2]

  • XBase Input

Notes:

[1] Step, which can create multiple files as its output, can be configured to add filenames to its results file so that the name of each file is recorded in lineage. If the Add filenames to result option is disabled for the step, only a single, generic target is recorded in lineage. For example, if the Add filenames to result option is enabled for the step, the output is recorded in lineage as <filename>_001.csv, <filename>_002.csv, <filename>_003.csv, and so on. But, if the option is disabled, the output is recorded as only <filename>.csv.

[2] Step allows generic connections, but lineage works only with generic connections that are listed as supported.

Note: The Google Big Query connection is not supported on table output step. An OpenLineage event won't have any dataset outputs from any Google Big Query storage.

Uninstall plugin

Uninstall the OpenLineage plugin from the PDI client and Pentaho Server by running commands appropriate for your operating system.

Before you begin, you must download the OpenLineage plugin from the Pentaho Support Portal, which contains script files for uninstalling the plugin. For details, see Download the plugin.

Note: The plugin can be uninstalled from the PDI client, Pentaho Server, or both.

Commands for uninstalling the plugin include the following placeholders that must be replaced:

  • <path-to-data-integration>: Replace with full path to the PDI client.

  • <path-to-pentaho-server>: Replace with full path to the Pentaho Server.

  • <version_check_option>: Replace with one of the following options:

    • none: Installs the plugin on any version of Pentaho. If the Pentaho version is unsupported, an error is shown.

    • loose: Default option. Installs the plugin on certified and compatible, newer Pentaho versions.

    • strict: Installs plugin only on certified Pentaho versions.

To uninstall the OpenLineage plugin, complete the following steps:

  1. Stop the PDI client and Pentaho Server.

  2. Extract the pdi-openlineage-plugin-<plugin_version>-<build number>.zip file to a folder on the computer where the PDI client or PDI Server is installed.

  3. In the pdi-openlineage-plugin-<plugin_version>-<build number> folder, open a command prompt as an administrator.

  4. In the command prompt, run the following installation commands for your operating system, replacing the placeholders for paths and version check options.

    • Windows

      • PDI client

        uninstall.bat -t <path-to-data-integration> --platformVersionCheck <version_check_option>

      • PDI Server

        uninstall.bat -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>

    • Linux

      • PDI client

        ./uninstall.sh -t <path-to-data-integration> --platformVersionCheck <version_check_option>

      • PDI Server

        ./uninstall.sh -t <path-to-pentaho-server> --platformVersionCheck <version_check_option>

  5. Start the PDI client and Pentaho Server.

Upgrade plugin

Important: Do not install a new version of the OpenLineage plugin over an existing installation of the plugin.

To upgrade the OpenLineage plugin, you must uninstall the plugin and then download and install the new version of the plugin. For details, see the following sections:

Last updated

Was this helpful?