Data lineage

In the Data Canvas, Data Catalog displays a visual representation of the lineage of the selected data, including its origin, flow, and transformations. Data lineage provides visibility into the data’s historical context and authenticity, which helps in understanding how data is manipulated and transformed across different processes and systems.

If you have Pentaho Data Integration (PDI), you can configure PDI to send lineage to Data Catalog. PDI sends lineage information to the configured Data Catalog API at key lineage events, such as the start or completion of a transformation. PDI and Data Catalog support the OpenLineage open framework for data lineage collection and analysis.

Supported lineage steps

PDI sends lineage to Data Catalog only for the supported steps listed below.

Native steps

These steps are implemented directly by the PDI platform and supported by the OpenLineage Plug-in:

  • Text file input (local files, Minio/HCP S3)

  • Text file output (local files, Minio/HCP S3) (*)

  • Table input (MySQL, PostgreSQL, Oracle, Vertica, SQL Server, Snowflake, Google Big Query)

  • Table output (MySQL, PostgreSQL, Oracle, Vertica, SQL Server, Snowflake)

Non-native steps

These steps are implemented by plugins that are loaded when PDI is initialized and consequently are supported by a plugin extension:

  • S3 CSV input

  • S3 file output (*)

  • Microsoft Excel Writer (*)

  • Microsoft Excel input

Note: The steps marked with (*) allow splitting files based on content. If they are set with the option Add filenames to result, they will work without any limitation. Otherwise, PDI only supports naming without any customization (such as date, time, or partitions).

Data Catalog continuously runs an API that captures the lineage information from PDI. To set up the connection between PDI and Data Catalog, see Configure Pentaho Data Integration to send lineage to Pentaho Data Catalog in the Administration Guide.

View lineage detail

You can see lineage information for a resource in the Data Canvas.

Perform the following steps to view more detail about the lineage.

  1. In the Data Canvas, navigate to a resource to view and click the Summary tab if it is not already displayed.

  2. On the Lineage pane, click View Lineage.

    Data Canvas with View Lineage button

    The Lineage page opens. The large rectangles represent data sources, and the smaller rectangle within a rectangle represents the resource, such as a table, column, or field.

    Lineage page
  3. Click the smaller rectangle for the resource.

    A side panel opens, with more detail about the resource, such as its Sensitivity.

    Lineage page with side panel and resource selected

    On the Lineage page, you can also use the following actions to explore the lineage:

    Field or Icon
    Action

    Find in graph

    Search the lineage graph

    Upstream

    Select a number corresponding to the number of hops upstream from the resource that you want to view

    Downstream

    Select a number corresponding to the number of hops downstream from the resource that you want to view

    Zoom out

    Reset the lineage graph size

    Zoom in

    Display the action that was performed on the data

    Add Lineage

    Manually add a resource to the lineage graph. See Add manual lineage.

Add manual lineage

If you know the source of a specific resource, and the source does not appear on the lineage graph for the resource, you can add it to the lineage graph. Adding manual lineage involves selecting a Target resource and adding another resource to it, which becomes the Source resource.

Use the following steps to add manual lineage.

  1. If you are not already viewing resource data in the Data Canvas, navigate to a resource in the Data Canvas and from the Summary tab, click View Lineage.

    The Lineage page opens.

  2. In the lineage graph, click a resource to use as the Target resource.

    Lineage page with side panel and resource selected
  3. Click Add Lineage.

    The Add Lineage window opens.

    Add Lineage window on Lineage page
  4. Navigate to a resource to add to the lineage and select its checkbox.

    The Add button displays the number of resources that are selected.

    This resource will be added as the Source of the Target resource.

  5. Click Add.

The resource is added to the lineage graph.

Last updated

Was this helpful?