Data lineage
In the Data Canvas, Data Catalog displays a visual representation of the lineage of the selected data, including its origin, flow, and transformations. Data lineage provides visibility into the data’s historical context and authenticity, which helps in understanding how data is manipulated and transformed across different processes and systems.
If you have Pentaho Data Integration (PDI), you can configure PDI to send lineage to Data Catalog. PDI sends lineage information to the configured Data Catalog API at key lineage events, such as the start or completion of a transformation. PDI and Data Catalog support the OpenLineage open framework for data lineage collection and analysis.
Supported lineage steps
PDI sends lineage to Data Catalog only for the supported steps listed below.
Native steps
These steps are implemented directly by the PDI platform and supported by the OpenLineage Plug-in:
Text file input (local files, Minio/HCP S3)
Text file output (local files, Minio/HCP S3) (*)
Table input (MySQL, PostgreSQL, Oracle, Vertica, SQL Server, Snowflake, Google Big Query)
Table output (MySQL, PostgreSQL, Oracle, Vertica, SQL Server, Snowflake)
Non-native steps
These steps are implemented by plugins that are loaded when PDI is initialized and consequently are supported by a plugin extension:
S3 CSV input
S3 file output (*)
Microsoft Excel Writer (*)
Microsoft Excel input
Note: The steps marked with (*) allow splitting files based on content. If they are set with the option Add filenames to result, they will work without any limitation. Otherwise, PDI only supports naming without any customization (such as date, time, or partitions).
Data Catalog continuously runs an API that captures the lineage information from PDI. To set up the connection between PDI and Data Catalog, see Configure Pentaho Data Integration to send lineage to Pentaho Data Catalog in the Administration Guide.
View lineage detail
You can see lineage information for a resource in the Data Canvas.
Perform the following steps to view more detail about the lineage.
In the Data Canvas, navigate to a resource to view and click the Summary tab if it is not already displayed.
On the Lineage pane, click View Lineage.
Data Canvas with View Lineage button The Lineage page opens. The large rectangles represent data sources, and the smaller rectangle within a rectangle represents the resource, such as a table, column, or field.
Lineage page Click the smaller rectangle for the resource.
A side panel opens, with more detail about the resource, such as its Sensitivity.
Lineage page with side panel and resource selected On the Lineage page, you can also use the following actions to explore the lineage:
Field or IconActionFind in graph
Search the lineage graph
Upstream
Select a number corresponding to the number of hops upstream from the resource that you want to view
Downstream
Select a number corresponding to the number of hops downstream from the resource that you want to view
Zoom out
Reset the lineage graph size
Zoom in
Display the action that was performed on the data
Add Lineage
Manually add a resource to the lineage graph. See Add manual lineage.
Add manual lineage
If you know the source of a specific resource, and the source does not appear on the lineage graph for the resource, you can add it to the lineage graph. Adding manual lineage involves selecting a Target resource and adding another resource to it, which becomes the Source resource.
Use the following steps to add manual lineage.
If you are not already viewing resource data in the Data Canvas, navigate to a resource in the Data Canvas and from the Summary tab, click View Lineage.
The Lineage page opens.
In the lineage graph, click a resource to use as the Target resource.
Lineage page with side panel and resource selected Click Add Lineage.
The Add Lineage window opens.
Add Lineage window on Lineage page Navigate to a resource to add to the lineage and select its checkbox.
The Add button displays the number of resources that are selected.
This resource will be added as the Source of the Target resource.
Click Add.
The resource is added to the lineage graph.
Last updated
Was this helpful?