Data Canvas

Use the Data Canvas page to explore and investigate your data. Here, you can find detailed insights into resource metadata to help you understand and clarify practical applications. Click Data Canvas in the left navigation menu to open the Data Canvas view and begin exploring your data. Be sure to add at least one data source to Data Catalog before exploring. See the Administer Pentaho Data Catalog document for more information.

The Data Canvas is divided into two primary areas:

Navigation and Content areas of Data Canvas page marked with numbers
Item
Name
Description

1

Navigation

Navigate the tree of data resources to find the one you want to explore in the canvas.

2

Content

Displays information about the selected resource. For example, if you select a folder or schema, the metadata appears in the Content pane.

Select a data element in the Navigation pane and view its details in the Content pane. The details vary by resource type. For example, selecting a folder or schema displays the metadata in the Content pane.

Navigate the tree of data resources to find the one you want to explore in Data Canvas in Data Catalog. Expand the data source and select the resources you want to work with, then view the structure of your data source in the Content pane. In addition, you can enter a search term in the Search field to search for resources such as folders, schemas, tables, files, or fields within the navigation pane.

When you select an individual or multiple resource, the resource name is highlighted in the tree view and the metadata of that resource displays in the Content pane. You can view the name of the selected item and the path in the banner. From the Moremenu, you can choose one of the following actions:

Process and Move data options on Data Canvas

Process

Opens the Choose Process page, where you can view and select available processes to run on the selected resource. For more information, see the Get started with Pentaho Data Catalog document.

Move Data

Opens the Data Pipes page with the selected resource automatically set in the Scope field. You can create a data pipe template that helps to speed the migration, duplication, or purging of datasets. The available actions and engines on the Data Pipes page depend on the type of data source selected. If you select assets from:

  • Structured data sources (such as Oracle, PostgreSQL, or MS SQL Server) Data Pipe Templates use the Data Integration engine, and in the main actions, Duplicate Data, Move Data, and Purge Data are available.

  • Unstructured data sources (such as object stores and file systems) Data Pipe Templates use the Data Optimizer engine, and in the main actions, only Move Data and Purge Dataare available. For more information, see the Manage Data Pipes Template section in the Administer Pentaho Data Catalog document.

Content pane

You can view details about the selected resource in the Content pane in the respective tabs. The details displayed depend on the type of resource selected. For example, if you select a table, then you can view the contents of a column or field, the resource-level metadata with data analysis, cardinality for fields, and sample values.

Content pane with numbered areas

The following table identifies the key details available in the Content pane for a table resource:

Screen area
Name
Description

1

Actions button

Click to view actions available for processing, saving, and copying the data, depending on the selected asset type. The actions you can take in the data content area are:

  • Process: Process the selected data.

  • View Galaxy: Change to a Galaxy view of the data.

  • Copy Path: Copy the data path.

  • View Data Movement: View Data Pipe templates.

  • Migrate*: Choose the location and move the selected data assets.

  • Delete*: Deletes the file from the file server.

CAUTION: Once you delete a data asset, you cannot recover it.

* These options appear only when:

  1. You have a license for Data Optimizer.

  2. You have imported the data source in Data Optimizer. For more information, see Importing a data source in Administer Pentaho Data Catalog.

  3. You have selected a data asset (file type) that Data Optimizer supports.

2

Data tabs

Click to view additional information about the resource. The tabs you see might vary according to the resource that is selected.

  • Summary

  • Details

  • Properties

  • Glossary

  • More - view additional tabs of information, such as Applications, Policies, Comment, and Similar Items.

Summary tab

In Data Catalog, you can view metadata in graphical formats like value histograms and unique value counts to help you analyze data quickly. You can also view sample values, and profiled samples.

To open a data type profile, navigate to the column in the resource you want to view and click it to explore the field-level data.

When viewing column details, you can see the resource field-level metadata along with data analysis, cardinality for fields, and sample values. To show metadata in the resource field, you need native access to the resource or metadata level as governed by the RBAC settings for your user role.

Depending on the selected resource level or data element, you can view different summaries of information, including the following resource metrics:

Description

Displays a description of the resource that is imported from the source. You can contribute resource information to the knowledge base to write content and include links to other articles in Data Catalog. To edit the description, click Edit Description, which will open a dialog box where you can format the text using tools like bold, italic, underline, and strikeout. You can also align text, insert code blocks, and add links as needed.

System Information

When you choose an unstructured file, System Information displays the timestamps for file creation, modification, and last access.

CAUTION:

  • In certain file systems, when a file's modification date is less than its creation date, certain APIs, like the SMB network client, might display the more recent date as the modification date.

  • In NFS and CIFS data sources, when you modify a file, Data Catalog might display the same timestamp for both Date Created and Date Last Modified fields.

  • Statistics

    When you select a table, you can view theField Count and Row Count statistics. The following table identifies the key details available in the Statistics pane when you select a column in a table to view:

    Feature
    Description

    Null Count

    Number of entries that are null.

    Cardinality

    The number of unique values in a field, where a low cardinality number indicates many repeated values.

    HLL

    An estimate of cardinality of the data, with a roughly ~2% margin of error.

    Blank Count

    The number of entries that are blank.

    Min Width

    The minimum number of character count in a value in the column.

    Max Width

    The maximum number of character count in a value in the column.

    Avg Width

    The average number of character count in a value in the column.

Data Patterns

In Data Catalog, data pattern analysis offers insightful recommendations based on detected patterns and their frequency. These recommendations include RegEx expressions, catering to different levels of pattern matching precision: loose, moderate, and strict. Data Catalog gives you the flexibility to choose the most appropriate patterns. Simplifying the patterns by focusing on just the characters 'A,' 'a,' 'n,' and 's' reveals the underlying data patterns more clearly. After obtaining a set of simplified patterns along with their respective frequency counts, candidate RegEx expressions can be generated. The following options demonstrate possible RegEx expressions tailored to the desired level of strictness:

Pattern
Description

^\w{2}\d{5}$

Loose Pattern: This pattern is less strict and excludes the last value in the example with 80% confidence.

^[K]\w\d{5}$

Strict first letter and five digits: This expression maintains strict criteria for the first letter while allowing for variability in the subsequent characters.

^[K]\w\d{5,6}$

Loose on the second character: This pattern ensures 100% confidence but introduces flexibility for the second character.

^[K][A,L,T,W]\d{5,6}$

More Strict Pattern: This expression imposes stricter conditions while maintaining 100% confidence.

^[A-Z][A-Z]\d{5,6}$

Another 100% confidence pattern that differs in its structure.

CAUTION:

If your user role does not grant access to the field or viewing level of the information, the Data Patterns pane does not appear.

Sample Data

Shows the random values for the field along with the frequency and distribution when viewing a column. Text names and values are truncated after 200 characters. You can identify resources that have been sample-profiled and other resource-level information.

To view this pane, your role must allow Sample Data Access through native system permissions. If your user role has administrative privileges, you can configure these values. If not, contact your administrator for details.

Important: Data Catalog governs access to view sample data with the View samples permission. Users with this permission can see sample data, but users without the View samples permission see the sample data in a masked format, such as ****** ** **, ensuring sensitive information remains protected. For information on permissions, see Default user roles and permissions.

Lineage

Displays a visual representation of the history of the selected data, including its origin, flow, and transformations. Data lineage provides visibility into the data’s historical context and authenticity, which helps in understanding how data is manipulated and transformed across different processes and systems. You can click View Lineage to focus on the lineage and add manual lineage.

Key Metrics

Shows the following important characteristics of the resource:

  • Data Quality: The Data Quality metric is visible if you purchase and configure Pentaho Data Quality, and process data with the Data Quality Loader process.

  • Sensitivity: By default, Sensitivity is set to Unknown. You can set the Sensitivity level to Low, Medium, or High.

  • Data Lineage: Data Lineage is visible if the resource is the resource is a table, column, or file. By default, Data Lineage is Unverified. You can set Data Lineage to Verified or Unverified.

  • Trust Score: By default, the Trust Score for a resource is Untrusted. You can enter a score, which sets the Trust Score to Untrusted, Trusted, or Highly Trusted, depending on the score.

Properties panel

Displays a summary of the resource properties, like the last update time stamp, name, version, and type of the resource.

Business Terms panel

Lists associated business terms for the resource. You can also click Add Term to open the Business Terms dialog box and add terms to the resource. For more information, see the Administer Pentaho Data Catalog document.

Tags panel

Lists the tags associated with the resource. In addition, you can click and start adding tags like “quality:45” (the key should be unique) to the resource, which helps to identify the resource with tagged keywords.

Custom Properties panel

Lists the first five custom properties associated with the resource. Custom properties refer to user-defined metadata attributes or fields that can be associated with various data assets, such as databases, tables, files, or documents, to provide additional context and information about those assets. To add a custom property, click Add Custom Property and provide the required information. In addition, go to the Properties tab to see the complete list of custom properties added to the resource.

Data Storage Administrator view

If you have the Data Storage Administrator role in Data Catalog, you can have access to enhanced views within the Data Canvas for root-level folders of Object Stores like AWS S3, Azure Blob Storage, and file systems like CIFS, NFS, SMB, and many more. The following are the UI components available in this role-specific view.

The UI cards available to the Data Storage Administrator role
  • Used Capacity: This tile shows the total storage consumed by all files and folders under the selected data source or root directory. It helps you to quickly identify storage-intensive locations and supports capacity planning.

  • Count of Subfolders: This tile shows the number of immediate subfolders present under the root directory, offering a quick view of the folder hierarchy and helping to assess structural complexity.

  • Count of Files/Entities: This tile shows the number of duplicate file groups identified in the data source. With this, you can reduce redundancy and improve storage efficiency by detecting duplicate files.

  • Duplicate Groups: This tile shows the number of duplicate file groups identified in the data source. With this, you can reduce redundancy and improve storage efficiency by detecting duplicate files.

  • Top 10 Summary View: This graph is an interactive bar chart that provides a visual overview of key folder-level metrics within the selected data source. You can toggle between three views:

    • Child Folders: Displays the top 10 subfolders by count.

    • Child Files: Shows the top 10 folders based on the number of contained files.

    • Used Capacity: Highlights the top 10 folders by total storage consumed. This visualization helps to compare folder usage patterns, identify high-volume or high-capacity directories, and prioritize areas for optimization.

  • Files by Temperature: The Files by Tempraturegraph shows the distribution of files based on their access and modification activity, referred to as data temperature. Files are grouped into categories such as Hot, Warm, Cold, or Unclassified (used when temperature metadata is unavailable). This visualization helps to assess how actively data is being used, helping to identify hot (frequently accessed), warm, or cold (rarely accessed) data. Understanding the data temperature helps you to make informed decisions around data retention, archival, and storage cost optimization.

Count of files by data temperature
  • Files by Type

    The Files by Type graph visualizes the distribution of files based on their format, such as CSV, JSON, PDF, DOCX, ZIP, and others. This chart helps you to understand the diversity of file types stored within a data source and evaluate the degree of file format standardization. This visibility of file types supports better metadata governance, content classification, and downstream processing decisions.

    Count of files by type

Note: In the graphs, you can hover over the columns to view the exact count of items, along with additional information.

Details tab

The Details tab contains detailed information about child resources. You can view the items available in the selected resource, along with some additional information. The information varies based on the resource selected. For example, if you select a data source, you can view available items like a schema for structured data and folders for file systems. It is a detailed breakdown of folder contents, which can help in storage auditing and metadata review. Each row in the list represents a subfolder and includes:

  • Item Name: Name of the folder or file

  • Item Type: Indicates if it's a folder or file.

  • Duplicate Groups: Number of duplicate file groups within that folder.

  • Used Capacity: Total size of files in the folder.

  • Oldest Child Date: Earliest recorded access or modification timestamp of any item within the folder.

  • Youngest Child Date: Most recent access or modification timestamp.

  • Data Temperature: A link to view more metadata on created, modified, and accessed dates.

When you select a schema, you can view the number of tables and columns it contains, along with associated tags, row counts, and the last profiled date and time. In addition, you can click View in each row to open the corresponding data asset in a focused view within the Data Canvas.

Properties tab

View the custom properties added to the resource and the details like name and value. You can also add custom properties and edit the value of a property. For more information, see Resource properties.

Glossary tab

Explore the business terms information on the resource, such as category, glossary, definition, and purpose. In addition, you can also add business terms to the resource. For more information, see Business Glossary.

Applications tab

Lists any applications associated with the selected resource, with details such as the application name, parent and owner of the application. You can sort the columns and add applications if you have permission to do so. For more information, see Applications.

Policies tab

View the policies and standards associated with the resource. With permission to modify a policy, you can add or delete a standard association. For more information on policies and standards, see Policies and standards.

Comment tab

The Comment tab is a collaborative feature that allows users to discuss and provide feedback on specific data assets within Data Catalog. You can add comments, share suggestions, or ask questions directly in the tab using the provided text box, which includes basic formatting options like bold, italic, and bullet points. In addition, you can tag other users by mentioning them with the "@" symbol followed by their username. Then the specific user, or users, are notified of the comment through email and in the Mentions tab on the Data Catalog landing page, prompting them to respond if necessary. For more information, see Tour of the Home page.

Note: In the Comment tab, you can:

  • Tag users who have been configured in Data Catalog.

  • Only delete the comments you posted.

  • Delete any comment if you are an admin.

Duplicates tab

If the Compute checksum of document content checkbox was selected when the Data Discovery process was used to process unstructured data, you can see any duplicate files listed on the Duplicates tab. Files are determined to be duplicates if they have the same checksum. You can view the contents of each file by clicking View on the file listing. For more information, see Processing unstructured data.

Last updated

Was this helpful?