Data Canvas

Use the Data Canvas page to explore and investigate your data. Here, you can find detailed insights into resource metadata to help you understand and clarify practical applications. Click Data Canvas in the left navigation menu to open the Data Canvas view and begin exploring your data. Be sure to add at least one data source to Data Catalog before exploring. See the Administer Pentaho Data Catalog document for more information.

The Data Canvas is divided into two primary areas:

Item

Name

Description

Navigation

Navigate the tree of data resources to find the one you want to explore in the canvas.

Content

Displays information about the selected resource. For example, if you select a folder or schema, the metadata appears in the Content pane.

Select a data element in the Navigation pane and view its details in the Content pane. The details vary by resource type. For example, selecting a folder or schema displays the metadata in the Content pane.

Navigate the tree of data resources to find the one you want to explore in Data Canvas in Data Catalog. Expand the data source and select the resources you want to work with, then view the structure of your data source in the Content pane. In addition, you can enter a search term in the Search field to search for resources such as folders, schemas, tables, files, or fields within the navigation pane.

When you select an individual or multiple resource, the resource name is highlighted in the tree view and the metadata of that resource displays in the Content pane. You can view the name of the selected item and the path in the banner. From the Moremenu, you can choose one of the following actions:

Process

Opens the Choose Process page, where you can view and select available processes to run on the selected resource. For more information, see the Get started with Pentaho Data Catalog document.

Move Data

Opens the Data Pipes page with the selected resource automatically set in the Scope field. You can create a data pipe template that helps to speed the migration, duplication, or purging of datasets. The available actions and engines on the Data Pipes page depend on the type of data source selected. If you select assets from:

Structured data sources (such as Oracle, PostgreSQL, or MS SQL Server) Data Pipe Templates use the Data Integration engine, and in the main actions, Duplicate Data, Move Data, and Purge Data are available.
Unstructured data sources (such as object stores and file systems) Data Pipe Templates use the Data Optimizer engine, and in the main actions, only Move Data and Purge Dataare available. For more information, see the Manage Data Pipes Template section in the Administer Pentaho Data Catalog document.

Content pane

You can view details about the selected resource in the Content pane in the respective tabs. The details displayed depend on the type of resource selected. For example, if you select a table, then you can view the contents of a column or field, the resource-level metadata with data analysis, cardinality for fields, and sample values.

The following table identifies the key details available in the Content pane for a table resource:

Screen area

Name

Description

Actions button

Click to view actions available for processing, saving, and copying the data, depending on the selected asset type. The actions you can take in the data content area are:

Process: Process the selected data.
View Galaxy: Change to a Galaxy view of the data.
Copy Path: Copy the data path.
View Data Movement: View Data Pipe templates.
Migrate*: Choose the location and move the selected data assets.
Delete*: Deletes the file from the file server.

CAUTION: Once you delete a data asset, you cannot recover it.

* These options appear only when:

You have a license for Data Optimizer.
You have imported the data source in Data Optimizer. For more information, see Importing a data source in Administer Pentaho Data Catalog.
You have selected a data asset (file type) that Data Optimizer supports.

Data tabs

Click to view additional information about the resource. The tabs you see might vary according to the resource that is selected.

Summary
Details
Properties
Glossary
More - view additional tabs of information, such as Applications, Policies, Comment, and Similar Items.

Summary tab

In Data Catalog, you can view metadata in graphical formats like value histograms and unique value counts to help you analyze data quickly. You can also view sample values, and profiled samples.

To open a data type profile, navigate to the column in the resource you want to view and click it to explore the field-level data.

When viewing column details, you can see the resource field-level metadata along with data analysis, cardinality for fields, and sample values. To show metadata in the resource field, you need native access to the resource or metadata level as governed by the RBAC settings for your user role.

Depending on the selected resource level or data element, you can view different summaries of information, including the following resource metrics:

Description

Displays a description of the resource that is imported from the source. You can contribute resource information to the knowledge base to write content and include links to other articles in Data Catalog. To edit the description, click Edit Description, which will open a dialog box where you can format the text using tools like bold, italic, underline, and strikeout. You can also align text, insert code blocks, and add links as needed.

System Information

When you choose an unstructured file, System Information displays the timestamps for file creation, modification, and last access.

In certain file systems, when a file's modification date is less than its creation date, certain APIs, like the SMB network client, might display the more recent date as the modification date.
In NFS and CIFS data sources, when you modify a file, Data Catalog might display the same timestamp for both the Date Created and Date Last Modified fields.

Statistics

When you select a table, you can view the Field Count and Row Count statistics. The following table identifies the key details available in the Statistics pane when you select a column in a table to view:

Feature

Description

Null Count

Number of entries that are null.

Cardinality

The number of unique values in a field, where a low cardinality number indicates many repeated values.

HLL

An estimate of cardinality of the data, with a roughly ~2% margin of error.

Blank Count

The number of entries that are blank.

Min Width

The minimum number of character count in a value in the column.

Max Width

The maximum number of character count in a value in the column.

Avg Width

The average number of character count in a value in the column.

Uniqueness

The uniqueness of the values in a field

Density

The percentage of fields with actual values

Selectivity

The percentage selectivity of a column. The higher the value the more effective a query is in narrowing down a result set

Stdev Value

The spread or dispersion of the data points in that column relative to the column's mean (average)

Lexical Min

Smallest possible version of a string or array when compared in dictionary (lexicographical) order

Lexical Max

Lexical Max: Largest possible version of a string or array when compared in dictionary (lexicographical) order

Data Patterns

In Data Catalog, data pattern analysis offers insightful recommendations based on detected patterns and their frequency. These recommendations include RegEx expressions, catering to different levels of pattern matching precision: loose, moderate, and strict. Data Catalog gives you the flexibility to choose the most appropriate patterns. Simplifying the patterns by focusing on just the characters 'A,' 'a,' 'n,' and 's' reveals the underlying data patterns more clearly. After obtaining a set of simplified patterns along with their respective frequency counts, candidate RegEx expressions can be generated. The following options demonstrate possible RegEx expressions tailored to the desired level of strictness:

Pattern

Description

^\w{2}\d{5}$

Loose Pattern: This pattern is less strict and excludes the last value in the example with 80% confidence.

^[K]\w\d{5}$

Strict first letter and five digits: This expression maintains strict criteria for the first letter while allowing for variability in the subsequent characters.

^[K]\w\d{5,6}$

Loose on the second character: This pattern ensures 100% confidence but introduces flexibility for the second character.

^[K][A,L,T,W]\d{5,6}$

More Strict Pattern: This expression imposes stricter conditions while maintaining 100% confidence.

^[A-Z][A-Z]\d{5,6}$

Another 100% confidence pattern that differs in its structure.

If your user role does not grant access to the field or viewing level of the information, the Data Patterns pane does not appear.

Sample Data

During data profiling of structured data, when you select the Extract Samples option, a small random sample of data is extracted and displayed in the Sample Data pane under the Details tab of a column. It provides sample values from the column to help you preview and validate data, as well as help to understand the data distribution. The Sample Data pane has two tabs: Raw and Aggregated.

Raw tab: Displays a random set of individual sample values from the column. Text names and values are truncated after 200 characters. Use the Raw tab to review how actual values appear in the data set.
Aggregated tab: Groups identical values and displays each unique value only once, along with its frequency and percentage. This view helps you quickly identify the most common values and their relative distribution in the column. For example, a value such as “white” appears once in the list, with a count of rows containing that value and its percentage of the total.

To view this pane, your role must allow Sample Data Access through native system permissions. If your user role has administrative privileges, you can configure these values. If not, contact your administrator for details.

Important: Data Catalog governs access to view sample data with the View samples permission. Users with this permission can see sample data, but users without the View samples permission see the sample data in a masked format, such as ****** ** **, ensuring sensitive information remains protected. For information on permissions, see Default user roles and permissions.

Lineage

Displays a visual representation of the history of the selected data, including its origin, flow, and transformations. Data lineage provides visibility into the data’s historical context and authenticity, which helps in understanding how data is manipulated and transformed across different processes and systems. You can click View Lineage to focus on the lineage and add a manual lineage.

Key Metrics

Shows the following important characteristics of the resource:

Data Quality: The Data Quality metric is visible if you purchase and configure Pentaho Data Quality, and process data with the Data Quality Loader process.
Sensitivity: By default, Sensitivity is set to Unknown. You can set the Sensitivity level to Low, Medium, or High.
Data Lineage: Data Lineage is visible if the resource is the resource is a table, column, or file. By default, Data Lineage is Unverified. You can set Data Lineage to Verified or Unverified.
Trust Score: By default, the Trust Score for a resource is Untrusted. You can enter a score, which sets the Trust Score to Untrusted, Trusted, or Highly Trusted, depending on the score.

Properties panel

The Properties panel displays a summary of the selected resource, including details such as the last update timestamp, name, version, and type of the resource.

For Microsoft SQL, Oracle, or Snowflake data sources, when the Usage Statistics process is run, the panel also displays usage-related properties. These properties provide insights into how the resource is accessed and modified. Examples include:

Read Count: Number of times the entity has been queried or read.
Write Count: Number of times the entity has been updated or written to.
Alter Count: Number of times the entity’s structure has been altered.
Last Accessed Time: Timestamp of the most recent access.
First Accessed Time: Timestamp of the first recorded access.

The availability of usage properties depends on the data source and configuration. For more information, see Usage Statistics.

Business Terms panel

Lists associated business terms for the resource. You can also click Add Term to open the Business Terms dialog box and add terms to the resource. For more information, see the Administer Pentaho Data Catalog document.

Tags panel

Lists the tags associated with the resource. In addition, you can click and start adding tags like “quality:45” (the key should be unique) to the resource, which helps to identify the resource with tagged keywords.

Custom Properties panel

Lists the first five custom properties associated with the resource. Custom properties refer to user-defined metadata attributes or fields that can be associated with various data assets, such as databases, tables, files, or documents, to provide additional context and information about those assets. To add a custom property, click Add Custom Property and provide the required information. In addition, go to the Properties tab to see the complete list of custom properties added to the resource.

Data Storage Administrator view

If you have the Data Storage Administrator role in Data Catalog, you can have access to enhanced views within the Data Canvas for root-level folders of Object Stores like AWS S3, Azure Blob Storage, and file systems like CIFS, NFS, SMB, and many more. The following are the UI components available in this role-specific view.

Used Capacity: This tile shows the total storage consumed by all files and folders under the selected data source or root directory. It helps you to quickly identify storage-intensive locations and supports capacity planning.
Count of Subfolders: This tile shows the number of immediate subfolders present under the root directory, offering a quick view of the folder hierarchy and helping to assess structural complexity.
Count of Files/Entities: This tile shows the number of duplicate file groups identified in the data source. With this, you can reduce redundancy and improve storage efficiency by detecting duplicate files.
Duplicate Groups: This tile shows the number of duplicate file groups identified in the data source. With this, you can reduce redundancy and improve storage efficiency by detecting duplicate files.
Top 10 Summary View: This graph is an interactive bar chart that provides a visual overview of key folder-level metrics within the selected data source. You can toggle between three views:
- Child Folders: Displays the top 10 subfolders by count.
- Child Files: Shows the top 10 folders based on the number of contained files.
- Used Capacity: Highlights the top 10 folders by total storage consumed. This visualization helps to compare folder usage patterns, identify high-volume or high-capacity directories, and prioritize areas for optimization.
Files by Temperature: The Files by Tempraturegraph shows the distribution of files based on their access and modification activity, referred to as data temperature. Files are grouped into categories such as Hot, Warm, Cold, or Unclassified (used when temperature metadata is unavailable). This visualization helps to assess how actively data is being used, helping to identify hot (frequently accessed), warm, or cold (rarely accessed) data. Understanding the data temperature helps you to make informed decisions around data retention, archival, and storage cost optimization.

Files by Type
The Files by Type graph visualizes the distribution of files based on their format, such as CSV, JSON, PDF, DOCX, ZIP, and others. This chart helps you to understand the diversity of file types stored within a data source and evaluate the degree of file format standardization. This visibility of file types supports better metadata governance, content classification, and downstream processing decisions.
Count of files by type

Note: In the graphs, you can hover over the columns to view the exact count of items, along with additional information.

Details tab

The Details tab displays detailed information about child resources. You can view the items available in the selected resource, along with some additional information. The information varies based on the resource selected. For example, if you select a data source, you can view available items like a schema for structured data and folders for file systems. It is a detailed breakdown of folder contents, which can help in storage auditing and metadata review. Each row in the list represents a subfolder and includes:

Item Name: Name of the folder or file
Item Type: Indicates if it's a folder or file.
Duplicate Groups: Number of duplicate file groups within that folder.
Used Capacity: Total size of files in the folder.
Oldest Child Date: Earliest recorded access or modification timestamp of any item within the folder.
Youngest Child Date: Most recent access or modification timestamp.
Data Temperature: A link to view more metadata on created, modified, and accessed dates.

When you select a schema, you can view the number of tables and columns it contains, along with associated tags, row counts, and the last profiled date and time. In addition, you can click View in each row to open the corresponding data asset in a focused view within the Data Canvas.

The Details tab provides filter options for each column. You can apply these filters to narrow down the displayed assets, making asset selection efficient. When you apply filters and select one or more checkboxes, the Add to Cart button becomes active.

Clicking Add to Cart adds the selected items to your cart. After items are added, you can create a data set or data collection based on your selected data type. The processes to create data sets or data collections remain unchanged. For more information, see Manage collections in the Administer Pentaho Data Catalog guide.

Properties tab

View the custom properties added to the resource and the details like name and value. You can also add custom properties and edit the value of a property. For more information, see Resource properties.

Glossary tab

Explore the business terms information on the resource, such as category, glossary, definition, and purpose. In addition, you can also add business terms to the resource. For more information, see Business Glossary.

Applications tab

Lists any applications associated with the selected resource, with details such as the application name, parent and owner of the application. You can sort the columns and add applications if you have permission to do so. For more information, see Applications.

Policies tab

View the policies and standards associated with the resource. With permission to modify a policy, you can add or delete a standard association. For more information on policies and standards, see Policies and standards.

Comment tab

The Comment tab is a collaborative feature that allows users to discuss and provide feedback on specific data assets within Data Catalog. You can add comments, share suggestions, or ask questions directly in the tab using the provided text box, which includes basic formatting options like bold, italic, and bullet points. In addition, you can tag other users by mentioning them with the "@" symbol followed by their username. Then the specific user, or users, are notified of the comment through email and in the Mentions tab on the Data Catalog landing page, prompting them to respond if necessary. For more information, see Tour of the Home page.

Note: In the Comment tab, you can:

Tag users who have been configured in Data Catalog.
Only delete the comments you posted.
Delete any comment if you are an admin.

Duplicates tab

If the Compute checksum of document content checkbox was selected when the Data Discovery process was used to process unstructured data, you can see any duplicate files listed on the Duplicates tab. Files are determined to be duplicates if they have the same checksum. You can view the contents of each file by clicking View on the file listing. For more information, see Processing unstructured data.

PreviousExploring your data NextCollections

Last updated 23 days ago

Was this helpful?

hashtagNavigation pane

hashtagProcess

hashtagMove Data

hashtagContent pane

hashtagSummary tab

hashtagDescription

hashtagSystem Information

hashtagStatistics

hashtagData Patterns

hashtagSample Data

hashtagLineage

hashtagKey Metrics

hashtagProperties panel

hashtagBusiness Terms panel

hashtagTags panel

hashtagCustom Properties panel

hashtagData Storage Administrator view

hashtagDetails tab

hashtagProperties tab

hashtagGlossary tab

hashtagApplications tab

hashtagPolicies tab

hashtagComment tab

hashtagDuplicates tab

Navigation pane

Process

Move Data

Content pane

Summary tab

Description

System Information

Statistics

Data Patterns

Sample Data

Lineage

Key Metrics

Properties panel

Business Terms panel

Tags panel

Custom Properties panel

Data Storage Administrator view

Details tab

Properties tab

Glossary tab

Applications tab

Policies tab

Comment tab

Duplicates tab