> For the complete documentation index, see [llms.txt](https://docs.pentaho.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.pentaho.com/pdc-use/pdc-collections.md).

# Collections

In Pentaho Data Catalog, a **Collection** is a way to logically group data assets, such as schemas, tables, and files, so that you can work with them more efficiently. Whether you are analyzing similar datasets or combining diverse data sources, Collections allow you to organize and manage your data entities based on structure or business use case.

Collections help you bring together related data assets in a meaningful and organized manner. You can group tables, files, or schemas that belong to the same business context or project, making it easier to organize and access them. For Datasets, you can streamline analysis by identifying common columns across multiple tables and using profiling and aggregation jobs to evaluate data structure and quality.

Data Catalog supports two types of Collections:

* **Dataset**: A Dataset is a group of homogeneous data assets, such as tables or files that share the same schema.
* **Data Collection**: A Data Collection is a group of heterogeneous data assets, such as files, tables, or schemas, with different structures.

Collections also support governance through business terms, trust scores, and sensitivity levels. For datasets, you can identify common columns across related assets and use profiling and aggregation jobs to evaluate data structure and quality. After you curate a collection, you can share it with your team or publish it as a data product to make it discoverable and reusable across the organization.

{% hint style="info" %}
By default, datasets and data collections are visible only to their owners unless shared explicitly. You can publish a Dataset or Data Collection as a Data Product to make it visible to all users in Data Catalog. For more information, see [Components of collections](#components-of-collections).
{% endhint %}

## Components of collections

In Data Catalog, Collections are organized using a hierarchical structure that helps you manage data logically and efficiently. The following table describes the components of this hierarchy:

<table><thead><tr><th width="78.11114501953125" align="center">Item</th><th width="148.5555419921875">Component</th><th>Description</th><th>Example</th></tr></thead><tbody><tr><td align="center">1</td><td>Category</td><td>The top-level container used to group collections by domain, department, or project.</td><td>Finance, Sales, Healthcare, or Marketing Analytics</td></tr><tr><td align="center">2</td><td>Group</td><td>A subfolder within a category that organizes datasets or collections around a specific subject, source, or use case.</td><td>Monthly Reports, Vendor Data, or Customer Feedback</td></tr><tr><td align="center">3</td><td>Dataset</td><td>A collection of homogeneous data, such as tables or files that share the same schema. Datasets support profiling and aggregation.</td><td>A group of .csv files with the same columns from different months</td></tr><tr><td align="center">4</td><td>Data Collection</td><td>A collection of heterogeneous data, such as files, tables, or schemas that differ in structure. Used for organizing logically related but structurally different assets.</td><td>A collection containing database schemas with SQL tables, and folders with Excel files and PDFs for a project</td></tr><tr><td align="center">5</td><td>Data Product</td><td>A curated Dataset or Data Collection that meets publishing criteria and is made available for broader discovery and consumption.</td><td>A verified Customer Profile Dataset shared with analysts across teams.</td></tr></tbody></table>

The hierarchy flows from Category > Group > Dataset or Data Collection and can result in a **Data Product** when published.

### Category

In Data Catalog, a **Category** is the top-level container in the **Collections** hierarchy. It helps you group related data assets based on a broader business domain, department, or organizational function. Categories act as entry points that help you to organize and locate collections more easily. It simplifies navigation within Data Catalog, where you can quickly browse and locate relevant Groups and Collections. By structuring data under Categories, teams can standardize how they store and manage related collections, making collaboration easier and more efficient.

A **Category** contains one or more **Groups**, which in turn contain **Datasets** or **Data Collections**. These collections can be published as **Data Products**. For example, a Finance category may include groups such as Monthly Reports or Audit Logs, each containing relevant Datasets or Data Collections.

### Group

In Data Catalog, a **Group** is a child container within a **Category** that helps further organize data assets in the **Collection** hierarchy. It allows users to structure **Collections**, such as **Datasets** and **Data Collections**, based on more specific topics, projects, or data sources within a broader business domain. A Group can also contain nested groups, allowing you to build deeper hierarchies for complex data structures.

Groups provide an intermediate layer of organization between the high-level **Category** and the individual collections, making it easier to maintain clarity and consistency in large or complex data environments. For example, within a Sales Category, you might create Groups such as Quarterly Reports, Customer Feedback, or Channel Partner Data, each containing its respective datasets.

### Dataset

In Data Catalog, a **Dataset** is a collection of homogeneous data assets, such as tables or files that share the same schema or structure. A **Dataset** is ideal when you want to analyze multiple data items that have identical columns and formats, making it possible to evaluate them as a unified group.

A **Dataset** contains a group of two or more tables or files that share the same column structure, allowing them to be analyzed together. It includes the results of processing jobs such as data profiling, which evaluates the structure and quality of each file or table, and aggregation, which computes summary statistics across common columns. For example, a Group called Customer Transactions under the Finance Category might include a Dataset of monthly transaction files that all follow the same format.

In addition to raw data, a Dataset also stores metadata, including tags, sensitivity levels, and trust scores. Additionally, you can associate custom properties, data labels, business terms, BI reports, ML models, external applications, policies, and physical assets by using the respective tabs. These attributes help provide context and support governance. A **Dataset** also features a columns canvas, which displays the common columns found across all included files or tables, along with their aggregated metrics, giving users a unified view of the dataset's structure and quality. To learn more about running data profiling or aggregation jobs, see [Processing collections](/pdc-use/pdc-collections/pdc-processing-collections.md).

### Data Collection

In Data Catalog, a **Data Collection** is a logical grouping of heterogeneous data assets, such as files, tables, or schemas that differ in structure or format. Unlike a Dataset, which requires a common schema, you can combine diverse data sources that are related by purpose or business context rather than structure and create a Data Collection. A Data Collection can include a variety of data assets with different schemas and formats. It may contain:

* Tables from different databases
* Files of various types (for example, CSV, Excel, JSON, XML)
* Schemas or partially structured data

{% hint style="info" %}
Data Collection is ideal when you want to group data assets for organizational or governance purposes, even if those assets vary in layout, columns, or type.
{% endhint %}

A **Data Collection** is created under a **Group**, which belongs to a **Category**. It is present alongside Datasets within the same Group and, like Datasets, can be enriched with metadata or published as a Data Product. For example, within a Group named Vendor Management under the Procurement Category, you might create a Data Collection that includes PDFs, Excel reports, and SQL tables, all related to supplier performance.

In Data Collections, you can apply metadata, including tags, sensitivity levels, and trust scores, to add context and enforce governance standards. Additionally, you can associate custom properties, data labels, business terms, BI reports, ML models, external applications, policies, and physical assets by using the respective tabs. Role-based visibility and access controls ensure that only authorized users can view or modify the collection. Data Collections can also be shared with specific users or teams to support collaboration. Additionally, Data Collections can be published as Data Products for broader discovery across the organization, once they meet the required standards.

### Data Product

In Data Catalog, a Data Product is a curated and published version of a Dataset or Data Collection, made available for broader discovery and consumption across the organization. After enrichment and validation, the Dataset or Data Collection can be promoted to a Data Product by publishing it. This publishing action changes its lifecycle state and makes it available in global search results and for broader consumption. For example, a Customer Insights Dataset under the Marketing Category can be published as a Data Product once it includes profiling results, sensitivity tagging, and trust scoring.

A Data Product contains all the components of its source Collection (Dataset or Data Collection), including the data assets, metadata, tags, and trust scores. Additionally, you can associate custom properties, data labels, business terms, BI reports, ML models, external applications, policies, and physical assets by using the respective tabs. It may also contain information about data quality, profiling results (for Datasets), and any associated governance attributes.&#x20;

For **Data Collections**, use **Get Marketplace Ready** from the **Actions** menu to review whether the collection is ready to be published as a Data Product and made discoverable in the **Marketplace**. This option provides a consolidated view of the conditions that apply to the selected Data Collection and shows whether each condition is currently satisfied.

The **Get Marketplace Ready** dialog evaluates the following conditions:

**Mandatory conditions**

* **Data Quality must be high:** The current data quality score must be greater than or equal to the data quality threshold defined for the collection.
* **Sensitivity must be configured:** The **Sensitivity** value must be set for the Data Collection.
* **Trust Score must be trusted:** The current trust score must be greater than or equal to the trust score threshold defined for the collection.

**Recommended condition**

* **Lineage Verification should be set to Verified:** This indicates that a user has reviewed the available lineage information and confirmed that it is reliable.

For the **Data Quality** and **Trust Score** conditions, the initial thresholds are based on the system-defined threshold values configured for Data Catalog. Unless changed in the deployment, both thresholds default to **80%**. The owner of the Data Collection, and any user with **Update** permission on that collection, can change these thresholds for that specific Data Collection. A threshold change affects only the readiness evaluation of that collection.

When a threshold has been customized, the **Update Custom Threshold** dialog shows who last updated the value and when it was updated. If the threshold was changed from the system-defined value, you can reset it to the default value.

After the collection satisfies the required conditions, continue the publish flow from the Marketplace readiness experience.

Once published, Data Products are searchable through Global Search, can be filtered using metadata facets, and provide quick access to users who need verified data for reporting, analysis, or operational use. To learn more about the global search, see [Global search and discovery](https://docs.hitachivantara.com/r/en-us/pentaho-data-catalog/10.2.x/mk-95pdc000/global-search-and-discovery).

## Access and permissions for collections

In Data Catalog, access to Datasets, Data Collections, and Data Products is governed by user roles and sharing settings. By default, Datasets and Data Collections are visible only to their creators or owners. Other users cannot see or access these collections unless the owner explicitly shares them. The following are the different access levels of a Dataset or Data Collection

* Private (default): Only the creator can view, update, delete, and interact with the collection.
* Shared: The owner can share a collection with specific users, groups, or roles. The owner can grant the following access to other users:
  * **View**: The user can only view the Collection. They cannot make changes or run jobs.
  * **Update**: The user can view and modify the Collection, including editing metadata or adding tags and terms, and publishing the collection as a Data Product.
  * **Run**: The user can view and modify the collection and execute supported operations such as **Profile**, **Aggregation**, or metric refresh calculations, depending on the collection type.
* **Published as Data Product**: Once the owner publishes a Dataset or Data Collection as a Data Product, it is visible and searchable by all users in the catalog (unless access restrictions are defined), making it easier to discover and reuse trusted data.

To learn more about sharing and publishing collections, see the [**Manage collections** ](/pdc-admin/pdc-10.2-admin/pdc-manage-collections.md)section in the [**Administer Pentaho Data Catalog**](https://docs.pentaho.com/pdc-admin/pdc-10.2-admin/) document.

## Tour the Collections page

In Pentaho Data Catalog, the **Collections** page provides a user-friendly interface for managing and viewing datasets, data collections, data products. Click **Collections** in the left navigation menu to open the **Collections** page. This page is divided into two primary areas: the [Navigation](https://hv-eng.atlassian.net/wiki/spaces/PDC/pages/33560428549/Collections+-+Content+pane#Collections-navigation-pane) and the Content pane.

<figure><img src="/files/nnCeVqG06CbTzR90kkmj" alt=""><figcaption></figcaption></figure>

### Collections navigation pane <a href="#collections-navigation-pane" id="collections-navigation-pane"></a>

In Data Catalog, under Collections, on the left navigation pane, you see the list of collection components in a hierarchical tree structure. These components can include categories, groups, datasets, data collections, and data products. Use the **Search** box to quickly find a specific collection asset. When you select an item, the asset name is highlighted in the tree, and the metadata for that asset appears in the **Content** pane. The banner also shows the name of the selected item and its location in the hierarchy. Select the **more options** icon next to an item to access available actions, such as deleting the asset, depending on the item type and your permissions. You can also create a new category or group from the navigation pane.

Additionally, you can access and explore datasets and data collections through three views: **Browse Collections**, **My Collections**, and **Shared with Me**. These views help you browse available collection assets based on your access permissions, quickly find assets that you own, and access assets that other users have shared with you.

<div data-with-frame="true"><figure><img src="/files/37XpWT7UeDKzaDDSU7wn" alt=""><figcaption></figcaption></figure></div>

#### **Browse Collections** <a href="#browse-collections" id="browse-collections"></a>

The **Browse Collections** view displays all collections that are visible to you across the organization. This includes collections you created, collections that have been shared with you by others, and any published Data Products. This view is ideal when you need a complete overview of all accessible collections across business domains or departments.

#### **My Collections** <a href="#my-collections" id="my-collections"></a>

The **My Collections** view filters the interface to show only the collections that you have created. These may include private drafts, collections that you have shared with others, or collections you are in the process of enriching or preparing for publication. This view is most useful when you're actively curating datasets or managing work-in-progress data assets.

#### **Shared Collections** <a href="#shared-collections" id="shared-collections"></a>

The **Shared with Me** view lists all collections that other users have explicitly shared with you. These may include Datasets or Data Collections where you’ve been granted permission to view, update, or run operations. This view helps streamline collaboration by grouping shared resources in one place.

Each of these three views, **Browse Collections**, **My Collections**, and **Shared with Me**, contains the same set of tabs across the top: **All**, **Collections**, and **Data Products**. These tabs help you filter and focus your exploration based on the current state and purpose of the collection.

<div data-with-frame="true"><figure><img src="/files/B91fyny5Lrsd7lvLxOId" alt=""><figcaption></figcaption></figure></div>

**All**

The **All** tab is the default view and displays all types of collections visible in the current context. Whether you are in **Browse Collections**, **My Collections**, or **Shared with Me**, the **All** tab will include Datasets, Data Collections, and Data Products, regardless of whether they are published or still in draft status. This tab is useful when you want a unified view of all relevant assets in one place.

**Collections**

The **Collections** tab narrows the display to items in draft or shared states, typically unpublished Datasets and Data Collections. This tab is helpful when you are working on assets that are still under development or being enriched with metadata and governance attributes.

**Data Products**

The **Data Products** tab displays only those collections that have been published as trusted data products. These collections are intended for broader organizational consumption and usually meet defined quality and governance standards. In this tab, Data Catalog users can find vetted and reusable assets for analysis, reporting, or integration.

In all views and tabs, the data is presented in a table format with the following columns:

<table><thead><tr><th width="155.111083984375">Column Name</th><th>Description</th></tr></thead><tbody><tr><td><strong>Name</strong></td><td>The title of the Dataset or Data Collection. Clicking it opens the asset in focused view.</td></tr><tr><td><strong>Description</strong></td><td>A brief summary of what the collection represents or contains.</td></tr><tr><td><strong>Type</strong></td><td>Indicates whether the item is a Dataset (a collection of homogeneous data), a Data Collection (a collection of heterogeneous data), or a Data Product (a published and curated Dataset or Data Collection available for broader consumption).</td></tr><tr><td><strong>Owner</strong></td><td>Displays the initials and name of the user who created the collection.</td></tr><tr><td><strong>Date Created</strong></td><td>Shows the exact date and time the collection was initially created.</td></tr></tbody></table>

The right side of each row may include additional actions such as the **Share** icon, which opens the permission settings for the collection. To learn more about sharing Datasets and Data Collections, see [**Share collections**](/pdc-admin/pdc-10.2-admin/pdc-manage-collections.md#share-collections) under the [**Manage Collections**](/pdc-admin/pdc-10.2-admin/pdc-manage-collections.md) section in the [**Administer Pentaho Data Catalog**](https://docs.pentaho.com/pdc-admin/pdc-10.2-admin/) document.

### Content pane

When you open a Dataset or Data Collection in the Collections hierarchy, the content pane displays the details of the selected collection and the tabs that apply to that component. Use the tabs in the content pane to review summary information, inspect the assets included in the collection, manage custom metadata, explore related objects across PDC, and collaborate with other users.

The tabs shown in the content pane depend on whether you select the collection itself or an asset inside the collection.

The following table identifies the key details available in the Content pane for a collection:

<table><thead><tr><th width="81.5">Item</th><th width="154.25">Name</th><th>Description</th></tr></thead><tbody><tr><td>1</td><td>Data banner</td><td>Displays the name, hierarchy path, and type icon of the selected collection asset, such as a category, group, collection, dataset, or schema. The banner can also show additional metadata, such as the last updated time.</td></tr><tr><td>2</td><td>Actions menu</td><td>Click to view the actions available for the selected collection asset. The available actions depend on the selected asset type and your permissions. For example, actions can include <strong>Process</strong>, <strong>Refresh Metrics</strong>, <strong>Define SQL View</strong>, <strong>Get Marketplace Ready</strong>, <strong>Duplicate</strong>, <strong>Share</strong>, and <strong>View in Galaxy</strong>.</td></tr><tr><td>3</td><td>Content tabs</td><td>Click to view additional information about the selected collection asset. Depending on the selected item, the available tabs can include <strong>Summary</strong>, <strong>Contents</strong>, <strong>Custom Properties</strong>, <strong>Relationships</strong>, and <strong>Comments</strong>. The tabs that appear depend on the type of asset selected.</td></tr></tbody></table>

#### Collection name

Name of the selected collection asset in Data Catalog, such as a category, group, collection, dataset, or schema. When you hover over the asset name, click the pencil icon to edit it. You can also view the timestamp of the last update below the asset name.

#### Actions

A menu with the following options:

<table><thead><tr><th width="161.5">Feature</th><th>Description</th></tr></thead><tbody><tr><td><strong>Process</strong></td><td>Opens the <strong>Process</strong> page for the selected collection asset. For a <strong>data collection</strong>, only <strong>Data Profiling</strong> is available. This process generates statistical and intermediate data by using the default options. For a <strong>dataset</strong>, both <strong>Data Aggregation</strong> and <strong>Data Profiling</strong> are available. <strong>Data Aggregation</strong> summarizes statistics at the collection level. For more information, see <a href="https://docs.pentaho.com/pdc-use/ldc-explore-your-data-cp/pdc-processing-data/pdc-processing-collections"><strong>Processing collections</strong></a>.</td></tr><tr><td><strong>Refresh Metrics</strong></td><td>Opens a submenu with options to refresh the <strong>Data Quality</strong>, <strong>Trust Score</strong>, and <strong>Sensitivity</strong> metrics for the selected Data Collection. Each refresh action starts a background job and updates the corresponding metric after the job finishes successfully. For more information, see <a href="https://hv-eng.atlassian.net/wiki/spaces/PDC/pages/33560428549/Collections+-+Content+pane#Refresh-metrics-for-a-data-collection"><strong>Refresh metrics for a data collection</strong></a>.</td></tr><tr><td><strong>Define SQL View</strong></td><td>Opens the SQL View definition dialog for the selected <strong>Data Collection</strong>. Use this option to define a SQL statement for the SQL tables in the collection, preview the query results, and create a logical SQL View for the collection. For more information, see SQL view of a data collection.</td></tr><tr><td><strong>Get Marketplace Ready</strong></td><td>Opens the Marketplace readiness dialog for the selected Data Collection. Use this option to review required and recommended conditions, update applicable thresholds, and continue the publish or unpublish flow for the collection as a Data Product in Marketplace. For more information, see <a href="https://docs.pentaho.com/pdc-admin/pdc-manage-collections#publish-a-collection-as-a-data-product"><strong>Publish a collection as a data product</strong></a>.</td></tr><tr><td><strong>Duplicate</strong></td><td>Creates a copy of an existing dataset or data collection so that you can reuse, enrich, or customize it without affecting the original. For more information, see <a href="https://docs.pentaho.com/pdc-use/ldc-explore-your-data-cp/pdc-collections#duplicate-or-versioning-of-a-collection"><strong>Duplicate or versioning of a collection</strong></a>.</td></tr><tr><td><strong>Share</strong></td><td>Shares the selected collection asset with other users. For more information, see <a href="https://docs.pentaho.com/pdc-admin/pdc-manage-collections#share-collections"><strong>Share collections</strong></a>.</td></tr><tr><td><strong>View in Galaxy</strong></td><td>Opens the <strong>Galaxy</strong> view for the selected collection asset. Here, you can see the selected collection and its related categories, groups, datasets, data collections, and data products. For more information, see <a href="https://docs.pentaho.com/pdc-use/pdc-galaxy-view"><strong>Galaxy view</strong></a>.</td></tr></tbody></table>

#### Summary tab

The **Summary** tab provides a summarized view of the selected collection component. You can view the following key information about the selected component.

{% hint style="info" %}
The information visible depends on the collection component you select.
{% endhint %}

<table><thead><tr><th width="141.5">Feature</th><th>Description</th></tr></thead><tbody><tr><td><strong>Description</strong></td><td>Provides a clear and authoritative description of the selected collection component. You can add or edit the definition to help users understand the purpose of the category, group, dataset, data collection, or data product.</td></tr><tr><td><strong>Purpose</strong></td><td>Provides a brief description of the purpose or use case of the selected collection component. The purpose helps you understand why the category, group, dataset, data collection, or data product was created, what it is intended for, and how it is relevant. You can also add or edit the purpose of the selected collection component.</td></tr><tr><td><strong>Key Metrics</strong></td><td>Displays the key quality and governance indicators for the selected collection, such as <strong>Data Quality</strong>, <strong>Lineage</strong>, <strong>Sensitivity</strong>, and <strong>Trust Score</strong>. These metrics help you quickly understand the current state of the collection, for example, whether the collection has <strong>High Quality</strong>, <strong>Verified Lineage</strong>, <strong>High Sensitivity</strong>, or <strong>Trusted Data</strong>. If a metric has not yet been calculated or configured, Data Catalog can display states such as <strong>Unknown</strong> or <strong>Uncomputed Score</strong>. To calculate the <strong>Data Quality</strong>, <strong>Sensitivity</strong>, or <strong>Trust Score</strong> metrics for a Data Collection, use <strong>Actions</strong> > <strong>Refresh Metrics</strong> and run the corresponding job. <strong>Lineage</strong> remains a verification metric rather than a calculated job-based metric.</td></tr><tr><td><strong>Rating</strong></td><td><p>Shows the average rating for the selected collection component on a scale of 1 to 5, where 5 is the highest rating. Any user with access to the collection component can rate it to express their satisfaction or confidence in the quality of that component.</p><p>To add or update your rating, click the <strong>Rating</strong> widget and then select the number of stars that you want to assign. Each user can rate a collection component only once, but can update that rating at any time. To remove your rating, use <strong>Clear</strong> in the rating dialog. The displayed rating is the average of all user ratings. When you hover over, you can see the rating summary with the average rating and the total number of reviews.</p></td></tr><tr><td><strong>Properties</strong></td><td><p>Shows the properties of the selected collection component.</p><ul><li><strong>Domain:</strong> The business area to which the collection component belongs. You can edit this field and select a predefined value.</li><li><strong>Status:</strong> The current state of the collection component. You can edit this field and select a predefined value.</li><li><strong>Object Type:</strong> The type of the selected component, such as Category, Group, Dataset, Data Collection, or Data Product.</li><li><strong>Created By:</strong> The user who created the collection component in Data Catalog.</li><li><strong>Updated By:</strong> The user who last updated the collection component in Data Catalog.</li><li><strong>Owner:</strong> The user or identifier associated with the ownership of the collection component.</li><li><strong>Duplicates:</strong> Indicates whether duplicate copies of the collection component exist.</li></ul></td></tr><tr><td><strong>Business Terms</strong></td><td>Lists associated business terms for the collection component. If you are the creator of the collection component or have <strong>Update</strong> permission, you can also click <strong>Add Term</strong> to open the Business Terms dialog box and add terms to the resource.</td></tr><tr><td><strong>Tags</strong></td><td>Lists the tags associated with the collection component. In addition, you can click and start adding tags like “quality:45” (the key should be unique) to the resource, which helps to identify the resource with tagged keywords.</td></tr><tr><td><strong>Labels</strong></td><td>Lists the labels associated with the selected collection component. These labels are based on custom properties created with the <strong>Data Label</strong> type and help users view label-related metadata directly from the summary.</td></tr><tr><td><strong>Custom Properties</strong></td><td>Lists the custom properties associated with the collection component. Custom properties refer to user-defined metadata attributes or fields that can be associated with various data assets. For more information, see <a href="https://docs.pentaho.com/pdc-use/ldc-resource-properties-user-guide-cp#custom-properties">Custom properties</a>.</td></tr></tbody></table>

#### Contents tab

The **Contents** tab shows the assets associated with the selected collection component. The information displayed in this tab depends on the selected component. When you select a **Dataset** or **Data Collection**, the Contents tab helps you review the assets that belong to the collection. When a **SQL view** has been created for a Data Collection and you select that SQL view in the tree, the Contents tab helps you review the source data assets used in the view and the columns defined by the SQL view.

For a **Dataset** or **Data Collection**, the Contents tab includes the [Data Assets](https://hv-eng.atlassian.net/wiki/spaces/PDC/pages/33560428549/Collections+-+Content+pane#Data-Assets) subtab; for a selected SQL view, it includes the [Data Assets in Use](https://hv-eng.atlassian.net/wiki/spaces/PDC/pages/33560428549/Collections+-+Content+pane#Data-Assets-in-Use) and [SQL View Columns](https://hv-eng.atlassian.net/wiki/spaces/PDC/pages/33560428549/Collections+-+Content+pane#SQL-View-Columns) subtabs.

**Data Assets**

In the **Data Assets** subtab, you can view the assets that are included in the selected Dataset or Data Collection. This subtab helps you review the assets that belong to the collection and inspect basic metadata for each item. In this subtab, each row represents an asset in the collection and can include details such as the asset name, asset type, associated tags, and the dates and times when the asset was created and last updated.

You can click an asset name to open that asset and continue exploring its details. You can also use the filter and column settings controls to adjust the list and focus on the assets that you want to review. If you have the required permissions, you can also remove an asset from the collection directly from this subtab.

**Data Assets in Use**

In the **Data Assets in Use** subtab, you can view the source data assets used by the selected SQL view. This subtab helps you understand which assets are referenced by the SQL statement that defines the view and review metadata for those source assets. In this subtab, each row represents a source asset used in the SQL view and can include details such as the asset name, asset type, number of columns, number of rows when available, current sensitivity value, associated business terms, and tags.

You can click an asset name to open that asset and continue exploring its details. You can also use the filter and column settings controls to adjust the list and focus on the assets that you want to review.

**SQL View Columns**

In the **SQL View Columns** subtab, you can view the columns defined by the selected SQL view. This subtab helps you understand the output structure of the SQL view and review metadata for each view column. In this subtab, each row represents a SQL view column and can include details such as the column name, data type, column size, whether the column allows null values, current sensitivity value, associated business terms, and tags.

You can click a column name to open that column and continue exploring its details. You can also use the column settings control to customize the displayed columns.

#### Custom Properties tab

The **Custom Properties** tab lists the custom properties and assigned values associated with the selected collection component. [Custom properties](https://docs.pentaho.com/pdc-use/ldc-resource-properties-user-guide-cp#custom-properties) are user-defined metadata that help extend the standard metadata model with organization-specific information. You can use them to capture additional business, governance, or operational details, such as project identifiers, ownership context, review status, cost centers, or certification status. This tab helps you enrich the collection component with structured metadata that supports better understanding, consistency, and governance. You can also apply filters to refine the list.

To manage custom properties, select **Manage Properties** to open the **Custom Properties** page in the **Management** section. For more information, see [Manage custom properties](https://docs.pentaho.com/pdc-admin/manage-custom-properties) in [Administer Pentaho Data Catalog](https://docs.pentaho.com/pdc-admin/) document.

#### Relationships tab

The **Relationships** tab displays the assets and governance objects associated with the selected collection. You can use this tab to understand how the collection connects to other objects across Data Catalog. Additionally, in the subtabs, you can identify where the collection is used, which governance artifacts apply to it, and which related business or technical objects are linked to it. The Relationships tab contains the following subtabs:

**BI Reports tab**

In the **BI Reports** tab, you can view the Business Intelligence (BI) reports associated with the selected collection, including their name, type, parent, and owner. This tab helps you understand how the collection is used in reporting and analytics and identify reports that depend on the data in the collection. Associating BI reports with collections helps users understand the downstream reporting context of the collection and assess the potential impact of changes to the collection.

To associate a BI report, click **Add BI Reports**, select the BI report that you want to add to the collection, and then click **Add**. Data Catalog creates a relationship between the collection and the selected BI report. You can sort the list, customize the displayed columns, and apply filters to refine the results. You can remove an existing association by clicking the **Delete** icon next to the BI report. Deleting the association does not delete the actual BI report.

**ML Models**

In the **ML models** tab, you can view models in the selected collection, including their name, type, parent, and owner. This tab helps you understand how the collection relates to machine learning assets and identify ML models that use or are connected to the collection. Associating ML models with collections helps users understand the role of the collection in machine learning workflows and identify collections that support model development, training, or analysis.

To associate an ML model, click **Add ML Models**, select the ML model that you want to add to the collection, and then click **Add**. Data Catalog creates a relationship between the collection and the selected ML model. You can sort the list, customize the displayed columns, and apply filters to refine the results. You can also remove an existing association by clicking the **Delete** icon next to the ML model. Deleting the association does not delete the actual ML model.

**Rules & Policies**

In the **Rules & Policies** tab, you can view the policies associated with the selected collection. This tab helps you understand the governance and compliance context of the collection by showing which policies apply to it and helping you identify standards or governance controls that are linked to the collection. Associating policies with collections helps users understand how the collection aligns with organizational rules, standards, or policy requirements.

In this tab, you can review policy details associated with the collection, such as the policy name, type, parent, and owner. You can sort the list, customize the displayed columns, and apply filters to refine the results. To associate a policy, click **Add Policy**, select the policy that you want to add to the collection, and then click **Add**. Data Catalog creates a relationship between the collection and the selected policy. You can remove an existing association by clicking the **Delete** icon next to the policy. Deleting the association does not delete the actual policy.

**Glossary tab**

In the **Glossary tab**, you can view the business terms associated with the selected collection, including their type, parent, and sensitivity. Associating business terms with collections helps users interpret the collection more accurately and identify collections that relate to a specific business concept or domain. It also helps connect technical assets in the collection with the business glossary used across the organization.

To associate a business term with the collection, click **Add Terms**, select the business term that you want to add, and then click **Add**. Data Catalog creates an association between the collection and the selected business term. You can remove an existing association by clicking the **Delete** icon next to the term. Deleting the association removes only the link between the collection and the business term. Deleting the association doesn’t delete the actual business term from the business glossary.

**Applications tab**

In the **Applications** tab, you can view the applications associated with the selected collection, including their name, type, parent, source created date, source updated date, and owner. This tab helps you understand how the collection relates to applications in your organization and where the collection is used in a broader business and operational contexts by showing how it connects to business systems, tools, or application groups.

To associate an application, select **Add Applications**, then choose the application you want to add to the collection. Data Catalog creates a relationship between the collection and the application. You can sort the list, customize the displayed columns, and apply filters to refine the results. You can remove an existing association by clicking the **Delete** next to the application. Deleting the association does not delete the actual application.

**Physical Assets**

In the **Physical Assets** tab, you can view the physical assets associated with the selected collection, including name, type, parent, and owner. This tab helps you understand how the collection relates to physical assets in your organization and identify physical assets that are connected to the collection. Associating physical assets with collections helps users understand the operational context of the collection and identify collections that relate to specific equipment, devices, or other physical resources.

To associate a physical asset, click **Add Physical Assets**, select the physical asset that you want to add to the collection, and then click **Add**. Data Catalog creates a relationship between the collection and the selected physical asset. You can sort the list, customize the displayed columns, and apply filters to refine the results. You can remove an existing association by clicking the **Delete** icon next to the physical asset. Deleting the association does not delete the actual physical asset.

#### Comments tab

The **Comment** tab is a collaborative feature that allows users to discuss and provide feedback on specific data assets within Data Catalog. You can add comments, share suggestions, or ask questions directly in the tab using the provided text box, which includes basic formatting options like bold, italic, and bullet points. In addition, you can tag other users by mentioning them with the "@" symbol followed by their username. Then the specific user, or users, are notified of the comment through email and in the Mentions tab on the Data Catalog landing page, prompting them to respond if necessary. For more information, see [Tour of the Home page](https://docs.pentaho.com/pdc-use/ldc-quick-start-user-guide-cp#tour-of-the-home-page).

**Note:** In the Comment tab, you can:

* Tag users who have been configured in Data Catalog.
* Only delete the comments you posted.
* Delete any comment if you are an admin.

### Key metrics

The **Key Metrics** section appears in the upper-right panel of the **Summary** view for a Data Collection. It provides a quick view of the current quality and governance state of the collection by displaying **Data Quality**, **Lineage**, **Sensitivity**, and **Trust Score**.

For a Data Collection, **Data Quality**, **Sensitivity**, and **Trust Score** are calculated by running their corresponding jobs from **Actions** > **Refresh Metrics**. Each refresh action starts a background job and updates the metric after the job finishes successfully. For more information, see *Refresh metrics for a data collection*.

The displayed value of each metric reflects the current state of the collection. For example, Data Catalog can display values such as **High Quality**, **Low Quality**, **High Sensitivity**, **Unknown Sensitivity**, **Trusted Data**, or **Uncomputed Score**, depending on whether the metric has been calculated and what result was returned.

**Lineage** is a verification metric. It indicates whether the lineage information for the collection has been reviewed and verified. Data Catalog displays this metric as a state, such as **Verified Lineage**. Unlike **Data Quality**, **Sensitivity**, and **Trust Score**, Lineage is not calculated by running a refresh job.

#### Refresh metrics for a data collection

Use **Refresh Metrics** to recalculate the system-generated metrics for a selected **Data Collection**. This option is available from the **Actions** menu and includes **Refresh Data Quality**, **Refresh Trust Score**, and **Refresh Sensitivity**. Use it when the collection changes and you want Data Catalog to compute the latest metric values for the collection.

When you refresh a metric, Data Catalog applies the system calculation to all applicable internal levels of the Data Collection, including **Common Columns**, **SQL View Columns**, the **SQL View**, and the **Data Collection** itself. The refreshed value replaces the previous value, even if it was set manually via the UI or API. Each refresh action starts a background job. After the job finishes successfully, the updated value appears in **Key Metrics**, and the **Last Computed** field in the metric tooltip shows the date of the most recent refresh.

To check the refresh status, go to the **Workers** page and find the corresponding worker process. Data Catalog uses these worker process names for collection metric refresh operations: **Collection Quality Score**, **Collection Trust Score**, and **Collection Sensitivity**. If the metric value does not update as expected after the refresh, review the job details on the **Workers** page to check for errors.

**Refresh Data Quality**

Use **Refresh Data Quality** to recalculate the **Data Quality** score for the selected Data Collection. For Data Collections, Data Quality can be computed by the system and can also be set through the API. When you use the refresh action, Data Catalog recalculates the metric and updates the Data Quality value by using the current system calculation. The updated value appears in the **Key Metrics** section after the refresh job finishes successfully.

**Refresh Trust Score**

Use **Refresh Trust Score** to recalculate the **Trust Score** for the selected Data Collection. For Data Collections, the Trust Score is computed only by the system and is not editable in the UI. Use this refresh action when you want Data Catalog to compute the latest Trust Score for the collection. After the background job finishes successfully, the updated Trust Score appears in the **Key Metrics** section.

**Refresh Sensitivity**

Use **Refresh Sensitivity** to recalculate the **Sensitivity** value for the selected Data Collection. Sensitivity can be computed by the system or set manually by a user. If a manual sensitivity value already exists for a collection element, such as the SQL View or SQL View columns, Data Catalog displays a warning and asks for confirmation before continuing. If you continue, the refreshed value replaces the previous value.

For Data Collections that include a **SQL View**, Sensitivity refresh depends on whether Data Catalog can successfully parse each SQL View column expression and identify the physical source columns used by that expression. If Data Catalog cannot identify the source columns for a view column, that column is skipped during Sensitivity calculation, and the process continues with the columns that were parsed successfully.

For example, if source columns are referenced without clearly identifying the table to which they belong, Data Catalog cannot parse the expression successfully. When a SQL expression or function, such as sum, concat, or a mathematical expression, defines a view column, use an explicit alias with the AS clause so that Data Catalog can match the SQL View column with its corresponding expression. If no explicit alias is provided, that view column is skipped during Sensitivity calculation.

### SQL view

A **SQL view** in a **Data Collection** is a logical view that is defined by a SQL statement and built from the SQL tables already included in the collection. It lets you transform, combine, and restrict the raw table data in the collection without changing the original source data. The SQL view is not materialized. It is only a SQL statement that defines the data selection.

A SQL view is useful when a Data Collection contains SQL tables, and you want to create a business-focused or analysis-ready representation of that data. For example, you can join related tables, filter rows, select only the required columns, rename columns, or create calculated expressions. This helps you organize the collection around a specific use case without modifying the underlying tables. Only the SQL tables already added to the Data Collection can be used in the SQL statement that defines the SQL view.

If [Smart Type to SQL](https://docs.pentaho.com/pdc-admin/ldc-advanced-configuration-ut_cp#configure-smart-type-to-sql-feature-in-data-catalog) is configured in Data Catalog, you can also describe the data selection in natural language and generate the SQL statement by using the AI-assisted **Generate SQL** option in the SQL view dialog. This helps you to create queries more quickly when you do not want to write the full SQL statement manually. Depending on the database engine, you might need to include schema-qualified table names or enclose schema and table names in double quotes.

If you use the AI-assisted generation option, you might still need to adjust the generated SQL to match the syntax required by your database. The SQL view dialog also lets you run the query and preview a result set before you save it. However, preview support currently only applies when the collection contains individual tables.

After you save the SQL statement, Data Catalog starts a background job to ingest the SQL view, discover its columns, and profile both the view and its columns. While the job is running, the column details might not be available immediately. After processing finishes successfully, the SQL view appears as a child of the Data Collection in the left navigation tree, and the view columns appear as children of the SQL view. The view and each view column have their own focused pages. The collection icon also changes to indicate that the collection has a SQL view defined.

When you define column names in the SQL view, those names must come from the SQL statement. You cannot rename SQL view columns later in the UI. If you want a specific column name to appear in Data Catalog, define it directly in the SQL statement by using the AS clause.

If the SQL statement fails during processing, Data Catalog shows an error in the **SQL View Columns** area. To investigate the problem, check the **Workers** page for the worker process named **Ingesting collection view columns** and review the failure details. After you correct the SQL statement, you can define the view again.

If you no longer need the SQL view, you can delete it from the contextual menu of the SQL view element in the collection tree. After you confirm the action, Data Catalog removes the SQL view and restores the previous tree structure for the collection.

#### Define a SQL view for a data collection

A SQL view helps you create a logical, query-based representation of the data in the collection so that you can focus on a specific business question, analysis scenario, or reporting need. The SQL view is defined by a SQL statement and does not materialize data in the source system.

Perform the following procedure to define a **SQL view** for a **Data Collection** that contains SQL tables in Data Catalog:

**Prerequisites**

* Ensure that the selected **Data Collection** contains SQL tables. Only SQL tables already added to the collection can be used in the SQL statement.
* Ensure that you are the owner of the Data Collection or that you have **Update** permission for it.
* If you want to use natural-language query generation, ensure that Smart Type to SQL is configured in Data Catalog. For more information, see [Configure Smart Type to SQL feature in Data Catalog](https://docs.pentaho.com/pdc-admin/ldc-advanced-configuration-ut_cp#configure-smart-type-to-sql-feature-in-data-catalog).

**Procedure**

1. In the left navigation menu, click **Collections**.
2. Open the **Data Collection** for which you want to define a SQL view.
3. Click **Actions** and then click **Define SQL View**.

   The **Define SQL View** dialog opens. The dialog lists the SQL tables and columns available in the collection and provides an editor for defining the SQL statement.
4. Define the SQL view by using one of these methods:
   * Enter the SQL statement directly in the SQL editor.
   * If Smart Type to SQL is available, describe the required data selection in natural language and click **Generate SQL** to create the SQL statement. Then review and update the generated SQL as needed.
5. If needed, click **Run** to preview the query result.

   Query preview is supported when the collection contains individual SQL tables. If the collection contains a schema element instead of individual tables, preview is not supported. A preview failure does not always mean that the SQL view cannot be created.
6. Review the SQL statement and ensure that it matches the syntax required by your database engine.

   Depending on the database engine, you might need to prefix table names with the schema name or enclose schema and table names in double quotes. If you want specific output column names, define them in the SQL statement by using the AS clause.
7. Click **Save**.

   Data Catalog saves the SQL statement and starts a background job to ingest the SQL view, discover the view columns, and profile both the view and its columns.

**Result**

The SQL view is created as a child of the selected Data Collection. The SQL view can be expanded in the collection tree, and its columns appear as child elements. The SQL view and each SQL view column have their own focused pages. The collection icon also changes to indicate that the collection has a SQL view defined.

### Duplicate or versioning of a collection

In Data Catalog, you can duplicate an existing Dataset or Data Collection to create a versioned copy that can be reused, enriched, or customized without impacting the original. A duplicate, or versioned collection, is an independently editable copy of an existing Dataset or Data Collection. It retains all user-defined properties, metadata, and relationships, but is treated as a new collection. You can create a duplicate in the same group or assign it to a different group. Once duplicated, the new collection includes a reference link to the original on its Properties tab, allowing users to trace its lineage and evolve independently. This helps teams experiment, enrich, or adapt datasets for new use cases, without affecting the original.

Duplicating a collection supports data versioning by allowing users to create variations of a collection for different projects or analysis scenarios, without altering the original. It also enables reuse by letting users start with an already curated collection rather than building a new one from scratch. This process ensures safety as the original collection remains unchanged and untouched throughout. Additionally, it enhances governance by clearly identifying duplicates and linking them to their source, facilitating easier tracking of lineage and maintaining data integrity.

When you duplicate a collection, the following properties and settings are preserved:

<table><thead><tr><th width="267.75">Copied Property</th><th>Description</th></tr></thead><tbody><tr><td>Collection metadata</td><td>Name (edited), description, category, and group</td></tr><tr><td>Data assets</td><td>All included files or tables from the original collection</td></tr><tr><td>Business terms</td><td>Any business glossary terms assigned to the collection</td></tr><tr><td>Sensitivity levels</td><td>Tags that define data sensitivity for compliance and security</td></tr><tr><td>Trust score</td><td>Confidence level assigned to the data asset</td></tr><tr><td>Data quality metrics</td><td>If defined manually, these values are retained</td></tr><tr><td>Tags and custom properties</td><td>Any custom metadata added to enrich the collection</td></tr><tr><td>Relationships and policies</td><td>Includes associations with other assets, such as linked reports or rules</td></tr></tbody></table>

Some system-generated values will not be copied to the duplicated collection:

<table><thead><tr><th width="265.25">Not Copied</th><th>Reason</th></tr></thead><tbody><tr><td>Generated columns (Datasets)</td><td>Created by profiling or aggregation; treated as dynamic, not user-defined</td></tr><tr><td>Profiling results</td><td>Must be regenerated for the new collection</td></tr><tr><td>Sharing permissions</td><td>You must manually reassign users and roles in the new collection</td></tr><tr><td>Version history of original</td><td>The new collection starts its own version lifecycle</td></tr></tbody></table>

While the duplicate collection feature offers flexibility and control, it has the following limitations and behavioral constraints in place to maintain data integrity, version traceability, and governance consistency:

* A collection can be duplicated only once. You cannot create multiple duplicates of the same source collection.
* A duplicate is a new collection created by using an existing collection as a starting point. The original collection can continue to be modified after duplication.
* Duplication is not intended for versioning. It is intended to help you create another collection by reusing the structure, assets, and metadata of an existing collection.
* You cannot remove columns from a Dataset. Dataset columns are generated and managed by the system and cannot be removed by users.
* The Duplicate banner is displayed once after creation to provide user awareness and disappears upon navigation.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdc-use/pdc-collections.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
