Manage collections
In Pentaho Data Catalog, you can structure and manage your data assets by creating Collections, which include two types: Datasets, used to group homogeneous tables or files with identical schemas, and Data Collections, used to logically organize heterogeneous data assets that may vary in format or structure. These assets are structured within a hierarchy of components: Category, Group, Dataset, Data Collection, and Data Product. To learn more about these components, see the Collections section in the Use Pentaho Data Catalog document.
The Manage Collections section provides step-by-step guidance for organizing and maintaining your data assets within Data Catalog. You can create and organize Categories and Groups, which serve as containers for Datasets and Data Collections. Additionally, you can also customize and maintain this structure by editing or deleting existing items as needed. Once your data is organized and enriched, you can publish it as a Data Product to make it available across the organization.
Create a category
A Category in Collections helps Data Catalog users by providing a logical and organized structure for managing data assets. It serves as the top-level container that groups related Groups, Datasets, and Data Collections under a common business context, such as a department, project, or domain.
Perform the following steps to create a new category for collections in Data Catalog:
In the left navigation menu, click Data Canvas and then click Collections.
The list of existing collections appears in the Collections panel.
At the bottom of the Collections panel, click Create New Category.
The Create New Category dialog box opens.
You can also access this option from any of the views, Browse Collections, My Collections, or Shared with Me.
In the Create New Category dialog box, in the Category Name box, enter a name for the category that you want to create.
In the Description box, enter a meaningful description and then click Create.
Tip: A clear and concise description helps other users understand the purpose of the category and improves discoverability across the catalog.
You have successfully created a new category, and it is visible in the Collections panel.
After creating a Category, create a group to organize your pre-defined dataset or collections. For more information, see Create a group.
Create a group
A Group in Collections helps Data Catalog users to organize Datasets and Data Collections within a category based on specific topics, data sources, or use cases. It adds an intermediate layer of structure, making collections easier to manage, access, and govern.
Perform the following steps to create a new group for collections in Data Catalog:
In the left navigation menu, click Data Canvas and then click Collections.
The list of existing collections appears in the Collections panel.
At the bottom of the Collections panel, click the Create New Category drop-down, select Create Group.
The Create New Group dialog box opens.
You can also access this option from any of the views, Browse Collections, My Collections, or Shared with Me.
In the Create New Group dialog box, from the Parent Category drop-down menu, select an existing category or group (as a nested item), that you want to use for creating a new group.
In the Group Namebox, enter a name for the group that you want to create.
In the Description box, enter a meaningful description and then click Create.
Tip: A clear and concise description helps other users understand the purpose of the category and improves discoverability across the catalog.
You have successfully created a new group, and it is now visible in the Collections panel under the selected category.
After creating a category and group, you can create a dataset or data collection. For more information, see Create a dataset or Create a data collection.
Create a dataset
In Data Catalog, you can create datasets by grouping related data elements that share a common structure. This helps organize data in a meaningful way, making it easier to apply metadata, perform profiling, assign business terms, and enable efficient data discovery and governance.
Perform the following steps to create a new dataset for collections in Data Catalog:
In the left navigation menu, click Data Canvas.
The Data Canvas opens, showing a list of available data sources and their respective data assets.
From the Data Sources panel, select data tables that share a similar structure, such as those with the same columns or schema (homogeneous data), and click Add to Cart located below the panel. You can also use the Details tab to filter and select the data tables, and click the Add to Cart option on the tab.
The selected tables appear in the Data Cart.
In the Data Cart, review the selected assets and click Save as Collection.
The Create New Collection dialog box opens.
In the Parent Group drop-down menu, select an existing group where you want to create a dataset.
In the Collection Name box, enter a name for the dataset.
In the Description box, enter a meaningful description.
Tip: A clear and concise description helps other users understand the purpose of the category and improves discoverability across the catalog.
Under Select Type, select Dataset.
Note: Select the Collection option if you want to create a Data Collection instead. For more information, see Create a data collection.
To run profiling and aggregation jobs on the dataset, choose one or both of the following options:
Profile Job: Profiles the dataset and runs data aggregation.
Aggregation Job: Runs only the aggregation job if the dataset is already profiled.
For more details about these jobs, see Processing Collections in the Use Pentaho Data Catalog document.
Review the information you entered and click Create.
It creates the dataset and initiates the selected job. You can track job progress in the Workers page.
You have successfully created a Dataset, and it appears in the Collection hierarchy under the selected group.
Once the profiling or aggregation job is complete, you can view the results on the Summary tab. You can then proceed to assign Business Terms, Applications, and explore other available tabs. For more information, see Content pane under the Data Canvas section in Use Pentaho Data Catalog.
Create a data collection
In Data Catalog, you can create a data collection to group together multiple data assets that may have different structures. With data collections, you can organize diverse data assets under a single logical entity for easier management, discovery, and collaboration. Data collections help in categorizing related information, sharing curated groups of data assets with other users, and streamlining metadata management tasks across heterogeneous data.
Perform the following steps to create a new data collection under a group in Data Catalog:
In the left navigation menu, click Data Canvas.
The Data Canvas opens, showing a list of available data sources and their respective data assets.
From the Data Sources panel, select data assets with different formats (heterogeneous data), and click Add to Cart located below the panel. You can also use the Details tab to filter and select the data tables, and then click the Add to Cart option on the tab.
The selected data assets appear in the Data Cart.
In the Data Cart, review the selected assets and click Save as Collection.
The Create New Collection dialog box opens.
In the Parent Group drop-down menu, select an existing group where you want to create a data collection.
In the Collection Name box, enter a name for the data collection.
In the Description box, enter a meaningful description.
Tip: A clear and concise description helps other users understand the purpose of the category and improves discoverability across the catalog.
Under Select Type, select Data Collection.
Note:
Select the Dataset option if you want to create a dataset using homogeneous data assets, such as tables with the same schema. For more information, see Create a dataset.
Unlike datasets, data collections do not support profiling or aggregation jobs
Review the information you entered and click Create.
You have successfully created a Data Collection, and it appears in the Collection hierarchy under the selected group.
You can then proceed to assign Business Terms, Applications, and explore other available tabs. For more information, see Content pane under the Data Canvas section in Use Pentaho Data Catalog.
Create a duplicate of a collection
In Data Catalog, you can duplicate a Dataset or Data Collection to create a new version that includes the original content, metadata, and relationships. With this feature, you can reuse or extend an existing collection without modifying the source.
Perform the following steps to create a duplicate of a collection:
In the left navigation menu, click Data Canvas, and then click Collections.
In the Collections panel, expand the respective category and group, and then click the Dataset or Data Collection you want to duplicate. The Summary page of the selected collection appears.
Click the Actions drop-down menu and then click Duplicate. The Duplicate Collection dialog box appears. Note: If the collection is already duplicated, you can’t see the duplicate option. For more information, see Duplicate or versioning of a collection.
In the Collection Name field, enter a new, unique name for the duplicate collection. A default name is suggested. You can edit it as needed. The name must be unique within the selected group.
From the Parent Group dropdown, select the category and group where you want to place the duplicate collection.
Click Duplicate to complete the duplication. A new collection is created in the selected group, and a Duplicate banner briefly appears to notify you, including the option to undo and delete the duplicate collection.
You have successfully created a duplicate collection. The new collection includes all metadata, business terms, properties, and data assets from the original collection. System-generated values, such as profiling results and generated columns, are not copied. The duplicated collection also includes a reference link to the original, visible on the Properties tab. Next, you can edit (add or remove assets), update (add or remove properties), or process the collection.
Publish a collection as a data product
In Data Catalog, you can publish a collection as a data product, which helps you package curated datasets or data collections into shareable, reusable, and well-documented entities. As data products are visible to all users, they enhance data discoverability, promote trust through clear ownership and descriptions, and support data-as-a-service initiatives across teams. Publishing a collection as a data product helps other users easily access, understand, and consume high-quality, business-ready data for analytics, reporting, or integration.
Perform the following steps to publish a collection as a data product under a group in Data Catalog:
In the left navigation menu, click Data Canvas and then click Collections.
The list of existing collections appears in the Collections panel.
Select a collection that you want to publish as a data product.
Verify that the Sensitivity level is defined, as it is mandatory for publishing. You should also review and, if applicable, define additional metrics such as Trust Score, Data Quality, and other relevant attributes to enhance the product’s credibility. If these values are not specified, go to the Summary tab, configure the required properties, and click Save.
Click Actions, then select Publish as Data Product. In the Publish Data Product dialog box, review the Sensitivity property, which is mandatory, and ensure it is properly configured.
You now have two options:
Click Start Publishing to publish the collection with all required and recommended properties set.
Click Publish Anyway if you want to proceed, even if some recommended properties (other than Sensitivity) don’t meet the suggested thresholds.
Note: If the collection doesn’t have properties defined, you get an error message, and publishing cannot proceed.
You have successfully published the collection as a data product. The component now displays the updated data product icon and appears as a data product in the Collections hierarchy. It is visible to all users in Data Catalog.
After publishing a collection as a data product, you can view and manage its metadata, including the description, owner, and assigned domain. You can also share the data product with other users or teams to promote collaboration and reuse.
Share collections
In Data Catalog, you can share a dataset, data collection, or data product with other users to enable collaboration and controlled access. Sharing allows other users to view, update, or run the component based on the permission you assign. It supports collaboration, ensures alignment across departments, and enables secure access to curated data assets for business and analytical use cases.
Perform the following steps to share a collection component in Data Catalog:
In the left navigation menu, click Data Canvas, and then click Collections.
The Collections panel displays the hierarchy of components.
Select the dataset, data collection, or data product you want to share.
The Summary tab opens.
In the upper-right corner, click Share.
The Share dialog box opens.
In the Type a member box, enter the name or email address of the user you want to share with.
From the Permission drop-down menu, choose one of the following:
View: The user can only view the component.
Update: The user can edit the component.
Run: The user can execute supported actions such as profiling or publishing. Note: With Permissions, you can control what actions others can perform on the shared component.
(Optional) In the Message box, enter a message for the user.
Click Share.
The selected user is granted access with the specified permission level.
The dataset, data collection, or data product is now shared with the selected user. The component will appear in their Shared with me view when they log in to Data Catalog.
Add assets to a collection
In Data Catalog, you can add selected data assets, such as tables or files, to an existing Dataset or Data Collection. With this, you can enhance existing collections or create a duplicate version before adding new assets, depending on your needs.
Perform the following steps to add data assets to a collection:
In the left navigation menu, click Data Canvas.
In the Data Sources panel, navigate to and select the data assets (tables or files) you want to add. You can select multiple assets from different schemas or folders as needed.
Click Add to Cart. The selected items appear in the Data Cart on the right panel.
In the Data Cart, click the drop-down next to Save as Collection, and then click Add to Collection. The Add to Collection dialog box appears with two options:
Add assets to an existing collection Select this option to append the selected data assets to an existing Dataset or Data Collection. The original collection is updated with the new items, and its version remains unchanged. This option is useful when extending an already curated collection.
Duplicate collection and add items Choose this option to create a new copy of an existing collection and add the selected assets to it. This approach avoids modifying the original collection and helps you to manage and enrich the duplicate version independently.
Option 1: Add assets to an existing collection
Perform the following steps to add assets to an existing collection:

In the Add to Collection dialog box, select Add to Existing Collection.
Browse through the available categories and groups to locate the target collection. You can also use the search bar to locate a collection. Note: Only compatible collections (Dataset or Data Collection) are enabled for selection based on the type of selected assets.
Select the collection you want to add the assets to.
Click Add to Collection. The assets are appended to the selected collection.
The selected data assets are successfully added to the existing collection. A banner briefly appears to notify you about the added assets, including an option to delete them.
Option 2: Duplicate collection and add assets
Perform the following procedure if you want to avoid modifying the original collection and instead duplicate it, adding the assets to the new copy.

In the Add to Collection dialog box, select Duplicate Collection and Add Items.
Browse through the available categories and groups to locate the source collection. You can also use the search bar to locate the source collection. Note: Only compatible collections (Dataset or Data Collection) are enabled for selection based on the type of selected assets.
Select the source collection you want to duplicate and add the assets to.
In the Collection Name field, enter a unique name for the new collection. A name is required and must be unique within the selected group.
In the Parent Group dropdown, select the group where the new duplicate collection will be created.
Click Duplicate and Add.
A new collection is created in the selected group, and the selected assets are added to it. A Duplicate banner briefly appears to notify you, including the option View Original Collection.
Remove assets from a collection
In Data Catalog, you can remove individual assets, such as tables or files, from an existing Dataset or Data Collection. With this, you can maintain collections that are relevant, clean, and aligned with your business requirements.
Perform the following steps to remove assets from a collection:
In the left navigation menu, click Data Canvas and then Collections.
In the Collections panel, expand the respective category and group and then click the Dataset or Data Collection from which you want to remove assets. The Summary page of the selected collection appears.
Click the Details tab. A list of assets included in the collection appears.
Locate the asset you want to remove and click the Delete icon next to it.
In the confirmation dialog box, click Remove to confirm. The asset is removed from the collection.
The selected asset is successfully removed from the collection. It no longer appears in the Details tab of the collection, but the source asset remains available in Data Canvas for future use.
Edit or update a category, group, dataset, data collection, or data product
In Data Catalog, each collection component, such as category, group, dataset, data collection, or data product, includes editable metadata, including name, description, purpose, key metrics, and properties. You can edit or update this information on the Summary tab, and manage associated metadata on other tabs such as Terms, Applications, or Policies, depending on the component type.
Perform the following steps to edit a collection component in Data Catalog:
In the left navigation menu, click Data Canvas, and then click Collections.
The Collections panel displays a hierarchy of categories, groups, datasets, data collections, and data products.
Select the component (category, group, dataset, data collection, or data product) you want to edit.
The Summary tab opens by default.
To edit text fields such as Name, Description, or Purpose, click the Edit icon next to the respective field, update the value, and click Save to apply the changes.
Update key metrics, such as Sensitivity, Trust Score, or Data Quality, in the Key Metrics section as required.
These metrics help validate readiness for publishing as a data product.
To update properties such as Domain or Status, click the Edit icon next to the field under the Properties section, make the necessary changes, and click Save.
Go to other tabs, such as Terms, Applications, or Policies, to add or remove relevant metadata.
Note: The available tabs vary based on the type of collection component selected.
After completing your updates, ensure that all required fields are defined and properly saved.
You have successfully updated the selected collection component. The changes are saved immediately and are visible to other users based on their assigned access rights.
Delete a category, group, dataset, data collection, or data product
In Data Catalog, you can delete any collection component, such as a category, group, dataset, data collection, or data product, when it is no longer needed. Deleting unused or obsolete components helps maintain a clean and organized catalog, reduces clutter, and ensures users can easily navigate and find relevant data assets.
Perform the following steps to delete a collection component in Data Catalog:
In the left navigation menu, click Data Canvas, and then click Collections.
The Collections panel displays a hierarchy of existing components, including categories, groups, datasets, data collections, and data products.
In the Collections panel, click the More options icon next to the component name (such as a dataset or data collection) that you want to delete.
Click Delete from the menu.
A confirmation prompt appears.
Click Confirm to confirm the deletion.
The component is permanently removed from the catalog.
You have successfully deleted the selected collection component. It no longer appears in the Collections hierarchy and is no longer accessible to users.
Last updated
Was this helpful?