Manage collections
In Pentaho Data Catalog, you can structure and manage your data assets by creating Collections, which include two types: Datasets, used to group homogeneous tables or files with identical schemas, and Data Collections, used to logically organize heterogeneous data assets that may vary in format or structure. These assets are structured within a hierarchy of components: Category, Group, Dataset, Data Collection, and Data Product. To learn more about these components, see the Collections section in the Use Pentaho Data Catalog document.
The Manage Collections section provides step-by-step guidance for organizing and maintaining your data assets within Data Catalog. You can create and organize Categories and Groups, which serve as containers for Datasets and Data Collections. Additionally, you can also customize and maintain this structure by editing or deleting existing items as needed. Once your data is organized and enriched, you can publish it as a Data Product to make it available across the organization.
Create a category
A Category in Collections helps Data Catalog users by providing a logical and organized structure for managing data assets. It serves as the top-level container that groups related Groups, Datasets, and Data Collections under a common business context, such as a department, project, or domain.
Perform the following steps to create a new category for collections in Data Catalog:
In the left navigation menu, click Data Canvas and then click Collections.
The list of existing collections appears in the Collections panel.
At the bottom of the Collections panel, click Create New Category.
The Create New Category dialog box opens.
You can also access this option from any of the views, Browse Collections, My Collections, or Shared with Me.
In the Create New Category dialog box, in the Category Name box, enter a name for the category that you want to create.
In the Description box, enter a meaningful description and then click Create.
Tip: A clear and concise description helps other users understand the purpose of the category and improves discoverability across the catalog.
You have successfully created a new category, and it is visible in the Collections panel.
After creating a Category, create a group to organize your pre-defined dataset or collections. For more information, see Create a group.
Create a group
A Group in Collections helps Data Catalog users to organize Datasets and Data Collections within a category based on specific topics, data sources, or use cases. It adds an intermediate layer of structure, making collections easier to manage, access, and govern.
Perform the following steps to create a new group for collections in Data Catalog:
In the left navigation menu, click Data Canvas and then click Collections.
The list of existing collections appears in the Collections panel.
At the bottom of the Collections panel, click the Create New Category drop-down, select Create Group.
The Create New Group dialog box opens.
You can also access this option from any of the views, Browse Collections, My Collections, or Shared with Me.
In the Create New Group dialog box, from the Parent Category drop-down menu, select an existing category or group (as a nested item), that you want to use for creating a new group.
In the Group Namebox, enter a name for the group that you want to create.
In the Description box, enter a meaningful description and then click Create.
Tip: A clear and concise description helps other users understand the purpose of the category and improves discoverability across the catalog.
You have successfully created a new group, and it is now visible in the Collections panel under the selected category.
After creating a category and group, you can create a dataset or data collection. For more information, see Create a dataset or Create a data collection.
Create a dataset
In Data Catalog, you can create datasets by grouping related data elements that share a common structure. This helps organize data in a meaningful way, making it easier to apply metadata, perform profiling, assign business terms, and enable efficient data discovery and governance.
Perform the following steps to create a new dataset for collections in Data Catalog:
In the left navigation menu, click Data Canvas.
The Data Canvas opens, showing a list of available data sources and their respective data assets.
From the Data Sources panel, select data tables that share a similar structure, such as those with the same columns or schema (homogeneous data), and click Add to Cart.
The selected tables appear in the Data Cart.
In the Data Cart, review the selected assets and click Save as Collection.
The Create New Collection dialog box opens.
In the Parent Group drop-down menu, select an existing group where you want to create a dataset.
In the Collection Name box, enter a name for the dataset.
In the Description box, enter a meaningful description.
Tip: A clear and concise description helps other users understand the purpose of the category and improves discoverability across the catalog.
Under Select Type, select Dataset.
Note: Select the Collection option if you want to create a Data Collection instead. For more information, see Create a data collection.
To run profiling and aggregation jobs on the dataset, choose one or both of the following options:
Profile Job: Profiles the dataset and runs data aggregation.
Aggregation Job: Runs only the aggregation job if the dataset is already profiled.
For more details about these jobs, see Processing Collections in the Use Pentaho Data Catalog document.
Review the information you entered and click Create.
It creates the dataset and initiates the selected job. You can track job progress in the Workers page.
You have successfully created a Dataset, and it appears in the Collection hierarchy under the selected group.
Once the profiling or aggregation job is complete, you can view the results on the Summary tab. You can then proceed to assign Business Terms, Applications, and explore other available tabs. For more information, see Data tabs under Content pane in the Use Pentaho Data Catalog document.
Create a data collection
In Data Catalog, you can create a data collection to group together multiple data assets that may have different structures. With data collections, you can organize diverse data assets under a single logical entity for easier management, discovery, and collaboration. Data collections help in categorizing related information, sharing curated groups of data assets with other users, and streamlining metadata management tasks across heterogeneous data.
Perform the following steps to create a new data collection under a group in Data Catalog:
In the left navigation menu, click Data Canvas.
The Data Canvas opens, showing a list of available data sources and their respective data assets.
From the Data Sources panel, select data assets with different formats (heterogeneous data), and click Add to Cart.
The selected data assets appear in the Data Cart.
In the Data Cart, review the selected assets and click Save as Collection.
The Create New Collection dialog box opens.
In the Parent Group drop-down menu, select an existing group where you want to create a data collection.
In the Collection Name box, enter a name for the data collection.
In the Description box, enter a meaningful description.
Tip: A clear and concise description helps other users understand the purpose of the category and improves discoverability across the catalog.
Under Select Type, select Data Collection.
Note:
Select the Dataset option if you want to create a dataset using homogeneous data assets, such as tables with the same schema. For more information, see Create a dataset.
Unlike datasets, data collections do not support profiling or aggregation jobs
Review the information you entered and click Create.
You have successfully created a Data Collection, and it appears in the Collection hierarchy under the selected group.
You can then proceed to assign Business Terms, Applications, and explore other available tabs. For more information, see Content pane under Explore your data section in the Use Pentaho Data Catalog document.
Publish a collection as a data product
In Data Catalog, you can publish a collection as a data product, which helps you package curated datasets or data collections into shareable, reusable, and well-documented entities. As data products are visible to all users, they enhance data discoverability, promote trust through clear ownership and descriptions, and support data-as-a-service initiatives across teams. Publishing a collection as a data product helps other users easily access, understand, and consume high-quality, business-ready data for analytics, reporting, or integration.
Perform the following steps to publish a collection as a data product under a group in Data Catalog:
In the left navigation menu, click Data Canvas and then click Collections.
The list of existing collections appears in the Collections panel.
Select a collection that you want to publish as a data product.
Verify that all items meet the required thresholds before publishing, including Sensitivity, Trust Score, Data Quality, and other relevant metrics. If these values are not defined, go to the Summary tab to configure them and click Save .
Click Actions, select Publish as Data Product, and click Start Publishing.
Note: If the collection doesn’t have properties defined, you get an error message, and publishing cannot proceed.
You have successfully published the collection as a data product. The component now displays the updated data product icon and appears as a data product in the Collections hierarchy. It is visible to all users in Data Catalog.
After publishing a collection as a data product, you can view and manage its metadata, including the description, owner, and assigned domain. You can also share the data product with other users or teams to promote collaboration and reuse.
Share collections
In Data Catalog, you can share a dataset, data collection, or data product with other users to enable collaboration and controlled access. Sharing allows other users to view, update, or run the component based on the permission you assign. It supports collaboration, ensures alignment across departments, and enables secure access to curated data assets for business and analytical use cases.
Perform the following steps to share a collection component in Data Catalog:
In the left navigation menu, click Data Canvas, and then click Collections.
The Collections panel displays the hierarchy of components.
Select the dataset, data collection, or data product you want to share.
The Summary tab opens.
In the upper-right corner, click Share.
The Share dialog box opens.
In the Type a member box, enter the name or email address of the user you want to share with.
From the Permission drop-down menu, choose one of the following:
View: The user can only view the component.
Update: The user can edit the component.
Run: The user can execute supported actions such as profiling or publishing. Note: With Permissions, you can control what actions others can perform on the shared component.
(Optional) In the Message box, enter a message for the user.
Click Share.
The selected user is granted access with the specified permission level.
The dataset, data collection, or data product is now shared with the selected user. The component will appear in their Shared with me view when they log in to Data Catalog.
Edit a category, group, dataset, data collection, or data product
In Data Catalog, each collection component, such as category, group, dataset, data collection, or data product, includes editable metadata such as name, description, purpose, key metrics, and properties. You can edit or update this information on the Summary tab, and manage associated metadata on other tabs such as Terms, Applications, or Policies, depending on the component type.
Note: You can edit a category, group, dataset, data collection, or data product only if you are the owner or if the component has been shared with you with edit permissions. If you do not see the edit options, contact the owner or your administrator to request the appropriate access.
Perform the following steps to edit a collection component in Data Catalog:
In the left navigation menu, click Data Canvas, and then click Collections.
The Collections panel displays a hierarchy of categories, groups, datasets, data collections, and data products.
Select the component (category, group, dataset, data collection, or data product) you want to edit.
The Summary tab opens by default.
To edit text fields such as Name, Description, or Purpose, click the Edit icon next to the respective field, update the value, and click Save to apply the changes.
Update key metrics, such as Sensitivity, Trust Score, or Data Quality, in the Key Metrics section as required.
These metrics help validate readiness for publishing as a data product.
To update properties such as Domain or Status, click the Edit icon next to the field under the Properties section, make the necessary changes, and click Save.
Go to other tabs, such as Terms, Applications, or Policies, to add or remove relevant metadata.
Note: The available tabs vary based on the type of collection component selected.
After completing your updates, ensure that all required fields are defined and properly saved.
You have successfully updated the selected collection component. The changes are saved immediately and are visible to other users based on their assigned access rights.
Delete a category, group, dataset, data collection, or data product
In Data Catalog, you can delete any collection component, such as a category, group, dataset, data collection, or data product, when it is no longer needed. Deleting unused or obsolete components helps maintain a clean and organized catalog, reduces clutter, and ensures users can easily navigate and find relevant data assets.
Note: Only the owner of the collection component can delete it. If you do not see the Delete option, it is possible that you do not have the required permissions.
Perform the following steps to delete a collection component in Data Catalog:
In the left navigation menu, click Data Canvas, and then click Collections.
The Collections panel displays a hierarchy of existing components, including categories, groups, datasets, data collections, and data products.
In the Collections panel, click the More options icon next to the component name (such as a dataset or data collection) that you want to delete.
Click Delete from the menu.
A confirmation prompt appears.
Click Confirm to confirm the deletion.
The component is permanently removed from the catalog.
You have successfully deleted the selected collection component. It no longer appears in the Collections hierarchy and is no longer accessible to users.
Last updated
Was this helpful?