Get started with Pentaho Data Catalog
Learn how to install and to start working with Pentaho Data Catalog to ingest, profile, and curate structured and unstructured data through a combination of automation and machine learning.
Product overview
Pentaho Data Catalog rapidly ingests, profiles, and meticulously curates structured and unstructured data through a combination of automation and machine learning. This process involves data fingerprinting and the application of metadata rules to provide contextualization aligned with the business's terminology as documented in the business glossary.
AI-driven discovery
Data Catalog uses unique data fingerprinting to automate discovery and classification of structured, semi-structured, and unstructured data.
User interface
Data Catalog's interactive user interface provides a customized user experience for the business role of every user, promoting rich content authoring and resource knowledge.
Data Canvas
Data Catalog gives users access to the file and field-level metadata available for the entire catalog of data assets. In addition, for the assets that each user has authorization to view, the Data Catalog displays a rich view of data-based details such as minimum, maximum, and most frequent values. Data Catalog users can add their own information to the catalog in the form of descriptions and custom metadata designed for their organization.
Use the Data Canvas to explore and investigate your data. Here, you can find detailed insights into resource metadata to help you understand and clarify practical applications.

Galaxy view
In Data Catalog, you can use the Galaxy view feature to quickly view the structure of your data and its details. Galaxy view is especially useful when you want to view information that is not easily visualized using the navigation tree in the Data Canvas. In the business glossary, you can use the Galaxy view feature to visualize and explore business terms and their relationships and to understand how terms are interconnected. You can also quickly identify related terms, synonyms, and hierarchies within specific domains or categories, enhancing their understanding of the glossary's structure.

Business glossary
Data Catalog lets you build or import a taxonomy of business terms in a glossary and organized into domains and categories and establish relationships between terms. Data Catalog distributes these terms to similar data across the cluster, producing a powerful index for business language searches. It lets business users find the right data quickly to understand the meaning and quality of the data at a glance.
Duplicate detection
Data Catalog uses checksums to provide duplicate detection for unstructured files, which is essential for controlling classified documents stored in multiple locations, preventing unauthorized access, and managing document leakages. If you know where data is duplicated, you can determine if you can archive or purge unused duplicate items, which can potentially reduce storage in expensive cloud services.
Access control
User profiles, roles, and access restrictions combine to deny or grant metadata access to users.
User roles
Assigning roles to Data Catalog users lets administrators exercise role-based functional and access control for those users, such as who can assign a business term to data and who can update a business rule. In addition, you can use roles to establish access control over Data Catalog resources at the data source level. Roles also incorporate a set of predefined access levels that define which features of the catalog are available to different users.
Data pipes
The data pipes feature in Data Catalog simplifies the migration, archiving, and movement of structured, semi-structured, and unstructured data to specified storage locations in a retrievable and queryable format, including copying the database to another database. This feature facilitates data movement for evaluation and archiving and ensures compliance with organizational data storage and management policies. It can be used to meet regulatory retention requirements and reduce maintenance costs related to logs, databases, and structured data types. For more detailed information, see the Manage data pipe templates section in the Administer Pentaho Data Catalog.
Licensing
The license you purchase for Data Catalog determines the following usage you are allowed:
Additional features that you can use (such as Pentaho Data Optimizer and Pentaho Data Mastering)
The number of data sources you can add
The amount of data you can scan
The number of Expert user roles that you can assign to users. The Expert user roles are:
Business Steward
Data Steward
Admin
Data Developer
Policy Manager
Data Catalog provides a policy manager to make it easier to perform complex data governance tasks. Use the policy manager to configure and enforce policies and standards and select rules to run on your data.
Machine Learning (ML) Models
With the ML Models feature in Data Catalog, you can seamlessly integrate machine learning assets into the broader data catalog environment. You can discover, organize, and manage Machine Learning metadata such as models, versions, experiments, runs, parameters, metrics, and artefacts in a structured and traceable way. By capturing this metadata, you can ensure reproducibility of experiments, compare model performance, and maintain visibility into the evolution of the models over time until the model is deployed in production.The feature also supports governance through tagging, business term association, and future capabilities like model performance metrics in production and model drift detection. Additionally, it promotes collaboration among data scientists, ML engineers, and business users by providing a centralized view of ML assets, while also offering the flexibility to import and export metadata across environments. To learn more about ML Models, see the Machine Learning (ML) Models section in Use Pentaho Data Catalog.
Architecture
The following diagram provides an overview of the Data Catalog architectural components across the distributed application.

Pentaho Data Optimizer
If you have a license for it, Pentaho Data Optimizer is enabled when you install Data Catalog.
Data Optimizer is an intelligent data tiering solution that reduces operating costs and gives you seamless access to Hadoop data with S3 compatible object storage like Hitachi Content Platform. Data Optimizer extends Data Catalog by providing migrate, delete, and rehydrate functions.
You can use Data Optimizer to inventory stored data, identify content, view usage, and tier files and objects into long term or deep archival storage. You can use rule-driven actions about data life cycles to account for compliance, manage costs, and mitigate risks, using a set of convenient tools and self-service processes for sustainable improvements in data management.
The benefits of using Data Optimizer apply regardless of the vendor you use and extend across local, cloud, and core environments. If needed, you can restore tiered files at any time.
Data Optimizer provides the following key capabilities:
File identification and classification
Rule-based governance of data location, life cycle, retention, and access
Rules-driven tiering and purging
Last updated
Was this helpful?