> For the complete documentation index, see [llms.txt](https://docs.pentaho.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.pentaho.com/pdc-get-started/readme.md).

# Get started with Pentaho Data Catalog

## Product overview

Pentaho Data Catalog rapidly ingests, profiles, and meticulously curates structured and unstructured data through a combination of automation and machine learning. This process involves data fingerprinting and the application of metadata rules to provide contextualization aligned with the business's terminology as documented in the business glossary.

### **AI-driven discovery**

Data Catalog uses unique data fingerprinting to automate discovery and classification of structured, semi-structured, and unstructured data.

### **User interface**

Data Catalog's interactive user interface provides a customized user experience for the business role of every user, promoting rich content authoring and resource knowledge.

### **Data Canvas**

Data Catalog gives users access to the file and field-level metadata available for the entire catalog of data assets. In addition, for the assets that each user has authorization to view, the Data Catalog displays a rich view of data-based details such as minimum, maximum, and most frequent values. Data Catalog users can add their own information to the catalog in the form of descriptions and custom metadata designed for their organization.

Use the Data Canvas to explore and investigate your data. Here, you can find detailed insights into resource metadata to help you understand and clarify practical applications.<br>

<figure><img src="/files/HLdjg8Gq7C6pBFWFTawz" alt=""><figcaption><p>Data Canvas View</p></figcaption></figure>

### **Galaxy view**

In Data Catalog, you can use the Galaxy view feature to quickly view the structure of your data and its details. Galaxy view is especially useful when you want to view information that is not easily visualized using the navigation tree in the Data Canvas.\
\
In the business glossary, you can use the Galaxy view feature to visualize and explore business terms and their relationships and to understand how terms are interconnected. You can also quickly identify related terms, synonyms, and hierarchies within specific domains or categories, enhancing their understanding of the glossary's structure.<br>

<figure><img src="/files/PyGIxqXDeIM6VJxH7jZC" alt=""><figcaption><p>Galaxy View</p></figcaption></figure>

### **Business glossary**

Data Catalog lets you build or import a taxonomy of business terms in a ​glossary and organized into domains and categories and establish relationships between terms. Data Catalog distributes these terms to similar data across the cluster, producing a powerful index for business language searches. It lets business users find the right data quickly to understand the meaning and quality of the data at a glance.

### **Duplicate detection**

Data Catalog uses checksums to provide duplicate detection for unstructured files, which is essential for controlling classified documents stored in multiple locations, preventing unauthorized access, and managing document leakages. If you know where data is duplicated, you can determine if you can archive or purge unused duplicate items, which can potentially reduce storage in expensive cloud services.

### **Access control**

User profiles, roles, and access restrictions combine to deny or grant metadata access to users.

### **User roles**

Assigning roles to Data Catalog users lets administrators exercise role-based functional and access control for those users, such as who can assign a business term to data and who can update a business rule. In addition, you can use roles to establish access control over Data Catalog resources at the data source level. Roles also incorporate a set of predefined access levels that define which features of the catalog are available to different users.

### **Data pipes**

The data pipes feature in Data Catalog simplifies the migration, archiving, and movement of structured, semi-structured, and unstructured data to specified storage locations in a retrievable and queryable format, including copying the database to another database. This feature facilitates data movement for evaluation and archiving and ensures compliance with organizational data storage and management policies. It can be used to meet regulatory retention requirements and reduce maintenance costs related to logs, databases, and structured data types. For more detailed information, see the [Manage data pipe templates](/pdc-admin/pdc-10.2-admin/pdc_manage-data-pipe-templates.md) section in the [Administer Pentaho Data Catalog](https://docs.pentaho.com/pdc-admin/pdc-10.2-admin/).

### **Policy Manager**

Data Catalog provides a policy manager to make it easier to perform complex data governance tasks. Use the policy manager to configure and enforce policies and standards and select rules to run on your data.

### **Machine Learning (ML) Models**

With the ML Models feature in Data Catalog, you can seamlessly integrate machine learning assets into the broader data catalog environment. You can discover, organize, and manage Machine Learning metadata such as models, versions, experiments, runs, parameters, metrics, and artefacts in a structured and traceable way. By capturing this metadata, you can ensure reproducibility of experiments, compare model performance, and maintain visibility into the evolution of the models over time until the model is deployed in production.The feature also supports governance through tagging, business term association, and future capabilities like model performance metrics in production and model drift detection. Additionally, it promotes collaboration among data scientists, ML engineers, and business users by providing a centralized view of ML assets, while also offering the flexibility to import and export metadata across environments. To learn more about ML Models, see the [Machine Learning (ML) Models](/pdc-use/pdc-10.2-use/pdc-machine-learning-ml-models-ug.md) section in [Use Pentaho Data Catalog](https://docs.pentaho.com/pdc-use/pdc-10.2-use/).

## Pentaho Data Optimizer

If you have a license for it, Pentaho Data Optimizer is enabled when you install Data Catalog.

Data Optimizer is an intelligent data tiering solution that reduces operating costs and gives you seamless access to Hadoop data with S3 compatible object storage like Hitachi Content Platform. Data Optimizer extends Data Catalog by providing migrate, delete, and rehydrate functions.

You can use Data Optimizer to inventory stored data, identify content, view usage, and tier files and objects into long term or deep archival storage. You can use rule-driven actions about data life cycles to account for compliance, manage costs, and mitigate risks, using a set of convenient tools and self-service processes for sustainable improvements in data management.

The benefits of using Data Optimizer apply regardless of the vendor you use and extend across local, cloud, and core environments. If needed, you can restore tiered files at any time.

Data Optimizer provides the following key capabilities:

* File identification and classification
* Rule-based governance of data location, life cycle, retention, and access
* Rules-driven tiering and purging


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdc-get-started/readme.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.