What's new in Pentaho Data Catalog

Learn the highlights of this Pentaho Data Catalog release.

Pentaho Data Catalog Overview

A modern organization must be data fit. As data volumes increase, so too does the necessity and cost of maintaining data in a state that is ready for business use. To harness data for business decisions and enable artificial intelligence, it is imperative that data is reliable, of high quality, and readily accessible to data users. The need to discover content across structured and unstructured formats, both on-premises and in the cloud, is more critical than ever. Organizations must continuously monitor their data to identify trends and anomalies and maintain data hygiene in tandem with data growth.

Policies governing data lifecycle and quality must be enforced to ensure that high-quality data is available to consumers. Consequently, data users and models can efficiently locate and utilize data through the data catalog, which is essential for a modern data-driven organization.

Pentaho Data Catalog swiftly ingests, profiles, and curates both structured and unstructured data utilizing automation and machine learning. Data and metadata fingerprinting rules are employed to contextualize data in the language of the business, as documented in the business glossary. The policy manager facilitates the implementation of governance and security policies.

A robust rules engine determines quality, sensitivity, and usage patterns. Activate your metadata by leveraging Data Catalog monitoring and notification capabilities. Construct a relationship graph across business entities and terms to infuse semantic understanding into the data.

Data fingerprints are analyzed to identify potential duplicates, copies, and similarities across data stores, thereby assessing data movement, optimization, and mastering needs. Data lineage support for Open Lineage enables tracking of data as it flows through the organization, fostering trust and facilitating early-stage data quality and remediation activities.

An advanced observability stack captures popular assets, searches, and trends, enabling stewardship organizations to concentrate their efforts on pertinent data. Collaborate within the context of your data and business vocabulary to capture tribal knowledge, recommendations, and user perspectives on data assets.

Documentation for this release is at the following link:

https://docs.pentaho.com/

What's new in Pentaho Data Catalog

Pentaho Data Catalog release 10.2.5 provides new features to discover content across structured and unstructured formats, both on-premises and in the cloud.

Learn the highlights of the Pentaho Data Catalog 10.2.5, 10.2.1 and 10.2 releases.

10.2.7

The key features in this release are:

  • Version control of datasets Users can capture the state of a dataset by enabling them to duplicate or append data, with each action creating a new version while preserving all properties and not copying generated values. Dataset columns cannot be removed, duplication is limited to one copy per collection, and all changes are tracked through versioning.

  • Support for AWS S3 authentication modes with Secrets Manager Extending the AWS S3 data source feature to support secrets in AWS Secrets Manager, retrieve them dynamically, and use them to access S3 resources without hardcoding sensitive information.

  • Data Pipes - Containerization and deployment Simplified approach for Data Pipes to have Pentaho Data Integration (PDI) components ready to use immediately after deploying the PDC image, without manual setup steps like downloading, unpacking, configuring, or starting services separately.

  • UI improvements

    • Design appropriate interface for "published" APIs Support for key APIs—including search, notifications, dataset management, job execution, and status retrieval, many more.

    • Select and add tables to a Collection/dataset A Data Catalog user can search by a pattern, filter, and multi-select matching results, and add them in bulk to a dataset or collection easily and further Add to Cart in the table view of the data canvas.

  • Lineage Canvas enhancements: Delete manual lineage links and support for column-level lineage Enhances the PDC lineage canvas by allowing users to delete manually added lineage while maintaining the integrity of system-generated lineage and introduces support for column-level lineage transformations to enable detailed tracking of field-level data changes. It also addresses multiple UI and UX bugs to provide a more seamless and intuitive user experience.

  • Profiling Support for Semi-Structured data – Parquet Profiling support for Parquet and part files enhances users' ability to analyze and tag these file types.

  • Azure Blob to Azure Blob

    Pentaho Data Optimizer (PDO) users can migrate data between different Azure Blob or ADLS Gen2 instances, enabling movement from hot/expensive storage to cool, cold, or archive tiers to optimize and reduce storage costs. This feature leverages Azure’s tiered storage options, allowing efficient data management based on access frequency and retention needs.

  • Enhance gateway service to support custom model endpoints

    Users can configure custom model endpoints for both request and response models and specify environment variables required for their machine learning (ML) services. This enables seamless integration and flexible deployment of custom ML models by supporting endpoint customization and necessary runtime configuration.

  • Infrastructure configuration for AWS deployment

    • Ingest all PDC logs into Amazon CloudWatch to centralize monitoring and log analysis for the AWS deployment.

    • Restrict OpenSearch access by default to enhance security.

    • Confirm compatibility with Kubernetes 1.32 for the deployment.

    • Document Fluentbit configuration steps for connecting to CloudWatch.

    • PDC Helm chart to support using a customer's OpenSearch instance on the public cloud.

  • Production ML models - Nvidia Triton inference server

    Machine learning (ML) models hierarchy service, which imports metadata up to pre-production, has been extended to capture production model inference details, including data, metadata, and metrics such as successes and failures.

  • Structured and semi-structured file profiling options

    Data Catalog users can profile both structured and semi-structured files, perform data discovery when selecting multiple files or entire folders simultaneously, and apply profiling actions in bulk across all applicable data types.

  • Support for IRSA, CAR, and Istio

    • IAM Roles for Service Accounts (IRSA) support for PDC: Provides a secure and convenient way to grant Kubernetes pods access to AWS resources.

    • CAR Integration: Allow IAM users to assume roles for cross-account AWS service access when interacting with PDC “Data Resource” through the Cross-Account Role (CAR) feature.

    • Istio on EKS: Provides a streamlined process to install Istio Service Mesh on Amazon EKS using the EKS add-on in Ambient mode, combined with deploying the PDC application configured to use the AWS ALB Ingress Controller.

10.2.6

The key features in this release are:

  • Improved data sampling features

    • Added support for extracting data samples directly from files, with a new Extract Samples feature and an option to Skip recent days in the data discovery interface.

    • Enhanced visibility of sample data by capturing and displaying both the count and percentage of sample values.

  • Masking of sensitive data Introduced the ability to mask sensitive data within the View Samples interface, so that data can now be masked dynamically based on configured Tags and Sensitivity values.

  • Data profiling Enhanced profiling capabilities with support for multi-level JSON files.

  • Data Collections

    • Enhanced Data Collections functionality with integration into Galaxy View, and the ability to add or remove custom properties, policies, and other relationships.

    • You can now also assign data labels and publish physical assets to a collection.

  • Collapsible cards On the Summary tabs, any empty cards are now collapsed by default. Users can expand the cards as needed, and their configuration will be saved automatically for future sessions.

  • EKS support Enhanced support for deploying Data Catalog on Amazon EKS, including a configuration option to integrate with AWS CloudWatch for monitoring and logging.

10.2.5

The key features in this release are:

  • Discovery (faceted search)

    Support for full text search across asset types in Data Catalog and the ability to filter results using facets.

  • Data delivery

    Data pipe functionality now supports encryption. This requires Pentaho Data Integration (PDI) engine to be deployed and configured

  • ML Models hierarchy

    A structured model hierarchy for integrating ML models, versions, experiments, runs, and associated metadata. It also includes governance elements like policies, glossaries, and applications managed within Data Catalog.

  • Tableau support and reports lineage

    Integrating metadata from Tableau workbooks, reports, dashboards, and other assets, along with their relationships to data sources and datasets, effectively tracks data lineage.

  • Request access

    Users can request permission to access metadata of data assets in Data Catalog, initiating a workflow for review and approval by data owners or administrators to ensure secure data access management using ServiceNow and JIRA.

  • Data labelling

    Provide meaningful labels to data assets that enhance data discovery, understanding, governance, and trust within the organizational data landscape to help users quickly grasp the context and characteristics of data assets without needing to inspect the raw data itself.

  • Data profiling enhancements

    Ability to provide a WHERE clause to filter candidate rows for profiling.

  • Data Products

    Select data assets (tables and files) to create a collection, provide guidance on the sensitivity and quality of the collection and deliver a data product, to be shared within the organization

  • Document discovery enhancements

    Support the ability to summarize a document and detect sentiment using ML models

  • Similarity detection for tabular data (Tables and Files)

    Use the column names to determine potential duplicate tables and files that have similar structures.

  • PDI – PDC Lineage

    Automatically track, visualize, and analyze data’s entire journey through PDI lineage, making this information accessible in the Pentaho Data Catalog for governance, compliance, and insights.

10.2.1

The key features in this release are:

  • Product licensing support

    • Your software license determines user-based, data source count and data capacity entitlement

    • There are two tiers of users: Business Users and Expert Users

    • Additional licensed features available for Data Optimizer and Data Mastering

  • Data delivery

    Enhancements to data delivery via Data Pipes to support additional databases, Object and file store options.

  • OT asset hierarchy

    Offers a comprehensive operational and industrial data context, encompassing policies, applications, and a business glossary. It plays a crucial role in facilitating the convergence of IT and OT systems.

  • Improved rule engine

    We have enhanced the rule engine to decouple the Definitions from the Rule. You can now create a definition with multiple actions and using this definition create a rule to apply it to all your applicable data sources in a few clicks.

  • Additional data source support

    Pentaho Data Catalog 10.2.1 now supports Metadata Ingest, Data Profiling and Data Identification of InfluxDB, Redshift, Google Big Query, Sybase and Salesforce.

  • Import and export

    To facilitate migration and onboarding we have introduced support for importing and exporting the following assets: Data Connections, Dictionaries, Patterns, Rule Definitions, Business Glossary, Policies and Applications.

  • Add relationships and display for BI Assets

    Pentaho Data Catalog now supports the Term, Policy/Standards and Reference data relations to BI report asset types.

  • User guided sampling

    We have introduced user guided sampling to allow users to profile data based on a subset of rows rather than profiling all rows in a large table.

10.2

The key features in this release are:

  • Data delivery

    A business user is able to find data and with a few clicks be able to configure a data pipeline that delivers the data to its desired destination.

  • Application catalog

    Catalog your applications, their governance requirements, ownership, their relationship with data assets, reference data and other elements in the data catalog

  • Governance policies

    Document regulatory or corporate regulations, policies, standards in the catalog, building a source of record and an intuitive interface to find policies and the data to which they apply.

  • Data lineage

    Build data lineage for data movement executed by Pentaho ETL. Compatible with open lineage, this capability gives us a quick ROI for customers interested in lineage for Pentaho Data Integration.

  • Galaxy view

    This release enhances the visual representation of assets (such as data assets, business terms, policies/standards/rules, reference data, and applications) to easily grasp the dependencies, understand potential impact, and collaborate with all the stakeholders to make the right business decision.

  • Trust score

    Support for bringing data quality scores from the Pentaho Data Quality product, verification of lineage, and assessing sensitivity to build a trust score.

  • License support

    This release introduces a license solution for the Data Catalog and Pentaho Data Optimizer products. You can configure the number of data sources and Expert users (Steward, Admin, Developer) you can add, and the size of data scanned from file systems.

In addition, there are numerous improvements including extensive rules support, additional support for data sources and technologies, and continued emphasis on ease of use and performance improvements.

Product roles & licensing

Your Data Catalog software license determines the following entitlements:

  • additional features that you can use (Pentaho Data Optimizer and Pentaho Data Mastering)

  • the number of data sources you can add

  • the amount of data you can scan

  • the number of Expert user roles that you can assign to users. The Expert user roles are:

    • Business Steward

    • Data Steward

    • Admin

    • Data Developer

To calculate the total number of users for licensing, use the following table for mapping from product role to licensing role. A named user with multiple product roles maps into a single user for licensing consideration. For example, if a user is given the role of data source administrator and data source access manager, that is considered a single licensed Data Steward. If two distinct users are given these two roles, then that is considered two licensed Data Stewards.

Product roles
Licensing considerations

Owner

Admin User

Reader

Business User

User Access Administrator

Admin User

Data Source Administrator

Data Steward

Data Source Access Manager

Data Steward

Data Quality Administrator

Data Steward

Data Quality Operator

Data Steward

Data Quality Rule Approver

Data Steward

Data Profiler

Data Steward

Data Sample Viewer

Business User

Data Tagger

Data Steward

Data Tag Viewer

Data Steward

Business Glossary Administrator

Data Steward

Business Glossary Mapper

Data Steward

Data Identification Methods Manager

Data Steward

Data Source Steward

Data Steward

Last updated

Was this helpful?