What's new in Pentaho Data Catalog
Learn the highlights of this Pentaho Data Catalog release.
Pentaho Data Catalog Overview
A modern organization must be data fit. As data volumes increase, so too does the necessity and cost of maintaining data in a state that is ready for business use. To harness data for business decisions and enable artificial intelligence, it is imperative that data is reliable, of high quality, and readily accessible to data users. The need to discover content across structured and unstructured formats, both on-premises and in the cloud, is more critical than ever. Organizations must continuously monitor their data to identify trends and anomalies and maintain data hygiene in tandem with data growth.
Policies governing data lifecycle and quality must be enforced to ensure that high-quality data is available to consumers. Consequently, data users and models can efficiently locate and utilize data through the data catalog, which is essential for a modern data-driven organization.
Pentaho Data Catalog swiftly ingests, profiles, and curates both structured and unstructured data utilizing automation and machine learning. Data and metadata fingerprinting rules are employed to contextualize data in the language of the business, as documented in the business glossary. The policy manager facilitates the implementation of governance and security policies.
A robust rules engine determines quality, sensitivity, and usage patterns. Activate your metadata by leveraging Data Catalog monitoring and notification capabilities. Construct a relationship graph across business entities and terms to infuse semantic understanding into the data.
Data fingerprints are analyzed to identify potential duplicates, copies, and similarities across data stores, thereby assessing data movement, optimization, and mastering needs. Data lineage support for Open Lineage enables tracking of data as it flows through the organization, fostering trust and facilitating early-stage data quality and remediation activities.
An advanced observability stack captures popular assets, searches, and trends, enabling stewardship organizations to concentrate their efforts on pertinent data. Collaborate within the context of your data and business vocabulary to capture tribal knowledge, recommendations, and user perspectives on data assets.
Documentation for this release is at the following link:
What's new in Pentaho Data Catalog
Pentaho Data Catalog release 10.2.5 provides new features to discover content across structured and unstructured formats, both on-premises and in the cloud.
Learn the highlights of the Pentaho Data Catalog 10.2.5, 10.2.1 and 10.2 releases.
10.2.7
The key features in this release are:
Version control of datasets Users can capture the state of a dataset by enabling them to duplicate or append data, with each action creating a new version while preserving all properties and not copying generated values. Dataset columns cannot be removed, duplication is limited to one copy per collection, and all changes are tracked through versioning.
Support for AWS S3 authentication modes with Secrets Manager Extending the AWS S3 data source feature to support secrets in AWS Secrets Manager, retrieve them dynamically, and use them to access S3 resources without hardcoding sensitive information.
Data Pipes - Containerization and deployment Simplified approach for Data Pipes to have Pentaho Data Integration (PDI) components ready to use immediately after deploying the PDC image, without manual setup steps like downloading, unpacking, configuring, or starting services separately.
UI improvements
Design appropriate interface for "published" APIs Support for key APIs—including search, notifications, dataset management, job execution, and status retrieval, many more.
Select and add tables to a Collection/dataset A Data Catalog user can search by a pattern, filter, and multi-select matching results, and add them in bulk to a dataset or collection easily and further Add to Cart in the table view of the data canvas.
Lineage Canvas enhancements: Delete manual lineage links and support for column-level lineage Enhances the PDC lineage canvas by allowing users to delete manually added lineage while maintaining the integrity of system-generated lineage and introduces support for column-level lineage transformations to enable detailed tracking of field-level data changes. It also addresses multiple UI and UX bugs to provide a more seamless and intuitive user experience.
Profiling Support for Semi-Structured data – Parquet Profiling support for Parquet and part files enhances users' ability to analyze and tag these file types.
Azure Blob to Azure Blob
Pentaho Data Optimizer (PDO) users can migrate data between different Azure Blob or ADLS Gen2 instances, enabling movement from hot/expensive storage to cool, cold, or archive tiers to optimize and reduce storage costs. This feature leverages Azure’s tiered storage options, allowing efficient data management based on access frequency and retention needs.
Enhance gateway service to support custom model endpoints
Users can configure custom model endpoints for both request and response models and specify environment variables required for their machine learning (ML) services. This enables seamless integration and flexible deployment of custom ML models by supporting endpoint customization and necessary runtime configuration.
Infrastructure configuration for AWS deployment
Ingest all PDC logs into Amazon CloudWatch to centralize monitoring and log analysis for the AWS deployment.
Restrict OpenSearch access by default to enhance security.
Confirm compatibility with Kubernetes 1.32 for the deployment.
Document Fluentbit configuration steps for connecting to CloudWatch.
PDC Helm chart to support using a customer's OpenSearch instance on the public cloud.
Production ML models - Nvidia Triton inference server
Machine learning (ML) models hierarchy service, which imports metadata up to pre-production, has been extended to capture production model inference details, including data, metadata, and metrics such as successes and failures.
- Support for IRSA, CAR, and Istio
IAM Roles for Service Accounts (IRSA) support for PDC: Provides a secure and convenient way to grant Kubernetes pods access to AWS resources.
CAR Integration: Allow IAM users to assume roles for cross-account AWS service access when interacting with PDC “Data Resource” through the Cross-Account Role (CAR) feature.
Istio on EKS: Provides a streamlined process to install Istio Service Mesh on Amazon EKS using the EKS add-on in Ambient mode, combined with deploying the PDC application configured to use the AWS ALB Ingress Controller.
10.2.6
The key features in this release are:
Improved data sampling features
Added support for extracting data samples directly from files, with a new Extract Samples feature and an option to Skip recent days in the data discovery interface.
Enhanced visibility of sample data by capturing and displaying both the count and percentage of sample values.
Masking of sensitive data Introduced the ability to mask sensitive data within the View Samples interface, so that data can now be masked dynamically based on configured Tags and Sensitivity values.
Data profiling Enhanced profiling capabilities with support for multi-level JSON files.
Data Collections
Enhanced Data Collections functionality with integration into Galaxy View, and the ability to add or remove custom properties, policies, and other relationships.
You can now also assign data labels and publish physical assets to a collection.
Collapsible cards On the Summary tabs, any empty cards are now collapsed by default. Users can expand the cards as needed, and their configuration will be saved automatically for future sessions.
EKS support Enhanced support for deploying Data Catalog on Amazon EKS, including a configuration option to integrate with AWS CloudWatch for monitoring and logging.
10.2.5
The key features in this release are:
Discovery (faceted search)
Support for full text search across asset types in Data Catalog and the ability to filter results using facets.
Data delivery
Data pipe functionality now supports encryption. This requires Pentaho Data Integration (PDI) engine to be deployed and configured
ML Models hierarchy
A structured model hierarchy for integrating ML models, versions, experiments, runs, and associated metadata. It also includes governance elements like policies, glossaries, and applications managed within Data Catalog.
Tableau support and reports lineage
Integrating metadata from Tableau workbooks, reports, dashboards, and other assets, along with their relationships to data sources and datasets, effectively tracks data lineage.
Request access
Users can request permission to access metadata of data assets in Data Catalog, initiating a workflow for review and approval by data owners or administrators to ensure secure data access management using ServiceNow and JIRA.
Data labelling
Provide meaningful labels to data assets that enhance data discovery, understanding, governance, and trust within the organizational data landscape to help users quickly grasp the context and characteristics of data assets without needing to inspect the raw data itself.
Data profiling enhancements
Ability to provide a WHERE clause to filter candidate rows for profiling.
Data Products
Select data assets (tables and files) to create a collection, provide guidance on the sensitivity and quality of the collection and deliver a data product, to be shared within the organization
Document discovery enhancements
Support the ability to summarize a document and detect sentiment using ML models
Similarity detection for tabular data (Tables and Files)
Use the column names to determine potential duplicate tables and files that have similar structures.
PDI – PDC Lineage
Automatically track, visualize, and analyze data’s entire journey through PDI lineage, making this information accessible in the Pentaho Data Catalog for governance, compliance, and insights.
10.2.1
The key features in this release are:
Product licensing support
Your software license determines user-based, data source count and data capacity entitlement
There are two tiers of users: Business Users and Expert Users
Additional licensed features available for Data Optimizer and Data Mastering
Data delivery
Enhancements to data delivery via Data Pipes to support additional databases, Object and file store options.
OT asset hierarchy
Offers a comprehensive operational and industrial data context, encompassing policies, applications, and a business glossary. It plays a crucial role in facilitating the convergence of IT and OT systems.
Improved rule engine
We have enhanced the rule engine to decouple the Definitions from the Rule. You can now create a definition with multiple actions and using this definition create a rule to apply it to all your applicable data sources in a few clicks.
Additional data source support
Pentaho Data Catalog 10.2.1 now supports Metadata Ingest, Data Profiling and Data Identification of InfluxDB, Redshift, Google Big Query, Sybase and Salesforce.
Import and export
To facilitate migration and onboarding we have introduced support for importing and exporting the following assets: Data Connections, Dictionaries, Patterns, Rule Definitions, Business Glossary, Policies and Applications.
Add relationships and display for BI Assets
Pentaho Data Catalog now supports the Term, Policy/Standards and Reference data relations to BI report asset types.
User guided sampling
We have introduced user guided sampling to allow users to profile data based on a subset of rows rather than profiling all rows in a large table.
10.2
The key features in this release are:
Data delivery
A business user is able to find data and with a few clicks be able to configure a data pipeline that delivers the data to its desired destination.
Application catalog
Catalog your applications, their governance requirements, ownership, their relationship with data assets, reference data and other elements in the data catalog
Governance policies
Document regulatory or corporate regulations, policies, standards in the catalog, building a source of record and an intuitive interface to find policies and the data to which they apply.
Data lineage
Build data lineage for data movement executed by Pentaho ETL. Compatible with open lineage, this capability gives us a quick ROI for customers interested in lineage for Pentaho Data Integration.
Galaxy view
This release enhances the visual representation of assets (such as data assets, business terms, policies/standards/rules, reference data, and applications) to easily grasp the dependencies, understand potential impact, and collaborate with all the stakeholders to make the right business decision.
Trust score
Support for bringing data quality scores from the Pentaho Data Quality product, verification of lineage, and assessing sensitivity to build a trust score.
License support
This release introduces a license solution for the Data Catalog and Pentaho Data Optimizer products. You can configure the number of data sources and Expert users (Steward, Admin, Developer) you can add, and the size of data scanned from file systems.
In addition, there are numerous improvements including extensive rules support, additional support for data sources and technologies, and continued emphasis on ease of use and performance improvements.
Product roles & licensing
Your Data Catalog software license determines the following entitlements:
additional features that you can use (Pentaho Data Optimizer and Pentaho Data Mastering)
the number of data sources you can add
the amount of data you can scan
the number of Expert user roles that you can assign to users. The Expert user roles are:
Business Steward
Data Steward
Admin
Data Developer
To calculate the total number of users for licensing, use the following table for mapping from product role to licensing role. A named user with multiple product roles maps into a single user for licensing consideration. For example, if a user is given the role of data source administrator and data source access manager, that is considered a single licensed Data Steward. If two distinct users are given these two roles, then that is considered two licensed Data Stewards.
Owner
Admin User
Reader
Business User
User Access Administrator
Admin User
Data Source Administrator
Data Steward
Data Source Access Manager
Data Steward
Data Quality Administrator
Data Steward
Data Quality Operator
Data Steward
Data Quality Rule Approver
Data Steward
Data Profiler
Data Steward
Data Sample Viewer
Business User
Data Tagger
Data Steward
Data Tag Viewer
Data Steward
Business Glossary Administrator
Data Steward
Business Glossary Mapper
Data Steward
Data Identification Methods Manager
Data Steward
Data Source Steward
Data Steward
Last updated
Was this helpful?