Components Reference
Pentaho Data Catalog aims to accommodate diverse computing environments by providing details about the supported environment components and versions. Where applicable, versions are listed as certified or supported:
Certified
The version has been tested and validated for compatibility with Data Catalog.
Supported
Support is available for listed non-certified versions.
If you have questions about your particular computing environment, contact Pentaho Support.
Hitachi Vantara products
The following Hitachi Vantara product is certified for Pentaho Data Catalog 10.2.x:
Hitachi Content Platform 9.7
Server
Pentaho Data Catalog is hardware-independent and runs on server-class computers.
Data Catalog is officially certified to run on the Red Hat Enterprise and Ubuntu Linux distributions. It is compatible with any binary-compatible Linux distribution that meets the necessary software and hardware requirements, including in virtualized and cloud environments. If you have any questions, contact Pentaho Support.
Because formulating hardware recommendations is a complex task due to various factors that might significantly influence system performance, it is beyond the scope of this document. Factors that can influence performance include the volume or size of the data you are working with, and the nature and quality of the data. Characteristics of the data, such as whether it is homogeneous or heterogeneous, structured or unstructured, or random, can affect the efficiency of Data Catalog, especially as data volume increases.
It is the responsibility of your server operations team to monitor these and other server performance metrics. If any limitations arise, your server administrator must be knowledgeable about scaling these parameters appropriately within your deployment.
Your server-class computer must comply with the specifications for minimum hardware and required operating systems. As a best practice, use the following server sizing guidelines for Data Catalog deployments:
Functional Proof of Concept (POC) 16 cores, 32GB RAM
To handle small or medium work loads to demonstrate PDC functionality (few million files)
Basic 16 cores, 64GB RAM
Basic requirement for PDC + Pentaho Data Optimizer (PDO) deployment
Standard Edition 32 cores, 128GB RAM
PDC Standard Edition for Classification: no Pentaho Data Mastering (PDM), no PDO
Premium Edition 48 cores, 256GB RAM
PDC Premium Edition, or PDC Standard + PDO To handle a couple of hundred million files for scan, checksum, and extended metadata
Enterprise Scale 128 cores, 512GB RAM
High performance for large datasets including PDO or PDM, or both
PDC Enterprise Scale + PDO + PDM
Higher resources on VM enables more parallel processing (number of worker instances) and jobs. And each job has leverage to work with more threads. Unstructured content processing can be very resource-intensive based on the file size, like big PDF files.
Server storage requirements
The server file systems and storage must meet the following requirements:
At least 10 GB of storage should be allocated for the root file system.
Any POSIX-compliant file system can be used, but XFS, the standard file system in RHEL, is well-tested.
Ample storage should be mounted in the designated Docker storage area (typically the default on Linux servers).
Operating system requirements
You must deploy Data Catalog to a dedicated server, which can be either a physical server or a virtual machine. The hosting environment might be on-premises or on the cloud using platforms such as Azure or AWS.
Amazon Linux
2023
CentOS
Stream 9
Red Hat Enterprise Linux
9.4
Rocky Linux
8.10
Ubuntu Server
22.04
For optimal compatibility and performance, the server must run a modern Linux operating system based on 64-bit (x86_64/amd64) architecture.
Linux kernel version
Version 4.0 or higher of the Linux kernel is required. For RHEL, use version 3.10.0-514 of the kernel or later.
Additional software
For seamless SSH connectivity and secure file transfer between your machine and the server, it is a best practice to install the following software on your machine:
An SSH client such as PuTTY.
WinSCP for a graphical user interface to securely transfer files between the client and the server using SSH.
Network security and firewall requirements
Pentaho Data Catalog requires the following network and firewall configurations to work correctly:
Ports
443
– Required for HTTPS communication between the browser and the application.Port
9200
– Required for communication with the OpenSearch metadata repository.Port
5432
– Required for connectivity to the reporting database (bidb).The application server must have network connectivity to the database server and its respective port.
Container deployment
Supported technology for deploying Data Catalog in containers.
Docker
22.0, 20.10
22.0, 20.10
Docker Compose
2.22
2.22
You can also deploy pre-configured Docker images of specific Pentaho products in AWS environments. See Hyperscalers for more information.
User account
The server user that installs Pentaho Data Catalog must either be the root user or have appropriate permissions to run Docker.
To set up Docker permissions for non-root users, see the official Docker documentation at https://docs.docker.com/engine/install/linux-postinstall/.
Solution database repositories
Pentaho Data Catalog stores processing artifacts in the following database repositories:
PostgreSQL
16
MongoDB-ee
6.0.20
Apache Hadoop vendors
Pentaho Data Catalog supports the following Hadoop vendor data sources:
Amazon EMR
7.0.0
Cloudera Data Platform (CDP)
7.1.8
Open-source Hadoop
3.3.6
Data sources
Pentaho Data Catalog supports the following data sources. Review the requirements to verify general compatibility with a specific vendor.
Amazon DynamoDB
AWS SDK 2.20.45 (Supported)
Aurora MySQL
3.07 (Supported)
mysql-connector-java-8.0.17.jar
Aurora Postgres
15.3
postgresql-42.5.4.jar
AWS S3
Latest
Azure Blob Storage
12.21.2
Azure SQL Server
2022
Denodo
Latest
denodo-vdp-jdbcdriver-8.0-update-20230301.jar
Google BigQuery
Latest
GoogleBigQueryJDBC42.jar
Hadoop
10.1.0
Hitachi Content Platform
9.7
HNAS
Latest
IBM Db2
11.5
jcc-11.5.7.0.jar
InfluxDB
Latest
Default (custom JDBC driver)
MariaDB
11.4.2
mariadb-java-client-3.4.0.jar
Microsoft Access
Latest
Microsoft SQL
2022
mssql-jdbc-9.2.1.jre15.jar
MySQL
8.0.27
mysql-connector-java-8.0.17.jar
NFS
4.2 and later
OneDrive
Latest
Oracle
12, 19c, 21c, and 23
ojdbc8-21.1.0.0.jar
11
ojdbc6-11.2.0.4.jar
PostgreSQL
12 and 14
postgresql-42.5.4.jar
Redshift
Latest
redshift-jdbc42-2.1.0.30.jar
Salesforce
Latest
Default (custom JDBC driver)
SAP HANA
2.0 SPS 07 and later
ngdbc-2.17.12
SharePoint
Latest
SMB/CIFS
3.1.1 and later
Snowflake
8.20.10
snowflake-jdbc-3.13.34.jar
Sybase
Latest
jconn4-16.0.jar
Vertica
11
vertica-jdbc-11.1.1-0.jar
Data source connectivity
Your license determines the number of data sources that Pentaho Data Catalog can connect to.
The following table contains the supported data sources and respective requirements to connect with Data Catalog.
Active Directory
An account with credentials that include a username and password that have read access to query Active Directory objects.
AWS S3
AWS region where the S3 bucket was created
Access key and secret access key
Read-only permissions to the S3 bucket
Google Cloud Storage
AWS region where the S3 bucket was created
Access key and secret access key
Read-only permissions to the S3 bucket
Azure Blob Storage
A service account key file in JSON format.
Read access to the target Google Cloud Storage buckets.
Assign the Storage Object Viewer IAM role (roles/storage.objectViewer) to the service account.
HCP
HCP namespace location (Endpoint)
Access key and secret access key
Read-only permissions to the namespace of the HCP cluster
HDFS
Hadoop version 2.7.2 and later
URI should provide a hostname and share folder details
Path of the directory that needs to be scanned
Read-only access to the directory that needs to be scanned
To install Data Optimizer for Hadoop, see Install Pentaho Data Optimizer in Hadoop Cluster.
Okta
The private key file (.pem) and the Client ID should be generated with all necessary scopes (read scopes for apps, groups, and users) and role (Read Only Administrator). For more information, see the Generate Okta credentials for Data Catalog topic in the Administer Pentaho Data Catalog document.
OneDrive and SharePoint
Application (client) ID, Directory (tenant ID), and clientSecret from a registered app on the Azure portal
Delegated permissions and Application permissions in the registered app
Read-only permissions to the OneDrive and SharePoint sites
RDBMS
To perform data profiling, read-only access to all database objects and system catalog tables
SMB/CIFS
URI should provide a hostname and share folder details
Username and password to access the SMB/CIFS Share Directory
Path of the directory that needs to be scanned
Read-only access to the directory that needs to be scanned
JDBC drivers
The following table provides commands you can run to download JDBC drivers for use with database data sources in Pentaho Data Catalog:
MySQL JDBC Driver version 8.0.27
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar
Oracle JDBC Driver version 21.1.0
wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.1.0.0/ojdbc8-21.1.0.0.jar
Postgres JDBC Driver version 42.3.1
wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.1/postgresql-42.3.1.jar
SQL Server JDBC Driver version 10.2.0
wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/10.2.0.jre8/mssql-jdbc-10.2.0.jre8.jar
Snowflake JDBC Driver version 3.13.14
wget https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.13.14/snowflake-jdbc-3.13.14.jar
Vertica JDBC Driver version 11.0.2
wget https://repo1.maven.org/maven2/com/vertica/jdbc/vertica-jdbc/11.0.2-0/vertica-jdbc-11.0.2-0.jar
Alternatively, you can download the drivers directly:
Amazon RedShift
Denodo (v8.0-20220126)
Google BigQuery
GoogleBigQueryJDBC42.jar
IBMDB2 (v11.5.7.0)
MS Access
MS SQL (v10.2.0) - not supported in PDC 10.1
MS SQL (v9.2.1)
MySQL (v8.0.27)
Oracle (v232)
PostgreSQL (v42.3.1)
Snowflake (v3.13.14)
Sybase
Vertica (v11.0.2-0)
Big data sources
Pentaho Data Catalog supports the following big data sources. Review this list for general compatibility with a specific vendor.
Amazon EMR
7.0.0
Cloudera Data Platform (CDP) on-premises (Private cloud)
7.1.8
Open-source Hadoop
3.3.6
Web browsers
Pentaho Data Catalog supports major versions of web browsers that are publicly available. However, the following versions have been specifically certified to work with Data Catalog.
(Optional) Client Virtual Device Interface (VDI)
The following table contains the client’s VDI requirements.
Server configuration
Windows operating system
16 GB RAM
Disk or storage
100 GB minimum
Others
Internet connectivity
Google Chrome browser
Permission to download files from the FTP server (secure FTP access)
Last updated
Was this helpful?