Components Reference

Pentaho Data Catalog aims to accommodate diverse computing environments by providing details about the supported environment components and versions. Where applicable, versions are listed as certified or supported:

  • Certified

    The version has been tested and validated for compatibility with Data Catalog.

  • Supported

    Support is available for listed non-certified versions.

If you have questions about your particular computing environment, contact Pentaho Support.

Hitachi Vantara products

The following Hitachi Vantara product is certified for Pentaho Data Catalog 10.2.x:

  • Hitachi Content Platform 9.7

Server

Pentaho Data Catalog is hardware-independent and runs on server-class computers.

Data Catalog is officially certified to run on the Red Hat Enterprise and Ubuntu Linux distributions. It is compatible with any binary-compatible Linux distribution that meets the necessary software and hardware requirements, including in virtualized and cloud environments. If you have any questions, contact Pentaho Support.

Because formulating hardware recommendations is a complex task due to various factors that might significantly influence system performance, it is beyond the scope of this document. Factors that can influence performance include the volume or size of the data you are working with, and the nature and quality of the data. Characteristics of the data, such as whether it is homogeneous or heterogeneous, structured or unstructured, or random, can affect the efficiency of Data Catalog, especially as data volume increases.

It is the responsibility of your server operations team to monitor these and other server performance metrics. If any limitations arise, your server administrator must be knowledgeable about scaling these parameters appropriately within your deployment.

Mac servers are not supported.

Your server-class computer must comply with the specifications for minimum hardware and required operating systems. As a best practice, use the following server sizing guidelines for Data Catalog deployments:

Functional Proof of Concept (POC) 16 cores, 32GB RAM

To handle small or medium work loads to demonstrate PDC functionality (few million files)

Basic 16 cores, 64GB RAM

Basic requirement for PDC + Pentaho Data Optimizer (PDO) deployment

Standard Edition 32 cores, 128GB RAM

PDC Standard Edition for Classification: no Pentaho Data Mastering (PDM), no PDO

Premium Edition 48 cores, 256GB RAM

PDC Premium Edition, or PDC Standard + PDO To handle a couple of hundred million files for scan, checksum, and extended metadata

Enterprise Scale 128 cores, 512GB RAM

  • High performance for large datasets including PDO or PDM, or both

  • PDC Enterprise Scale + PDO + PDM

  • Higher resources on VM enables more parallel processing (number of worker instances) and jobs. And each job has leverage to work with more threads. Unstructured content processing can be very resource-intensive based on the file size, like big PDF files.

Server storage requirements

The server file systems and storage must meet the following requirements:

  • At least 10 GB of storage should be allocated for the root file system.

  • Any POSIX-compliant file system can be used, but XFS, the standard file system in RHEL, is well-tested.

  • Ample storage should be mounted in the designated Docker storage area (typically the default on Linux servers).

Operating system requirements

You must deploy Data Catalog to a dedicated server, which can be either a physical server or a virtual machine. The hosting environment might be on-premises or on the cloud using platforms such as Azure or AWS.

Operating System
Certified Version

Amazon Linux

2023

CentOS

Stream 9

Red Hat Enterprise Linux

9.4

Rocky Linux

8.10

Ubuntu Server

22.04

For optimal compatibility and performance, the server must run a modern Linux operating system based on 64-bit (x86_64/amd64) architecture.

For an updated list of compatible Linux distributions, visit Pentaho Support.

Linux kernel version

Version 4.0 or higher of the Linux kernel is required. For RHEL, use version 3.10.0-514 of the kernel or later.

The overlay and overlay2 drivers are supported on XFS backing file systems, but only with d_type=true enabled. Data related to the overlay filesystem is typically stored under /var/lib/docker.

  • The XFS file system must be formatted with the flag -n ftype=1. You can verify the ftype option by using the xfs_info command. To format an XFS file system correctly, use the flag -n ftype=1.

  • If the dedicated server is restarted, auto start-up for Docker must be enabled. You can enable auto start-up for Docker with the following commands:

    sudo systemctl enable docker.service
    sudo systemctl enable containerd.service

Additional software

For seamless SSH connectivity and secure file transfer between your machine and the server, it is a best practice to install the following software on your machine:

  • An SSH client such as PuTTY.

  • WinSCP for a graphical user interface to securely transfer files between the client and the server using SSH.

Network security and firewall requirements

Pentaho Data Catalog requires the following network and firewall configurations to work correctly:

  • Ports 443– Required for HTTPS communication between the browser and the application.

  • Port 9200 – Required for communication with the OpenSearch metadata repository.

  • Port 5432 – Required for connectivity to the reporting database (bidb).

  • The application server must have network connectivity to the database server and its respective port.

The default installation includes a signed certificate for HTTPS enablement on port 443. However, if desired, you can obtain an SSL certificate from a certificate authority.

Container deployment

Supported technology for deploying Data Catalog in containers.

Technology
Certified
Supported

Docker

22.0, 20.10

22.0, 20.10

Docker Compose

2.22

2.22

Kubernetes environments that use this Docker version are also supported.

You can also deploy pre-configured Docker images of specific Pentaho products in AWS environments. See Hyperscalers for more information.

User account

The server user that installs Pentaho Data Catalog must either be the root user or have appropriate permissions to run Docker.

To set up Docker permissions for non-root users, see the official Docker documentation at https://docs.docker.com/engine/install/linux-postinstall/.

Solution database repositories

Pentaho Data Catalog stores processing artifacts in the following database repositories:

Database
Version

PostgreSQL

16

MongoDB-ee

6.0.20

Apache Hadoop vendors

Pentaho Data Catalog supports the following Hadoop vendor data sources:

Vendor
Driver Version

Amazon EMR

7.0.0

Cloudera Data Platform (CDP)

7.1.8

Open-source Hadoop

3.3.6

Data sources

Pentaho Data Catalog supports the following data sources. Review the requirements to verify general compatibility with a specific vendor.

Data Source
Version
Driver Version

Amazon DynamoDB

AWS SDK 2.20.45 (Supported)

Aurora MySQL

3.07 (Supported)

mysql-connector-java-8.0.17.jar

Aurora Postgres

15.3

postgresql-42.5.4.jar

AWS S3

Latest

Azure Blob Storage

12.21.2

Azure SQL Server

2022

Denodo

Latest

denodo-vdp-jdbcdriver-8.0-update-20230301.jar

Google BigQuery

Latest

GoogleBigQueryJDBC42.jar

Hadoop

10.1.0

Hitachi Content Platform

9.7

HNAS

Latest

IBM Db2

11.5

jcc-11.5.7.0.jar

InfluxDB

Latest

Default (custom JDBC driver)

MariaDB

11.4.2

mariadb-java-client-3.4.0.jar

Microsoft Access

Latest

Microsoft SQL

2022

mssql-jdbc-9.2.1.jre15.jar

MySQL

8.0.27

mysql-connector-java-8.0.17.jar

NFS

4.2 and later

OneDrive

Latest

Oracle

12, 19c, 21c, and 23

ojdbc8-21.1.0.0.jar

11

ojdbc6-11.2.0.4.jar

PostgreSQL

12 and 14

postgresql-42.5.4.jar

Redshift

Latest

redshift-jdbc42-2.1.0.30.jar

Salesforce

Latest

Default (custom JDBC driver)

SAP HANA

2.0 SPS 07 and later

ngdbc-2.17.12

SharePoint

Latest

SMB/CIFS

3.1.1 and later

Snowflake

8.20.10

snowflake-jdbc-3.13.34.jar

Sybase

Latest

jconn4-16.0.jar

Vertica

11

vertica-jdbc-11.1.1-0.jar

Data source connectivity

Your license determines the number of data sources that Pentaho Data Catalog can connect to.

The following table contains the supported data sources and respective requirements to connect with Data Catalog.

For data sources that you want to optimize, permissions to read, write, and execute are required in addition to the listed requirements.

Data source
Required information and permissions

Active Directory

An account with credentials that include a username and password that have read access to query Active Directory objects.

AWS S3

  • AWS region where the S3 bucket was created

  • Access key and secret access key

  • Read-only permissions to the S3 bucket

Google Cloud Storage

  • AWS region where the S3 bucket was created

  • Access key and secret access key

  • Read-only permissions to the S3 bucket

Azure Blob Storage

  • A service account key file in JSON format.

  • Read access to the target Google Cloud Storage buckets.

  • Assign the Storage Object Viewer IAM role (roles/storage.objectViewer) to the service account.

HCP

  • HCP namespace location (Endpoint)

  • Access key and secret access key

  • Read-only permissions to the namespace of the HCP cluster

HDFS

  • Hadoop version 2.7.2 and later

  • URI should provide a hostname and share folder details

  • Path of the directory that needs to be scanned

  • Read-only access to the directory that needs to be scanned

To install Data Optimizer for Hadoop, see Install Pentaho Data Optimizer in Hadoop Cluster.

Okta

The private key file (.pem) and the Client ID should be generated with all necessary scopes (read scopes for apps, groups, and users) and role (Read Only Administrator). For more information, see the Generate Okta credentials for Data Catalog topic in the Administer Pentaho Data Catalog document.

OneDrive and SharePoint

  • Application (client) ID, Directory (tenant ID), and clientSecret from a registered app on the Azure portal

  • Delegated permissions and Application permissions in the registered app

  • Read-only permissions to the OneDrive and SharePoint sites

RDBMS

To perform data profiling, read-only access to all database objects and system catalog tables

SMB/CIFS

  • URI should provide a hostname and share folder details

  • Username and password to access the SMB/CIFS Share Directory

  • Path of the directory that needs to be scanned

  • Read-only access to the directory that needs to be scanned

JDBC drivers

The following table provides commands you can run to download JDBC drivers for use with database data sources in Pentaho Data Catalog:

JDBC driver version
Command

MySQL JDBC Driver version 8.0.27

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar

Oracle JDBC Driver version 21.1.0

wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.1.0.0/ojdbc8-21.1.0.0.jar

Postgres JDBC Driver version 42.3.1

wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.1/postgresql-42.3.1.jar

SQL Server JDBC Driver version 10.2.0

wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/10.2.0.jre8/mssql-jdbc-10.2.0.jre8.jar

Snowflake JDBC Driver version 3.13.14

wget https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.13.14/snowflake-jdbc-3.13.14.jar

Vertica JDBC Driver version 11.0.2

wget https://repo1.maven.org/maven2/com/vertica/jdbc/vertica-jdbc/11.0.2-0/vertica-jdbc-11.0.2-0.jar

Alternatively, you can download the drivers directly:

Google BigQuery

GoogleBigQueryJDBC42.jar

IBMDB2 (v11.5.7.0)

MS SQL (v10.2.0) - not supported in PDC 10.1

Oracle (v232)

PostgreSQL (v42.3.1)

Snowflake (v3.13.14)

Vertica (v11.0.2-0)

Big data sources

Pentaho Data Catalog supports the following big data sources. Review this list for general compatibility with a specific vendor.

Data Source
Certified Version

Amazon EMR

7.0.0

Cloudera Data Platform (CDP) on-premises (Private cloud)

7.1.8

Open-source Hadoop

3.3.6

Web browsers

Pentaho Data Catalog supports major versions of web browsers that are publicly available. However, the following versions have been specifically certified to work with Data Catalog.

(Optional) Client Virtual Device Interface (VDI)

The following table contains the client’s VDI requirements.

Category
Requirements

Server configuration

  • Windows operating system

  • 16 GB RAM

Disk or storage

  • 100 GB minimum

Others

  • Internet connectivity

  • Google Chrome browser

  • Permission to download files from the FTP server (secure FTP access)

Last updated

Was this helpful?