# Components Reference

Pentaho Data Catalog aims to accommodate diverse computing environments by providing details about the supported environment components and versions. Where applicable, versions are listed as certified or supported:

* **Certified**

  The version has been tested and validated for compatibility with Data Catalog.
* **Supported**

  Support is available for listed non-certified versions.

If you have questions about your particular computing environment, contact [Pentaho Support](https://support.pentaho.com/).

## Hitachi Vantara products

The following Hitachi Vantara product is certified for Pentaho Data Catalog 10.2.x:

* Hitachi Content Platform 9.7

## Server

Pentaho Data Catalog is hardware-independent and runs on server-class computers.

Data Catalog is officially certified to run on the Red Hat Enterprise and Ubuntu Linux distributions. It is compatible with any binary-compatible Linux distribution that meets the necessary software and hardware requirements, including in virtualized and cloud environments. If you have any questions, contact [Pentaho Support](https://support.pentaho.com/).

Because formulating hardware recommendations is a complex task due to various factors that might significantly influence system performance, it is beyond the scope of this document. Factors that can influence performance include the volume or size of the data you are working with, and the nature and quality of the data. Characteristics of the data, such as whether it is homogeneous or heterogeneous, structured or unstructured, or random, can affect the efficiency of Data Catalog, especially as data volume increases.

It is the responsibility of your server operations team to monitor these and other server performance metrics. If any limitations arise, your server administrator must be knowledgeable about scaling these parameters appropriately within your deployment.

{% hint style="info" %}
Mac servers are not supported.
{% endhint %}

Your server-class computer must comply with the specifications for minimum hardware and required operating systems. As a best practice, use the following server sizing guidelines for Data Catalog deployments:

| Functional proof of concepst (POC) 16 cores, 32GB RAM | To handle small or medium workloads to demonstrate Data Catalog functionality (a few million files)                                                                                                                                                                                                                                                                                                 |
| ----------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Basic 16 cores, 64GB RAM                              | Basic requirement for Data Catalog + Pentaho Data Optimizer (PDO) deployment                                                                                                                                                                                                                                                                                                                        |
| Standard Edition 32 cores, 128GB RAM                  | PDC Standard Edition for Classification: no Pentaho Data Mastering (PDM), no PDO                                                                                                                                                                                                                                                                                                                    |
| Premium Edition 48 cores, 256GB RAM                   | PDC Premium Edition, or PDC Standard + PDO To handle a couple of hundred million files for scan, checksum, and extended metadata                                                                                                                                                                                                                                                                    |
| Enterprise Scale 128 cores, 512GB RAM                 | <ul><li>High performance for large datasets including PDO or PDM, or both</li><li>PDC Enterprise Scale + PDO + PDM</li><li>Higher resources on VM enables more parallel processing (number of worker instances) and jobs. And each job has leverage to work with more threads. Unstructured content processing can be very resource-intensive based on the file size, like big PDF files.</li></ul> |

### Server storage requirements

The server file systems and storage must meet the following requirements:

* At least 10 GB of storage should be allocated for the root file system.
* Any POSIX-compliant file system can be used, but XFS, the standard file system in RHEL, is well-tested.
* Ample storage should be mounted in the designated Docker storage area (typically the default on Linux servers).

### Operating system requirements

You must deploy Data Catalog to a dedicated server, which can be either a physical server or a virtual machine. The hosting environment might be on-premises or on the cloud using platforms such as Azure or AWS.

| Operating System         | Certified Version |
| ------------------------ | ----------------- |
| Amazon Linux             | 2023              |
| CentOS                   | Stream 9          |
| Red Hat Enterprise Linux | 9.6               |
| Rocky Linux              | 8.10              |
| Ubuntu Server            | 22.04             |

For optimal compatibility and performance, the server must run a modern Linux operating system based on 64-bit (x86\_64/amd64) architecture.

{% hint style="info" %}
For an updated list of compatible Linux distributions, visit [Pentaho Support](https://support.pentaho.com/).
{% endhint %}

#### Linux kernel version

Version 4.0 or higher of the Linux kernel is required. For RHEL, use version 3.10.0-514 of the kernel or later.

{% hint style="info" %}
The `overlay` and `overlay2` drivers are supported on XFS backing file systems, but only with `d_type=true` enabled. Data related to the overlay filesystem is typically stored under `/var/lib/docker`.

* The XFS file system must be formatted with the flag `-n ftype=1`. You can verify the `ftype` option by using the `xfs_info` command. To format an XFS file system correctly, use the flag `-n ftype=1`.
* If the dedicated server is restarted, auto start-up for Docker must be enabled. You can enable auto start-up for Docker with the following commands:

  <pre><code><strong>sudo systemctl enable docker.service
  </strong>sudo systemctl enable containerd.service
  </code></pre>

{% endhint %}

### Additional softwares

For seamless SSH connectivity and secure file transfer between your machine and the server, it is a best practice to install the following software on your machine:

* An SSH client such as PuTTY.
* WinSCP for a graphical user interface to securely transfer files between the client and the server using SSH.

### Network security and firewall requirements

Pentaho Data Catalog requires the following network and firewall configurations to work correctly:

* Ports `443`– Required for HTTPS communication between the browser and the application.
* Port `9200` – Required for communication with the OpenSearch metadata repository.
* Port `5432` – Required for connectivity to the reporting database (bidb).
* The application server must have network connectivity to the database server and its respective port.

{% hint style="info" %}
The default installation includes a signed certificate for HTTPS enablement on port `443`. However, if desired, you can obtain an SSL certificate from a certificate authority.
{% endhint %}

## Container deployment

Supported technology for deploying Data Catalog in containers.

| Technology                                     | Certified   | Supported   |
| ---------------------------------------------- | ----------- | ----------- |
| Docker<sup>#</sup>                             | 22.0, 20.10 | 22.0, 20.10 |
| Docker Compose                                 | 2.22        | 2.22        |
| Amazon Elastic Kubernetes Service (Amazon EKS) | 1.32, 1.33  | 1.32, 1.33  |
| OpenShift Container Platform (OCP)             | 4.15.0      | 4.15.0      |

<sup>#</sup> Docker version 29 is not supported!

{% hint style="info" %}
Kubernetes environments that use this Docker version are also supported.
{% endhint %}

You can also deploy pre-configured Docker images of specific Pentaho products in AWS environments. See [Hyperscalers](https://docs.pentaho.com/pdc-10.2-install/install-pentaho-data-catalog/hyperscalers) for more information.

#### User account

The server user that installs Pentaho Data Catalog must either be the root user or have appropriate permissions to run Docker.

To set up Docker permissions for non-root users, see the official Docker documentation at <https://docs.docker.com/engine/install/linux-postinstall/>.

## Single sign-on (SSO) and directory services

Pentaho Data Catalog supports external identity providers for single sign-on (SSO) and directory-based authentication. The following connectors are certified for this version.

| Component                   | Status    |
| --------------------------- | --------- |
| **LDAP / Active Directory** | Certified |
| **Okta (OIDC)**             | Certified |
| **Ping Federate**           | Certified |

## Solution database repositories

Pentaho Data Catalog stores processing artifacts in the following database repositories:

| Database   | Version |
| ---------- | ------- |
| PostgreSQL | 16      |
| MongoDB-ee | 6.0.20  |

## Apache Hadoop vendors

Pentaho Data Catalog supports the following Hadoop vendor data sources:

| Vendor                       | Driver Version |
| ---------------------------- | -------------- |
| Amazon EMR                   | 7.0.0          |
| Cloudera Data Platform (CDP) | 7.1.8          |
| Open-source Hadoop           | 3.3.6          |

## Data sources

Pentaho Data Catalog supports the following data sources. Review the requirements to verify general compatibility with a specific vendor.

| Data Source              | Version                     | Driver Version                                |
| ------------------------ | --------------------------- | --------------------------------------------- |
| Active Directory (AD)    | Latest                      |                                               |
| Amazon DynamoDB          | AWS SDK 2.20.45 (Supported) |                                               |
| Aurora MySQL             | 3.07 (Supported)            | mysql-connector-java-8.0.17.jar               |
| Aurora Postgres          | 15.3                        | postgresql-42.5.4.jar                         |
| AWS S3                   | Latest                      |                                               |
| Azure Blob Storage       | 12.21.2                     |                                               |
| Azure SQL Server         | 2022                        |                                               |
| Denodo                   | Latest                      | denodo-vdp-jdbcdriver-8.0-update-20230301.jar |
| Google BigQuery          | Latest                      | GoogleBigQueryJDBC42.jar                      |
| Hadoop                   | 10.1.0                      |                                               |
| Hitachi Content Platform | 9.7                         |                                               |
| HNAS                     | Latest                      |                                               |
| IBM Db2                  | 11.5                        | jcc-11.5.7.0.jar                              |
| InfluxDB                 | Latest                      | Default (custom JDBC driver)                  |
| MariaDB                  | 11.4.2                      | mariadb-java-client-3.4.0.jar                 |
| Microsoft Access         | Latest                      |                                               |
| Microsoft SQL            | 2022                        | mssql-jdbc-9.2.1.jre15.jar                    |
| MySQL                    | 8.0.27                      | mysql-connector-java-8.0.17.jar               |
| NFS                      | 4.2 and later               |                                               |
| Okta                     | Latest                      |                                               |
| OneDrive                 | Latest                      |                                               |
| Oracle                   | 12, 19c, 21c, and 23        | ojdbc8-21.1.0.0.jar                           |
| Oracle                   | 11                          | ojdbc6-11.2.0.4.jar                           |
| PostgreSQL               | 12 and 14                   | postgresql-42.5.4.jar                         |
| Redshift                 | Latest                      | redshift-jdbc42-2.1.0.30.jar                  |
| Salesforce               | Latest                      | Default (custom JDBC driver)                  |
| SAP HANA                 | 2.0 SPS 07 and later        | ngdbc-2.17.12                                 |
| SharePoint               | Latest                      |                                               |
| SMB/CIFS                 | 3.1.1 and later             |                                               |
| Snowflake                | 8.20.10                     | snowflake-jdbc-3.27.1.jar                     |
| Sybase                   | Latest                      | jconn4-16.0.jar                               |
| Vertica                  | 11                          | vertica-jdbc-11.1.1-0.jar                     |

### Data source connectivity

Your license determines the number of data sources that Pentaho Data Catalog can connect to.

The following table lists the supported data sources and their respective requirements for connecting to Data Catalog.

{% hint style="info" %}
For data sources that you want to optimize, permissions to read, write, and execute are required in addition to the listed requirements.
{% endhint %}

<table><thead><tr><th width="196.22216796875">Data source</th><th>Required information and permissions</th></tr></thead><tbody><tr><td>Active Directory</td><td>An account with credentials that include a username and password that have read access to query Active Directory objects.</td></tr><tr><td>AWS S3</td><td><ul><li>AWS region where the S3 bucket was created</li><li>Access key and secret access key</li><li>Read-only permissions to the S3 bucket</li></ul></td></tr><tr><td>Google Cloud Storage</td><td><ul><li>AWS region where the S3 bucket was created</li><li>Access key and secret access key</li><li>Read-only permissions to the S3 bucket</li></ul></td></tr><tr><td>Azure Blob Storage</td><td><ul><li>A service account key file in JSON format.</li><li>Read access to the target Google Cloud Storage buckets.</li><li>Assign the Storage Object Viewer IAM role (roles/storage.objectViewer) to the service account.</li></ul></td></tr><tr><td>HCP</td><td><ul><li>HCP namespace location (Endpoint)</li><li>Access key and secret access key</li><li>Read-only permissions to the namespace of the HCP cluster</li></ul></td></tr><tr><td>HDFS</td><td><ul><li>Hadoop version 2.7.2 and later</li><li>URI should provide a hostname and share folder details</li><li>Path of the directory that needs to be scanned</li><li>Read-only access to the directory that needs to be scanned</li></ul><p>To install Data Optimizer for Hadoop, see <a href="install-pentaho-data-optimizer-in-hadoop-cluster">Install Pentaho Data Optimizer in Hadoop Cluster</a>.</p></td></tr><tr><td>Okta</td><td>The private key file (.pem) and the Client ID should be generated with all necessary scopes (read scopes for apps, groups, and users) and role (Read Only Administrator). For more information, see the <strong>Generate Okta credentials for Data Catalog</strong> topic in the Administer Pentaho Data Catalog document.</td></tr><tr><td>OneDrive and SharePoint</td><td><ul><li>Application (client) ID, Directory (tenant ID), and clientSecret from a registered app on the Azure portal</li><li>Delegated permissions and Application permissions in the registered app</li><li>Read-only permissions to the OneDrive and SharePoint sites</li></ul></td></tr><tr><td>RDBMS</td><td>To perform data profiling, read-only access to all database objects and system catalog tables</td></tr><tr><td>SMB/CIFS</td><td><ul><li>URI should provide a hostname and share folder details</li><li>Username and password to access the SMB/CIFS Share Directory</li><li>Path of the directory that needs to be scanned</li><li>Read-only access to the directory that needs to be scanned</li></ul></td></tr></tbody></table>

### JDBC drivers

The following table provides commands you can run to download JDBC drivers for use with database data sources in Pentaho Data Catalog:

<table><thead><tr><th width="321.77783203125">JDBC driver version</th><th>Command</th></tr></thead><tbody><tr><td>MySQL JDBC Driver version 8.0.27</td><td><code>wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar</code></td></tr><tr><td>Oracle JDBC Driver version 21.1.0</td><td><code>wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.1.0.0/ojdbc8-21.1.0.0.jar</code></td></tr><tr><td>Postgres JDBC Driver version 42.3.1</td><td><code>wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.1/postgresql-42.3.1.jar</code></td></tr><tr><td>SQL Server JDBC Driver version 10.2.0</td><td><code>wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/10.2.0.jre8/mssql-jdbc-10.2.0.jre8.jar</code></td></tr><tr><td>Snowflake JDBC Driver version 3.27.1</td><td><code>wget</code> <code>https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.27.1/snowflake-jdbc-3.27.1.jar</code></td></tr><tr><td>Vertica JDBC Driver version 11.0.2</td><td><code>wget https://repo1.maven.org/maven2/com/vertica/jdbc/vertica-jdbc/11.0.2-0/vertica-jdbc-11.0.2-0.jar</code></td></tr></tbody></table>

Alternatively, you can download the drivers directly:

<table data-header-hidden><thead><tr><th width="228.4444580078125"></th><th></th></tr></thead><tbody><tr><td>Amazon RedShift</td><td><a href="https://repo1.maven.org/maven2/com/amazon/redshift/redshift-jdbc42/2.1.0.30/redshift-jdbc42-2.1.0.30.jar">redshift-jdbc42-2.1.0.30.jar</a></td></tr><tr><td>Denodo (v8.0-20220126)</td><td><a href="https://astra-repo-eu-west-2.s3.eu-west-2.amazonaws.com/JDBC/denodo/8.0-20220126/denodo-vdp-jdbcdriver-8.0-update-20220126.jar">denodo-vdp-jdbcdriver-8.0-update-20220126.jar</a></td></tr><tr><td>Google BigQuery</td><td>GoogleBigQueryJDBC42.jar</td></tr><tr><td>IBMDB2 (v11.5.7.0)</td><td><a href="https://repo1.maven.org/maven2/com/ibm/db2/jcc/11.5.7.0/jcc-11.5.7.0.jar">jcc-11.5.7.0.jar</a></td></tr><tr><td>MariaDB</td><td><a href="https://repo1.maven.org/maven2/org/mariadb/jdbc/mariadb-java-client/3.0.4/mariadb-java-client-3.0.4.jar">mariadb-java-client-3.0.4.jar</a></td></tr><tr><td>MS Access</td><td><a href="https://repo1.maven.org/maven2/net/sf/ucanaccess/ucanaccess/5.0.1/ucanaccess-5.0.1.jar">ucanaccess-5.0.1.jar</a></td></tr><tr><td>MS SQL (v10.2.0) - not supported in PDC 10.1</td><td><a href="https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/10.2.0.jre8/mssql-jdbc-10.2.0.jre8.jar">mssql-jdbc-10.2.0.jre8.jar</a></td></tr><tr><td>MS SQL (v9.2.1)</td><td><a href="https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/9.2.1.jre15/mssql-jdbc-9.2.1.jre15.jar">mssql-jdbc-9.2.1.jre15.jar</a></td></tr><tr><td>MySQL (v8.0.27)</td><td><a href="https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar">mysql-connector-java-8.0.27.jar</a></td></tr><tr><td>Oracle (v232)</td><td><a href="https://download.oracle.com/otn-pub/otn_software/jdbc/232-DeveloperRel/ojdbc11.jar">ojdbc11.jar</a></td></tr><tr><td>PostgreSQL (v42.3.1)</td><td><a href="https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.1/postgresql-42.3.1.jar">postgresql-42.3.1.jar</a></td></tr><tr><td>Snowflake (v3.27.1)</td><td><a href="https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.27.1/snowflake-jdbc-3.27.1.jar">snowflake-jdbc-3.27.1.jar</a></td></tr><tr><td>Sybase</td><td><a href="https://maven.jumpmind.com/repo/jdbc/sybase/jconn4/16.0/jconn4-16.0.jar">jconn4-16.0.jar</a></td></tr><tr><td>Vertica (v11.0.2-0)</td><td><a href="https://repo1.maven.org/maven2/com/vertica/jdbc/vertica-jdbc/11.0.2-0/vertica-jdbc-11.0.2-0.jar">vertica-jdbc-11.0.2-0.jar</a></td></tr></tbody></table>

## Big data sources

Pentaho Data Catalog supports the following big data sources. Review this list for general compatibility with a specific vendor.

| Data Source                                              | Certified Version |
| -------------------------------------------------------- | ----------------- |
| Amazon EMR                                               | 7.0.0             |
| Cloudera Data Platform (CDP) on-premises (Private cloud) | 7.1.9             |
| Open-source Hadoop                                       | 3.3.6             |

## Integrations

Data Catalog integrates with associated Pentaho components and third-party analytics tools to extend data management, profiling, and visualization capabilities. The following components are certified or supported to work with Data Catalog.

| Component                      | Version | Supported PDC versions | Status    |
| ------------------------------ | ------- | ---------------------- | --------- |
| Pentaho Data Quality (PDQ)     | 3.1.2   | 10.2.9                 | Certified |
| Pentaho Data Integration (PDI) | 10.2    |                        | Supported |
| Power BI PDO Dashboards        |         |                        | Certified |

## Web browsers

Pentaho Data Catalog supports the major versions of publicly available web browsers. However, the following versions have been specifically certified to work with Data Catalog.

| Browser         | Certified Version |
| --------------- | ----------------- |
| Apple Safari    |                   |
| Google Chrome   | 138.0.7204.50     |
| Microsoft Edge  | 131.0.2903.51     |
| Mozilla Firefox | 140.0.4           |

## (Optional) Client Virtual Device Interface (VDI)

The following table contains the client’s VDI requirements.

<table><thead><tr><th width="180.6666259765625">Category</th><th>Requirements</th></tr></thead><tbody><tr><td>Server configuration</td><td><ul><li>Windows operating system</li><li>16 GB RAM</li></ul></td></tr><tr><td>Disk or storage</td><td><ul><li>100 GB minimum</li></ul></td></tr><tr><td>Others</td><td><ul><li>Internet connectivity</li><li>Google Chrome browser</li><li>Permission to download files from the FTP server (secure FTP access)</li></ul></td></tr></tbody></table>
