Advanced configuration

After installing Data Catalog, you may need to set up additional components, depending on your environment. Use the following topics as needed to finish setting up your environment.

Configure system environment variables

Although not common, there might be instances where you need to change the default settings for Data Catalog system environment variables. These configuration modifications allow you to override default system behavior to align with the specific needs.

triangle-exclamation
  1. In a terminal window, navigate to the pdc-docker-deployment folder and open the hidden environment variable configuration file (.env). This file is located in the /opt folder by default.

  2. Verify the system environment variables set in the /opt/pentaho/pdc-docker-deployment/vendor/.env.default file:

    For example, the number of worker instances that Data Catalog uses to run processes is set to 5:

    PDC_WS_DEFAULT_OPS_JOBPOOLMINSIZE=5
    PDC_WS_DEFAULT_OPS_JOBPOOLMAXSIZE=5
    circle-info

    Make sure that PDC_WS_DEFAULT_OPS_JOBPOOLMINSIZE and PDC_WS_DEFAULT_OPS_JOBPOOLMAXSIZE have the same value for consistent worker instance management.

  3. To override an environment variable set in the vendor/.env.default file, you can create a new .env file in the opt/pentaho/pdc-docker-deployment/conf/ folder:

    vi opt/pentaho/pdc-docker-deployment/conf/.env

  4. (Optional) The data in the Business Intelligence Database refreshes daily by default, as set in the .env file. To modify the data refresh frequency, update the variable in the .env file to one of the options listed in the following table:

    Value
    Description

    @yearly (or @annually)

    Run once a year, midnight, Jan 1st

    @monthly

    Run once a month, midnight, first of the month

    @weekly

    Run once a week, midnight between Sat/Sun

    @daily (or @midnight)

    Run once a day, midnight

    @hourly

    Run once an hour, the beginning of the hour

    @every <number>m

    Run at a custom interval, where <number> specifies the number of minutes. For example, @every 5m runs the job every 5 minutes.

    Example:

    PDC_CRON_BI_VIEWS_INIT_SCHEDULE=@daily
  5. After adding all required system variables, save the changes and restart the Data Catalog system services.

    ./pdc.sh stop
    ./pdc.sh up

Configure chatbot in Data Catalog

The chatbot in Pentaho Data Catalog enables users to interact with cataloged metadata using natural language queries. The chatbot supports two response modes:

  • Standard (PDO-based) responses, which retrieve structured metadata from the Business Intelligence Database (BIDB).

  • Conversational responses, which use vector embeddings stored in Qdrant and a configured large language model (LLM) to provide contextual, semantic answers.

Before users can access chatbot capabilities, administrators must configure and validate several backend components. These configurations include enabling the chatbot service, configuring the large language model, populating indexing data stores, and defining role-based access for conversational search.

This section describes how to configure, enable, disable, and manage the chatbot feature in both Docker and Kubernetes (Amazon EKS) deployments.

Initial setup for chatbot

Before users can interact with the chatbot in Data Catalog, administrators must complete several mandatory configuration steps. The chatbot depends on a large language model (LLM) configuration, metadata indexing, vector embedding generation, and role-based access control to function correctly.

Perform the following steps to complete the initial setup for the chatbot in Data Catalog:

Prerequisites

  • Pentaho Data Catalog is installed and running.

  • OpenSearch is deployed and accessible.

  • The chatbot feature is supported in the selected deployment profile.

  • You have administrative access to the deployment environment.

  • You have valid LLM provider details, including API key, model name, and embedding model.

Procedure

  1. Enable the chatbot frontend and backend services in your deployment.

    • For Docker deployments, configure the chatbot services in the Docker profile and restart the application.

    • For Kubernetes deployments, enable chatbot-frontend and chatbot-backend in the active profile YAML file and apply the Helm upgrade.

    For detailed steps, see Enable chatbot in Data Catalog.

  2. Configure the required LLM and embedding settings:

    • API key

    • LLM model

    • Embedding model

    • Base URL (if applicable)

    • Target vector dimensions

    • Vector indexing schedule

    Ensure that the embedding model and target vector dimensions match. Important: You cannot change the embedding model after the initial setup without rebuilding the vector index. For detailed steps, see Configure a large language model (LLM) for the chatbot in Data Catalog.

  3. Configure conversational roles to define which user roles are allowed to receive conversational chatbot responses. For detailed steps, see Configure Data Catalog user roles for conversational search chatbot.

    circle-info

    Users without the configured roles receive standard (PDO-based) chatbot responses.

  4. For the chatbot to return responses, the required data stores must be populated. Verify data population and indexing:

    1. For standard (PDO) queries, the Business Intelligence Database (BIDB) must contain indexed metadata.

    2. For conversational queries, the Qdrant vector database must contain generated embeddings.

    Both BIDB and Qdrant indexing processes run as scheduled jobs. By default, these jobs run once per day at midnight. The schedule is configurable. If you encounter any issues, see Chatbot vector indexing issues.

  5. Once the configuration is completed, sign in to Data Catalog and verify that:

    • The chatbot icon is visible in the user interface.

    • The chatbot accepts and processes queries.

    • Users with configured conversational roles receive contextual responses.

    • Users without conversational roles receive standard responses.

    • Review backend logs if responses are missing or incomplete.

Result

The chatbot is fully configured and operational.

  • Standard (PDO) queries return results from BIDB.

  • Conversational queries return semantic responses from Qdrant embeddings.

  • Indexing runs on schedule and updates incrementally.

The chatbot is now ready for production use.

What next

  • If responses are incomplete or missing, review the Chatbot vector indexing issues.

  • Monitor indexing logs during initial production deployment.

  • Periodically verify indexing schedules and role assignments.

Enable chatbot in Data Catalog

In Pentaho Data Catalog, the chatbot enables users to interact with cataloged data via natural language, retrieve insights, explore business glossary terms, and access related dashboards directly within the Data Catalog interface. This procedure explains how to enable the chatbot feature in Data Catalog. After you enable the chatbot, the chatbot icon appears in the user interface, and users can start conversational data discovery.

Prerequisites

Docker deployment

Perform the following steps to enable the chatbot in the Docker deployment:

Procedure

  1. Open a terminal session on the server where Data Catalog is deployed using Docker Compose.

  2. Change to the Data Catalog Docker deployment directory.

  3. Stop all running Data Catalog services.

  4. Open the Docker Compose configuration file.

  5. Change the profiles value from the disabled profile name (for example, core1) back to core for the chatbot backend and frontend services.

    circle-info

    The profiles value should match with the profile name defined for COMPOSE_PROFILES parameter in vendor/env.default (for example, core).

  6. Save the file and exit the editor.

  7. Start the Data Catalog services.

  8. Open the Data Catalog UI and confirm that the chatbot icon is visible.

Result

The chatbot frontend and backend services are enabled, and the chatbot icon appears in the Data Catalog interface. Users can now use the chatbot to explore catalog data using natural language.

What next

You can disable the chatbot if no longer required by reversing the configuration changes. For more information, see Disable chatbot in Data Catalog.

Disable chatbot in Data Catalog

In Pentaho Data Catalog, the chatbot enables users to interact with cataloged data via natural language, retrieve insights, explore business glossary terms, and access related dashboards directly within the Data Catalog interface. Sometimes, you might disable the chatbot when the feature is not required. This procedure explains how to disable the chatbot in Data Catalog.

Disabling the chatbot stops the chatbot frontend and backend services and removes the chatbot icon from the user interface.

Prerequisites

  • Data Catalog is deployed using Docker Compose or Kubernetes.

  • You have administrative access to the deployment environment.

  • You can open a terminal session on the deployment server or management system.

Docker deployment

Perform the following steps to disable the chatbot in the Docker deployment:

Procedure

  1. Open a terminal session on the server where Data Catalog is deployed using Docker Compose.

  2. Change to the Data Catalog Docker deployment directory.

  3. Stop all running Data Catalog services.

  4. Open the Docker Compose configuration file.

  5. Change the profiles value from core to an unused value (for example, core1) for the chatbot backend and frontend services.

    circle-info

    The profiles value used to disable the chatbot must not match any active profile defined for COMPOSE_PROFILES parameter in vendor/env.default (for example, core). If an admin later updates the profile's value, a name that is listed in vendor/env.default , the chatbot frontend and backend services start automatically.

  6. Save the file and exit the editor.

  7. Start the Data Catalog services.

  8. Open the Data Catalog application and confirm that the chatbot icon is no longer visible.

Result

The chatbot frontend and backend services are disabled, and the chatbot icon is removed from the Data Catalog interface. Users can no longer access or use the chatbot feature. Additionally, you can verify that no chatbot services are running by checking containers or pods.

What next

You can re-enable the chatbot if required by reversing the configuration changes. For more information, see Enable chatbot in Data Catalog.

Configure a large language model (LLM) for the chatbot in Data Catalog

In Data Catalog, the chatbot requires a configured large language model (LLM) to generate responses, create vector embeddings, and support conversational search for authorized users. This procedure explains how, as an admin, you can configure a LLM for the chatbot feature in Data Catalog. After you configure the LLM, the chatbot backend can process user queries, generate embeddings, and return contextual responses based on catalog metadata and permissions.

Prerequisites

  • The chatbot feature is enabled in Pentaho Data Catalog.

  • You have valid LLM provider details, including API key, model name, and embedding model.

Docker deployment

Perform the following steps to configure a LLM for the chatbot feature in Data Catalog Docker deployment:

Procedure

  1. Open a terminal session on the server where Data Catalog is deployed using Docker Compose.

  2. Change to the Pentaho Data Catalog Docker deployment directory.

  3. Stop all running Data Catalog services.

  4. Open the environment configuration file.

  5. Configure the chatbot LLM and embedding settings.

    Environment variable
    Description

    PDC_CHATBOT_API_KEY

    API key used to authenticate with the configured LLM provider. The chatbot backend does not start if this value is missing or invalid.

    PDC_CHATBOT_LLM

    Name of the LLM used to generate chatbot responses.

    PDC_CHATBOT_LLM_BASE_URL

    Base URL of the LLM provider endpoint. Use this value for self-hosted or non-default LLM providers.

    PDC_CHATBOT_EMBEDDING_MODEL

    Embedding model used to generate vector representations for conversational search. You cannot change this value after the initial setup.

    PDC_CHATBOT_TARGET_VECTOR_DIMENSIONS

    Number of dimensions produced by the embedding model. This value must match the dimensions supported by the embedding model.

    PDC_CHATBOT_TARGET_VECTOR_BATCH_SIZE

    Number of records processed in a single batch during vector indexing. Higher values improve throughput but increase memory usage.

    PDC_CHATBOT_TARGET_VECTOR_MAX_CONCURRENT_BATCHES

    Maximum number of vector indexing batches processed in parallel. Increasing this value increases the load on the system.

    PDC_CHATBOT_CONVERSATIONAL_ROLES

    User roles allowed to receive conversational (vector-based) chatbot responses. You can add additional roles separated by a comma.

    PDC_CHATBOT_SCHEDULE_VECTOR_INDEX_UPDATE

    Schedule that controls how often the chatbot refreshes the vector index (for example, @daily).

    circle-exclamation
  6. Save the file and exit the editor.

  7. Restart Pentaho Data Catalog services.

  8. Sign in to Data Catalog and verify that the chatbot returns responses.

Result

The chatbot is configured with a large language model and embedding settings. Users can use the chatbot to receive contextual responses in the chatbot.

What next

You can configure additional roles for conversational search to control which users receive vector-based responses. For more information, see Configure Data Catalog user roles for conversational search chatbot.

Configure Data Catalog user roles for conversational search chatbot

In Pentaho Data Catalog, conversational search uses vector-based retrieval and large language models to return contextual responses, while other users continue to receive standard reporting (BIDB-based) responses.

In Data Catalog, by configuring roles for conversational search, you can control access to advanced chatbot capabilities based on user responsibilities, such as allowing users with roles, such as Data Stewards and Business Stewards, to receive deeper, semantic responses. This procedure explains how to configure which user roles can use conversational search in the Data Catalog chatbot.

Prerequisites

  • Data Catalog is deployed using Docker Compose or Amazon EKS.

  • The chatbot feature is enabled.

  • You have administrative access to the deployment environment.

Docker deployment

Perform the following steps to configure Data Catalog user roles for the conversational search chatbot in the Docker deployment:

Procedure

  1. Open a terminal session on the server where Data Catalog is deployed using Docker Compose.

  2. Change to the Data Catalog Docker deployment directory.

  3. Stop all running Data Catalog services.

  4. Open the environment configuration file.

  5. Set the conversational search roles separated by a comma.

    circle-info

    By default, conversational search is enabled only for the Data Steward role. To know more about user roles in Data Catalog, see User roles and permissions in Data Catalog.

  6. Save the file and exit the editor.

  7. Start the Data Catalog services.

Result

Conversational search is enabled only for the configured user roles. Users with these roles receive vector-based, context-aware chatbot responses, while other users continue to receive standard chatbot responses.


Enable or disable business rules in Data Catalog

In Data Catalog, business rules are disabled by default. Administrators can enable or disable business rules by setting a deployment flag, then restarting the affected services in Dockerarrow-up-right or redeploying them in Amazon EKSarrow-up-right deployments.

Perform the following steps to enable or disable business rules in the Data Catalog:

Docker deployments

Procedure

  1. Sign in to the server where Data Catalog is installed and open a terminal.

  2. Go to the deployment folder.

  3. Open the environment configuration file.

    Data Catalog supports overriding variables from vendor/.env.default by placing them in conf/.env.

  4. Enable business rules by adding the following line:

    To disable business rules, set the value to false:

  5. Save the file and restart Data Catalog services.

Result

Business rules are enabled (or disabled) after services restart or the updated configuration is rolled out.


Enable verbose logging for profiling

Verbose logging increases the level of detail written to the application logs during profiling operations. As a Data Catalog admin, when you enable verbose logging, Data Catalog records additional diagnostic information about how profiling jobs are executed, including internal processing steps, execution flow, and runtime details. This detailed logging helps administrators and support teams troubleshoot profiling failures, investigate unexpected results, and analyze performance issues.

Verbose logging generates a higher volume of log data and can increase disk usage.

Perform the following steps to enable verbose logging in Data Catalog:

Procedure

  1. Open a terminal on the Data Catalog deployment server.

  2. Go to the Data Catalog deployment directory.

  3. Open the conf/.env file in a text editor.

  4. Add the following environment variable, or update the value if it already exists:

    This setting enables detailed logging for profiling operations performed by the Web Services component. To disable verbose profiling logging, set the parameter to false:

  5. Save and close the file.

  6. Restart the Data Catalog services to apply the change.

Result

Verbose logging is enabled for profiling jobs. Additional diagnostic information is written to the Data Catalog application logs during profiling operations.


Disable the Physical Assets feature from Data Catalog deployment

By default, the Physical Assets feature is included in the Data Catalog deployment to support OT assets metadata through Pentaho Edge. However, if your deployment does not require this feature, you can disable it by removing its reference from the Compose file. This helps reduce the size of the deployment and saves compute resources. This guide depicts how to disable the Physical Assets feature in Data Catalog.

circle-info

This procedure only disables the container associated with Physical Assets. It does not remove any binaries or metadata files. The service can be re-enabled at any time by restoring the profile reference.

Before you begin, make sure the following conditions are met:

  • Data Catalog is already installed using the Docker deployment method. To know more about installation, see Install Pentaho Data Catalogarrow-up-right.

  • You have access to the deployment directory (pdc-docker-deployment) and permission to edit the .env.default or .env file.

Perform the following steps to disable the Physical Assets feature:

  1. Navigate to the following directory:

  2. Open the .env.default file in a text editor:

  3. Locate the COMPOSE_PROFILES line. It may look like this:

  4. Remove physical-assets from the list:

  5. Save the file and return to the deployment root folder:

  6. Restart the deployment using the following command:

circle-check

Running Data Catalog workloads using node affinity, taints, and tolerations

Data Catalog is a distributed application that runs several containerized components, including the application, database, and worker pods. Worker pods perform data scanning, profiling, and metadata ingestion from enterprise data sources. These operations often involve direct access to sensitive or large-scale datasets.

In large or security-sensitive deployments, administrators often need precise control over where workloads run within a Kubernetes (EKS) cluster. This control helps ensure that critical services and data processing tasks run in secure, compliant, and performance-optimized environments. It becomes especially important when certain components of Data Catalog handle data that is confidential, high-volume, or requires specialized compute resources such as GPUs or high-performance storage.

By using Kubernetes node affinity, taints, and tolerations, you can configure Data Catalog so that:

  • Worker pods run only on specific nodes. For example, nodes in a restricted security group or a separate availability zone.

  • Non-worker workloads, such as the user interface or metadata services, are prevented from running on those nodes.

  • The system continues to meet data segregation and compliance requirements without affecting performance or scalability.

Running PDC workloads using node affinity, taints, and tolerations provides a secure, compliant, and efficient way to manage distributed deployments across different availability zones or network segments.

Perform the following steps to configure node affinity, taints, and tolerations for Data Catalog workloads.

Before you begin

  • Ensure that your Data Catalog deployment is running on Amazon Elastic Kubernetes Service (EKS).

  • Verify that you have kubectl and AWS CLI installed and configured with permissions to update node groups and apply taints.

  • Identify the node group where you want to run PDC worker workloads.

  • Obtain access to the custom-values.yaml file used for your PDC Helm deployment.

  • Confirm that your deployment uses Helmfile for orchestration.

Procedure

  1. Open a terminal on the machine that manages your Kubernetes cluster.

  2. Apply a taint to the node group you want to reserve for worker workloads.

    Replace the placeholders with your cluster and node group names.

    • key=dedicated specifies the taint key.

    • value=ws indicates that the node is dedicated for worker services.

    • effect=NO_SCHEDULE prevents other pods from being scheduled on these nodes unless they have a matching toleration.

    Tip: Use descriptive taint keys and values that reflect the node group’s purpose, such as key=data-processing or value=worker.

  3. Edit the custom-values.yaml file for your Helm deployment.

    For PDC 10.2.8 and later, locate the job-server section and add:

    Replace <your-nodegroup-name> with the name of the dedicated node group.

  4. Save the file and deploy the configuration.

    The -n pentaho flag ensures that the deployment targets the correct namespace.

Result

You have successfully configure node affinity, taints, and tolerations for Data Catalog workloads. After deployment:

  • Data Catalog worker pods are scheduled only on the nodes defined by the node affinity rules.

  • Other pods are prevented from being scheduled on those nodes unless they include a matching toleration.

  • Your Data Catalog deployment now supports workload segregation across different availability zones or network segments in AWS.

This configuration helps validate compliance and segregation requirements for deployments where worker nodes must belong to distinct security groups or availability zones.

Next steps

  • To verify the configuration, run:

    Confirm that worker pods are assigned to nodes in the expected node group.

  • Monitor node usage using Amazon EKS Console or the kubectl describe node command.

Additional information

For more information, see the official AWS documentation: Place Kubernetes pods on Amazon EKS by using node affinity, taints, and tolerationsarrow-up-right


Install user-provided SSL certificates

To provide a greater level of security to your data, you can use signed Secure Sockets Layer (SSL) certificates from your Certificate Authority (CA) with Data Catalog.

Data Catalog automatically installs self-signed certs in the <install-directory>/conf/https directory as server.key (PEM-encoded private key) and server.crt (PEM-encoded self-signed certificate). You can replace these files with certificates signed by your CA.

Use this procedure to install signed SSL certificates for Data Catalog:

  1. On your Data Catalog server, navigate to the Data Catalog installation directory <install-directory>/conf/https, where <install-directory> is the directory where Data Catalog is installed.

    • server.key is a PEM-formatted file that contains the private key of a specific certificate.

    • server.crt is a PEM-formatted file containing the certificate.

  2. Replace the <install-directory>/conf/https/server.key file with the PEM-encoded private key used to sign the SSL certificate or generate a new private key in PEM-encoded format.

  3. Replace the <install-directory>/conf/https/server.crt file with the PEM-encoded signed certificate associated with the private key in Step 1.

    If a new private key is generated, then you need to download a new PEM-encoded signed SSL certificate from your CA.

  4. Append the <install-directory>/conf/extra-certs/bundle.pem file with the following three certificates in this order:

    1. Top level PEM-encoded signed SSL certificate (basically the content of the <install-directory>/conf/https/server.crt file).

    2. Intermediate PEM-encoded certificate, if any, from your CA.

    3. Root PEM-encoded certificate, if any, from your CA.

  5. Navigate to the Data Catalog <install-directory>.

  6. Use the following command to restart Data Catalog:

    ./pdc.sh restart

circle-check

Check and remove outdated certificates

If Data Catalog services fail to start or show SSL-related errors, the issue might be caused by expired or outdated certificates in the bundle.pem file. You can identify and remove outdated certificates by checking their validity dates.

Before you begin

  • Ensure you have access to the server where Data Catalog is installed.

  • Verify that you have permission to view and edit files in the conf/extra-certs directory.

Procedure

  1. Go to the directory where the bundle.pem file is stored:

  2. Run the following command to check the validity dates of all certificates in the bundle.pem file:

  3. Review the command output. You see sections similar to the following:

  4. Interpret the fields in the output:

    Field Description notBefore Date from which the certificate becomes valid. notAfter Date after which the certificate expires.

  5. Compare the current date with the notAfter value: If the current date is later than notAfter, the certificate has expired. For example:

  6. Edit the bundle.pem file to remove the expired certificates.

    • Use any text editor such as vi or nano.

    • Remove the complete certificate block starting from -----BEGIN CERTIFICATE----- to -----END CERTIFICATE-----.

  7. Save the file and restart the Data Catalog services.

Result

The bundle.pem file now contains only active certificates, preventing SSL validation issues during Data Catalog startup or connectivity.


Add email domains to the safe list in Data Catalog after deployment

During the initial deployment of Data Catalog, it is typically configured to allow only a predefined set of email domains for user authentication. However, you might grant access to users with email addresses from new domains. Instead of redeploying Data Catalog, which can cause downtime and operational delays, you can dynamically update the list of allowed email domains using the Identity & Access Management (IAM) APIs.

circle-info

Adding email domains and SMTP details during the initial Data Catalog deployment is always a best practice. For more information, see the Install Pentaho Data Catalog.

Perform the following steps to add email domains to the safe list using IAM APIs after deployment:

Prerequisites

  • You must have administrative access to use the IAM APIs.

  • Identify your Data Catalog DNS (for example, catalog.example.com).

  • Obtain admin credentials to generate a Bearer token.

Procedure

  1. Open the CMD prompt and run the following cURL command to generate an authentication token to interact with the IAM APIs:

    • Replace <your-server-url> with your Data Catalog server URL.

    • Replace <admin-username> and <admin-password> with the actual Keycloak master realm admin user credentials. The response includes the token value.

  2. Before updating, you can view the currently configured email domains using the following GET command.

    The response displays the current domain configuration:

  3. Run the following IAM API cURL request to update email domains:

    • Replace <your-server-url> with your Data Catalog server URL.

    • Replace <ACCESS_TOKEN> with the token obtained in the previous step.

    • Replace <provider-id> with your Data Catalog server’s domain or IP address used during installation.

    • Modify the "emailDomains" list as needed.

    Note: Do not remove hv.com and hitachivantara.com from the email domain list.

  4. Run the GET command again (see Step 2) to verify that the email domains are added.

    The response displays the current domain configuration:

circle-check

Set up an email server to send Data Catalog notifications

To set up Data Catalog to send email notifications to users, you can configure any Simple Mail Transfer Protocol (SMTP) server that meets your needs.

Examples of notifications are when a user is tagged with '@' in a comment or set up in a data pipe template to be notified when a job completes.

circle-info

The steps to set up an SMTP server that are in the Installing Data Catalog topic in Install Pentaho Data Catalog only set up the forgot password functionality.

To integrate an SMTP server with Data Catalog, use the following steps:

  1. Gather the following information for the SMTP server you want to use:

    • Host name of SMTP server (IP address or domain name)

    • Port number for SMTP server

    • Username on SMTP server in <mail userID>@<domain>.com format

    • Password for username

    • Sender mail ID in <mail userID>@<domain>.com format

    • Whether to use Transport Layer Security (TLS) or Secure Sockets Layer (SSL) security.

    • TLS or SSL port number For example, you can use Gmail’s SMTP server to send emails from your application. Here are the SMTP server configuration settings for Gmail:

    • SMTP Server Address

      smtp.gmail.com

    • Secure Connection

      TLS/SSL based on your mail client/website SMTP plugin

    • SMTP Username

      your Gmail account ([email protected])

    • SMTP Password

      your Gmail password

    • Gmail SMTP port

      465 (SSL) or 587 (TLS)

  2. Log into Data Catalog using root user credentials to configure Data Catalog to use the SMTP server, as in the following example:

    https://*&lt;full domain name for PDC server&gt;*/

  3. Navigate to the configuresystem/smtp directory on the Data Catalog server, as in the following example:

    https://*&lt;full domain name for PDC server&gt;*/configuresystem/smtp

    The Configure Your System page opens.

  4. Specify the SMTP server information as detailed in the following table:

Field
Description

Host

IP address or domain name of SMTP server

Port

Port number for SMTP server

Username

User name in *&lt;mail userID&gt;*@*&lt;domain&gt;*.com format

Password

Password for user name specified above

Sender Mail

Sender mail ID in *&lt;mail userID&gt;*@*&lt;domain&gt;*.com format

Encryption

  • TLS: Default value (leave the Use SSL checkbox blank)

  • SSL: Select the Use SSL checkbox

5. Click Test Connection to test the integration. A success confirmation message is displayed next to the Test Connection button.

6. Click Save Changes.

circle-check

Update SMTP details in Data Catalog after deployment

Adding Simple Mail Transfer Protocol (SMTP) details in Data Catalog enables email notifications and alerts within the application, such as:

  • Alerts about Data Catalog changes, approvals, and errors like data ingestion, metadata extraction, or synchronization failures.

  • Password reset links when users forget their credentials.

  • Notification alerts when tagged in the comments tab.

SMTP details are typically configured during the initial deployment of Data Catalog. However, if you want to update SMTP details post-deployment, you can use the Identity & Access Management (IAM) APIs without redeploying Data Catalog, which might cause downtime and operational delays.

circle-info

Adding email domains and SMTP details during the initial Data Catalog deployment is always the best practice. For more information, see the Installing Data Catalog topic in Get started with Pentaho Data Catalog document.

Perform the following steps to update SMTP details in Data Catalog using IAM APIs after deployment:

Ensure you have sufficient access to use the IAM APIs.

  1. To generate an authentication token to interact with the IAM APIs, open the CMD prompt, and run the following cURL command:

    • Replace <your-server-url> with your Data Catalog server URL.

    • Replace <admin-username> and <admin-password> credentials with the actual admin credentials. The response includes the token value.

  2. To update SMTP details, run the following IAM API cURL request:

    Parameter
    Description

    <PDC_HOST>

    The host name or IP address of your Data Catalog instance.

    <TENANT_NAME>

    The tenant name, typically "pdc".

    <TOKEN_VALUE>

    A valid authentication token (must be obtained through IAM authentication).

    <SMTP_PASSWORD>

    The password for the SMTP server authentication.

    <REPLY_TO_DISPLAY_NAME>

    The display name for the reply-to email address.

    <SMTP_PORT>

    The port number used by the SMTP server.

    <SMTP_HOST>

    The SMTP server host address).

    <REPLY_TO_EMAIL>

    The reply-to email address.

    <FROM_EMAIL>

    The email address used to send notifications.

    <FROM_DISPLAY_NAME>

    The display name associated with the sender’s email.

    <ENVELOPE_FROM>

    The envelope sender address (optional).

    <SMTP_USERNAME>

    The username for SMTP authentication.

circle-check

Configure proxy server settings for the Licensing-API service

In Pentaho Data Catalog, the Licensing-API service is responsible for managing and validating software licenses, ensuring that only authorized users and services can access Data Catalog features. When Data Catalog is deployed in an enterprise environment that restricts direct internet access, services like the Licensing-API require a proxy server to reach external licensing servers and authenticate endpoints.

Post deployment of Data Catalog, perform the following steps to configure the proxy server for the Licensing-API service:

circle-info

When configuring the proxy server for the Licensing-API service, use the domain name instead of the IP address. SSL certificates are typically issued for domain names, ensuring secure communication.

Ensure that you have:

  • Access to the conf/.env and vendor/docker-compose.licensing.yml files.

  • Administrative privileges to modify configuration files and restart services.

  • The required proxy server details (domain, port, username, and password).

  • The SSL certificate file (proxy-cert.pem) if required for secure proxy connections.

Procedure

  1. To configure proxy environment variables, go to Data Catalog root folder and then open the conf/.env file.

  2. In the conf/.env file, update the following proxy variables with respective values:

    Variable
    Description
    Example Value

    LICENSING_SERVER_PROXY_ENABLED

    Enables or disables proxy configuration.

    true or false

    LICENSING_SERVER_PROXY_DOMAIN

    The domain or IP address of the proxy server.

    10.177.176.126

    LICENSING_SERVER_PROXY_PORT

    The port number used for proxy communication.

    443

    LICENSING_SERVER_PROXY_USER

    The username for proxy authentication.

    admin

    LICENSING_SERVER_PROXY_PASSWORD

    The password for proxy authentication.

    password

    Note: It is a best practice to avoid hard coding sensitive credentials like PROXY_USER and PROXY_PASSWORD. Use secret management tools or environment variables to secure them.

  3. To update proxy server configuration in Docker Compose, open the vendor/docker-compose.licensing.yml file and update the licensing-api service configuration as follows:

    Note:

    • The PROXY_ENABLED, PROXY_HOST, PROXY_PORT, PROXY_USER, and PROXY_PASSWORD environment variables are mapped inside the Docker container.

    • The JAVA_EXTRA_CERTS is set to "cert.pem" to configure SSL certificates for proxy authentication.

    • A volume mount is added to ensure that the SSL certificate file proxy-cert.pem is accessible within the container.

  4. (Optional) If the proxy server requires SSL authentication, place the SSL certificate file (proxy-cert.pem) in the specified directory:

    Note: Ensure that the file permissions allow access by the Licensing-API service.

  5. After updating the configuration, restart the Data Catalog services to apply the changes:

circle-check

Configure metadata index refresh threshold for profiling

In Data Catalog, the PDC_WS_REFRESH_MDS_INDEX_EVERY_N environment variable controls how frequently the metadata index (MDS index) is refreshed during processing. By default, Data Catalog refreshes the metadata index at predefined intervals while handling profiling and metadata operations. In high-volume environments or during large profiling jobs, you might want to override this threshold to control how often the index refresh occurs.

Changing the metadata index refresh threshold can affect system performance and indexing behavior.

Perform the following steps to override the default refresh interval in Data Catalog:

Procedure

  1. Open a terminal on the Data Catalog deployment server.

  2. Go to the Data Catalog deployment directory.

  3. Open the conf/.env file in a text editor.

  4. Add or update the following environment variable:

    This value specifies the number of processed records or operations after which the metadata index is refreshed.

  5. Save the file and close the editor.

  6. Restart the Data Catalog services to apply the change.

Result

Data Catalog refreshes the metadata index after every 100000 operations, based on the configured threshold. Use this configuration only when necessary and revert to the default setting if it is no longer required.


Configure job server auto-scaling in Amazon EKS

The Job Server in Pentaho Data Catalog can automatically scale its pods in an Amazon Elastic Kubernetes Service (EKS) cluster based on CPU and memory utilization. This auto-scaling applies to all jobs that are executed through the Job Server, including data profiling, data identification, and metadata ingestion tasks. By enabling auto-scaling, Data Catalog maintains consistent performance for job executions while optimizing compute resource usage.

Scaling operations are managed through the Kubernetes Horizontal Pod Autoscaler (HPA) for pod-level scaling and the EKS Cluster Autoscaler for node-level scaling.

Perform the following steps to configure job server auto-scaling in Amazon EKS:

Before you begin

Ensure the following prerequisites are met before enabling Job Server auto-scaling:

  • Data Catalog is deployed in an Amazon EKS cluster.

  • The Cluster Autoscaler is enabled for the EKS node groups.

  • You have administrative privileges to modify and redeploy the Helm configuration.

  • Identify the namespace where Data Catalog is installed.

  • Verify that your workload profiling results (for example, file profiling or JDBC profiling) are available. These results help you define the appropriate CPU and memory utilization thresholds for scaling.

Procedure

  1. Open the Job Server Helm values file.

  2. Enable auto-scaling and define utilization parameters.

    Add or update the following configuration under the deployments section:

    Note:

    • Adjust the CPU and memory utilization percentages according to the workload behavior.

    • For high-throughput profiling or ingestion, a lower threshold (for example, 50–60%) helps scale out faster.

    • The minReplicas and maxReplicas values define the lower and upper bounds for the Job Server pods.

  3. Save the configuration file.

  4. Redeploy Data Catalog using Helmfile.

    This command applies the updated Helm values and triggers a rolling update for the Job Server deployment.

  5. Verify that the Horizontal Pod Autoscaler (HPA) is active.

    The output lists the Job Server HPA with its configured CPU and memory thresholds.

  6. Monitor scaling activity.

    As workload intensity increases, additional Job Server pods are created automatically. When utilization decreases, the pods scale in to conserve resources.

Result

Data Catalog dynamically scales the Job Server pods based on CPU and memory utilization thresholds. The scaling behavior ensures optimal resource consumption while maintaining consistent performance for heavy profiling or ingestion workloads.

Example Reference Configuration

Parameter
Value
Description

minReplicas

1

Minimum number of Job Server pods

maxReplicas

100

Maximum number of Job Server pods

targetCPUUtilizationPercentage

80

Scale-out threshold for CPU usage

targetMemoryUtilizationPercentage

80

Scale-out threshold for memory usage

Next steps

Adjust jobPoolMaxSize in the configuration to control the number of concurrent jobs per pod. It defines the number of simultaneous jobs that each Job Server pod can execute. For example, if jobPoolMaxSize=10 and there are 4 Job Server pods, up to 40 jobs can run in parallel across the cluster.


Configure worker profiling overrides for memory management

In Data Catalog, you can use advanced worker-level environment variables to tune profiling behavior and memory usage. These variables help control how structured and unstructured profiling tasks are processed, especially in environments with large datasets or limited system resources.

circle-info

Adjust these parameters only when you need to optimize memory consumption, control concurrency, or manage the processing of large unstructured files. It is recommended to apply these overrides in consultation with engineering or support for production environments.

Perform the following steps to configure worker profiling overrides for memory management in Data Catalog:

Procedure

  1. Open a terminal on the Data Catalog deployment server.

  2. Go to the Data Catalog deployment directory.

  3. Open the conf/.env file in a text editor.

  4. Add or update the following environment variables:

    • PDC_WS_MAX_COLUMNS_IN_MEMORY=40 Limits the number of columns processed in memory during profiling. Reducing this value can help control memory usage for wide tables.

    • PDC_WS_PROFILE_HEAVY_THREADS=4 Defines the number of worker threads allocated for heavy profiling tasks. Adjust this value based on available CPU and memory resources.

    • PDC_WS_HEAVY_UNSTRUCTURED_MAX_SIZE=50MB Sets the maximum file size for heavy unstructured processing. Files larger than this threshold are handled differently to prevent excessive memory consumption.

    • PDC_WS_SKIP_FILE_BITCOUNT=false Controls whether file bit-count calculations are skipped during profiling. Setting this to false ensures full processing, while true can reduce processing overhead.

  5. Save the file and close the editor.

  6. Restart the Data Catalog services to apply the changes.

Result

The worker profiling configuration is updated. Profiling operations now use the specified memory and concurrency limits.

circle-info

Improper tuning of these parameters can lead to performance degradation or resource exhaustion. Monitor system performance and logs after applying changes, and revert to default values if issues occur.


Configure the generic metadata worker retry duration

In Data Catalog, as a admin, you can configure the retry duration for the generic metadata importer worker. This setting controls the maximum elapsed time that the worker continues retrying a metadata import job before it stops attempting further retries. By adjusting this parameter, you can control how long Data Catalog attempts to recover from transient failures, such as temporary connectivity issues or short-lived infrastructure disruptions. Increasing the retry duration can help improve resilience in unstable environments. Reducing the value can prevent prolonged retry cycles that consume system resources.

Perform the following steps to configure the generic metadata worker retry duration in Data Catalog:

Procedure

  1. Open a terminal on the Data Catalog deployment server.

  2. Go to the Data Catalog deployment directory.

  3. Open the conf/.env file in a text editor.

  4. Add or update the following environment variable:

    This value specifies the maximum number of hours that the generic metadata importer worker continues retrying a failed import job before terminating the retry process.

  5. Save the file and close the editor.

  6. Restart the Data Catalog services to apply the change.

Result

The generic metadata worker execution duration is configured in Data Catalog. When you rerun the metadata import job, it runs for the duration specified in the PDC_WS_DEFAULT_GENERIC_METADATA_IMPORTER_MAX_ELAPSED_HOURS setting.


Configure Smart Type to SQL feature in Data Catalog

In Data Catalog, you can use the Smart Type to SQL feature, which converts natural-language text into executable SQL queries within Data Pipes. This feature uses a large-language-model (LLM) service to interpret user input and automatically generate valid SQL statements for the selected database tables.

Perform the following steps to configure the Smart Type to SQL feature in Data Catalog:

Before you begin

  • Ensure your Data Catalog deployment includes the ml-gateway-service.

  • Confirm that the aiml profile is enabled. The feature is unavailable without it.

  • Obtain valid credentials or API keys for your LLM provider.

Procedure

For Docker Compose deployment

  1. Go to the conf/.env file in your Data Catalog installation directory.

  2. Add the following environment variables:

    ML_LLM_MODEL="" ML_LLM_API_KEY="" ML_LLM_INFERENCE_BASE_URL=""

  3. Save the file and restart the containers for the configuration to take effect.

For Kubernetes deployment

  1. Open the values.yaml file of the ml-gateway-service.

  2. Update the following configuration parameters:

    llmModel: llmApiKey: llmInferenceBaseUrl:

  3. Save the file and redeploy the ml-gateway-service.

Result

You have successfully configured Smart Type to SQL feature. When the aiml profile is active, users can enter plain-language prompts in the SQL Editor to generate valid SQL queries automatically.


Configure database password encoding for special characters

By default, the pg-migration service in Pentaho Data Catalog automatically encodes special characters (such as @, :, /, and ?) in PostgreSQL database passwords to ensure they are safely included in the database connection URL, preventing connection failures caused by unescaped reserved characters. If your environment requires passing passwords without URI encoding, you can disable this behavior by setting the PDC_PG_MIGRATIONS_DB_PASSWORD_REQUIRED_ENCODING environment variable to false.

circle-exclamation

Perform the following steps to disable URI encoding for passwords used by the pg-migration service:

Before you begin

Ensure that you have permission to edit the PDC environment configuration file (.env or .env.default).

Procedure

  1. Open the environment configuration file:

    • For Docker Compose deployment:

    • For Kubernetes deployment, update the value in the values.yaml file of the pg-migrations helm chart.

  2. Locate or add the following environment variable and set it to false:

  3. Save the file and start the PDC services to apply the change:

    • Docker Compose:

    • Kubernetes (Helm): Deploy the pg-migrations release with the updated values.yaml takes effect:

      Replace <pg-migrations-release-name> and <chart-path-or-repo> with your actual values.

Result

Data Catalog disables URI encoding for PostgreSQL database passwords used by the pg-migration service. The password is passed to the database connection URL as-is.

Next steps

If you need to modify additional system-level configuration variables, see Configure system environment variables.


Configure table and column sorting order in Data Canvas

By default, Pentaho Data Catalog displays tables and columns in their ordinal order within the Data Canvas, that is, in the same sequence as they appear in the source database. This ordering helps users analyze the data structure as designed in the original schema. However, sometimes, you might prefer to view tables and columns in alphabetical order, which simplifies browsing and locating objects across large schemas. You can modify this behavior by updating the system environment variable PDC_FE_DATA_CANVAS_COLUMN_SORTING_ORDER in the deployment configuration file.

Perform the following steps to configure table and column sorting order:

Prerequisites

Ensure that you have access to the Data Catalog deployment directory on the server.

Procedure

  1. Log in to the server where Pentaho Data Catalog is installed.

  2. Go to the Data Catalog Docker deployment configuration directory.

  3. Open the .env (environment configuration) file.

    If the .env file does not exist, create it and save the file before proceeding.

  4. Add or update the following environment variable.

  5. Save the file and exit the editor.

  6. Restart the Data Catalog containers for the changes to take effect.

circle-info

This configuration affects only the display order in the Data Canvas and does not modify metadata, lineage, or profiling results.

Result

Tables and columns in the Data Canvas are now displayed in alphabetical order.

Next steps


Configure the OCR feature in Data Catalog

The Optical Character Recognition (OCR) feature in Data Catalog enables the system to extract and classify text from scanned documents and image files. OCR enhances Document Processing in Data Discovery by automatically identifying sensitive or business-critical text, such as passport numbers, personal names, or identifiers. This text is matched against predefined or user-defined data patterns and then tagged or associated with business glossary terms for consistent governance.

By default, Pentaho Data Catalog uses Tesseract as the OCR engine. For improved accuracy with low-resolution or complex images, you can enable the EasyOCR model.

Perform the following steps to configure OCR in Data Catalog:

Before you begin

  • Verify that you have administrative access to the Pentaho Data Catalog deployment.

  • Ensure that the environment (Docker or EKS) is running a supported version of PDC (10.2.9 or later).

  • Identify the deployment type (Docker-based or EKS-based).

  • Back up your configuration files before making any changes.

Procedure

Perform the following steps to configure OCR feature in Data Catalog:

For Docker-based deployments

  1. Navigate to the conf/ directory of your Pentaho Data Catalog deployment.

  2. Open the .env file for editing.

  3. Add the following environment variable:

  4. Save the file and restart the services using the following command:

    After the restart, Pentaho Data Catalog uses the EasyOCR model to process scanned documents and images.

Result

Data Catalog now uses the EasyOCR engine for document text recognition. When users perform Data Discovery and then Document Processing, the system extracts text from scanned files and identifies information that matches OCR patterns.


Configure Large Language Models in Data Catalog

In Data Catalog, machine learning–driven document intelligence enables automated understanding and enrichment of unstructured content. Data Catalog applies pre-trained and configurable language models to analyze document content and deliver capabilities such as document classification, address detection, and document summarization.

By default, Data Catalog uses built-in models optimized for enterprise content, and you can also configure a custom or third-party language model to control how document content is processed, analyzed, and enriched across these ML-powered features. Refer to the following procedures to configure an LLM for Dockerarrow-up-right and Amazon EKSarrow-up-right deployments.

For Docker deployment

Perform the following steps to configure a custom or third-party language model in the Data Catalog Docker deployment:

Before you begin

Obtain the following information from your language model provider:

  • Model name or identifier

  • Inference endpoint base URL

  • API key

Procedure

  1. Log in to the server where Pentaho Data Catalog is installed.

  2. Go to the Data Catalog Docker deployment configuration directory.

  3. Open the .env (environment configuration) file.

    If the .env file does not exist, create it and save the file before proceeding.

  4. Add or update the following environment variables based on your deployment requirements:

    • ML_USE_LLM_FOR_CONTENT_PROCESSING: Enables language model–based content processing for document classification. Set this value to true.

    • ML_LLM_MODEL: Specifies the name or identifier of the language model to use.

    • ML_LLM_API_KEY: Specifies the API key used to authenticate with the language model service.

    • ML_LLM_INFERENCE_BASE_URL: Specifies the base URL for the language model inference endpoint.

    • ML_CLASSIFICATION_THRESHOLD: Minimum confidence score required to accept a model-generated classification.

    • ML_TERM_MATCHING_THRESHOLD: Minimum semantic similarity score required to assign a user-defined term to a document. Example:

  5. After adding all required variables, save the changes and restart the Data Catalog services to apply the configuration.

Result

You have configured the specified language model for AI-assisted document processing featuresarrow-up-right in the Docker deployment. Additionally, you can fine-tune the prompts used by AI-assisted document processing features to better align with your organization’s requirements. The default prompts are defined in the following file:


Connect to Business Intelligence Database (BIDB)

The Data Catalog includes the Business Intelligence Database (BIDB) server, which stores and manages reporting metadata. Depending on your PDC version, BIDB is implemented using either PostgreSQL (PDC 10.2.5 and later) or MongoDB (PDC 10.2.1). You can use the respective connection methods and connect to the BIDB server to access reporting data and build dashboards. See the Reporting and data visualization section in the Use Pentaho Data Catalog guide for details about BIDB and the components available in BIDB.

PDC 10.2.5 and later (PostgreSQL)

Beginning with PDC 10.2.5, BIDB has been migrated from MongoDB to PostgreSQL, providing a relational database structure that improves query performance and enhances compatibility with broader tool sets.

Perform the following steps to connect to BIDB (PostgreSQL) in PDC 10.2.5 and later:

  1. Locate the BIDB credentials in the PDC server:

    1. Navigate to the /vendor/.env.default file.

    2. Identify the variables beginning with POSTGRES_BIDB_USER_*.

    3. To list the values, run: cat .env.default | grep 'POSTGRES_BIDB'

  2. Install the required PostgreSQL driver:

    • For JDBC, download the PostgreSQL JDBC driver from the official PostgreSQL site.

    • For ODBC, install the PostgreSQL ODBC driver (psqlODBC) on your system.

  3. Configure your connection:

    • JDBC connection string format: jdbc:postgresql://pdc.pentaho.com:5432/bidb

    • ODBC DSN settings:

      • Server: <HOSTNAME>

      • Port: 5432

      • Database: bidb

      • Username: bidb_ro

      • Password: ${POSTGRES_BIDB_USER_PASSWORD}

  4. Save the configuration in your reporting or analytics tool.

  5. Test the connection to confirm access.

Important:

  • It is best practice not to hardcode credentials in your application. Instead, reference the environment variables (POSTGRES_BIDB_USER_*) to ensure secure and flexible credential management.

  • Ensure that the .env.default file is stored securely and is not shared publicly.

circle-check

PDC versions prior to 10.2.5 (MongoDB)

In PDC versions prior to 10.2.5, the Business Intelligence Database (BIDB) is implemented using MongoDB. To connect to the BIDB server, you can use either the Java Database Connectivity (JDBC) connector or the Open Database Connectivity (ODBC) connector, depending on your application or reporting tool requirements.

Configure the Java Database Connectivity (JDBC) connector

Perform the following steps to configure the JDBC connector for connecting to BIDB:

  1. Download the MySQL JDBC Connector JAR file from the MySQL websitearrow-up-right after selecting the appropriate version for the operating system.

  2. Download the DBeaver application from the DBeaver websitearrow-up-right and install it on the system. See DBeaver installationarrow-up-right for more details.

  3. To add the MySQL JDBC Driver and MySQL authentication plugin to DBeaver, open DBeaver and go to Database > Driver Manager.

  4. Click New to add a new driver.

  5. Select MySQL from the list and enter a name for the driver.

  6. Click Browse to locate and select the downloaded JDBC driver (JAR file) and the MySQL authentication plugin, then click OK or Finish to add the driver.

  7. After adding the MySQL driver, to create a New Connection, go to the DBeaver home page, click New Database Connection, and select MySQL as the database type.

  8. Enter the MySQL server connection details, such as host, port, username, password, and so on.

  9. Specify the jars in the local client configuration as shown in the following section.

  10. Click Test Connection to verify the connection is working.

  11. Click Finish to save the connection configuration.

circle-check

Configure the Open Database Connectivity (ODBC) connector

The MongoDB ODBC connector allows you to connect tools that support ODBC to MongoDB and query the data using SQL. Perform the following steps to configure the JDBC connector for connecting to BIDB.

  1. Download and install the MongoDB ODBC connector.

    See MongoDB BI Connector ODBC Driverarrow-up-right for more information.

  2. Download and install an ODBC driver manager on your system.

    For example, on the Windows operating system, you can use the default Windows ODBC Data Source Administrator.

  3. Open the ODBC Data Source Administrator on your machine and go to the System DSN tab.

  4. Click Add to add a new data source and select the MongoDB Driver.

  5. To configure the DSN (Data Source Name) settings:

    1. Set the server field to the address of your MongoDB server.

    2. Enter the port number if it differs from the default (27017).

    3. Enter the required details for authentication, username, and password.

    4. As a part of the connection details, enter the plugin directory details.

    5. Set the SSL Mode to Disabled in the SSL configuration.

  6. Click Test to verify that connection is working.

  7. Click OK to save the connection configuration.

circle-check

Configure a machine learning (ML) server connection in Data Catalog

You can connect a machine learning (ML) server to Data Catalog and import ML model server components into the ML Models hierarchy. Supported server types include:

  • Pre-Production Model Servers such as MLflow, which capture experiments, runs, versions, and artifacts.

  • Production Model Servers such as NVIDIA Triton, which provide model deployment, inference statistics, and operational metrics.

Once configured, the ML server appears under the Synchronize card in the Management section of Data Catalog, allowing you to import model components into the ML Models hierarchy. For more information about ML Models, see the ML Models section in Use Pentaho Data Catalog.

Configure an MLflow server connection

Perform the following steps to configure a pre-production MLflow server connection in Data Catalog.

Before you begin:

  • Make sure you have access to the MLflow server you want to connect to.

  • If the MLflow server requires authentication, make sure you have the necessary credentials, either a valid username and password or an access token.

Procedure

  1. Verify whether the file external-data-source-config.yml exists in the path ${PDC_CLIENT_PATH}/external-datasource/. If not available, create it.

  2. Open the external-data-source-config.yml file and add ML server configuration:

Parameter
Description
Example

id

Unique identifier (UUID) for the ML server.

916d3b20-7fd6-49d2-b911-cc051f56e837

name

Display name for the server. This name appears in the UI.

MLflowServer

type

Type of server (enum value). For ML server, use ‘MlFlow’.

URL

The base URL of the ML server.

http://mlflow.mycompany.com

config

Configuration keys specific to ML server you are configuring. Include either, only if authentication is enabled:- Username and password

  • Access token

3. After configuring the ML server in the YAML file, restart the PDC services to apply the changes.

circle-check

You can now import ML model server components into the ML Models hierarchy of Data Catalog. For more information, see Import ML model server components into ML Models hierarchy.


Configure a Triton server connection

Perform the following steps to configure a production Triton inference server connection in Data Catalog:

Before you begin:

  • Ensure the Triton inference server is running and accessible.

  • If running Triton in Docker, confirm that the ports are exposed (HTTP, gRPC, Metrics). For example:

Procedure

  1. Verify whether the file external-data-source-config.yml exists in the path ${PDC_CLIENT_PATH}/external-datasource/. If not available, create it.

  2. Open the external-data-source-config.yml file and add ML server configuration:

Parameter
Description
Example

id

Unique identifier (UUID) for the Triton server.

44e2fa51-e3af-4094-8dd3-c62320952de5

name

Display name for the server. This name appears in the UI.

Triton-Prod-server

type

Type of server (enum value). For Triton server, use ‘Triton’.

Triton

URL

The base URL where the Triton server is deployed.

http://192.168.0.10

config.metadata_port

HTTP port configured when deploying the Triton server. The default port is 8000.

8000

config.metrics_port

Metrics port configured by the user when deploying the Triton server. The default is 8002.

8002

3. Save the YAML file and restart the PDC services to apply the changes.

circle-check

You can now import ML model server components into the ML Models hierarchy of Data Catalog. For more information, see Import ML model server components into ML Models hierarchy.


Configure a Tableau server connection in Data Catalog

You can configure a connection between a Tableau server and Data Catalog to import Tableau metadata such as dashboards, workbooks, projects, and data sources into the Business Intelligence (BI) section of Data Catalog. To learn more, see the Business Intelligence section in the Use Pentaho Data Catalog guide. This guide depicts a step-by-step procedure to configure the Tableau server connection in Data Catalog.

Before you begin:

  • Make sure you have access to the Tableau Cloud or Tableau Server instance you want to connect to. The URL format looks like:

  • Identify the Site ID for the Tableau site. For Tableau Cloud, you can find this in the URL after /site/.

  • Generate a valid Personal Access Token (PAT) in Tableau, including PAT name and PAT secret.

Perform the following steps to configure a connection between the Tableau server and Data Catalog:

  1. Verify whether the file external-data-source-config.yml exists in the path $ {PDC_CLIENT_PATH}/external-datasource/. If not available, create it.

  2. Open the external-data-source-config.yml file and add Tableau server configuration:

    Parameter
    Description
    Example

    id

    The site ID (unique identifier) of the Tableau site to connect to, as seen in the Tableau Cloud URL.

    dev-8f012f9ca7

    name

    Display name for the server. This name appears in the UI.

    TableauServer

    type

    Type of server (enum value). For Tableau server, use ‘Tableau’.

    Tableau

    URL

    The Tableau REST API authentication endpoint. Use the signin endpoint for the Tableau site.

    https://prod-apnortheast-a.online.tableau.com/api/3.22/auth/signin

    config

    Configuration keys specific to the Tableau server you are configuring.

    - pat_name

    The name of the Tableau Personal Access Token (PAT) used for authentication.

    - pat_secret

    The secret key associated with the PAT. Ensure this is stored securely and never exposed.

  3. After configuring the Tableau server in the YAML file, restart the following PDC services to apply the changes:

    • Frontend service (fe)

    • Worker service (ws-default)

circle-info

You have successfully configured the Tableau server in Data Catalog as an external data source. It appears under the Synchronize card in the Management section of Data Catalog.

You can now import Tableau server components into the Business Intelligence hierarchy of Data Catalog.


Configure a Power BI service connection in Data Catalog

You can connect the Microsoft Power BI service to Data Catalog and import Power BI metadata into the Business Intelligence section. This integration lets you discover, explore, and manage Power BI reports, datasets, and workspaces directly from Data Catalog. This guide provides step-by-step instructions to configure a Power BI server connection in Data Catalog using either username and password–based authentication or Service Principal (SPN) authentication.

Before you begin:

  • Ensure that you have a valid Microsoft account with an active Power BI service license. Contact your Microsoft administrator if you need access.

  • Register an Azure Active Directory (Azure AD) application in the Azure portalarrow-up-right. For more information, see Microsoft guide to register an apparrow-up-right.

  • Generate a client secret in the Azure AD application and store it securely. For more information, see Microsoft guide to add client secretsarrow-up-right.

  • Assign the following API permissions to the Azure AD app:

    Permission
    Description

    App.Read.All

    View all Power BI apps

    Capacity.Read.All

    View all capacities

    Dashboard.Read.All

    Read dashboards

    Dataflow.Read.All

    Read dataflows

    Dataset.Read.All

    View all datasets

    Report.Read.All

    Read reports

    Tenant.Read.All

    View all content in tenant

    Workspace.Read.All

    View all workspaces

    See Power BI automation permissionsarrow-up-right for more information.

  • Ensure that a service account is available with access to all Power BI workspaces that you need to integrate with Data Catalog.

  • Confirm outbound HTTPS access to https://api.powerbi.com on port 443 from the Data Catalog server.

  • Collect the following details before continuing:

    Field
    Description

    Client ID

    From Azure AD App registration

    Client Secret

    Securely generated during Azure AD registration

    Tenant ID

    Azure AD Directory unique identifier

    OAuth2 Token URL

    https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token

    API Host URL

    https://api.powerbi.com

    Username

    Credentials for the Power BI server.

    Password

    Credentials for the Power BI server.

circle-info
  • Admin consent must be granted for all permissions during Azure AD app setup.

  • A Power BI Pro or Premium Per User (PPU) license is required to use the Power BI REST APIs for integration with Data Catalog.

  • Data Catalog uses the Resource Owner Password Credentials (ROPC) grant type for authentication.

Perform the following steps to configure a connection between the Power BI server and Data Catalog:

  1. Connect to the Data Catalog server using SSH.

    1. From your local machine, open a terminal (for Linux or macOS) or an SSH client, such as PuTTY (for Windows).

    2. Enter the SSH command to connect to the server where Pentaho Data Catalog is installed.

      Replace <username> with your server login account and <server-ip> with the server’s IP address or hostname.

    3. When prompted, enter the password for the specified account. After a successful login, you will have access to the Data Catalog server’s command line.

  2. Navigate to the configuration folder and verify whether the file external-data-source-config.yml exists in the path $ {PDC_CLIENT_PATH}/external-datasource/. If not available, create it.

  3. Open the external-data-source-config.yml file and add the Power BI server configuration: Choose one of the following authentication options based on your environment.

    1. Username and password–based authentication (ROPC) Use this option when connecting with a Power BI service account that uses username and password authentication.

    2. Service Principal (SPN)–based authentication Use this option when connecting with an Azure AD service principal. Service Principal (SPN) authentication is recommended for production environments because it avoids storing user credentials and aligns with Azure security best practices. This option does not require a username or password.

      Parameter
      Description
      Example

      id

      The unique identifier for the Power BI server connection. You define this value in the YAML.

      dev-powerbi01

      name

      Display name for the server. This name appears in the Data Catalog UI.

      PowerBIServer

      type

      Type of server (enum value). For Power BI server, use ‘PowerBI’.

      PowerBI

      URL

      The Microsoft identity platform OAuth 2.0 token endpoint. Used for authentication requests.

      https://prod-apnortheast-a.online.tableau.com/api/3.22/auth/signin

      config

      Configuration keys specific to the Power BI server you are configuring.

      - client_id

      The application (client) ID generated during Azure AD app registration.

      12345678-abcd-1234-abcd-1234567890ab

      - username

      The Microsoft account username with access to the Power BI service.

      -password

      The password associated with the Microsoft account username. Ensure this is stored securely.

      -client_secret

      The client secret generated for the Azure AD app. Ensure this is stored securely and never exposed.

      abcdEFGH12345!@#xyz

      -tenant_id

      The Azure Active Directory tenant ID used for Service Principal authentication.

      contoso.onmicrosoft.com

      -host_url

      The base API endpoint for Power BI service.

      https://api.powerbi.com

  4. After configuring the Power BI server in the YAML file, restart the following PDC services to apply the changes:

circle-check

You can now import Power BI server components into the Business Intelligence hierarchy of Data Catalog.


Configure the Physical Assets service in Data Catalog

In Pentaho Data Catalog, you can import operational technology (OT) components, including device services, locations, devices, and values, and view them in the Physical Assets section of Data Catalog in a hierarchical structure. With the Physical Assets feature, you can understand how data flows from physical sources into analytical systems, enabling better traceability and context. Additionally, users can enrich asset nodes with business terms, policies, lineage, and metadata to strengthen data governance and compliance. For more information, see Physical Assets in the Use Pentaho Data Catalog guide.

To use the Physical Assets feature in Data Catalog, you must first configure the Physical Assets service. This involves completing the following procedures:

Note: The configuration steps assume Data Catalog is already installed. For installation instructions, seeInstall Data Catalog in the Install Pentaho Data Catalog guide.

Enable the Physical Assets service in Data Catalog

Perform the following steps to enable the Physical Assets service in the existing Data Catalog deployment.

  1. Log in to the server where Data Catalog is installed.

  2. Go to the Data Catalog Docker deployment configuration directory.

  3. Open the .env (environment configuration) file.

    If the .env file does not exist, create it and save the file before proceeding.

  4. Add or update the following lines.

  5. Add the Pentaho Edge connection details:

    Replace <PE-IP> with the IP where Pentaho Edge is installed.

  6. Restart Data Catalog to apply changes:

You have successfully enabled the Physical Assets service in the Data Catalog deployment. The service is now active and ready to connect with Pentaho Edge to receive physical assets metadata.


Configure Pentaho Edge for the Physical Assets service

Perform the following steps to configure Pentaho Edge to connect it to Data Catalog.

  1. Clone the Pentaho Edge installer repositoryarrow-up-right and navigate to the installer folder:

  2. Edit the docker-compose-pentaho-edge.yml file:

  3. Save and close the docker-compose-pentaho-edge.yml file.

  4. Open the .env file:

  5. Update the following properties:

    Note:

    • Replace the <PDC-IP> address of the URL with the IP where pdc-docker-deployment is installed.

    • Use the FQDN instead of the IP address if needed.

  6. Add authentication properties:

  7. Run the Edge installer script:

  8. When prompted, provide a user ID and password.

You have successfully configured Pentaho Edge to support the Physical Assets hierarchy and configured the connection to Data Catalog. You can now view OT assets in the Physical Assets in Data Catalog.


Configure PDI to send lineage to Data Catalog

You can configure Pentaho Data Integration (PDI) to write lineage information from key lineage events into the Data Catalog metadata store. When configured, PDI writes lineage metadata for supported lineage events to the Data Catalog metadata store. Data Catalog continuously runs an API that reads the lineage information from PDI. Both PDI and Data Catalog support the OpenLineage framework for data lineage collection and analysis.

See Data lineage in Use Pentaho Data Catalog guide for information on the specific lineage events that are supported.

circle-info

You must perform these steps on PDI.

Before you begin this task, turn off PDI and the Pentaho Server.

Perform the following steps to set up PDI to send lineage metadata to Data Catalog.

  1. On the Support Portalarrow-up-right home page, sign in using the Pentaho support username and password provided in your Pentaho Welcome Packet. If you don't have credentials, contact your PDI administrator.

  2. On the Pentaho card, click Download.

  3. Navigate to the Marketplace location with plugin downloads.

  4. Download the PDI OpenLineage plugin.

  5. Extract the downloaded package.

  6. Run the installer for PDI:

    1. Run install.sh if on Linux, or install.bat if on Windows.

    2. Install in the <data-integration> folder.

  7. Run the installer for Pentaho Server:

    1. Run install.sh if on Linux, or install.bat if on Windows.

    2. Install in the <pentaho-server> folder.

      Note: To view lineage information in Data Catalog, ensure that the data connections and tables used in your PDI transformations are already mapped in the Data Canvas. If there are no connections, connections will be created by metadata push using APIs only for Lineage purposes.

  8. Create a config.yml file, adding the correct users and passwords for your environment, and the URL for Data Catalog.

    There is an example in the readme.txt file:

  9. Edit the ~/.kettle/kettle.properties file and add the following properties:

  10. Restart PDI and the Pentaho Server.

circle-check

Integrate Active Directory with Pentaho Data Catalog

You can integrate Microsoft Active Directory (AD) with Pentaho Data Catalog (PDC) to enable users of AD to have single sign-on access to PDC. Part of this integration includes configuring the Keycloak identity and access management tool to use AD as an identity provider.

The configuration includes the following topics:

circle-info

After importing AD users to PDC, you need to perform the following operations from Active Directory, because they can no longer be done from the Data Catalog User Management page:

  • Edit a user

  • Add a new user

  • Delete a user

Verify the LDAP server configuration

To integrate Active Directory with Pentaho Data Catalog, you need to integrate Lightweight Directory Access Protocol (LDAP) with Keycloak. You first need to check that your LDAP server is configured correctly.

For detailed information on how to configure LDAP in your environment, consult your LDAP server documentation.

You should have the following components in an example configuration:

  • Base DN: Base Distinguished Name, such as: dc=example,dc=com, where dc is the domain component. The Base DN is the root entry where you want to start your LDAP searches.

  • User DN: User Distinguished Name, such as: ou=users,dc=example,dc=com, where ou is the organizational unit and dc is the domain component.

  • Groups DN: Groups Distinguished Name, such as: ou=groups,dc=example,dc=com, where ou is the organizational unit and dc is the domain component.

Next steps

Configure the LDAP provider

To integrate Active Directory (AD) with Pentaho Data Catalog (PDC), you need to configure the LDAP provider for PDC in the Keycloak interface.

Use the following steps to configure the LDAP provider:

  1. Navigate to your Keycloak admin console (such as https://<FQDN>/keycloak/) and log in with admin credentials.

  2. Select the PDC realm.

    If you haven't already configured an LDAP provider, click Add provider and select ldap. If you have an existing LDAP provider, click on it to edit.

  3. Enter the following information for the LDAP provider:

    Field
    Value

    Vendor

    Active Directory

    Connection URL

    ldap://*&lt;LDAP\_SERVER&gt;*:*&lt;PORT&gt;* such as: ldap://localhost:389

  4. Click Test connection.

    You should get a success message.

  5. Enter the following information on the remainder of the page:

    Field
    Value

    Bind type

    Select simple

    Bind DN

    DN for your LDAP admin user, such as: cn=administrator,dc=example,dc=com

    Bind credentials

    Password for the LDAP admin user

  6. Click Test authentication.

    You should get a success message.

The LDAP provider is configured for use with AD.

Connect to AD using the LDAP server's SSL certificate (Optional)

When you use an LDAP server with Pentaho Data Catalog (PDC), you can use the LDAP server's SSL certificate to securely connect to Active Directory (AD). This is an optional step in integrating AD with PDC.

For more information on integrating AD with PDC, see Integrate Active Directory with Pentaho Data Catalog. Refer to Keycloak documentationarrow-up-right if necessary.

Perform the following steps to use the LDAP server's SSL certificate to connect to AD.

  1. To retrieve the certificate from your LDAP server, enter the following command:

    openssl s_client -connect ldap.example.com:636 -showcerts

  2. Copy the entire certificate chain (from -----BEGIN CERTIFICATE----- to -----END CERTIFICATE-----) and save it to a file, such as ldap-cert.pem.

  3. Update the *&lt;PDC\_INSTALL\_LOCATION&gt;*/conf/extra-certs/bundle.pem file with the LDAP server’s SSL certificate, where *&lt;PDC\_INSTALL\_LOCATION&gt;* is the directory where PDC is installed.

  4. Restart PDC services by entering the following command:

    sh pdc.sh restart

  5. Log in to the Keycloak admin console (https://*&lt;FQDN&gt;*/keycloak/).

  6. Navigate to the PDC realm.

  7. Click User Federation.

  8. Click the LDAP provider to edit it.

  9. Enter the following LDAP settings:

Field
Value

UI display name

Name to display, such as LDAPS

Vendor

Select Active Directory

Connection URL

ldaps://*&lt;LDAP\_SERVER&gt;*:*&lt;PORT&gt;*such as: ldaps://ldap.example.com:636

10. Click Test connection. You should see a success message.

11. Enter the remaining LDAP connection and authentication settings:

Field
Value

Bind type

Select simple

Bind DN

DN to bind to the LDAP server, such as: cn=admin,dc=example,dc=com.

Bind credentials

password for the Bind DN

12. Click Test authentication. You should see a success message.

13. Enter values for the required LDAP searching and updating settings:

Field
Value

Edit mode

It is a best practice to set this to Readonly

Users DN

Specify the DN where the user entries are located, such as: ou=users,dc=example,dc=com

Username LDAP attribute

cn

RDN LDAP attribute

cn

UUID LDAP attribute

objectGUID

User object classes

person, organizationalPerson, user

14. Click Save to save the configuration.

AD is set up to use the SSL certificate of the LDAP server for a secure connection.

Optionally, you can configure the following settings:

  • Set how often Keycloak should sync with LDAP.

  • Set periodic full sync and periodic changed users sync.

Configure LDAP mappers in Keycloak

To integrate Active Directory (AD) with Pentaho Data Catalog (PDC), you need to configure LDAP mappers so that PDC has the necessary information (such as usernames, email addresses, or group memberships) to connect to an LDAP directory.

The Keycloak LDAP mapper translates attributes stored in an LDAP directory into the corresponding attributes needed by PDC.

Use the following steps in Keycloak to configure the LDAP mappers for Data Catalog. See Keycloak documentationarrow-up-right for more information.

  1. In your Keycloak admin console (such as https://<FQDN>/keycloak/), log in with admin credentials.

  2. Select the PDC realm and go to the User Federation settings.

  3. Click the LDAP provider and go to the Mappers tab.

  4. Map the LDAP attribute for the username.

  5. Map other user attributes as needed (such as email, first name, last name).

  6. To add additional mappers to assign default roles for the users being imported from AD, enter the following settings under User federation > Settings > Mapper details.

    For the Business User role:

    Field
    Value

    Name

    Business_User_Role_Mapper_To_LDAP_USERS

    Mapper type

    hardcoded-ldap-role-mapper

    Role

    Business_User (select from menu and click Assign)

  7. Click Save.

  8. Repeat step 6 for the Data User role, using the following values:

    Field
    Value

    Name

    Data_User_Role_Mapper_To_LDAP_USERS

    Mapper type

    hardcoded-ldap-role-mapper

    Role

    Data_User (select from menu and click Assign)

  9. Click Save.

  10. Save the configuration.

  11. From the Action menu, click Sync all users to import users from LDAP.

    A success message displays.

    When users are synced from AD, the default PDC realm assigns the Business User and Data User roles to all the users.

    Note: PDC applies limits for licensing when users receive one or more of the following roles:

    • Business Steward

    • Data Steward

    • Data Developer

    • Admin

  12. Go to Users and verify that the users were imported correctly into Keycloak.

circle-check

Configure PDC permissions for an AD user

The last step in integrating Active Directory (AD) with Pentaho Data Catalog (PDC) is to set up permissions in PDC for the AD users.

Use the following steps to create and verify an AD user.

  1. Log into Data Catalog as the admin user.

  2. Click Management, and on the Users & Communities card, click Users.

  3. Check that the imported users display correctly and make any needed adjustments.

  4. Select an AD user and assign a community or role to the user.

  5. Click Save, and log out.

  6. Log in as the AD user to verify the login is working properly.

circle-check

Integrate Okta with Pentaho Data Catalog

You can integrate Okta authentication with Pentaho Data Catalog for the added security provided by multi-factor authentication. To integrate Okta with Pentaho Data Catalog, you need to configure Okta in parallel with the Keycloak identity and access management tool.

The steps in the integration process are:

Add an OIDC provider in Keycloak

To integrate Okta with Pentaho Data Catalog, you need to set up an identity provider in Keycloak. Keycloak uses the OpenID Connect (OIDC) protocol to connect to identity providers.

If necessary, see the Keycloak documentationarrow-up-right to complete this task.

Perform the following steps in the Keycloak interface:

  1. Log in to Keycloak and select the PDC realm.

    If a PDC realm does not already exist, consult your PDC administrator or see Creating a realmarrow-up-right in the Keycloak documentation to create one.

  2. Click Identity Providers and select OpenID Connect v1.0.

    If necessary, see OpenID Connect v1.0 identity providersarrow-up-right in the Keycloak documentation.

  3. Use the following steps to add an OIDC ID provider:

    1. Enter an alias in the Alias field.

      This populates the Redirect URI field, in a format like the following:

    2. Copy the Redirect URI to be used in the next task, Add an OpenID Connect application in Okta.

circle-check

Perform the following tasks:

Add an OIDC application in Okta

The next step in integrating Okta with Pentaho Data Catalog is to add an OpenID Connect (OIDC) application in Okta.

Before beginning this task, make sure to perform this task:

In this task, you need the Keycloak Identity Provider Redirect URI you copied in the Add an OpenID Connect provider in Keycloak task. If necessary, see Launch the wizardarrow-up-right in the Okta documentationarrow-up-right.

Perform the following steps in the Okta Admin console:

  1. From the left menu, click Applications and then Applications.

  2. Click Create App Integration.

  3. Select or enter the following values:

    Field
    Value

    Sign-in method

    OIDC – OpenID Connect

    Application type

    Web Application

    App integration name

    CatalogPlus_10.2.1

    Grant type

    Authorization Code

    Sign-in redirect URIs

    Keycloak Identity Provider Redirect URI copied in the Add OpenID Connect provider in Keycloak task

    https://<application_url>/keycloak/realms/<realm_name>/broker/<alias_name>/endpoint/logout_response

  4. For Sign-out redirect URIs, configure your logout URI in this format:

    For example:

  5. Continue entering values in Okta screens:

    Field
    Value

    Controlled access

    Select the default value, Allow everyone in your organization to access

    Enable immediate access

    Clear the checkbox

  6. In the General tab, make a note of the Client Id and Client Secret to use in the Configure an identity provider in Keycloak task.

  7. Click Save.

  8. On the left menu, click Applications.

  9. From the down arrow, select Assign to Groups.

  10. Assign a group to your application.

circle-check

Perform the following tasks:

Set up security in Okta

When integrating Okta with Pentaho Data Catalog, you need to set up security in Okta for the connection to PDC.

Before beginning this task, make sure to perform these tasks:

Perform the following steps in the Okta admin console:

  1. On the left menu, click Security, then API, then Default.

  2. On the Access Policies tab, click Add New Access Policy.

  3. Add details for the policy and click Create Policy.

  4. Click Add Rule.

  5. Add details for the rule and click Create Rule.

circle-check

Perform the following tasks:

Configure an identity provider in Keycloak

To integrate Okta with Pentaho Data Catalog, you need to configure an identity provider in Keycloak. If necessary, see the Keycloak documentation.

Before beginning this task, make sure to perform these tasks:

In this task, you need the Client Id and Client Secret you noted during the Add an OpenID Connect application in Okta task.

Perform the following steps in the Keycloak admin console:

  1. From the left menu, click Identity providers.

  2. Click OpenID Connect v1.0.

  3. Make sure the Use discovery endpoint switch is on.

  4. In the Discovery endpoint field, enter the discovery endpoint URL in the following format:

    https://hostname/auth/realms/master/.well-known/openid-configuration

    The Authorization URL, Token URL, Logout URL, and User Info URL and other fields populate automatically.

  5. Enter the Client Id and Client Secret noted during the Add an OpenID Connect application in Okta task.

  6. On the Settings tab, select the following settings:

    • First login flow override: First login flow override

    • Sync mode: Force

  7. Expand the Advanced settings and set the Scopes setting to openid email profile (separated by a single space).

  8. Click Save.

circle-check

Perform the following task:

Sign in to Pentaho Data Catalog using Okta

After Pentaho Data Catalog is integrated with Okta, you have the option to log in to PDC with Okta.

Before beginning this task, make sure to perform these tasks:

To log in to PDC using Okta, perform the following steps:

  1. On the PDC login screen, click the button corresponding to the Okta alias.

    Note: The alias matches whatever is set for Okta OpenID Connect in Keycloak.

    In the following example, the button is labeled CATALOG+OKTA:

    Updated PDC login screen after Okta integration
  2. On the Okta login screen that appears, enter the credentials for the Okta user assigned to PDC.

    Okta prompts you to enter a code.

  3. To finish logging in, enter the code that Okta provides.

You have completed the integration of Okta with PDC.

Integrate an identity provider with Data Catalog using Keycloak (SAML 2.0)

You can integrate an external identity provider (IdP), such as PingFederate or Ping Identity, with Data Catalog by configuring Keycloak as a SAML 2.0 identity broker. After you complete this integration, users authenticate with your IdP and then access Data Catalog using single sign-on (SSO). Users can also continue to sign in using local username and password if local users remain enabled.

Prerequisites

Make sure that you have:

  • Administrative access to the target Keycloak instance.

  • Valid credentials to invoke Keycloak Admin REST APIs.

  • Administrative access to the IdP (for example, PingFederate).

  • IdP SAML metadata (Entity ID, SSO endpoint, SLO endpoint if used, and signing certificate).

  • Network connectivity between Keycloak and IdP endpoints.

1

Collect SAML metadata from your identity provider

Before you configure Keycloak, collect the SAML metadata values required to create the SAML identity provider (IdP) instance.

Prerequisites

Make sure you have administrative access to your IdP console.

Procedure

  1. Export or download the SAML metadata for the SAML application/connection that represents Keycloak as the service provider.

    • PingFederate: On SP Connections, select the service provider connection, and then select Export Metadata.

    • Ping Identity: Open the SAML application, and then select Download Metadata on the Overview tab.

  2. From the metadata, record these values:

    • IdP Entity ID (idpEntityId) This is the IdP entityID value in the metadata.

    • Single sign-on service URL (singleSignOnServiceUrl) This is SingleSignOnService Location.

    • Single logout service URL (singleLogoutServiceUrl) (if used) This is SingleLogoutService Location.

    • Signing certificate (X.509) (signingCertificate) This is the X509Certificate value.

Result

You have the IdP metadata values needed to configure a SAML identity provider in Keycloak.

2

Authenticate to Keycloak and retrieve realm certificates

To configure Keycloak for IdP integration, you must first authenticate to Keycloak and obtain an admin access token. You then use that token to call Keycloak APIs, such as retrieving the realm certificates that are used for signing and validation.

Prerequisites

Make sure you have:

  • Keycloak base URL: https://<host>/keycloak

  • Keycloak admin credentials:

    • <admin-username>

    • <admin-password>

  • PDC realm name: <pdc-realm>

circle-info

Important: Treat the admin credentials and tokens as sensitive information. Do not store them in logs or share them in screenshots.

Procedure

  1. To generate an admin access token, send a token request to Keycloak.

    Endpoint

    Headers

    Body (x-www-form-urlencoded)

    • username=<admin-username>

    • password=<admin-password>

    • client_id=admin-cli

    • grant_type=password

    Example (PowerShell)

  2. Record the access_token value from the response.

    circle-info

    The token expires based on the expires_in value in the response. If the token expires, generate a new token and retry the API call.

    You have an admin access token to use as Authorization: Bearer <access-token> in Keycloak Admin API requests.

  3. To retrieve realm certificates, call the realm certificates endpoint.

    Endpoint

    Authentication

    Example (PowerShell)

  4. From the response, identify the signing certificate:

    1. Find the key entry with "use": "sig".

    2. The certificate chain is in x5c[].

      circle-info

      If your IdP administrator requests the certificate, provide the first value in x5c[]. This is the base64-encoded X.509 certificate.

Result

You retrieved the realm certificates for <pdc-realm>. You can now use them for verification or IdP-side configuration (if required by your organization).

3

Create a SAML identity provider in Keycloak for Data Catalog

Create a SAML identity provider (IdP) instance in Keycloak, so Keycloak can broker authentication from your external IdP (for example, PingFederate) into the PDC realm.

Prerequisites

Make sure you have:

  • Keycloak base URL: https://<host>/keycloak

  • PDC realm name: <pdc-realm>

  • SAML provider alias: <saml-alias> (for example, PingFed or CorporateSSO)

  • A valid Keycloak admin access token: <admin-token>

  • The following values from your IdP SAML metadata:

    • <idp-entity-id> (IdP Entity ID)

    • <idp-sso-url> (SingleSignOnService Location)

    • <idp-slo-url> (SingleLogoutService Location), if used

    • <idp-signing-cert> (X509Certificate)

circle-info

Important: In this procedure, set "syncMode": "FORCE". This ensures Keycloak refreshes user details and role mappings on every login.

Procedure

  1. Create a file named create-saml-idp.json and paste the following content into the file and replace placeholders with your values.

    Use this reference when replacing placeholders:

    • idpEntityId → IdP Entity ID Source: SAML metadata root entityID attribute.

    • singleSignOnServiceUrl → IdP SSO endpoint Source: SingleSignOnService Location.

    • singleLogoutServiceUrl → IdP SLO endpoint Source: SingleLogoutService Location (if your IdP supports SLO and you use it).

    • signingCertificate → IdP signing certificate Source: X509Certificate value in SAML metadata.

    circle-info
    • Keep the firstBrokerLoginFlowAlias and postBrokerLoginFlowAlias values as shown unless your Keycloak administrator has configured different broker flows for Data Catalog.

    • When backchannelSupported is enabled, Keycloak uses a server-to-server logout flow: Keycloak pod → PingFederate. If this request times out, it typically indicates a network connectivity issue between the Keycloak pod and PingFederate (for example, firewall rules, DNS resolution, routing, or a blocked port).

    circle-info

    Tip: If you copy the X509Certificate value from XML, remove line breaks and extra spaces so the certificate is stored as a single continuous base64 string.

  2. Send a request to the Keycloak Admin API.

    Endpoint

    Authentication

    Headers

    Example

    Example

  3. Confirm the request succeeds. A successful request typically returns HTTP 201 (Created) or HTTP 204 (No Content), depending on your Keycloak version and configuration.

Result

A SAML identity provider instance named <saml-alias> is created in the <pdc-realm> realm.

4

Configure SAML mappers in Keycloak for Data Catalog

SAML mappers tell Keycloak how to translate identity provider (IdP) attributes (for example, Ping Identity claims) into:

  • Keycloak user attributes (email, username, first name, last name)

  • Keycloak roles that Data Catalog uses for authorization (Admin, Data Steward, and so on)

Configure the user attribute mappers first, and then configure the role mappers.

Prerequisites

Make sure you have:

  • Keycloak base URL: https://<host>/keycloak

  • PDC realm name: <pdc-realm>

  • Keycloak SAML identity provider alias: <saml-alias>

  • A valid Keycloak admin access token: <admin-token>

  • The Ping Identity attribute names used for:

    • Email

    • Given name

    • Family name

    • Group membership (for example, memberOf)

  • The Ping Identity group values that should map to Data Catalog roles (one group value per role)

circle-info

Important: Set syncMode to FORCE in every mapper. This ensures Keycloak refreshes user details and role assignments on every login and prevents stale role assignments.

Mapper endpoint reference

All mapper requests use the same endpoint.

Endpoint

Authentication

Headers

Procedure

  1. Create the following user attribute mappers so Keycloak can populate user profile fields for IdP users. Tip: If you are using a stable email address as NameID at your IdP, configure the username mapper to use ${NAMEID} as shown in this procedure.

Create a file named mapper-email.json :

  1. Create each user attribute mapper in Keycloak. For each JSON file, send a POST request to create the mapper. Example

    Repeat the request for:

    • mapper-username.json

    • mapper-firstname.json

    • mapper-lastname.json Keycloak can create and update IdP users with populated email, username, first name, and last name values.

  2. Before you create role mappers, confirm the Ping Identity values that represent membership for each Data Catalog role. Record the values your IdP provides:

    • Group attribute name: <ping-group-attribute-name> (for example, memberOf)

    • Admin group value: <ping-admin-group-value>

    • Business Steward group value: <ping-business-steward-group-value>

    • Business User group value: <ping-business-user-group-value>

    • Data Developer group value: <ping-data-developer-group-value>

    • Data Steward group value: <ping-data-steward-group-value>

    • Data Storage Administrator group value: <ping-data-storage-admin-group-value>

    • Data User group value: <ping-data-user-group-value>

    circle-info

    Important: Create one role mapper per Data Catalog role. If you do not create a mapper for a role, users will not receive that role through the IdP.

  3. Create role mappers (one mapper per Data Catalog role) to assign a Keycloak role when a Ping Identity attribute contains a specific group value.

    Role mapper rules

    • Set syncMode to FORCE in every role mapper.

    • Update these values in every mapper:

      • attribute.name to the Ping Identity group attribute name (for example, memberOf)

      • attribute.value to the group value that represents the role

      • identityProviderAlias to the Keycloak SAML IdP alias (<saml-alias>)

Create a file named role-admin.json:

  1. To create each role mapper in Keycloak, send a POST request to create the mapper for each role mapper JSON file.

    Example

  2. Repeat the request for each mapper file you created. Keycloak assigns the correct Data Catalog roles based on Ping Identity group membership.

Result

Keycloak mappers are configured for:

  • User attributes (email, username, first name, last name)

  • Role assignments (one mapper per Data Catalog role)

Users signing in through the SAML IdP are provisioned with the expected identity details and permissions.

Configure Metadata Request Access in Data Catalog

You can configure Metadata Request Access in Data Catalog to manage user requests for metadata access. This feature uses the Access Request Service, a backend service that integrates with ticketing tools such as Jira or ServiceNow. The service creates, tracks, and updates access requests, and synchronizes their status between Data Catalog and the external system. This guide depicts how to configure Metadata Request Access in Data Catalog.

By default, only metadata access requests are managed in Data Catalog. Data access requests are routed to an integrated ticketing system.

Before you begin,

  • Ensure you have:

    • Network connectivity between the service and Jira or ServiceNow.

    • Credentials for authenticating with your identity provider and your ticketing system.

  • Confirm that the external system (Jira or ServiceNow) contains a custom field to store the access request status (for example, Access Status with values Approved and Rejected). For more information, see Integrating ServiceNow with Data Catalogarrow-up-right and Integrating Jira with Data Catalogarrow-up-right.

  • Gather the following details:

    • Service account credentials for Jira or ServiceNow.

    • Database connection parameters.

    • Authentication service endpoint, client ID, and credentials.

Perform the following steps to configure Metadata Request Access in Data Catalog:

  1. Configure the service by setting the required environment variables[PA1] . Use either your deployment tool (for example, Docker Compose, Kubernetes, or Helm) or a configuration file. Key environment variables

    Variable
    Description
    Required
    Example

    TENANT_NAME

    Tenant identifier

    Yes

    your-tenant

    ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE[SR5]

    Default assignee email if no auto-assignment

    Yes

    STATUS_FETCHER_INTERVAL

    Poll interval for status updates; supports cron or @every syntax

    Yes

    @every 5m

    PAGINATION_MAX_RESULTS_SIZE

    Maximum results per API request

    No

    1000

    LOG_LEVEL

    Logging verbosity (info, debug, warn, error)

    No

    info

    CAUTION: The value of ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE must match a user created in Data Catalog with the Admin role using User Management. Authentication settings

    Variable
    Description
    Required
    Example

    AUTH_URL

    Authentication endpoint URL

    Yes

    https://auth.example.com

    AUTH_HOST

    Authentication host and port

    Yes

    auth-host:5000

    AUTH_CLIENT_ID

    Client ID for authentication

    Yes

    generic-client-id

    AUTH_USER_NAME

    Username for authentication

    Yes

    generic-user

    AUTH_PASSWORD

    Password for authentication

    Yes

    generic-password

    Database settings

    Variable
    Description
    Required
    Example

    DB_URL

    Database host

    Yes

    generic-db-host

    DB_PORT

    Database port

    Yes

    5432

    DB_NAME

    Database name

    Yes

    generic_db

    DB_USER

    Database user

    Yes

    generic-db-user

    DB_PASSWORD

    Database password

    Yes

    generic-db-password

    DB_SSL_MODE

    SSL mode (disable, require)

    Yes

    disable

    Ticketing provider selection

    Variable
    Description
    Required
    Example

    PROVIDER_TOOL

    Select ticketing system (Jira or ServiceNow)

    Yes

    Jira

    If Jira is used

    Variable
    Description
    Required
    Example

    JIRA_URLrestart

    Jira server URL

    Yes

    https://jira.example.com

    JIRA_PROJECT_NAME

    Project key or name

    Yes

    GENERIC

    JIRA_USER_NAME

    Jira service account username

    Yes

    JIRA_PASSWORD

    Jira password or API token

    Yes

    generic-jira-password

    JIRA_ACCESS_STATUS_KEY

    Jira field key for access status

    Yes

    Access Status

    JIRA_ACCESS_STATUS_APPROVED_VALUE

    Field value for approval

    Yes

    Approved

    JIRA_ACCESS_STATUS_REJECTED_VALUE

    Field value for rejection

    Yes

    Rejected

    If ServiceNow is used

    Variable
    Description
    Required
    Example

    SERVICENOW_URL

    ServiceNow instance URL

    Yes

    https://servicenow.example.com

    SERVICENOW_USER_NAME

    ServiceNow service account username

    Yes

    SERVICENOW_PASSWORD

    ServiceNow password

    Yes

    generic-snow-password

    SERVICENOW_CLIENT_ID

    ServiceNow client ID

    Yes

    snow-client-id

    SERVICENOW_CLIENT_SECRET

    ServiceNow client secret

    Yes

    snow-client-secret

    SERVICENOW_ACCESS_STATUS_KEY

    ServiceNow field key for access status

    Yes

    u_access_status

    SERVICENOW_ACCESS_STATUS_APPROVED_VALUE

    Field value for approval

    Yes

    Approved

    SERVICENOW_ACCESS_STATUS_REJECTED_VALUE

    Field value for rejection

    Yes

    Rejected

  2. Save your configuration and restart the access-request-service container.

  3. (optional) Once you restarted the access-request-service container, you can verify the configuration:

    1. Submit a metadata request in the PDC UI.

    2. Confirm that a corresponding ticket is created in Jira or ServiceNow.

    3. Update the ticket status to Approved or Rejected.

    4. Confirm that PDC reflects the updated status after the poll interval.

circle-check

Integrating Jira with Pentaho Data Catalog

You can integrate Jira as an external ticketing system with Data Catalog to manage data access requests. This guide describes how to configure Jira integration using a config.yaml file or environment variables, including optional proxy settings for the access-request service, and how to create a custom field in Jira to use for the data access request statuses.

circle-info
  • You can set any administrative user as the default administrator to manage data access requests. However, there can be only one default administrator set, because the ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE environment variable supports only a single user. If necessary, the default administrator can edit a data access request to assign it to another administrative user.

  • Configure proxy settings only if your environment requires the access-request service to connect to Jira through a forward proxy. If the /app/config.yaml file is present in the container, the service loads configuration from that file. If the file is not present, the service uses environment variables instead.

To integrate Jira with Data Catalog, perform the following tasks:

Integrate Jira with Data Catalog using a config.yaml file

To integrate the Jira ticketing system with Data Catalog, you can use a config.yaml file with settings for connection details, credentials, project information, status mappings, and optional proxy settings. If your system does not have a config.yaml file, you can also integrate Jira with Data Catalog using environment variables. For more information, see Integrate Jira with Data Catalog using environment variables.

Perform the following steps to integrate Jira with Data Catalog using a config.yaml file:

  1. Go to /pentaho/pdc-docker-deployment/conf folder and open config.yaml file. If the file does not exist, create it.

  2. Add the following configuration to the config.yaml file.

    circle-info

    Replace the placeholder values in angle brackets (< >) with your actual Jira credentials, project details, and proxy details if your environment uses a forward proxy.

    Where:

    • is_proxy_enabled enables or disables Jira proxy support.

    • proxy_url specifies the proxy server URL.

    • proxy_username and proxy_password are optional and are required only when the proxy requires authentication.

  3. Additionally, add the following auth section as per the PDC version in config.yaml file.

    • For PDC 10.2.5 and 10.2.6

    • For 10.2.7 and 10.2.8

    • For PDC 10.2.9 and PDC 10.2.11, along with auth section add mds section also.

  4. Save the changes and close the config.yaml file.

  5. Open vendor/docker-compose-um.yml file. Under the access request service container configuration, add the Volumes section parallel to the environment section, save the changes, and close the file.

    circle-info
    • The config.yaml file is created on the host system and must be mounted into the access-request-service container as /app/config.yaml.

    • If your Jira proxy uses HTTPS and requires client certificates, mount the certificate directory to /app/proxy-certs in the access-request-service container. For example:

    • You can store the certificates in any accessible directory, but you must mount that directory correctly in the container.

  6. Restart the access request service to apply changes:

circle-check

You now need to add a custom field to Jira, to include the data access request statuses. For more information, see Add a custom field to Jira.

Integrate Jira with Data Catalog using environment variables

To integrate the Jira ticketing system with Data Catalog, you can use environment variables to set connection details, credentials, project information, status mappings, and optional proxy settings. You can also integrate Jira with Data Catalog using a config.yaml file.

Perform the following steps to integrate Jira with Data Catalog using environment variables:

  1. Log in to the server where Data Catalog is installed.

  2. Go to the Data Catalog Docker deployment configuration directory.

  3. Open the .env (environment configuration) file.

    If the .env file does not exist, create it and save the file before proceeding.

  4. Add the following lines:

    Where:

    • IS_JIRA_PROXY_ENABLED enables or disables Jira proxy support. The default value is false.

    • JIRA_PROXY_URL specifies the proxy server URL.

    • JIRA_PROXY_USERNAME and JIRA_PROXY_PASSWORD are optional and are required only when the proxy requires authentication.

    circle-info
    • Environment variables are used only when the access-request-service does not load /app/config.yaml.

    • If your Jira proxy uses HTTPS and requires client certificates, mount the certificate directory to /app/certs in the access-request-service container.

  5. Save the changes and close the file.

  6. Restart the access request service to apply changes:

circle-check

You now need to add a custom field to Jira to include the data access request statuses. For more information, see Add a custom field to Jira.

Add a custom field to Jira

If you have configured Data Catalog to connect to Jira for managing data access requests, you need to add a custom field to Jira to map the Data Catalog data access request statuses to complete the Jira integration.

Perform the following steps to add a custom field to Jira:

  1. Log in to Jira with administrative rights.

  2. Go to the Jira Admin settings.

    If you cannot find the Jira Admin settings, use these steps:

    1. Open any issue.

    2. In the Details section, click the settings icon, then click Manage Fields. In the bottom right corner, you see the Go to Custom Fields option.

    3. Click Go to Custom Fields, and you are taken to the Jira Admin settings.

  3. Click Custom Fields and then click Create custom field.

  4. Select the Select List type and enter Access Status as the name.

  5. Add the options: Approved, Rejected, and Pending, and click Create.

  6. Open any issue. In the Details section, click the settings icon, then click Manage Fields.

  7. Locate the newly created Access Status field in the list of fields on the right side.

  8. Drag and drop the Access Status field into the Context Fields section.

Jira is now updated to use data access request statuses with Data Catalog.

Integrating ServiceNow with Data Catalog

You can integrate ServiceNow as an external ticketing system with Data Catalog to manage data access requests. This guide describes how to configure ServiceNow integration using a config.yaml file or environment variables, including optional proxy settings for the access-request service, and how to create a custom field in ServiceNow to track data access request statuses.

circle-info
  • You can set any administrative user as the default administrator to manage data access requests. However, there can be only one default administrator set, because the ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE environment variable only supports a single user. If necessary, the default administrator can edit a data access request to assign it to another administrative user.

  • Configure proxy settings only if your environment requires the access-request service to connect to ServiceNow through a forward proxy. If the /app/config.yaml file is present in the container, the service loads configuration from that file. If the file is not present, the service uses environment variables instead.

To integrate ServiceNow with Data Catalog, you need to perform the following tasks:

Integrate ServiceNow with Data Catalog using a config.yaml file

To integrate the ServiceNow ticketing system with Data Catalog, you can use a config.yaml file with settings for connection details, credentials, project information, status mappings, and optional proxy settings. If your system does not have a config.yaml file, you can also integrate ServiceNow with Data Catalog using environment variables. For more information, see Integrate ServiceNow with Data Catalog using environment variables.

Perform the following steps to integrate ServiceNow with Data Catalog using a config.yaml file, use the following steps:

  1. Go to /pentaho/pdc-docker-deployment/conf folder and open config.yaml file. If the file does not exist, create it.

  2. Add the following configuration to the config.yaml file.

    circle-info

    Replace the placeholder values in angle brackets (< >) with your actual ServiceNow credentials and project details, and proxy details if your environment uses a forward proxy.

    Where:

    • is_proxy_enabled enables or disables ServiceNow proxy support.

    • proxy_url specifies the proxy server URL.

    • proxy_username and proxy_password are optional and are required only when the proxy requires authentication.

  3. Additionally, add the following auth section as per the PDC version in config.yaml file.

    • For PDC 10.2.5 and 10.2.6

    • For 10.2.7 and 10.2.8

    • For PDC 10.2.9 and PDC 10.1.11, along with the auth section, add the mds section also.

  4. Save the changes and close the config.yaml file.

  5. Open vendor/docker-compose-um.yml file. Under the access request service container configuration, add the Volumes section parallel to the environment section, save the changes, and close the file.

    circle-info
    • The config.yaml file is created on the host system and must be mounted into the access-request-service container as /app/config.yaml.

    • If your ServiceNow proxy uses HTTPS and requires client certificates, mount the certificate directory to /app/proxy-certs in the access-request-service container. For example:

    • You can store the certificates in any accessible directory, but you must mount that directory correctly in the container.

  6. Restart the access request service to apply changes:

circle-check

You now need to add a custom field to ServiceNow, to include the data access request statuses. For more information, see Add a custom field to ServiceNow.

Integrate ServiceNow with Data Catalog using environment variables

To integrate the ServiceNow ticketing system with Data Catalog, you can use environment variables to set connection details, credentials, project information, status mappings, and optional proxy settings. You can also integrate ServiceNow with Data Catalog using a config.yaml file.

Perform the following steps to integrate ServiceNow with Data Catalog using environment variables:

  1. Log in to the server where Data Catalog is installed.

  2. Go to the Data Catalog Docker deployment configuration directory.

  3. Open the .env file.

    If the .env file does not exist, create it, and save the file before proceeding.

  4. Add the following lines:

    Where:

    • IS_SERVICE_NOW_PROXY_ENABLED enables or disables ServiceNow proxy support. The default value is false.

    • SERVICE_NOW_PROXY_URL specifies the proxy server URL.

    • SERVICE_NOW_PROXY_USERNAME and SERVICE_NOW_PROXY_PASSWORD are optional and are required only when the proxy requires authentication.

    circle-info
    • Environment variables are used only when the access-request-service does not load /app/config.yaml.

    • If your ServiceNow proxy uses HTTPS and requires client certificates, mount the certificate directory to /app/proxy-certs in the access-request-service container.

  5. Save the changes and close the file.

  6. Restart the access request service to apply changes:

circle-check

You now need to add a custom field to ServiceNow to include the data access request statuses. For more information, see Add a custom field to ServiceNow.

Add a custom field to ServiceNow

If you have configured Data Catalog to connect to ServiceNow for managing data access requests, you need to add a custom field to ServiceNow to map the Data Catalog data access request statuses to complete the ServiceNow integration.

Perform the following steps to add a custom field to ServiceNow:

  1. Log in to the ServiceNow instance with administrative rights.

  2. Go to System Definition > Tables and locate the Incident table.

  3. Open the Incident table, and at the bottom, in the Columns section, click New to add a new column (custom field).

  4. Configure the following properties:

Property
Description

Column Label

Enter a descriptive name like Access Request Status.

Column Name

Automatically generated as u_access_request_status (prefixed with u_ to indicate it’s a custom field).

Type

Select the appropriate field type, which should be Choice for values like Pending, Granted, and Denied.

Choices

Once the field type is set to Choice, there is an option to add Choice List Values. Add the following values:- Pending

  • Granted

  • Denied You can also set a default value if desired.

5. Verify the changes you have made and click Submit.

circle-check

Configure Power BI templates with Pentaho Data Catalog

This guide outlines steps to configure Pentaho Data Catalog by connecting Power BI reports to a PostgreSQL database for catalog metadata. It details the setup for integration, including:

  • Pre-created database objects in PostgreSQL

  • Configuration of Power BI Desktop and Service

  • Scheduled refresh setup

Additionally, it explains synchronizing materialized view refresh jobs in PostgreSQL with Power BI dataset refresh schedules.

Prerequisites

Before configuring Power BI with Data Catalog, ensure:

  • PostgreSQL database is installed, configured, and accessible from the Power BI network.

  • A SQL client tool, such as DBeaver Community Edition, is available to run database scripts.

  • The user account has the necessary roles and privileges to create tables and materialized views in the PostgreSQL database.

  • Power BI Desktop is installed for creating and testing reports.

  • Power BI Service access is available for publishing reports and scheduling dataset refreshes.

  • The on-premises data gateway is installed and running to enable secure connectivity between Power BI Service and the PostgreSQL database.

  • Network permissions allow communication between the Power BI gateway host and the PostgreSQL server.

Table and view mapping

The following views and tables are pre-created in PDC for the Power BI.

View Name

Dependent Tables

mv_master

entities_master_view, datasource_category_mapping, currency_exchange_rates

mv_entity_category_summary_view

entities_custom_categorization, glossary_summary_view, terms_view, entities_master_view, currency_exchange_rates, datasource_category_mapping

mv_duplicate_savings_by_original_view

duplicate_files_view, entities_master_view, currency_exchange_rates, datasource_category_mapping

mv_duplicate_by_term_summary_view

entities_custom_categorization, glossary_summary_view, duplicate_files_view, terms_view, entities_master_view, currency_exchange_rates

mv_duplicate_entities_summary_view

duplicate_files_view, entities_master_view, currency_exchange_rates

mv_duplicate_entity_detail_view

duplicate_files_view, entities_master_view, currency_exchange_rates

mv_policies_summary

policies_summary_view, entities_policies_view, entities_master_view

Configure the Power BI Desktop environment

Use Power BI Desktop to update the PostgreSQL data source and configure connection settings before publishing the report to Power BI Service.

Perform the following steps to configure Power BI Desktop:

  1. Open Power BI Desktop application and open the Power BI report (,pbix).

  2. On the ribbon, select Transform Data > Data Source Settings.

  3. In the Data Source Settings window, select your PostgreSQL connection.

  4. Click Change Source, update Server name/host and Database name, and click OK.

  5. Click Edit Permission, uncheck Encrypt connection and enter valid PostgreSQL database credentials.

  6. Click OK and then click Close.

  7. On the Home tab, select Apply Changes > Run All to apply the new configuration.

  8. (Optional) To modify parameters directly, select Home > Edit Parameters, update values, and then click Apply Changes.

Power BI Desktop connects to the PostgreSQL database with the updated host, database, and credential settings. The report is now ready to be published to Power BI Service.

Configure the Power BI Service gateway

Use the Power BI Service to configure the on-premises data gateway and connect the Power BI dataset to the PostgreSQL database used by Data Catalog. Perform the following procedure to configure Power BI Service gateway:

Before you begin ensure you have an administrator access.

1

Install and configure on-premises Data Gateway

  1. Download the On-premises data gateway installer from the Microsoft Power BI portalarrow-up-right.

  2. Run the installer as an administrator.

  3. Select On-premises data gateway (recommended) when prompted.

  4. Choose Standard mode, and then click Next.

  5. Sign in with your Power BI credentials.

  6. Enter a Gateway name and Recovery key (password), and then click Configure.

  7. After installation completes, verify that the gateway status shows Online in the Power BI Service.

2

Configure Dataset in the Power BI Service

  1. In Power BI Service, go to the workspace that contains your dataset.

  2. Select the More options (⋮) menu next to the dataset, and then select Settings.

  3. Under Parameters, update the Host name and Database name values.

  4. Click Apply, and then select Reload to refresh the dataset parameters.

3

Add gateway connection

  1. Go to Manage gateways from the Power BI Service navigation pane.

  2. Select your gateway, and then choose Add to gateway.

  3. Enter a Connection name (preferably the same as the PostgreSQL host name).

  4. Select Basic authentication.

  5. Enter the PostgreSQL username and password.

  6. Click Apply to create and map the connection to the dataset.

The Power BI dataset is connected to the PostgreSQL database through the on-premises data gateway. The gateway is active and can be used for scheduled dataset refreshes in Power BI Service.

Modifying parameters in Power BI Service

Use Power BI Service to update data source parameters for the PostgreSQL connection used by Data Catalog.

Perform the following procedure to change data source details:

  1. In Power BI Service, go to the workspace that contains your dataset.

  2. Select the More options (⋮) menu next to the dataset, and then select Settings.

  3. Under Parameters, update the required values such as Host name or Database name.

  4. Click Apply, and then select Reload to refresh the dataset.

  5. If a gateway error occurs, re-create the gateway connection from Manage gateways.

  6. If the gateway is valid, map the dataset again to ensure the updated parameters are applied.

The Power BI dataset is updated with the new PostgreSQL connection parameters and successfully mapped through the on-premises data gateway.

Optional - Refresh scheduling and synchronisation

Generally, Cron jobs refresh materialized views in BIDB; however, you can also set a manual refresh schedule for a materialized view in PostgreSQL.

Perform the following steps to schedule a manual refresh:

PostgreSQL materialize view refresh scheduling

1

Create a PostgreSQL refresh function

Open your SQL client (for example, DBeaver) and run the following script to refresh all required materialized views:

2

Schedule automatic refresh jobs

Use pgAgent or an operating system scheduler such as cron or Windows Task Scheduler to automate materialized view refreshes.

  • Example cron job for Linux:

  • For Windows, use Task Scheduler:

    • Action: Launch psql.exe

    • Arguments: -U db_user -d your_database -c "SELECT refresh_all_pentaho_views();"

Power BI Dataset refresh configuration

Perform the following procedure to configure scheduled refresh in Power BI Service:

1

Power BI Service (Scheduled Refresh)

  1. In Power BI Service, open your workspace and select the Datasets tab.

  2. Select Schedule refresh for the dataset connected to PostgreSQL.

  3. Turn on Keep your data up to date.

  4. Configure the refresh schedule:

    1. Power BI Pro: up to 8 refreshes per day

    2. Power BI Premium: up to 48 refreshes per day

  5. Set the time zone, frequency, and preferred refresh times.

  6. Align the dataset refresh to run after the PostgreSQL refresh completes.

2

Gateway mapping validation

  1. Go to Settings > Manage gateways > Connections.

  2. Verify that the on-premises data gateway status is Online.

  3. Confirm that the dataset connection is mapped correctly and uses valid credentials.

3

Refresh notification and logs

  1. In Power BI Service, go to Settings > Scheduled refresh > Failure notifications.

  2. Enable email alerts for refresh failures to receive proactive notifications.

Power BI datasets refresh automatically after PostgreSQL materialized views are updated.

Component

Frequency

Trigger Time

Dependency

PostgreSQL Materialized Views

Every 24 hours (1:00 AM)

Cron job/pgAgent

N/A

Power BI Dataset

Every 24 hours (2:00 AM)

Power BI scheduled refresh

Runs after PostgreSQL refresh completes

This ensures that Power BI always retrieves updated, fully processed data from Data Catalog’s database views.

Last updated

Was this helpful?