Advanced configuration

After installing Data Catalog, you may need to set up additional components, depending on your environment. Use the following topics as needed to finish setting up your environment.

Configure system environment variables

Although not common, there might be instances where you need to change the default settings for Data Catalog system environment variables. These configuration modifications allow you to override default system behavior to align with the specific needs.

  1. In a terminal window, navigate to the pdc-docker-deployment folder and open the hidden environment variable configuration file (.env). This file is located in the /opt folder by default.

  2. Verify the system environment variables set in the /opt/pentaho/pdc-docker-deployment/vendor/.env.default file:

    • For example, the number of worker instances that Data Catalog uses to run processes is set to 5:

      PDC_WS_DEFAULT_OPS_JOBPOOLMINSIZE=5
      PDC_WS_DEFAULT_OPS_JOBPOOLMAXSIZE=5

      Note: Make sure that PDC_WS_DEFAULT_OPS_JOBPOOLMINSIZE and PDC_WS_DEFAULT_OPS_JOBPOOLMAXSIZE have the same value for consistent worker instance management.

  3. To override an environment variable set in the vendor/.env.default file, you can create a new .env file in the opt/pentaho/pdc-docker-deployment/conf/ folder:

    vi opt/pentaho/pdc-docker-deployment/conf/.env

  4. (Optional) The data in the Business Intelligence Database refreshes daily by default, as set in the .env file. To modify the data refresh frequency, update the variable in the .env file to one of the options listed in the following table:

Value
Description

@yearly (or @annually)

Run once a year, midnight, Jan 1st

@monthly

Run once a month, midnight, first of the month

@weekly

Run once a week, midnight between Sat/Sun

@daily (or @midnight)

Run once a day, midnight

@hourly

Run once an hour, the beginning of the hour

@every <number>m

Run at a custom interval, where <number> specifies the number of minutes. For example, @every 5m runs the job every 5 minutes.

PDC_CRON_BI_VIEWS_INIT_SCHEDULE=@daily
  1. After adding all required system variables, save the changes and restart the Data Catalog system services.

    ./pdc.sh stop
    ./pdc.sh up

Disable the Physical Assets feature from Data Catalog deployment

By default, the Physical Assets feature is included in the Data Catalog deployment to support OT assets metadata through Pentaho Edge. However, if your deployment does not require this feature, you can disable it by removing its reference from the Compose file. This helps reduce the size of the deployment and saves compute resources. This guide depicts how to disable the Physical Assets feature in Data Catalog.

This procedure only disables the container associated with Physical Assets. It does not remove any binaries or metadata files. The service can be re-enabled at any time by restoring the profile reference.

Before you begin, make sure the following conditions are met:

  • Data Catalog is already installed using the Docker deployment method. To know more about installation, see Install Pentaho Data Catalog.

  • You have access to the deployment directory (pdc-docker-deployment) and permission to edit the .env.default or .env file.

Perform the following steps to disable the Physical Assets feature:

  1. Navigate to the following directory:

    cd /pentaho/pdc-docker-deployment/vendor
  2. Open the .env.default file in a text editor:

    vi .env.default
  3. Locate the COMPOSE_PROFILES line. It may look like this:

    COMPOSE_PROFILES=core,mongodb,collab,pdso,mdm,physical-assets
  4. Remove physical-assets from the list:

    COMPOSE_PROFILES=core,mongodb,collab,pdso,mdm
  5. Save the file and return to the deployment root folder:

    cd /pentaho/pdc-docker-deployment
  6. Restart the deployment using the following command:

    ./pdc.sh up

Running Data Catalog workloads using node affinity, taints, and tolerations

Data Catalog is a distributed application that runs several containerized components, including the application, database, and worker pods. Worker pods perform data scanning, profiling, and metadata ingestion from enterprise data sources. These operations often involve direct access to sensitive or large-scale datasets.

In large or security-sensitive deployments, administrators often need precise control over where workloads run within a Kubernetes (EKS) cluster. This control helps ensure that critical services and data processing tasks run in secure, compliant, and performance-optimized environments. It becomes especially important when certain components of Data Catalog handle data that is confidential, high-volume, or requires specialized compute resources such as GPUs or high-performance storage.

By using Kubernetes node affinity, taints, and tolerations, you can configure Data Catalog so that:

  • Worker pods run only on specific nodes. For example, nodes in a restricted security group or a separate availability zone.

  • Non-worker workloads, such as the user interface or metadata services, are prevented from running on those nodes.

  • The system continues to meet data segregation and compliance requirements without affecting performance or scalability.

Running PDC workloads using node affinity, taints, and tolerations provides a secure, compliant, and efficient way to manage distributed deployments across different availability zones or network segments.

Perform the following steps to configure node affinity, taints, and tolerations for Data Catalog workloads.

Before you begin

  • Ensure that your Data Catalog deployment is running on Amazon Elastic Kubernetes Service (EKS).

  • Verify that you have kubectl and AWS CLI installed and configured with permissions to update node groups and apply taints.

  • Identify the node group where you want to run PDC worker workloads.

  • Obtain access to the custom-values.yaml file used for your PDC Helm deployment.

  • Confirm that your deployment uses Helmfile for orchestration.

Procedure

  1. Open a terminal on the machine that manages your Kubernetes cluster.

  2. Apply a taint to the node group you want to reserve for worker workloads.

    aws eks update-nodegroup-config \
      --cluster-name <your-cluster-name> \
      --nodegroup-name <your-nodegroup-name> \
      --taints key=dedicated,value=ws,effect=NO_SCHEDULE

    Replace the placeholders with your cluster and node group names.

    • key=dedicated specifies the taint key.

    • value=ws indicates that the node is dedicated for worker services.

    • effect=NO_SCHEDULE prevents other pods from being scheduled on these nodes unless they have a matching toleration.

    Tip: Use descriptive taint keys and values that reflect the node group’s purpose, such as key=data-processing or value=worker.

  3. Edit the custom-values.yaml file for your Helm deployment.

    For PDC 10.2.8 and later, locate the job-server section and add:

    job-server:
      tolerations:
        - key: "dedicated"
          operator: Equal
          value: "ws"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: eks.amazonaws.com/nodegroup
                  operator: In
                  values:
                    - <your-nodegroup-name>

    Replace <your-nodegroup-name> with the name of the dedicated node group.

  4. Save the file and deploy the configuration.

    helmfile sync -n pentaho

    The -n pentaho flag ensures that the deployment targets the correct namespace.

Result

You have successfully configure node affinity, taints, and tolerations for Data Catalog workloads. After deployment:

  • Data Catalog worker pods are scheduled only on the nodes defined by the node affinity rules.

  • Other pods are prevented from being scheduled on those nodes unless they include a matching toleration.

  • Your Data Catalog deployment now supports workload segregation across different availability zones or network segments in AWS.

This configuration helps validate compliance and segregation requirements for deployments where worker nodes must belong to distinct security groups or availability zones.

Next steps

  • To verify the configuration, run:

    kubectl get pods -o wide -n pentaho

    Confirm that worker pods are assigned to nodes in the expected node group.

  • Monitor node usage using Amazon EKS Console or the kubectl describe node command.

Additional information

For more information, see the official AWS documentation: Place Kubernetes pods on Amazon EKS by using node affinity, taints, and tolerations

Install user-provided SSL certificates

To provide a greater level of security to your data, you can use signed Secure Sockets Layer (SSL) certificates from your Certificate Authority (CA) with Data Catalog.

Data Catalog automatically installs self-signed certs in the <install-directory>/conf/https directory as server.key (PEM-encoded private key) and server.crt (PEM-encoded self-signed certificate). You can replace these files with certificates signed by your CA.

Use this procedure to install signed SSL certificates for Data Catalog:

  1. On your Data Catalog server, navigate to the Data Catalog installation directory <install-directory>/conf/https, where <install-directory> is the directory where Data Catalog is installed.

    • server.key is a PEM-formatted file that contains the private key of a specific certificate.

    • server.crt is a PEM-formatted file containing the certificate.

  2. Replace the <install-directory>/conf/https/server.key file with the PEM-encoded private key used to sign the SSL certificate or generate a new private key in PEM-encoded format.

  3. Replace the <install-directory>/conf/https/server.crt file with the PEM-encoded signed certificate associated with the private key in Step 1.

    If a new private key is generated, then you need to download a new PEM-encoded signed SSL certificate from your CA.

  4. Append the <install-directory>/conf/extra-certs/bundle.pem file with the following three certificates in this order:

    1. Top level PEM-encoded signed SSL certificate (basically the content of the <install-directory>/conf/https/server.crt file).

    2. Intermediate PEM-encoded certificate, if any, from your CA.

    3. Root PEM-encoded certificate, if any, from your CA.

  5. Navigate to the Data Catalog <install-directory>.

  6. Use the following command to restart Data Catalog:

    ./pdc.sh restart

Check and remove outdated certificates

If Data Catalog services fail to start or show SSL-related errors, the issue might be caused by expired or outdated certificates in the bundle.pem file. You can identify and remove outdated certificates by checking their validity dates.

Before you begin

  • Ensure you have access to the server where Data Catalog is installed.

  • Verify that you have permission to view and edit files in the conf/extra-certs directory.

Procedure

  1. Go to the directory where the bundle.pem file is stored:

    cd /<PDCInstallDir>/conf/extra-certs/
  2. Run the following command to check the validity dates of all certificates in the bundle.pem file:

    awk 'BEGIN{block=""}
    /-----BEGIN CERTIFICATE-----/ {block=$0 RS; inblock=1; next}
    inblock {block=block $0 RS}
    /-----END CERTIFICATE-----/ {
       block=block $0 RS
       print "================"
       print block | "openssl x509 -noout -dates"
       close("openssl x509 -noout -dates")
       block=""
       inblock=0
    }' bundle.pem
  3. Review the command output. You see sections similar to the following:

    notBefore=Sep  1 09:01:05 2025 GMT
    notAfter=Sep  1 09:01:05 2026 GMT
    
    notBefore=Jun 10 06:00:00 2023 GMT
    notAfter=Jun 10 06:00:00 2024 GMT
  4. Interpret the fields in the output:

    Field Description notBefore Date from which the certificate becomes valid. notAfter Date after which the certificate expires.

  5. Compare the current date with the notAfter value: If the current date is later than notAfter, the certificate has expired. For example:

    notAfter=Jun 10 06:00:00 2024 GMT   ❌ Expired (past date)
    notAfter=Sep  1 09:01:05 2026 GMT   ✅ Active (future date)
  6. Edit the bundle.pem file to remove the expired certificates.

    • Use any text editor such as vi or nano.

    • Remove the complete certificate block starting from -----BEGIN CERTIFICATE----- to -----END CERTIFICATE-----.

  7. Save the file and restart the Data Catalog services.

Result

The bundle.pem file now contains only active certificates, preventing SSL validation issues during Data Catalog startup or connectivity.

Add email domains to the safe list in Data Catalog after deployment

During the initial deployment of Data Catalog, it is typically configured to allow only a predefined set of email domains for user authentication. However, you might grant access to users with email addresses from new domains. Instead of redeploying Data Catalog, which can cause downtime and operational delays, you can dynamically update the list of allowed email domains using the Identity & Access Management (IAM) APIs.

Adding email domains and SMTP details during the initial Data Catalog deployment is always a best practice. For more information, see the Install Pentaho Data Catalog.

Perform the following steps to add email domains to the safe list using IAM APIs after deployment:

Prerequisites

  • You must have administrative access to use the IAM APIs.

  • Identify your Data Catalog DNS (for example, catalog.example.com).

  • Obtain admin credentials to generate a Bearer token.

Procedure

  1. Open the CMD prompt and run the following cURL command to generate an authentication token to interact with the IAM APIs:

    curl -k --location 'http://<your-server-url>/keycloak/realms/master/protocol/openid-connect/token' \
    --header 'Content-Type: application/x-www-form-urlencoded' \
    --data-urlencode 'username=<admin-username>' \
    --data-urlencode 'password=<admin-password>' \
    --data-urlencode 'client_id=admin-cli' \
    --data-urlencode 'grant_type=password'
    • Replace <your-server-url> with your Data Catalog server URL.

    • Replace <admin-username> and <admin-password> with the actual Keycloak master realm admin user credentials. The response includes the token value.

    {"access_token":"<TOKEN_VALUE>"}
  2. Before updating, you can view the currently configured email domains using the following GET command.

    curl -k -X 'GET' \
      'https://<your-server-url>/css-admin-api/api/internal/css-auth-proxy/v1/provider/<provider-id>' \
      -H 'accept: application/json' \
      -H 'Authorization: Bearer <ACCESS_TOKEN>'

    The response displays the current domain configuration:

    {
      "id": "catalog.example.com",
      "emailDomains": ["example.com", "partner.com"]
    }
  3. Run the following IAM API cURL request to update email domains:

    curl -k -X 'PUT' \
      'https://<your-server-url>/css-admin-api/api/internal/css-auth-proxy/v1/provider' \
      -H 'Authorization: Bearer <ACCESS_TOKEN>' \
      -H 'Content-Type: application/json' \
      -d '{
      "id": "<provider-id>",
      "emailDomains": [
        "hv.com",
        "hitachivantara.com",
        "gmail.com"
      ]
    }'
    • Replace <your-server-url> with your Data Catalog server URL.

    • Replace <ACCESS_TOKEN> with the token obtained in the previous step.

    • Replace <provider-id> with your Data Catalog server’s domain or IP address used during installation.

    • Modify the "emailDomains" list as needed.

Set up an email server to send Data Catalog notifications

To set up Data Catalog to send email notifications to users, you can configure any Simple Mail Transfer Protocol (SMTP) server that meets your needs.

Examples of notifications are when a user is tagged with '@' in a comment or set up in a data pipe template to be notified when a job completes.

The steps to set up an SMTP server that are in the Installing Data Catalog topic in Install Pentaho Data Catalog only set up the forgot password functionality.

To integrate an SMTP server with Data Catalog, use the following steps:

  1. Gather the following information for the SMTP server you want to use:

    • Host name of SMTP server (IP address or domain name)

    • Port number for SMTP server

    • Username on SMTP server in <mail userID>@<domain>.com format

    • Password for username

    • Sender mail ID in <mail userID>@<domain>.com format

    • Whether to use Transport Layer Security (TLS) or Secure Sockets Layer (SSL) security.

    • TLS or SSL port number For example, you can use Gmail’s SMTP server to send emails from your application. Here are the SMTP server configuration settings for Gmail:

    • SMTP Server Address

      smtp.gmail.com

    • Secure Connection

      TLS/SSL based on your mail client/website SMTP plugin

    • SMTP Username

      your Gmail account ([email protected])

    • SMTP Password

      your Gmail password

    • Gmail SMTP port

      465 (SSL) or 587 (TLS)

  2. Log into Data Catalog using root user credentials to configure Data Catalog to use the SMTP server, as in the following example:

    https://*&lt;full domain name for PDC server&gt;*/

  3. Navigate to the configuresystem/smtp directory on the Data Catalog server, as in the following example:

    https://*&lt;full domain name for PDC server&gt;*/configuresystem/smtp

    The Configure Your System page opens.

  4. Specify the SMTP server information as detailed in the following table:

Field
Description

Host

IP address or domain name of SMTP server

Port

Port number for SMTP server

Username

User name in *&lt;mail userID&gt;*@*&lt;domain&gt;*.com format

Password

Password for user name specified above

Sender Mail

Sender mail ID in *&lt;mail userID&gt;*@*&lt;domain&gt;*.com format

Encryption

  • TLS: Default value (leave the Use SSL checkbox blank)

  • SSL: Select the Use SSL checkbox

5. Click Test Connection to test the integration. A success confirmation message is displayed next to the Test Connection button.

6. Click Save Changes.

Update SMTP details in Data Catalog after deployment

Adding Simple Mail Transfer Protocol (SMTP) details in Data Catalog enables email notifications and alerts within the application, such as:

  • Alerts about Data Catalog changes, approvals, and errors like data ingestion, metadata extraction, or synchronization failures.

  • Password reset links when users forget their credentials.

  • Notification alerts when tagged in the comments tab.

SMTP details are typically configured during the initial deployment of Data Catalog. However, if you want to update SMTP details post-deployment, you can use the Identity & Access Management (IAM) APIs without redeploying Data Catalog, which might cause downtime and operational delays.

Adding email domains and SMTP details during the initial Data Catalog deployment is always the best practice. For more information, see the Installing Data Catalog topic in Get started with Pentaho Data Catalog document.

Perform the following steps to update SMTP details in Data Catalog using IAM APIs after deployment:

Ensure you have sufficient access to use the IAM APIs.

  1. To generate an authentication token to interact with the IAM APIs, open the CMD prompt, and run the following cURL command:

    curl -k --location 'http://<your-server-url>/keycloak/realms/master/protocol/openid-connect/token' \
    --header 'Content-Type: application/x-www-form-urlencoded' \
    --data-urlencode 'username=<admin-username>' \
    --data-urlencode 'password=<admin-password>' \
    --data-urlencode 'client_id=admin-cli' \
    --data-urlencode 'grant_type=password'
    
    • Replace <your-server-url> with your Data Catalog server URL.

    • Replace <admin-username> and <admin-password> credentials with the actual admin credentials. The response includes the token value.

    {"access_token":"<TOKEN_VALUE>"}
  2. To update SMTP details, run the following IAM API cURL request:

    curl -X PUT \
      'https://<PDC_HOST>/css-admin-api/api/v1/tenants/<TENANT_NAME>' \
      -H 'Accept: */*' \
      -H 'Authorization: Bearer <TOKEN_VALUE>' \
      -H 'Content-Type: application/json' \
      -d '{
        "realm": "<TENANT_NAME>",
        "smtpServer": {
          "password": "<SMTP_PASSWORD>",
          "replyToDisplayName": "<REPLY_TO_DISPLAY_NAME>",
          "starttls": "<true|false>",
          "auth": "<true|false>",
          "port": "<SMTP_PORT>",
          "host": "<SMTP_HOST>",
          "replyTo": "<REPLY_TO_EMAIL>",
          "from": "<FROM_EMAIL>",
          "fromDisplayName": "<FROM_DISPLAY_NAME>",
          "envelopeFrom": "<ENVELOPE_FROM>",
          "ssl": "<true|false>",
          "user": "<SMTP_USERNAME>"
        }
      }'
    
    Parameter
    Description

    <PDC_HOST>

    The host name or IP address of your Data Catalog instance.

    <TENANT_NAME>

    The tenant name, typically "pdc".

    <TOKEN_VALUE>

    A valid authentication token (must be obtained through IAM authentication).

    <SMTP_PASSWORD>

    The password for the SMTP server authentication.

    <REPLY_TO_DISPLAY_NAME>

    The display name for the reply-to email address.

    <SMTP_PORT>

    The port number used by the SMTP server.

    <SMTP_HOST>

    The SMTP server host address).

    <REPLY_TO_EMAIL>

    The reply-to email address.

    <FROM_EMAIL>

    The email address used to send notifications.

    <FROM_DISPLAY_NAME>

    The display name associated with the sender’s email.

    <ENVELOPE_FROM>

    The envelope sender address (optional).

    <SMTP_USERNAME>

    The username for SMTP authentication.

Configure proxy server settings for the Licensing-API service

In Pentaho Data Catalog, the Licensing-API service is responsible for managing and validating software licenses, ensuring that only authorized users and services can access Data Catalog features. When Data Catalog is deployed in an enterprise environment that restricts direct internet access, services like the Licensing-API require a proxy server to reach external licensing servers and authenticate endpoints.

Post deployment of Data Catalog, perform the following steps to configure the proxy server for the Licensing-API service:

When configuring the proxy server for the Licensing-API service, use the domain name instead of the IP address. SSL certificates are typically issued for domain names, ensuring secure communication.

Ensure that you have:

  • Access to the conf/.env and vendor/docker-compose.licensing.yml files.

  • Administrative privileges to modify configuration files and restart services.

  • The required proxy server details (domain, port, username, and password).

  • The SSL certificate file (proxy-cert.pem) if required for secure proxy connections.

Procedure

  1. To configure proxy environment variables, go to Data Catalog root folder and then open the conf/.env file.

  2. In the conf/.env file, update the following proxy variables with respective values:

    Variable
    Description
    Example Value

    LICENSING_SERVER_PROXY_ENABLED

    Enables or disables proxy configuration.

    true or false

    LICENSING_SERVER_PROXY_DOMAIN

    The domain or IP address of the proxy server.

    10.177.176.126

    LICENSING_SERVER_PROXY_PORT

    The port number used for proxy communication.

    443

    LICENSING_SERVER_PROXY_USER

    The username for proxy authentication.

    admin

    LICENSING_SERVER_PROXY_PASSWORD

    The password for proxy authentication.

    password

    LICENSING_SERVER_PROXY_ENABLED=true
    LICENSING_SERVER_PROXY_DOMAIN=10.177.176.126
    LICENSING_SERVER_PROXY_PORT=443
    LICENSING_SERVER_PROXY_USER=user
    LICENSING_SERVER_PROXY_PASSWORD=password
    

    Note: It is a best practice to avoid hard coding sensitive credentials like PROXY_USER and PROXY_PASSWORD. Use secret management tools or environment variables to secure them.

  3. To update proxy server configuration in Docker Compose, open the vendor/docker-compose.licensing.yml file and update the licensing-api service configuration as follows:

    services:
      licensing-api:
        image: ${GLOBAL_IMAGE_PREFIX}/${LICENSING_API_IMAGE}
        restart: always
        environment:
          LICENSING_SERVER_URL: ${LICENSING_SERVER_URL}
          PROXY_ENABLED: ${LICENSING_SERVER_PROXY_ENABLED}
          # Use domain because SSL requires a domain, not an IP, for configuration
          PROXY_HOST: ${LICENSING_SERVER_PROXY_DOMAIN}
          PROXY_PORT: ${LICENSING_SERVER_PROXY_PORT}
          PROXY_USER: ${LICENSING_SERVER_PROXY_USER}
          PROXY_PASSWORD: ${LICENSING_SERVER_PROXY_PASSWORD}
          # Used for configuring SSL certificate for proxy
          JAVA_EXTRA_CERTS: "cert.pem"
        platform: linux/amd64
        profiles:
          - core
        volumes:
          - ${PDC_CLIENT_PATH}/proxy-cert.pem:/app/cert.pem
    

    Note:

    • The PROXY_ENABLED, PROXY_HOST, PROXY_PORT, PROXY_USER, and PROXY_PASSWORD environment variables are mapped inside the Docker container.

    • The JAVA_EXTRA_CERTS is set to "cert.pem" to configure SSL certificates for proxy authentication.

    • A volume mount is added to ensure that the SSL certificate file proxy-cert.pem is accessible within the container.

  4. (Optional) If the proxy server requires SSL authentication, place the SSL certificate file (proxy-cert.pem) in the specified directory:

    cp /path/to/proxy-cert.pem ${PDC_CLIENT_PATH}/proxy-cert.pem

    Note: Ensure that the file permissions allow access by the Licensing-API service.

  5. After updating the configuration, restart the Data Catalog services to apply the changes:

    ./pdc.sh restart

Configure Smart Type to SQL feature in Data Catalog

In Data Catalog, you can use the Smart Type to SQL feature, which converts natural-language text into executable SQL queries within Data Pipes. This feature uses a large-language-model (LLM) service to interpret user input and automatically generate valid SQL statements for the selected database tables.

Perform the following steps to configure the Smart Type to SQL feature in Data Catalog:

Before you begin

  • Ensure your Data Catalog deployment includes the ml-gateway-service.

  • Confirm that the aiml profile is enabled. The feature is unavailable without it.

  • Obtain valid credentials or API keys for your LLM provider.

Procedure

For Docker Compose deployment

  1. Go to the conf/.env file in your Data Catalog installation directory.

  2. Add the following environment variables:

    ML_LLM_MODEL="" ML_LLM_API_KEY="" ML_LLM_INFERENCE_BASE_URL=""

  3. Save the file and restart the containers for the configuration to take effect.

For Kubernetes deployment

  1. Open the values.yaml file of the ml-gateway-service.

  2. Update the following configuration parameters:

    llmModel: llmApiKey: llmInferenceBaseUrl:

  3. Save the file and redeploy the ml-gateway-service.

Result

You have successfully configured Smart Type to SQL feature. When the aiml profile is active, users can enter plain-language prompts in the SQL Editor to generate valid SQL queries automatically.

Configure database password encoding for special characters

By default, the pg-migration service in Pentaho Data Catalog automatically encodes special characters (such as @, :, /, and ?) in PostgreSQL database passwords to ensure they are safely included in the database connection URL, preventing connection failures caused by unescaped reserved characters. If your environment requires passing passwords without URI encoding, you can disable this behavior by setting the PDC_PG_MIGRATIONS_DB_PASSWORD_REQUIRED_ENCODING environment variable to false.

Perform the following steps to disable URI encoding for passwords used by the pg-migration service:

Before you begin

Ensure that you have permission to edit the PDC environment configuration file (.env or .env.default).

Procedure

  1. Open the environment configuration file:

    • For Docker Compose deployment:

      cd /opt/pentaho/pdc-docker-deployment/vendor
      vi .env.default
    • For Kubernetes deployment, update the value in the values.yaml file of the pg-migrations helm chart.

  2. Locate or add the following environment variable and set it to false:

    PDC_PG_MIGRATIONS_DB_PASSWORD_REQUIRED_ENCODING=false
  3. Save the file and start the PDC services to apply the change:

    • Docker Compose:

      docker-compose down
      docker-compose up -d
    • Kubernetes (Helm): Deploy the pg-migrations release with the updated values.yaml takes effect:

      helm upgrade <pg-migrations-release-name> <chart-path-or-repo> -f values.yaml

      Replace <pg-migrations-release-name> and <chart-path-or-repo> with your actual values.

Result

Data Catalog disables URI encoding for PostgreSQL database passwords used by the pg-migration service. The password is passed to the database connection URL as-is.

Next steps

If you need to modify additional system-level configuration variables, see Configure system environment variables.

Configure table and column sorting order in Data Canvas

By default, Pentaho Data Catalog displays tables and columns in their ordinal order within the Data Canvas, that is, in the same sequence as they appear in the source database. This ordering helps users analyze the data structure as designed in the original schema. However, sometimes, you might prefer to view tables and columns in alphabetical order, which simplifies browsing and locating objects across large schemas. You can modify this behavior by updating the system environment variable PDC_FE_DATA_CANVAS_COLUMN_SORTING_ORDER in the deployment configuration file.

Perform the following steps to configure table and column sorting order:

Before you begin

Ensure that you have access to the Data Catalog deployment directory on the server.

Procedure

  1. Go to the Data Catalog deployment directory:

    cd /opt/pentaho/pdc-docker-deployment/vendor
  2. Open the .env.default file in a text editor:

    vi .env.default
  3. Locate or update the following environment variable:

    PDC_FE_DATA_CANVAS_COLUMN_SORTING_ORDER=name,asc
  4. Save the file and exit the editor.

  5. Restart the Data Catalog containers for the changes to take effect:

    sh pdc.sh down
    sh pdc.sh up

This configuration affects only the display order in the Data Canvas and does not modify metadata, lineage, or profiling results.

Result

Tables and columns in the Data Canvas are now displayed in alphabetical order.

Next steps

  • If you prefer to revert to the default ordinal order, set the variable in the .env.default file.

    PDC_FE_DATA_CANVAS_COLUMN_SORTING_ORDER=ordinalPosition,asc
  • For details on editing environment variables, see Configure system environment variables.

Connect to Business Intelligence Database (BIDB)

The Data Catalog includes the Business Intelligence Database (BIDB) server, which stores and manages reporting metadata. Depending on your PDC version, BIDB is implemented using either PostgreSQL (PDC 10.2.5 and later) or MongoDB (PDC 10.2.1). You can use the respective connection methods and connect to the BIDB server to access reporting data and build dashboards. See the Reporting and data visualization section in the Use Pentaho Data Catalog guide for details about BIDB and the components available in BIDB.

PDC 10.2.5 and later (PostgreSQL)

Beginning with PDC 10.2.5, BIDB has been migrated from MongoDB to PostgreSQL, providing a relational database structure that improves query performance and enhances compatibility with broader tool sets.

Perform the following steps to connect to BIDB (PostgreSQL) in PDC 10.2.5 and later:

  1. Locate the BIDB credentials in the PDC server:

    1. Navigate to the /vendor/.env.default file.

    2. Identify the variables beginning with POSTGRES_BIDB_USER_*.

    3. To list the values, run: cat .env.default | grep 'POSTGRES_BIDB'

  2. Install the required PostgreSQL driver:

    • For JDBC, download the PostgreSQL JDBC driver from the official PostgreSQL site.

    • For ODBC, install the PostgreSQL ODBC driver (psqlODBC) on your system.

  3. Configure your connection:

    • JDBC connection string format: jdbc:postgresql://pdc.pentaho.com:5432/bidb

    • ODBC DSN settings:

      • Server: <HOSTNAME>

      • Port: 5432

      • Database: bidb

      • Username: bidb_ro

      • Password: ${POSTGRES_BIDB_USER_PASSWORD}

  4. Save the configuration in your reporting or analytics tool.

  5. Test the connection to confirm access.

Important:

  • It is best practice not to hardcode credentials in your application. Instead, reference the environment variables (POSTGRES_BIDB_USER_*) to ensure secure and flexible credential management.

  • Ensure that the .env.default file is stored securely and is not shared publicly.

PDC versions prior to 10.2.5 (MongoDB)

In PDC versions prior to 10.2.5, the Business Intelligence Database (BIDB) is implemented using MongoDB. To connect to the BIDB server, you can use either the Java Database Connectivity (JDBC) connector or the Open Database Connectivity (ODBC) connector, depending on your application or reporting tool requirements.

Configure the Java Database Connectivity (JDBC) connector

Perform the following steps to configure the JDBC connector for connecting to BIDB:

  1. Download the MySQL JDBC Connector JAR file from the MySQL website after selecting the appropriate version for the operating system.

  2. Download the DBeaver application from the DBeaver website and install it on the system. See DBeaver installation for more details.

  3. To add the MySQL JDBC Driver and MySQL authentication plugin to DBeaver, open DBeaver and go to Database > Driver Manager.

  4. Click New to add a new driver.

  5. Select MySQL from the list and enter a name for the driver.

  6. Click Browse to locate and select the downloaded JDBC driver (JAR file) and the MySQL authentication plugin, then click OK or Finish to add the driver.

  7. After adding the MySQL driver, to create a New Connection, go to the DBeaver home page, click New Database Connection, and select MySQL as the database type.

  8. Enter the MySQL server connection details, such as host, port, username, password, and so on.

  9. Specify the jars in the local client configuration as shown in the following section.

    jdbc:mysql://20.8.222.21:3307?useSSL=false&authenticationPlugins=org.mongodb.mongosql.auth.plugin.MongoSqlAuthenticationPlugin
  10. Click Test Connection to verify the connection is working.

  11. Click Finish to save the connection configuration.

Configure the Open Database Connectivity (ODBC) connector

The MongoDB ODBC connector allows you to connect tools that support ODBC to MongoDB and query the data using SQL. Perform the following steps to configure the JDBC connector for connecting to BIDB.

  1. Download and install the MongoDB ODBC connector.

    See MongoDB BI Connector ODBC Driver for more information.

  2. Download and install an ODBC driver manager on your system.

    For example, on the Windows operating system, you can use the default Windows ODBC Data Source Administrator.

  3. Open the ODBC Data Source Administrator on your machine and go to the System DSN tab.

  4. Click Add to add a new data source and select the MongoDB Driver.

  5. To configure the DSN (Data Source Name) settings:

    1. Set the server field to the address of your MongoDB server.

    2. Enter the port number if it differs from the default (27017).

    3. Enter the required details for authentication, username, and password.

    4. As a part of the connection details, enter the plugin directory details.

    5. Set the SSL Mode to Disabled in the SSL configuration.

  6. Click Test to verify that connection is working.

  7. Click OK to save the connection configuration.

Configure a machine learning (ML) server connection in Data Catalog

You can connect a machine learning (ML) server to Data Catalog and import ML model server components into the ML Models hierarchy. Supported server types include:

  • Pre-Production Model Servers such as MLflow, which capture experiments, runs, versions, and artifacts.

  • Production Model Servers such as NVIDIA Triton, which provide model deployment, inference statistics, and operational metrics.

Once configured, the ML server appears under the Synchronize card in the Management section of Data Catalog, allowing you to import model components into the ML Models hierarchy. For more information about ML Models, see the ML Models section in Use Pentaho Data Catalog.

Configure an MLflow server connection

Perform the following steps to configure a pre-production MLflow server connection in Data Catalog.

Before you begin:

  • Make sure you have access to the MLflow server you want to connect to.

  • If the MLflow server requires authentication, make sure you have the necessary credentials, either a valid username and password or an access token.

Procedure

  1. Verify whether the file external-data-source-config.yml exists in the path ${PDC_CLIENT_PATH}/external-datasource/. If not available, create it.

  2. Open the external-data-source-config.yml file and add ML server configuration:

    servers:
      - id: {SERVER_ID}
        name: {SERVER_NAME}
        type: {SERVER_TYPE}
        url: {SERVER_URL}
        config:
          username: {username}
          password: {password}
          access_token: {access_token}
Parameter
Description
Example

id

Unique identifier (UUID) for the ML server.

916d3b20-7fd6-49d2-b911-cc051f56e837

name

Display name for the server. This name appears in the UI.

MLflowServer

type

Type of server (enum value). For ML server, use ‘MlFlow’.

URL

The base URL of the ML server.

http://mlflow.mycompany.com

config

Configuration keys specific to ML server you are configuring. Include either, only if authentication is enabled:- Username and password

  • Access token

3. After configuring the ML server in the YAML file, restart the PDC services to apply the changes.

You can now import ML model server components into the ML Models hierarchy of Data Catalog. For more information, see Import ML model server components into ML Models hierarchy.

Configure a Triton server connection

Perform the following steps to configure a production Triton inference server connection in Data Catalog:

Before you begin:

  • Ensure the Triton inference server is running and accessible.

  • If running Triton in Docker, confirm that the ports are exposed (HTTP, gRPC, Metrics). For example:

version: "3.9"
services:
  triton:
    image: nvcr.io/nvidia/tritonserver:24.11-py3
    volumes:
      - ./models:/models
    command: ["tritonserver", "--model-repository=/models", "--backend-config=execution_mode=cpu", "--strict-model-config=false"]
    ports:
      - "8000:8000"   # HTTP
      - "8001:8001"   # gRPC
      - "8002:8002"   # Metrics

Procedure

  1. Verify whether the file external-data-source-config.yml exists in the path ${PDC_CLIENT_PATH}/external-datasource/. If not available, create it.

  2. Open the external-data-source-config.yml file and add ML server configuration:

    servers:
      - id: {SERVER_ID}
        name: {SERVER_NAME}
        type: Triton
        url: {TRITON_SERVER_URL}
        config:
          metadata_port: {metadata-port}
          metrics_port: {metrics-port}
Parameter
Description
Example

id

Unique identifier (UUID) for the Triton server.

44e2fa51-e3af-4094-8dd3-c62320952de5

name

Display name for the server. This name appears in the UI.

Triton-Prod-server

type

Type of server (enum value). For Triton server, use ‘Triton’.

Triton

URL

The base URL where the Triton server is deployed.

http://192.168.0.10

config.metadata_port

HTTP port configured when deploying the Triton server. The default port is 8000.

8000

config.metrics_port

Metrics port configured by the user when deploying the Triton server. The default is 8002.

8002

3. Save the YAML file and restart the PDC services to apply the changes.

You can now import ML model server components into the ML Models hierarchy of Data Catalog. For more information, see Import ML model server components into ML Models hierarchy.

Configure a Tableau server connection in Data Catalog

You can configure a connection between a Tableau server and Data Catalog to import Tableau metadata such as dashboards, workbooks, projects, and data sources into the Business Intelligence (BI) section of Data Catalog. To learn more, see the Business Intelligence section in the Use Pentaho Data Catalog guide. This guide depicts a step-by-step procedure to configure the Tableau server connection in Data Catalog.

Before you begin:

  • Make sure you have access to the Tableau Cloud or Tableau Server instance you want to connect to. The URL format looks like:

    https://<region>.online.tableau.com/#/site/<site-id>/home
  • Identify the Site ID for the Tableau site. For Tableau Cloud, you can find this in the URL after /site/.

  • Generate a valid Personal Access Token (PAT) in Tableau, including PAT name and PAT secret.

Perform the following steps to configure a connection between the Tableau server and Data Catalog:

  1. Verify whether the file external-data-source-config.yml exists in the path $ {PDC_CLIENT_PATH}/external-datasource/. If not available, create it.

  2. Open the external-data-source-config.yml file and add Tableau server configuration:

    servers:
      - id: dev-8f012f9ca7
        name: Test_Server
        type: Tableau
        url: https://prod-apnortheast-a.online.tableau.com/api/3.22/auth/signin
        config:
          pat_name: 'test'
          pat_secret: 'kITbTaYmTPSdZ7ADeP11VA==:hwt9jkehQqGuq72Lh9V4wiFlfZcIpny8'
    
    Parameter
    Description
    Example

    id

    The site ID (unique identifier) of the Tableau site to connect to, as seen in the Tableau Cloud URL.

    dev-8f012f9ca7

    name

    Display name for the server. This name appears in the UI.

    TableauServer

    type

    Type of server (enum value). For Tableau server, use ‘Tableau’.

    Tableau

    URL

    The Tableau REST API authentication endpoint. Use the signin endpoint for the Tableau site.

    https://prod-apnortheast-a.online.tableau.com/api/3.22/auth/signin

    config

    Configuration keys specific to the Tableau server you are configuring.

    - pat_name

    The name of the Tableau Personal Access Token (PAT) used for authentication.

    - pat_secret

    The secret key associated with the PAT. Ensure this is stored securely and never exposed.

  3. After configuring the Tableau server in the YAML file, restart the following PDC services to apply the changes:

    • Frontend service (fe)

    • Worker service (ws-default)

    # Restart the frontend and worker services
    ./pdc.sh restart fe
    ./pdc.sh restart ws-default
    

You have successfully configured the Tableau server in Data Catalog as an external data source. It appears under the Synchronize card in the Management section of Data Catalog.

You can now import Tableau server components into the Business Intelligence hierarchy of Data Catalog.

Configure a Power BI service connection in Data Catalog

You can connect the Microsoft Power BI service to Data Catalog and import Power BI metadata into the Business Intelligence section. This integration lets you discover, explore, and manage Power BI reports, datasets, and workspaces directly from Data Catalog. This guide depicts a step-by-step procedure to configure the Power BI server connection in Data Catalog.

Before you begin:

  • Ensure that you have a valid Microsoft account with an active Power BI service license. Contact your Microsoft administrator if you need access.

  • Register an Azure Active Directory (Azure AD) application in the Azure portal. For more information, see Microsoft guide to register an app.

  • Generate a client secret in the Azure AD application and store it securely. For more information, see Microsoft guide to add client secrets.

  • Assign the following API permissions to the Azure AD app:

    Permission
    Description

    App.Read.All

    View all Power BI apps

    Capacity.Read.All

    View all capacities

    Dashboard.Read.All

    Read dashboards

    Dataflow.Read.All

    Read dataflows

    Dataset.Read.All

    View all datasets

    Report.Read.All

    Read reports

    Tenant.Read.All

    View all content in tenant

    Workspace.Read.All

    View all workspaces

    See Power BI automation permissions for more information.

  • Ensure that a service account is available with access to all Power BI workspaces that you need to integrate with Data Catalog.

  • Confirm outbound HTTPS access to https://api.powerbi.com on port 443 from the Data Catalog server.

  • Collect the following details before continuing:

    Field
    Description

    Client ID

    From Azure AD App registration

    Client Secret

    Securely generated during Azure AD registration

    Tenant ID

    Azure AD Directory unique identifier

    OAuth2 Token URL

    https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token

    API Host URL

    https://api.powerbi.com

    Username

    Credentials for the Power BI server.

    Password

    Credentials for the Power BI server.

  • Admin consent must be granted for all permissions during Azure AD app setup.

  • A Power BI Pro or Premium Per User (PPU) license is required to use the Power BI REST APIs for integration with Data Catalog.

  • Data Catalog uses the Resource Owner Password Credentials (ROPC) grant type for authentication.

Perform the following steps to configure a connection between the Power BI server and Data Catalog:

  1. Connect to the Data Catalog server using SSH.

    1. From your local machine, open a terminal (for Linux or macOS) or an SSH client, such as PuTTY (for Windows).

    2. Enter the SSH command to connect to the server where Pentaho Data Catalog is installed.

      ssh <pdc-admin>@<pdc-server-ip>

      Replace <username> with your server login account and <server-ip> with the server’s IP address or hostname.

    3. When prompted, enter the password for the specified account. After a successful login, you will have access to the Data Catalog server’s command line.

  2. Navigate to the configuration folder and verify whether the file external-data-source-config.yml exists in the path $ {PDC_CLIENT_PATH}/external-datasource/. If not available, create it.

    /opt/pentaho/pdc-docker-deployment/conf/external-data-source
  3. Open the external-data-source-config.yml file and add Power BI server configuration:

    servers:
      - id: <your-server-id>
        name: PowerBIServer
        type: PowerBI
        url: https://login.microsoftonline.com/organizations/oauth2/v2.0/token
        config:
          client_id: <your-client-id>
          username: <your-username>
          password: <your-password>
          client_secret: <your-client-secret>
          host_url: https://api.powerbi.com
    
    Parameter
    Description
    Example

    id

    The unique identifier for the Power BI server connection. You define this value in the YAML.

    dev-powerbi01

    name

    Display name for the server. This name appears in the Data Catalog UI.

    PowerBIServer

    type

    Type of server (enum value). For Power BI server, use ‘PowerBI’.

    PowerBI

    URL

    The Microsoft identity platform OAuth 2.0 token endpoint. Used for authentication requests.

    https://prod-apnortheast-a.online.tableau.com/api/3.22/auth/signin

    config

    Configuration keys specific to the Power BI server you are configuring.

    - client_id

    The application (client) ID generated during Azure AD app registration.

    12345678-abcd-1234-abcd-1234567890ab

    - username

    The Microsoft account username with access to the Power BI service.

    -password

    The password associated with the Microsoft account username. Ensure this is stored securely.

    -client_secret

    The client secret generated for the Azure AD app. Ensure this is stored securely and never exposed.

    abcdEFGH12345!@#xyz

    -host_url

    The base API endpoint for Power BI service.

    https://api.powerbi.com

  4. After configuring the Power BI server in the YAML file, restart the following PDC services to apply the changes:

    ./pdc.sh up
    

You can now import Power BI server components into the Business Intelligence hierarchy of Data Catalog.

Configure the Physical Assets service in Data Catalog

In Pentaho Data Catalog, you can import operational technology (OT) components, including device services, locations, devices, and values, and view them in the Physical Assets section of Data Catalog in a hierarchical structure. With the Physical Assets feature, you can understand how data flows from physical sources into analytical systems, enabling better traceability and context. Additionally, users can enrich asset nodes with business terms, policies, lineage, and metadata to strengthen data governance and compliance. For more information, see Physical Assets in the Use Pentaho Data Catalog document.

To use the Physical Assets feature in Data Catalog, you must first configure the Physical Assets service. This involves completing the following procedures:

Note: The configuration steps assume Data Catalog is already installed. For installation instructions, seeInstall Data Catalog in the Install Pentaho Data Catalog document.

Enable the Physical Assets service in Data Catalog

Perform the following steps to enable the Physical Assets service in the existing Data Catalog deployment.

  1. Go to the vendor folder:

    /pentaho/pdc-docker-deployment/vendor
  2. Open the .env.default file:

  3. Update the following lines:

    COMPOSE_PROFILES=core,mongodb,collab,pdso,mdm,physical-assets
    ASSET_HIERARCHY_URL=/physical-assets-service
    
  4. Add the Pentaho Edge connection details:

    PENTAHO_EDGE_URL=http://<PE-IP>:4000
    PENTAHO_EDGE_USERNAME_PASSWORD=admin:admin
    PENTAHO_EDGE_BACKEND_URL=https://<PE-IP>:8443
    

    Replace <PE-IP> with the IP where Pentaho Edge is installed.

  5. Restart PDC to apply changes:

    cd /pdc-docker-deployment
    ./pdc.sh up
    

You have successfully enabled the Physical Assets service in the Data Catalog deployment. The service is now active and ready to connect with Pentaho Edge to receive physical assets metadata.

Configure Pentaho Edge for the Physical Assets service

Perform the following steps to configure Pentaho Edge to connect it to Data Catalog.

  1. Clone the Pentaho Edge installer repository and navigate to the installer folder:

    git clone <repo-url>
    cd installer
    
  2. Edit the docker-compose-pentaho-edge.yml file:

    ENABLE_ASSET_HIERARCHY_FEATURE=true
  3. Save and close the docker-compose-pentaho-edge.yml file.

  4. Open the .env file:

    vi .env
  5. Update the following properties:

    PDC_ASSET_HIE_SERVICE_BASE_URL=https://<PDC-IP>/physical-assets-service/api/v1/assets
    AUTH_URL=https://<PDC-IP>/keycloak
    PDC_INSECURE_SKIP_VERIFY=true
    ENABLE_ASSET_HIERARCHY_FEATURE=true
    

    Note:

    • Replace the <PDC-IP> address of the URL with the IP where pdc-docker-deployment is installed.

    • Use the FQDN instead of the IP address if needed.

      PDC_ASSET_HIE_SERVICE_BASE_URL=https://<FQDN>/physical-assets-service/api/v1/assets
      AUTH_URL=https://<FQDN>
      PDC_INSECURE_SKIP_VERIFY=true
      ENABLE_ASSET_HIERARCHY_FEATURE=true
      
  6. Add authentication properties:

    AUTH_USERNAME=
    AUTH_PASSWORD=
    AUTH_CLIENT_ID=
    AUTH_REALM=
    
  7. Run the Edge installer script:

    ./install.sh
  8. When prompted, provide a user ID and password.

You have successfully configured Pentaho Edge to support the Physical Assets hierarchy and configured the connection to Data Catalog. You can now view OT assets in the Physical Assets in Data Catalog.

Configure PDI to send lineage to Data Catalog

You can configure Pentaho Data Integration (PDI) to write lineage information from key lineage events into the Data Catalog metadata store. When configured, PDI writes lineage metadata for supported lineage events to the Data Catalog metadata store. Data Catalog continuously runs an API that reads the lineage information from PDI. Both PDI and Data Catalog support the OpenLineage framework for data lineage collection and analysis.

See Data lineage in Use Pentaho Data Catalog guide for information on the specific lineage events that are supported.

You must perform these steps on PDI.

Before you begin this task, turn off PDI and the Pentaho Server.

Perform the following steps to set up PDI to send lineage metadata to Data Catalog.

  1. On the Support Portal home page, sign in using the Pentaho support username and password provided in your Pentaho Welcome Packet. If you don't have credentials, contact your PDI administrator.

  2. On the Pentaho card, click Download.

  3. Navigate to the Marketplace location with plugin downloads.

  4. Download the PDI OpenLineage plugin.

  5. Extract the downloaded package.

  6. Run the installer for PDI:

    1. Run install.sh if on Linux, or install.bat if on Windows.

    2. Install in the <data-integration> folder.

  7. Run the installer for Pentaho Server:

    1. Run install.sh if on Linux, or install.bat if on Windows.

    2. Install in the <pentaho-server> folder.

  8. Create a config.yml file, adding the correct users and passwords for your environment, and the URL for Data Catalog.

    There is an example in the readme.txt file:

    Example of a configuration file
    =================================
    ```yaml
    version: 0.0.1
    consumers:
     console:
     file:
      - path: /path/to/file
     http:
      - name: Marquez
       url: http://localhost:5001
      - name: PDC
       url: https://pdc.example.com
       endpoint: /lineage/api/events
       authenticationParameters:
        endpoint: /keycloak/realms/pdc/protocol/openid-connect/token
        username: user
        password: pass
        client_id: pdc-client-in-keycloak
        scope: openid
  9. Edit the ~/.kettle/kettle.properties file and add the following properties:

    KETTLE_OPEN_LINEAGE_CONFIG_FILE=</full/path/to/your/openlineage/config.yml>  
    KETTLE_OPEN_LINEAGE_ACTIVE=true
  10. Restart PDI and the Pentaho Server.

Integrate Active Directory with Pentaho Data Catalog

You can integrate Microsoft Active Directory (AD) with Pentaho Data Catalog (PDC) to enable users of AD to have single sign-on access to PDC. Part of this integration includes configuring the Keycloak identity and access management tool to use AD as an identity provider.

The configuration includes the following topics:

After importing AD users to PDC, you need to perform the following operations from Active Directory, because they can no longer be done from the Data Catalog User Management page:

  • Edit a user

  • Add a new user

  • Delete a user

Verify the LDAP server configuration

To integrate Active Directory with Pentaho Data Catalog, you need to integrate Lightweight Directory Access Protocol (LDAP) with Keycloak. You first need to check that your LDAP server is configured correctly.

For detailed information on how to configure LDAP in your environment, consult your LDAP server documentation.

You should have the following components in an example configuration:

  • Base DN: Base Distinguished Name, such as: dc=example,dc=com, where dc is the domain component. The Base DN is the root entry where you want to start your LDAP searches.

  • User DN: User Distinguished Name, such as: ou=users,dc=example,dc=com, where ou is the organizational unit and dc is the domain component.

  • Groups DN: Groups Distinguished Name, such as: ou=groups,dc=example,dc=com, where ou is the organizational unit and dc is the domain component.

Next steps

Configure the LDAP provider

To integrate Active Directory (AD) with Pentaho Data Catalog (PDC), you need to configure the LDAP provider for PDC in the Keycloak interface.

Use the following steps to configure the LDAP provider:

  1. Navigate to your Keycloak admin console (such as https://<FQDN>/keycloak/) and log in with admin credentials.

  2. Select the PDC realm.

    If you haven't already configured an LDAP provider, click Add provider and select ldap. If you have an existing LDAP provider, click on it to edit.

  3. Enter the following information for the LDAP provider:

    Field
    Value

    Vendor

    Active Directory

    Connection URL

    ldap://*&lt;LDAP\_SERVER&gt;*:*&lt;PORT&gt;* such as: ldap://localhost:389

  4. Click Test connection.

    You should get a success message.

  5. Enter the following information on the remainder of the page:

    Field
    Value

    Bind type

    Select simple

    Bind DN

    DN for your LDAP admin user, such as: cn=administrator,dc=example,dc=com

    Bind credentials

    Password for the LDAP admin user

  6. Click Test authentication.

    You should get a success message.

The LDAP provider is configured for use with AD.

Connect to AD using the LDAP server's SSL certificate (Optional)

When you use an LDAP server with Pentaho Data Catalog (PDC), you can use the LDAP server's SSL certificate to securely connect to Active Directory (AD). This is an optional step in integrating AD with PDC.

For more information on integrating AD with PDC, see Integrate Active Directory with Pentaho Data Catalog. Refer to Keycloak documentation if necessary.

Perform the following steps to use the LDAP server's SSL certificate to connect to AD.

  1. To retrieve the certificate from your LDAP server, enter the following command:

    openssl s_client -connect ldap.example.com:636 -showcerts

  2. Copy the entire certificate chain (from -----BEGIN CERTIFICATE----- to -----END CERTIFICATE-----) and save it to a file, such as ldap-cert.pem.

  3. Update the *&lt;PDC\_INSTALL\_LOCATION&gt;*/conf/extra-certs/bundle.pem file with the LDAP server’s SSL certificate, where *&lt;PDC\_INSTALL\_LOCATION&gt;* is the directory where PDC is installed.

  4. Restart PDC services by entering the following command:

    sh pdc.sh restart

  5. Log in to the Keycloak admin console (https://*&lt;FQDN&gt;*/keycloak/).

  6. Navigate to the PDC realm.

  7. Click User Federation.

  8. Click the LDAP provider to edit it.

  9. Enter the following LDAP settings:

Field
Value

UI display name

Name to display, such as LDAPS

Vendor

Select Active Directory

Connection URL

ldaps://*&lt;LDAP\_SERVER&gt;*:*&lt;PORT&gt;*such as: ldaps://ldap.example.com:636

10. Click Test connection. You should see a success message.

11. Enter the remaining LDAP connection and authentication settings:

Field
Value

Bind type

Select simple

Bind DN

DN to bind to the LDAP server, such as: cn=admin,dc=example,dc=com.

Bind credentials

password for the Bind DN

12. Click Test authentication. You should see a success message.

13. Enter values for the required LDAP searching and updating settings:

Field
Value

Edit mode

It is a best practice to set this to Readonly

Users DN

Specify the DN where the user entries are located, such as: ou=users,dc=example,dc=com

Username LDAP attribute

cn

RDN LDAP attribute

cn

UUID LDAP attribute

objectGUID

User object classes

person, organizationalPerson, user

14. Click Save to save the configuration.

AD is set up to use the SSL certificate of the LDAP server for a secure connection.

Optionally, you can configure the following settings:

  • Set how often Keycloak should sync with LDAP.

  • Set periodic full sync and periodic changed users sync.

Configure LDAP mappers in Keycloak

To integrate Active Directory (AD) with Pentaho Data Catalog (PDC), you need to configure LDAP mappers so that PDC has the necessary information (such as usernames, email addresses, or group memberships) to connect to an LDAP directory.

The Keycloak LDAP mapper translates attributes stored in an LDAP directory into the corresponding attributes needed by PDC.

Use the following steps in Keycloak to configure the LDAP mappers for Data Catalog. See Keycloak documentation for more information.

  1. In your Keycloak admin console (such as https://<FQDN>/keycloak/), log in with admin credentials.

  2. Select the PDC realm and go to the User Federation settings.

  3. Click the LDAP provider and go to the Mappers tab.

  4. Map the LDAP attribute for the username.

  5. Map other user attributes as needed (such as email, first name, last name).

  6. To add additional mappers to assign default roles for the users being imported from AD, enter the following settings under User federation > Settings > Mapper details.

    For the Business User role:

    Field
    Value

    Name

    Business_User_Role_Mapper_To_LDAP_USERS

    Mapper type

    hardcoded-ldap-role-mapper

    Role

    Business_User (select from menu and click Assign)

  7. Click Save.

  8. Repeat step 6 for the Data User role, using the following values:

    Field
    Value

    Name

    Data_User_Role_Mapper_To_LDAP_USERS

    Mapper type

    hardcoded-ldap-role-mapper

    Role

    Data_User (select from menu and click Assign)

  9. Click Save.

  10. Save the configuration.

  11. From the Action menu, click Sync all users to import users from LDAP.

    A success message displays.

    When users are synced from AD, the default PDC realm assigns the Business User and Data User roles to all the users.

    Note: PDC applies limits for licensing when users receive one or more of the following roles:

    • Business Steward

    • Data Steward

    • Data Developer

    • Admin

  12. Go to Users and verify that the users were imported correctly into Keycloak.

Configure PDC permissions for an AD user

The last step in integrating Active Directory (AD) with Pentaho Data Catalog (PDC) is to set up permissions in PDC for the AD users.

Use the following steps to create and verify an AD user.

  1. Log into Data Catalog as the admin user.

  2. Click Management, and on the Users & Communities card, click Users.

  3. Check that the imported users display correctly and make any needed adjustments.

  4. Select an AD user and assign a community or role to the user.

  5. Click Save, and log out.

  6. Log in as the AD user to verify the login is working properly.

Integrate Okta with Pentaho Data Catalog

You can integrate Okta authentication with Pentaho Data Catalog for the added security provided by multi-factor authentication. To integrate Okta with Pentaho Data Catalog, you need to configure Okta in parallel with the Keycloak identity and access management tool.

The steps in the integration process are:

Add an OIDC provider in Keycloak

To integrate Okta with Pentaho Data Catalog, you need to set up an identity provider in Keycloak. Keycloak uses the OpenID Connect (OIDC) protocol to connect to identity providers.

If necessary, see the Keycloak documentation to complete this task.

Perform the following steps in the Keycloak interface:

  1. Log in to Keycloak and select the PDC realm.

    If a PDC realm does not already exist, consult your PDC administrator or see Creating a realm in the Keycloak documentation to create one.

  2. Click Identity Providers and select OpenID Connect v1.0.

    If necessary, see OpenID Connect v1.0 identity providers in the Keycloak documentation.

  3. Use the following steps to add an OIDC ID provider:

    1. Enter an alias in the Alias field.

      This populates the Redirect URI field, in a format like the following:

      http://localhost:8180/realms/master/broker/<alias>/endpoint
    2. Copy the Redirect URI to be used in the next task, Add an OpenID Connect application in Okta.

Perform the following tasks:

Add an OIDC application in Okta

The next step in integrating Okta with Pentaho Data Catalog is to add an OpenID Connect (OIDC) application in Okta.

Before beginning this task, make sure to perform this task:

In this task, you need the Keycloak Identity Provider Redirect URI you copied in the Add an OpenID Connect provider in Keycloak task. If necessary, see Launch the wizard in the Okta documentation.

Perform the following steps in the Okta Admin console:

  1. From the left menu, click Applications and then Applications.

  2. Click Create App Integration.

  3. Select or enter the following values:

    Field
    Value

    Sign-in method

    OIDC – OpenID Connect

    Application type

    Web Application

    App integration name

    CatalogPlus_10.2.1

    Grant type

    Authorization Code

    Sign-in redirect URIs

    Keycloak Identity Provider Redirect URI copied in the Add OpenID Connect provider in Keycloak task

    https://<application_url>/keycloak/realms/<realm_name>/broker/<alias_name>/endpoint/logout_response

  4. For Sign-out redirect URIs, configure your logout URI in this format:

    For example:

    https://<ip address>/keycloak/realms/pdc/broker/okta/endpoint/logout_response
  5. Continue entering values in Okta screens:

    Field
    Value

    Controlled access

    Select the default value, Allow everyone in your organization to access

    Enable immediate access

    Clear the checkbox

  6. In the General tab, make a note of the Client Id and Client Secret to use in the Configure an identity provider in Keycloak task.

  7. Click Save.

  8. On the left menu, click Applications.

  9. From the down arrow, select Assign to Groups.

  10. Assign a group to your application.

Perform the following tasks:

Set up security in Okta

When integrating Okta with Pentaho Data Catalog, you need to set up security in Okta for the connection to PDC.

Before beginning this task, make sure to perform these tasks:

Perform the following steps in the Okta admin console:

  1. On the left menu, click Security, then API, then Default.

  2. On the Access Policies tab, click Add New Access Policy.

  3. Add details for the policy and click Create Policy.

  4. Click Add Rule.

  5. Add details for the rule and click Create Rule.

Perform the following tasks:

Configure an identity provider in Keycloak

To integrate Okta with Pentaho Data Catalog, you need to configure an identity provider in Keycloak. If necessary, see the Keycloak documentation.

Before beginning this task, make sure to perform these tasks:

In this task, you need the Client Id and Client Secret you noted during the Add an OpenID Connect application in Okta task.

Perform the following steps in the Keycloak admin console:

  1. From the left menu, click Identity providers.

  2. Click OpenID Connect v1.0.

  3. Make sure the Use discovery endpoint switch is on.

  4. In the Discovery endpoint field, enter the discovery endpoint URL in the following format:

    https://hostname/auth/realms/master/.well-known/openid-configuration

    The Authorization URL, Token URL, Logout URL, and User Info URL and other fields populate automatically.

  5. Enter the Client Id and Client Secret noted during the Add an OpenID Connect application in Okta task.

  6. On the Settings tab, select the following settings:

    • First login flow override: First login flow override

    • Sync mode: Force

  7. Expand the Advanced settings and set the Scopes setting to openid email profile (separated by a single space).

  8. Click Save.

Perform the following task:

Sign in to Pentaho Data Catalog using Okta

After Pentaho Data Catalog is integrated with Okta, you have the option to log in to PDC with Okta.

Before beginning this task, make sure to perform these tasks:

To log in to PDC using Okta, perform the following steps:

  1. On the PDC login screen, click the button corresponding to the Okta alias.

    Note: The alias matches whatever is set for Okta OpenID Connect in Keycloak.

    In the following example, the button is labeled CATALOG+OKTA:

    Updated PDC login screen after Okta integration
  2. On the Okta login screen that appears, enter the credentials for the Okta user assigned to PDC.

    Okta prompts you to enter a code.

  3. To finish logging in, enter the code that Okta provides.

You have completed the integration of Okta with PDC.

Configure Metadata Request Access in Data Catalog

You can configure Metadata Request Access in Data Catalog to manage user requests for metadata access. This feature uses the Access Request Service, a backend service that integrates with ticketing tools such as Jira or ServiceNow. The service creates, tracks, and updates access requests, and synchronizes their status between Data Catalog and the external system. This guide depicts how to configure Metadata Request Access in Data Catalog.

By default, only metadata access requests are managed in Data Catalog. Data access requests are routed to an integrated ticketing system.

Before you begin,

  • Ensure you have:

    • Network connectivity between the service and Jira or ServiceNow.

    • Credentials for authenticating with your identity provider and your ticketing system.

  • Confirm that the external system (Jira or ServiceNow) contains a custom field to store the access request status (for example, Access Status with values Approved and Rejected). For more information, see Integrating ServiceNow with Data Catalog and Integrating Jira with Data Catalog.

  • Gather the following details:

    • Service account credentials for Jira or ServiceNow.

    • Database connection parameters.

    • Authentication service endpoint, client ID, and credentials.

Perform the following steps to configure Metadata Request Access in Data Catalog:

  1. Configure the service by setting the required environment variables[PA1] . Use either your deployment tool (for example, Docker Compose, Kubernetes, or Helm) or a configuration file. Key environment variables

    Variable
    Description
    Required
    Example

    TENANT_NAME

    Tenant identifier

    Yes

    your-tenant

    ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE[SR5]

    Default assignee email if no auto-assignment

    Yes

    STATUS_FETCHER_INTERVAL

    Poll interval for status updates; supports cron or @every syntax

    Yes

    @every 5m

    PAGINATION_MAX_RESULTS_SIZE

    Maximum results per API request

    No

    1000

    LOG_LEVEL

    Logging verbosity (info, debug, warn, error)

    No

    info

    CAUTION: The value of ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE must match a user created in Data Catalog with the Admin role using User Management. Authentication settings

    Variable
    Description
    Required
    Example

    AUTH_URL

    Authentication endpoint URL

    Yes

    https://auth.example.com

    AUTH_HOST

    Authentication host and port

    Yes

    auth-host:5000

    AUTH_CLIENT_ID

    Client ID for authentication

    Yes

    generic-client-id

    AUTH_USER_NAME

    Username for authentication

    Yes

    generic-user

    AUTH_PASSWORD

    Password for authentication

    Yes

    generic-password

    Database settings

    Variable
    Description
    Required
    Example

    DB_URL

    Database host

    Yes

    generic-db-host

    DB_PORT

    Database port

    Yes

    5432

    DB_NAME

    Database name

    Yes

    generic_db

    DB_USER

    Database user

    Yes

    generic-db-user

    DB_PASSWORD

    Database password

    Yes

    generic-db-password

    DB_SSL_MODE

    SSL mode (disable, require)

    Yes

    disable

    Ticketing provider selection

    Variable
    Description
    Required
    Example

    PROVIDER_TOOL

    Select ticketing system (Jira or ServiceNow)

    Yes

    Jira

    If Jira is used

    Variable
    Description
    Required
    Example

    JIRA_URLrestart

    Jira server URL

    Yes

    https://jira.example.com

    JIRA_PROJECT_NAME

    Project key or name

    Yes

    GENERIC

    JIRA_USER_NAME

    Jira service account username

    Yes

    JIRA_PASSWORD

    Jira password or API token

    Yes

    generic-jira-password

    JIRA_ACCESS_STATUS_KEY

    Jira field key for access status

    Yes

    Access Status

    JIRA_ACCESS_STATUS_APPROVED_VALUE

    Field value for approval

    Yes

    Approved

    JIRA_ACCESS_STATUS_REJECTED_VALUE

    Field value for rejection

    Yes

    Rejected

    If ServiceNow is used

    Variable
    Description
    Required
    Example

    SERVICENOW_URL

    ServiceNow instance URL

    Yes

    https://servicenow.example.com

    SERVICENOW_USER_NAME

    ServiceNow service account username

    Yes

    SERVICENOW_PASSWORD

    ServiceNow password

    Yes

    generic-snow-password

    SERVICENOW_CLIENT_ID

    ServiceNow client ID

    Yes

    snow-client-id

    SERVICENOW_CLIENT_SECRET

    ServiceNow client secret

    Yes

    snow-client-secret

    SERVICENOW_ACCESS_STATUS_KEY

    ServiceNow field key for access status

    Yes

    u_access_status

    SERVICENOW_ACCESS_STATUS_APPROVED_VALUE

    Field value for approval

    Yes

    Approved

    SERVICENOW_ACCESS_STATUS_REJECTED_VALUE

    Field value for rejection

    Yes

    Rejected

  2. Save your configuration and restart the access-request-service container.

  3. (optional) Once you restarted the access-request-service container, you can verify the configuration:

    1. Submit a metadata request in the PDC UI.

    2. Confirm that a corresponding ticket is created in Jira or ServiceNow.

    3. Update the ticket status to Approved or Rejected.

    4. Confirm that PDC reflects the updated status after the poll interval.

Integrating Jira with Pentaho Data Catalog

You can integrate Jira as an external ticketing system with Data Catalog to manage data access requests. This guide describes how to configure Jira integration using a config.yaml file or environment variables, and how to create a custom field in Jira to use for the data access request statuses.

You can set any administrative user as the default administrator to manage data access requests. However, there can be only one default administrator set, because the ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE environment variable only supports a single user. If necessary, the default administrator can edit a data access request to assign it to another administrative user.

To integrate Jira with Data Catalog, perform the following tasks:

Integrate Jira with Data Catalog using a config.yaml file

To integrate the Jira ticketing system with Pentaho Data Catalog, you can use a config.yaml file with settings for connection details, credentials, project information, and status mappings. If your system does not have a config.yaml file, you can also integrate Jira with Data Catalog using environment variables. For more information, see Integrate Jira with Data Catalog using environment variables.

Perform the following steps to integrate Jira with Data Catalog using a config.yaml file:

  1. Go to /pentaho/pdc-docker-deployment/conf folder and open config.yaml file. If not available, create it.

  2. Add the following configuration to the config.yaml file.

    Note: Replace the placeholder values in angle brackets (< >) with your actual Jira credentials and project details.

    tools:
      jira:
        url: <your_jira_url>
        username: <your_jira_username>
        password: <your_jira_password>
        project_name: <your_jira_project_name>
        access_status_key: <your_jira_access_status_key>
        status_mapping:
          approved_status: <your_jira_approved_status_value>
          rejected_status: <your_jira_rejected_status_value>
    
    database:
      postgres:
        host: 'um-postgresql'
        port: 5432
        user: 'postgres'
        password: 'admin123#'
        dbname: 'pdc_access_request_db'
        sslmode: 'disable'
    
    tools:
      jira:
        url: 'https://teamwork7com.atlassian.net'
        username: '[email protected]'
        password: 'ATATT3xFfGF0u8CftyjZn0PO-p-M_J8VJUtYcn3ZiZzfM0pF7iqpnUT3TPd-7q7QO8PevM7IDIyzSwwwUoksY7tfKnJwXV1EokuGhp1YmcIkP-78H0H1-io2bkkSVL-bpRmwL4Tha0yWWHZRnEvhSRE1SX984WX4vQMp1hi9w6ua5a_jAq9gWwg=0C4170DE'
        project_name: 'KAN'
        access_status_key: 'Access Status'
        status_mapping:
          approved_status: 'Approved'
          rejected_status: 'Rejected'
    
    auth:
      url: 'https://10.177.177.7'
      auth_host: 'um-css-admin-api:5000'
      username: 'admin'
      password: 'admin'
      client_id: 'admin-cli'
    
  3. Save the changes and close the config.yaml file.

  4. Open vendor/docker-compose-um.yml file. Under the access request service container configuration, add the Volumes section parallel to the environment section, save the changes, and close the file.

    volumes:
          - <path-to-the-config.yaml>:/app/config.yaml:ro
    
  5. Restart the access request service to apply changes:

    ./pdc.sh restart access-request-service

You now need to add a custom field to Jira, to include the data access request statuses. For more information, see Add a custom field to Jira.

Integrate Jira with Data Catalog using environment variables

To integrate the Jira ticketing system with Data Catalog, you can use environment variables to set connection details, credentials, project information, and status mappings. You can also Integrate Jira with Data Catalog using a config.yaml file.

Perform the following steps to integrate Jira with Data Catalog using environment variables:

  1. Edit the /opt/pentaho/pdc-docker-deployment/vendor/.env.default file and add the following lines:

    Note: Instead of the default location, your environment variables may be set in an opt/pentaho/pdc-docker-deployment/conf/.env file.

    ACCESS_REQUEST_SERVICE_PROVIDER_TOOL=Jira
    ACCESS_REQUEST_SERVICE_JIRA_URL=your_jira_url
    ACCESS_REQUEST_SERVICE_JIRA_USER_NAME=your_jira_username
    ACCESS_REQUEST_SERVICE_JIRA_PASSWORD=your_jira_password
    ACCESS_REQUEST_SERVICE_JIRA_PROJECT_NAME=your_jira_project_name
    ACCESS_REQUEST_SERVICE_JIRA_ACCESS_STATUS_KEY=your_jira_access_status_key
    ACCESS_REQUEST_SERVICE_JIRA_ACCESS_STATUS_APPROVED_VALUE=your_jira_approved_status_value
    ACCESS_REQUEST_SERVICE_JIRA_ACCESS_STATUS_REJECTED_VALUE=your_jira_rejected_status_value
    ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE=PDC_admin_user_email
    
  2. Save the changes and close the file.

  3. Restart the access request service to apply changes:

    ./pdc.sh restart access-request-service

You now need to add a custom field to Jira to include the data access request statuses. For more information, see Add a custom field to Jira.

Add a custom field to Jira

If you have configured Data Catalog to connect to Jira for managing data access requests, you need to add a custom field to Jira to map the Data Catalog data access request statuses to complete the Jira integration.

Perform the following steps to add a custom field to Jira:

  1. Log in to Jira with administrative rights.

  2. Go to the Jira Admin settings.

    If you cannot find the Jira Admin settings, use these steps:

    1. Open any issue.

    2. In the Details section, click the settings icon, then click Manage Fields. In the bottom right corner, you see the Go to Custom Fields option.

    3. Click Go to Custom Fields, and you are taken to the Jira Admin settings.

  3. Click Custom Fields and then click Create custom field.

  4. Select the Select List type and enter Access Status as the name.

  5. Add the options: Approved, Rejected, and Pending, and click Create.

  6. Open any issue. In the Details section, click the settings icon, then click Manage Fields.

  7. Locate the newly created Access Status field in the list of fields on the right side.

  8. Drag and drop the Access Status field into the Context Fields section.

Jira is now updated to use data access request statuses with Data Catalog.

Integrating ServiceNow with Data Catalog

You can integrate ServiceNow as an external ticketing system with Data Catalog to manage data access requests. This guide describes how to configure ServiceNow integration using a config.yaml file or environment variables, and how to create a custom field in ServiceNow to track data access request statuses.

You can set any administrative user as the default administrator to manage data access requests. However, there can be only one default administrator set, because the ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE environment variable only supports a single user. If necessary, the default administrator can edit a data access request to assign it to another administrative user.

To integrate ServiceNow with Data Catalog, you need to perform the following tasks:

Integrate ServiceNow with Data Catalog using a config.yaml file

To integrate the ServiceNow ticketing system with Data Catalog, you can use a config.yaml file with settings for connection details, credentials, project information, and status mappings. If your system does not have a config.yaml file, you can also Integrate ServiceNow with Data Catalog using environment variables.

To integrate ServiceNow with Data Catalog using a config.yaml file, use the following steps:

  1. Go to /pentaho/pdc-docker-deployment/conf folder and open config.yaml file. If not available, create it.

  2. Add the following configuration to the config.yaml file.

    Note: Replace the placeholder values in angle brackets (< >) with your actual ServiceNow credentials and project details.

    tools:
      servicenow:
        url: <your_servicenow_url>
        username: <your_servicenow_username>
        password: <your_servicenow_password>
        client_id: <your_servicenow_client_id>
        client_secret: <your_servicenow_client_secret>
        access_status_key: <your_servicenow_access_status_key>
        status_mapping:
          approved_status: <your_servicenow_approved_status_value>
          rejected_status: <your_servicenow_rejected_status_value>
    
    database:
      postgres:
        host: 'um-postgresql'
        port: 5432
        user: 'postgres'
        password: 'admin123#'
        dbname: 'pdc_access_request_db'
        sslmode: 'disable'
    
    auth:
      url: 'https://10.177.177.7'
      auth_host: 'um-css-admin-api:5000'
      username: 'admin'
      password: 'admin'
      client_id: 'admin-cli'
    
    tools:
      servicenow:
        url: 'https://mycompany.service-now.com'
        username: 'mySNusername'
        password: 'mjfo39847tnd'
        client_id: 'ljdfsae9087534rmlvspe495rfnv'
        client_secret: '(YGVFJMKLOUJIHY'
        access_status_key: 'u_access_request_status'
        status_mapping:
          approved_status: 'Granted'
          rejected_status: 'Denied'
    
  3. Save the changes and close the config.yaml file.

  4. Open vendor/docker-compose-um.yml file. Under the access request service container configuration, add the Volumes section parallel to the environment section, save the changes, and close the file.

    volumes:
          - <path-to-the-config.yaml>:/app/config.yaml:ro
    
  5. Restart the access request service to apply changes:

    ./pdc.sh restart access-request-service

You now need to add a custom field to ServiceNow, to include the data access request statuses. For more information, see Add a custom field to ServiceNow.

Integrate ServiceNow with Data Catalog using environment variables

To integrate the ServiceNow ticketing system with Data Catalog, you can use environment variables to set connection details, credentials, project information, and status mappings. You can also integrate ServiceNow with Data Catalog using a config.yaml file.

Perform the following steps to integrate ServiceNow with Data Catalog using environment variables:

  1. Edit the /opt/pentaho/pdc-docker-deployment/vendor/.env.default file and add the following lines:

    Note: Instead of the default location, your environment variables may be set in an opt/pentaho/pdc-docker-deployment/conf/.env file.

    ACCESS_REQUEST_SERVICE_PROVIDER_TOOL= ServiceNow
    ACCESS_REQUEST_SERVICE_SERVICENOW_URL=your_servicenow_url
    ACCESS_REQUEST_SERVICE_SERVICENOW_USER_NAME=your_servicenow_username
    ACCESS_REQUEST_SERVICE_SERVICENOW_PASSWORD=your_servicenow_password
    ACCESS_REQUEST_SERVICE_SERVICENOW_CLIENT_ID=your_servicenow_client_id
    ACCESS_REQUEST_SERVICE_SERVICENOW_CLIENT_SECRET=your_servicenow_client_secret
    ACCESS_REQUEST_SERVICE_SERVICENOW_ACCESS_STATUS_KEY=your_servicenow_access_status_key
    ACCESS_REQUEST_SERVICE_SERVICENOW_ACCESS_STATUS_APPROVED_VALUE=your_servicenow_approved_status_value
    ACCESS_REQUEST_SERVICE_SERVICENOW_ACCESS_STATUS_REJECTED_VALUE=your_servicenow_rejected_status_value
    ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE=PDC_admin_user_email
    
  2. Save the changes and close the file.

  3. Restart the access request service to apply changes:

    ./pdc.sh restart access-request-service

You now need to add a custom field to ServiceNow to include the data access request statuses. For more information, see Add a custom field to ServiceNow.

Add a custom field to ServiceNow

If you have configured Data Catalog to connect to ServiceNow for managing data access requests, you need to add a custom field to ServiceNow to map the Data Catalog data access request statuses to complete the ServiceNow integration.

Perform the following steps to add a custom field to ServiceNow:

  1. Log in to the ServiceNow instance with administrative rights.

  2. Go to System Definition > Tables and locate the Incident table.

  3. Open the Incident table, and at the bottom, in the Columns section, click New to add a new column (custom field).

  4. Configure the following properties:

Property
Description

Column Label

Enter a descriptive name like Access Request Status.

Column Name

Automatically generated as u_access_request_status (prefixed with u_ to indicate it’s a custom field).

Type

Select the appropriate field type, which should be Choice for values like Pending, Granted, and Denied.

Choices

Once the field type is set to Choice, there is an option to add Choice List Values. Add the following values:- Pending

  • Granted

  • Denied You can also set a default value if desired.

5. Verify the changes you have made and click Submit.

Configure Power BI templates with Pentaho Data Catalog

This guide outlines steps to configure Pentaho Data Catalog by connecting Power BI reports to a PostgreSQL database for catalog metadata. It details the setup for integration, including:

  • Pre-created database objects in PostgreSQL

  • Configuration of Power BI Desktop and Service

  • Scheduled refresh setup

Additionally, it explains synchronizing materialized view refresh jobs in PostgreSQL with Power BI dataset refresh schedules.

Prerequisites

Before configuring Power BI with Data Catalog, ensure:

  • PostgreSQL database is installed, configured, and accessible from the Power BI network.

  • A SQL client tool, such as DBeaver Community Edition, is available to run database scripts.

  • The user account has the necessary roles and privileges to create tables and materialized views in the PostgreSQL database.

  • Power BI Desktop is installed for creating and testing reports.

  • Power BI Service access is available for publishing reports and scheduling dataset refreshes.

  • The on-premises data gateway is installed and running to enable secure connectivity between Power BI Service and the PostgreSQL database.

  • Network permissions allow communication between the Power BI gateway host and the PostgreSQL server.

Table and view mapping

The following views and tables are pre-created in PDC for the Power BI.

View Name

Dependent Tables

mv_master

entities_master_view, datasource_category_mapping, currency_exchange_rates

mv_entity_category_summary_view

entities_custom_categorization, glossary_summary_view, terms_view, entities_master_view, currency_exchange_rates, datasource_category_mapping

mv_duplicate_savings_by_original_view

duplicate_files_view, entities_master_view, currency_exchange_rates, datasource_category_mapping

mv_duplicate_by_term_summary_view

entities_custom_categorization, glossary_summary_view, duplicate_files_view, terms_view, entities_master_view, currency_exchange_rates

mv_duplicate_entities_summary_view

duplicate_files_view, entities_master_view, currency_exchange_rates

mv_duplicate_entity_detail_view

duplicate_files_view, entities_master_view, currency_exchange_rates

mv_policies_summary

policies_summary_view, entities_policies_view, entities_master_view

Configure the Power BI Desktop environment

Use Power BI Desktop to update the PostgreSQL data source and configure connection settings before publishing the report to Power BI Service.

Perform the following steps to configure Power BI Desktop:

  1. Open Power BI Desktop application and open the Power BI report (,pbix).

  2. On the ribbon, select Transform Data > Data Source Settings.

  3. In the Data Source Settings window, select your PostgreSQL connection.

  4. Click Change Source, update Server name/host and Database name, and click OK.

  5. Click Edit Permission, uncheck Encrypt connection and enter valid PostgreSQL database credentials.

  6. Click OK and then click Close.

  7. On the Home tab, select Apply Changes > Run All to apply the new configuration.

  8. (Optional) To modify parameters directly, select Home > Edit Parameters, update values, and then click Apply Changes.

Power BI Desktop connects to the PostgreSQL database with the updated host, database, and credential settings. The report is now ready to be published to Power BI Service.

Configure the Power BI Service gateway

Use the Power BI Service to configure the on-premises data gateway and connect the Power BI dataset to the PostgreSQL database used by Data Catalog. Perform the following procedure to configure Power BI Service gateway:

Before you begin ensure you have an administrator access.

1

Install and configure on-premises Data Gateway

  1. Download the On-premises data gateway installer from the Microsoft Power BI portal.

  2. Run the installer as an administrator.

  3. Select On-premises data gateway (recommended) when prompted.

  4. Choose Standard mode, and then click Next.

  5. Sign in with your Power BI credentials.

  6. Enter a Gateway name and Recovery key (password), and then click Configure.

  7. After installation completes, verify that the gateway status shows Online in the Power BI Service.

2

Configure Dataset in the Power BI Service

  1. In Power BI Service, go to the workspace that contains your dataset.

  2. Select the More options (⋮) menu next to the dataset, and then select Settings.

  3. Under Parameters, update the Host name and Database name values.

  4. Click Apply, and then select Reload to refresh the dataset parameters.

3

Add gateway connection

  1. Go to Manage gateways from the Power BI Service navigation pane.

  2. Select your gateway, and then choose Add to gateway.

  3. Enter a Connection name (preferably the same as the PostgreSQL host name).

  4. Select Basic authentication.

  5. Enter the PostgreSQL username and password.

  6. Click Apply to create and map the connection to the dataset.

The Power BI dataset is connected to the PostgreSQL database through the on-premises data gateway. The gateway is active and can be used for scheduled dataset refreshes in Power BI Service.

Modifying parameters in Power BI Service

Use Power BI Service to update data source parameters for the PostgreSQL connection used by Data Catalog.

Perform the following procedure to change data source details:

  1. In Power BI Service, go to the workspace that contains your dataset.

  2. Select the More options (⋮) menu next to the dataset, and then select Settings.

  3. Under Parameters, update the required values such as Host name or Database name.

  4. Click Apply, and then select Reload to refresh the dataset.

  5. If a gateway error occurs, re-create the gateway connection from Manage gateways.

  6. If the gateway is valid, map the dataset again to ensure the updated parameters are applied.

The Power BI dataset is updated with the new PostgreSQL connection parameters and successfully mapped through the on-premises data gateway.

Optional - Refresh scheduling and synchronisation

Generally, Cron jobs refresh materialized views in BIDB; however, you can also set a manual refresh schedule for a materialized view in PostgreSQL.

Perform the following steps to schedule a manual refresh:

PostgreSQL materialize view refresh scheduling

1

Create a PostgreSQL refresh function

Open your SQL client (for example, DBeaver) and run the following script to refresh all required materialized views:

CREATE OR REPLACE FUNCTION refresh_all_pentaho_views() RETURNS void AS $$
BEGIN
    REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_master;
    REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_entity_category_summary_view;
    REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_duplicate_savings_by_original_view;
    REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_duplicate_by_term_summary_view;
    REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_duplicate_entities_summary_view;
    REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_duplicate_entity_detail_view;
    REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_policies_summary;
END;
$$ LANGUAGE plpgsql;
2

Schedule automatic refresh jobs

Use pgAgent or an operating system scheduler such as cron or Windows Task Scheduler to automate materialized view refreshes.

  • Example cron job for Linux:

    # Refresh Pentaho Data Catalog views every day at 1:00 AM
    0 1 * * * psql -U db_user -d your_database -c "SELECT refresh_all_pentaho_views();"
  • For Windows, use Task Scheduler:

    • Action: Launch psql.exe

    • Arguments: -U db_user -d your_database -c "SELECT refresh_all_pentaho_views();"

Power BI Dataset refresh configuration

Perform the following procedure to configure scheduled refresh in Power BI Service:

1

Power BI Service (Scheduled Refresh)

  1. In Power BI Service, open your workspace and select the Datasets tab.

  2. Select Schedule refresh for the dataset connected to PostgreSQL.

  3. Turn on Keep your data up to date.

  4. Configure the refresh schedule:

    1. Power BI Pro: up to 8 refreshes per day

    2. Power BI Premium: up to 48 refreshes per day

  5. Set the time zone, frequency, and preferred refresh times.

  6. Align the dataset refresh to run after the PostgreSQL refresh completes.

2

Gateway mapping validation

  1. Go to Settings > Manage gateways > Connections.

  2. Verify that the on-premises data gateway status is Online.

  3. Confirm that the dataset connection is mapped correctly and uses valid credentials.

3

Refresh notification and logs

  1. In Power BI Service, go to Settings > Scheduled refresh > Failure notifications.

  2. Enable email alerts for refresh failures to receive proactive notifications.

Power BI datasets refresh automatically after PostgreSQL materialized views are updated.

Component

Frequency

Trigger Time

Dependency

PostgreSQL Materialized Views

Every 24 hours (1:00 AM)

Cron job/pgAgent

N/A

Power BI Dataset

Every 24 hours (2:00 AM)

Power BI scheduled refresh

Runs after PostgreSQL refresh completes

This ensures that Power BI always retrieves updated, fully processed data from Data Catalog’s database views.

Last updated

Was this helpful?