Advanced configuration
After installing Data Catalog, you may need to set up additional components, depending on your environment. Use the following topics as needed to finish setting up your environment.
Configure system environment variables
Although not common, there might be instances where you need to change the default settings for Data Catalog system environment variables. These configuration modifications allow you to override default system behavior to align with the specific needs.
Modifying these settings can have system-wide implications, and incorrect changes might negatively impact the functionality of the other platforms. It is a best practice to collaborate with your Pentaho Data Catalog partner to ensure that any modifications align with the intended objectives.
In a terminal window, navigate to the
pdc-docker-deployment
folder and open the hidden environment variable configuration file (.env
). This file is located in the/opt
folder by default.Verify the system environment variables set in the
/opt/pentaho/pdc-docker-deployment/vendor/.env.default
file:For example, the number of worker instances that Data Catalog uses to run processes is set to 5:
PDC_WS_DEFAULT_OPS_JOBPOOLMINSIZE=5 PDC_WS_DEFAULT_OPS_JOBPOOLMAXSIZE=5
Note: Make sure that
PDC_WS_DEFAULT_OPS_JOBPOOLMINSIZE
andPDC_WS_DEFAULT_OPS_JOBPOOLMAXSIZE
have the same value for consistent worker instance management.
To override an environment variable set in the
vendor/.env.default
file, you can create a new.env
file in theopt/pentaho/pdc-docker-deployment/conf/
folder:vi opt/pentaho/pdc-docker-deployment/conf/.env
(Optional) The data in the Business Intelligence Database refreshes daily by default, as set in the
.env
file. To modify the data refresh frequency, update the variable in the .env file to one of the options listed in the following table:
@yearly (or @annually)
Run once a year, midnight, Jan 1st
@monthly
Run once a month, midnight, first of the month
@weekly
Run once a week, midnight between Sat/Sun
@daily (or @midnight)
Run once a day, midnight
@hourly
Run once an hour, the beginning of the hour
@every <number>m
Run at a custom interval, where <number>
specifies the number of minutes.
For example, @every 5m
runs the job every 5 minutes.
PDC_CRON_BI_VIEWS_INIT_SCHEDULE=@daily
After adding all required system variables, save the changes and restart the Data Catalog system services.
./pdc.sh stop ./pdc.sh up
Disable the Physical Assets feature from Data Catalog deployment
By default, the Physical Assets feature is included in the Data Catalog deployment to support OT assets metadata through Pentaho Edge. However, if your deployment does not require this feature, you can disable it by removing its reference from the Compose file. This helps reduce the size of the deployment and saves compute resources. This guide depicts how to disable the Physical Assets feature in Data Catalog.
Before you begin, make sure the following conditions are met:
Data Catalog is already installed using the Docker deployment method. To know more about installation, see Install Pentaho Data Catalog.
You have access to the deployment directory (pdc-docker-deployment) and permission to edit the
.env.default
or.env
file.
Perform the following steps to disable the Physical Assets feature:
Navigate to the following directory:
cd /pentaho/pdc-docker-deployment/vendor
Open the .env.default file in a text editor:
vi .env.default
Locate the COMPOSE_PROFILES line. It may look like this:
COMPOSE_PROFILES=core,mongodb,collab,pdso,mdm,physical-assets
Remove physical-assets from the list:
COMPOSE_PROFILES=core,mongodb,collab,pdso,mdm
Save the file and return to the deployment root folder:
cd /pentaho/pdc-docker-deployment
Restart the deployment using the following command:
./pdc.sh up
You have successfully disabled the Physical Assets service, and it will no longer be started when deploying or restarting Data Catalog. This reduces the number of running containers and optimizes resource usage for environments where OT asset lineage is not required.
Running Data Catalog workloads using node affinity, taints, and tolerations
Data Catalog is a distributed application that runs several containerized components, including the application, database, and worker pods. Worker pods perform data scanning, profiling, and metadata ingestion from enterprise data sources. These operations often involve direct access to sensitive or large-scale datasets.
In large or security-sensitive deployments, administrators often need precise control over where workloads run within a Kubernetes (EKS) cluster. This control helps ensure that critical services and data processing tasks run in secure, compliant, and performance-optimized environments. It becomes especially important when certain components of Data Catalog handle data that is confidential, high-volume, or requires specialized compute resources such as GPUs or high-performance storage.
By using Kubernetes node affinity, taints, and tolerations, you can configure Data Catalog so that:
Worker pods run only on specific nodes. For example, nodes in a restricted security group or a separate availability zone.
Non-worker workloads, such as the user interface or metadata services, are prevented from running on those nodes.
The system continues to meet data segregation and compliance requirements without affecting performance or scalability.
Running PDC workloads using node affinity, taints, and tolerations provides a secure, compliant, and efficient way to manage distributed deployments across different availability zones or network segments.
Perform the following steps to configure node affinity, taints, and tolerations for Data Catalog workloads.
Before you begin
Ensure that your Data Catalog deployment is running on Amazon Elastic Kubernetes Service (EKS).
Verify that you have kubectl and AWS CLI installed and configured with permissions to update node groups and apply taints.
Identify the node group where you want to run PDC worker workloads.
Obtain access to the
custom-values.yaml
file used for your PDC Helm deployment.Confirm that your deployment uses Helmfile for orchestration.
Procedure
Open a terminal on the machine that manages your Kubernetes cluster.
Apply a taint to the node group you want to reserve for worker workloads.
aws eks update-nodegroup-config \ --cluster-name <your-cluster-name> \ --nodegroup-name <your-nodegroup-name> \ --taints key=dedicated,value=ws,effect=NO_SCHEDULE
Replace the placeholders with your cluster and node group names.
key=dedicated specifies the taint key
.value=ws
indicates that the node is dedicated for worker services.effect=NO_SCHEDULE
prevents other pods from being scheduled on these nodes unless they have a matching toleration.
Tip: Use descriptive taint keys and values that reflect the node group’s purpose, such as key=data-processing or value=worker.
Edit the custom-values.yaml file for your Helm deployment.
For PDC 10.2.8 and later, locate the job-server section and add:
job-server: tolerations: - key: "dedicated" operator: Equal value: "ws" effect: "NoSchedule" affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: eks.amazonaws.com/nodegroup operator: In values: - <your-nodegroup-name>
Replace
<your-nodegroup-name>
with the name of the dedicated node group.Save the file and deploy the configuration.
helmfile sync -n pentaho
The
-n pentaho
flag ensures that the deployment targets the correct namespace.
Result
You have successfully configure node affinity, taints, and tolerations for Data Catalog workloads. After deployment:
Data Catalog worker pods are scheduled only on the nodes defined by the node affinity rules.
Other pods are prevented from being scheduled on those nodes unless they include a matching toleration.
Your Data Catalog deployment now supports workload segregation across different availability zones or network segments in AWS.
This configuration helps validate compliance and segregation requirements for deployments where worker nodes must belong to distinct security groups or availability zones.
Next steps
To verify the configuration, run:
kubectl get pods -o wide -n pentaho
Confirm that worker pods are assigned to nodes in the expected node group.
Monitor node usage using Amazon EKS Console or the kubectl describe node command.
Additional information
For more information, see the official AWS documentation: Place Kubernetes pods on Amazon EKS by using node affinity, taints, and tolerations
Install user-provided SSL certificates
To provide a greater level of security to your data, you can use signed Secure Sockets Layer (SSL) certificates from your Certificate Authority (CA) with Data Catalog.
Data Catalog automatically installs self-signed certs in the <install-directory>/conf/https
directory as server.key
(PEM-encoded private key) and server.crt
(PEM-encoded self-signed certificate). You can replace these files with certificates signed by your CA.
Use this procedure to install signed SSL certificates for Data Catalog:
On your Data Catalog server, navigate to the Data Catalog installation directory
<install-directory>/conf/https
, where <install-directory> is the directory where Data Catalog is installed.server.key
is a PEM-formatted file that contains the private key of a specific certificate.server.crt
is a PEM-formatted file containing the certificate.
Replace the <install-directory>
/conf/https/server.key
file with the PEM-encoded private key used to sign the SSL certificate or generate a new private key in PEM-encoded format.Replace the <install-directory>
/conf/https/server.crt
file with the PEM-encoded signed certificate associated with the private key in Step 1.If a new private key is generated, then you need to download a new PEM-encoded signed SSL certificate from your CA.
Append the <install-directory>
/conf/extra-certs/bundle.pem
file with the following three certificates in this order:Top level PEM-encoded signed SSL certificate (basically the content of the <install-directory>
/conf/https/server.crt
file).Intermediate PEM-encoded certificate, if any, from your CA.
Root PEM-encoded certificate, if any, from your CA.
Navigate to the Data Catalog
<install-directory>
.Use the following command to restart Data Catalog:
./pdc.sh restart
The SSL certificates are installed.
Check and remove outdated certificates
If Data Catalog services fail to start or show SSL-related errors, the issue might be caused by expired or outdated certificates in the bundle.pem
file. You can identify and remove outdated certificates by checking their validity dates.
Before you begin
Ensure you have access to the server where Data Catalog is installed.
Verify that you have permission to view and edit files in the
conf/extra-certs
directory.
Procedure
Go to the directory where the
bundle.pem
file is stored:cd /<PDCInstallDir>/conf/extra-certs/
Run the following command to check the validity dates of all certificates in the
bundle.pem
file:awk 'BEGIN{block=""} /-----BEGIN CERTIFICATE-----/ {block=$0 RS; inblock=1; next} inblock {block=block $0 RS} /-----END CERTIFICATE-----/ { block=block $0 RS print "================" print block | "openssl x509 -noout -dates" close("openssl x509 -noout -dates") block="" inblock=0 }' bundle.pem
Review the command output. You see sections similar to the following:
notBefore=Sep 1 09:01:05 2025 GMT notAfter=Sep 1 09:01:05 2026 GMT notBefore=Jun 10 06:00:00 2023 GMT notAfter=Jun 10 06:00:00 2024 GMT
Interpret the fields in the output:
Field Description notBefore Date from which the certificate becomes valid. notAfter Date after which the certificate expires.
Compare the current date with the notAfter value: If the current date is later than notAfter, the certificate has expired. For example:
notAfter=Jun 10 06:00:00 2024 GMT ❌ Expired (past date) notAfter=Sep 1 09:01:05 2026 GMT ✅ Active (future date)
Edit the bundle.pem file to remove the expired certificates.
Use any text editor such as vi or nano.
Remove the complete certificate block starting from
-----BEGIN CERTIFICATE-----
to-----END CERTIFICATE-----
.
Save the file and restart the Data Catalog services.
Result
The bundle.pem
file now contains only active certificates, preventing SSL validation issues during Data Catalog startup or connectivity.
Add email domains to the safe list in Data Catalog after deployment
During the initial deployment of Data Catalog, it is typically configured to allow only a predefined set of email domains for user authentication. However, you might grant access to users with email addresses from new domains. Instead of redeploying Data Catalog, which can cause downtime and operational delays, you can dynamically update the list of allowed email domains using the Identity & Access Management (IAM) APIs.
Perform the following steps to add email domains to the safe list using IAM APIs after deployment:
Prerequisites
You must have administrative access to use the IAM APIs.
Identify your Data Catalog DNS (for example,
catalog.example.com
).Obtain admin credentials to generate a Bearer token.
Procedure
Open the CMD prompt and run the following cURL command to generate an authentication token to interact with the IAM APIs:
curl -k --location 'http://<your-server-url>/keycloak/realms/master/protocol/openid-connect/token' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'username=<admin-username>' \ --data-urlencode 'password=<admin-password>' \ --data-urlencode 'client_id=admin-cli' \ --data-urlencode 'grant_type=password'
Replace
<your-server-url>
with your Data Catalog server URL.Replace
<admin-username>
and<admin-password>
with the actual Keycloak master realm admin user credentials. The response includes the token value.
{"access_token":"<TOKEN_VALUE>"}
Before updating, you can view the currently configured email domains using the following GET command.
curl -k -X 'GET' \ 'https://<your-server-url>/css-admin-api/api/internal/css-auth-proxy/v1/provider/<provider-id>' \ -H 'accept: application/json' \ -H 'Authorization: Bearer <ACCESS_TOKEN>'
The response displays the current domain configuration:
{ "id": "catalog.example.com", "emailDomains": ["example.com", "partner.com"] }
Run the following IAM API cURL request to update email domains:
curl -k -X 'PUT' \ 'https://<your-server-url>/css-admin-api/api/internal/css-auth-proxy/v1/provider' \ -H 'Authorization: Bearer <ACCESS_TOKEN>' \ -H 'Content-Type: application/json' \ -d '{ "id": "<provider-id>", "emailDomains": [ "hv.com", "hitachivantara.com", "gmail.com" ] }'
Replace
<your-server-url>
with your Data Catalog server URL.Replace
<ACCESS_TOKEN>
with the token obtained in the previous step.Replace
<provider-id>
with your Data Catalog server’s domain or IP address used during installation.Modify the "
emailDomains
" list as needed.
New email domains have been added to the Data Catalog safe list, and users with those domains can now sign in successfully.
Set up an email server to send Data Catalog notifications
To set up Data Catalog to send email notifications to users, you can configure any Simple Mail Transfer Protocol (SMTP) server that meets your needs.
Examples of notifications are when a user is tagged with '@' in a comment or set up in a data pipe template to be notified when a job completes.
To integrate an SMTP server with Data Catalog, use the following steps:
Gather the following information for the SMTP server you want to use:
Host name of SMTP server (IP address or domain name)
Port number for SMTP server
Username on SMTP server in <mail userID>
@<domain>.com
formatPassword for username
Sender mail ID in <mail userID>
@<domain>.com
formatWhether to use Transport Layer Security (TLS) or Secure Sockets Layer (SSL) security.
TLS or SSL port number For example, you can use Gmail’s SMTP server to send emails from your application. Here are the SMTP server configuration settings for Gmail:
SMTP Server Address
smtp.gmail.com
Secure Connection
TLS/SSL based on your mail client/website SMTP plugin
SMTP Username
your Gmail account (
[email protected]
)SMTP Password
your Gmail password
Gmail SMTP port
465 (SSL) or 587 (TLS)
Log into Data Catalog using root user credentials to configure Data Catalog to use the SMTP server, as in the following example:
https://*<full domain name for PDC server>*/
Navigate to the
configuresystem/smtp
directory on the Data Catalog server, as in the following example:https://*<full domain name for PDC server>*/configuresystem/smtp
The Configure Your System page opens.
Specify the SMTP server information as detailed in the following table:
Host
IP address or domain name of SMTP server
Port
Port number for SMTP server
Username
User name in *<mail userID>*@*<domain>*.com
format
Password
Password for user name specified above
Sender Mail
Sender mail ID in *<mail userID>*@*<domain>*.com
format
Encryption
TLS: Default value (leave the Use SSL checkbox blank)
SSL: Select the Use SSL checkbox
5. Click Test Connection to test the integration. A success confirmation message is displayed next to the Test Connection button.
6. Click Save Changes.
The SMTP server is configured.
Update SMTP details in Data Catalog after deployment
Adding Simple Mail Transfer Protocol (SMTP) details in Data Catalog enables email notifications and alerts within the application, such as:
Alerts about Data Catalog changes, approvals, and errors like data ingestion, metadata extraction, or synchronization failures.
Password reset links when users forget their credentials.
Notification alerts when tagged in the comments tab.
SMTP details are typically configured during the initial deployment of Data Catalog. However, if you want to update SMTP details post-deployment, you can use the Identity & Access Management (IAM) APIs without redeploying Data Catalog, which might cause downtime and operational delays.
Perform the following steps to update SMTP details in Data Catalog using IAM APIs after deployment:
Ensure you have sufficient access to use the IAM APIs.
To generate an authentication token to interact with the IAM APIs, open the CMD prompt, and run the following cURL command:
curl -k --location 'http://<your-server-url>/keycloak/realms/master/protocol/openid-connect/token' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'username=<admin-username>' \ --data-urlencode 'password=<admin-password>' \ --data-urlencode 'client_id=admin-cli' \ --data-urlencode 'grant_type=password'
Replace
<your-server-url>
with your Data Catalog server URL.Replace
<admin-username>
and<admin-password>
credentials with the actual admin credentials. The response includes the token value.
{"access_token":"<TOKEN_VALUE>"}
To update SMTP details, run the following IAM API cURL request:
curl -X PUT \ 'https://<PDC_HOST>/css-admin-api/api/v1/tenants/<TENANT_NAME>' \ -H 'Accept: */*' \ -H 'Authorization: Bearer <TOKEN_VALUE>' \ -H 'Content-Type: application/json' \ -d '{ "realm": "<TENANT_NAME>", "smtpServer": { "password": "<SMTP_PASSWORD>", "replyToDisplayName": "<REPLY_TO_DISPLAY_NAME>", "starttls": "<true|false>", "auth": "<true|false>", "port": "<SMTP_PORT>", "host": "<SMTP_HOST>", "replyTo": "<REPLY_TO_EMAIL>", "from": "<FROM_EMAIL>", "fromDisplayName": "<FROM_DISPLAY_NAME>", "envelopeFrom": "<ENVELOPE_FROM>", "ssl": "<true|false>", "user": "<SMTP_USERNAME>" } }'
ParameterDescription<PDC_HOST>
The host name or IP address of your Data Catalog instance.
<TENANT_NAME>
The tenant name, typically "pdc".
<TOKEN_VALUE>
A valid authentication token (must be obtained through IAM authentication).
<SMTP_PASSWORD>
The password for the SMTP server authentication.
<REPLY_TO_DISPLAY_NAME>
The display name for the reply-to email address.
<SMTP_PORT>
The port number used by the SMTP server.
<SMTP_HOST>
The SMTP server host address).
<REPLY_TO_EMAIL>
The reply-to email address.
<FROM_EMAIL>
The email address used to send notifications.
<FROM_DISPLAY_NAME>
The display name associated with the sender’s email.
<ENVELOPE_FROM>
The envelope sender address (optional).
<SMTP_USERNAME>
The username for SMTP authentication.
You have successfully updated SMTP details in Data Catalog.
Configure proxy server settings for the Licensing-API service
In Pentaho Data Catalog, the Licensing-API service is responsible for managing and validating software licenses, ensuring that only authorized users and services can access Data Catalog features. When Data Catalog is deployed in an enterprise environment that restricts direct internet access, services like the Licensing-API require a proxy server to reach external licensing servers and authenticate endpoints.
Post deployment of Data Catalog, perform the following steps to configure the proxy server for the Licensing-API service:
Ensure that you have:
Access to the
conf/.env
andvendor/docker-compose.licensing.yml
files.Administrative privileges to modify configuration files and restart services.
The required proxy server details (domain, port, username, and password).
The SSL certificate file (
proxy-cert.pem
) if required for secure proxy connections.
Procedure
To configure proxy environment variables, go to Data Catalog root folder and then open the
conf/.env
file.In the
conf/.env
file, update the following proxy variables with respective values:VariableDescriptionExample ValueLICENSING_SERVER_PROXY_ENABLED
Enables or disables proxy configuration.
true
orfalse
LICENSING_SERVER_PROXY_DOMAIN
The domain or IP address of the proxy server.
10.177.176.126
LICENSING_SERVER_PROXY_PORT
The port number used for proxy communication.
443
LICENSING_SERVER_PROXY_USER
The username for proxy authentication.
admin
LICENSING_SERVER_PROXY_PASSWORD
The password for proxy authentication.
password
LICENSING_SERVER_PROXY_ENABLED=true LICENSING_SERVER_PROXY_DOMAIN=10.177.176.126 LICENSING_SERVER_PROXY_PORT=443 LICENSING_SERVER_PROXY_USER=user LICENSING_SERVER_PROXY_PASSWORD=password
Note: It is a best practice to avoid hard coding sensitive credentials like
PROXY_USER and PROXY_PASSWORD
. Use secret management tools or environment variables to secure them.To update proxy server configuration in Docker Compose, open the
vendor/docker-compose.licensing.yml
file and update the licensing-api service configuration as follows:services: licensing-api: image: ${GLOBAL_IMAGE_PREFIX}/${LICENSING_API_IMAGE} restart: always environment: LICENSING_SERVER_URL: ${LICENSING_SERVER_URL} PROXY_ENABLED: ${LICENSING_SERVER_PROXY_ENABLED} # Use domain because SSL requires a domain, not an IP, for configuration PROXY_HOST: ${LICENSING_SERVER_PROXY_DOMAIN} PROXY_PORT: ${LICENSING_SERVER_PROXY_PORT} PROXY_USER: ${LICENSING_SERVER_PROXY_USER} PROXY_PASSWORD: ${LICENSING_SERVER_PROXY_PASSWORD} # Used for configuring SSL certificate for proxy JAVA_EXTRA_CERTS: "cert.pem" platform: linux/amd64 profiles: - core volumes: - ${PDC_CLIENT_PATH}/proxy-cert.pem:/app/cert.pem
Note:
The
PROXY_ENABLED
,PROXY_HOST
,PROXY_PORT
,PROXY_USER
, andPROXY_PASSWORD
environment variables are mapped inside the Docker container.The
JAVA_EXTRA_CERTS
is set to"cert.pem"
to configure SSL certificates for proxy authentication.A volume mount is added to ensure that the SSL certificate file
proxy-cert.pem
is accessible within the container.
(Optional) If the proxy server requires SSL authentication, place the SSL certificate file (
proxy-cert.pem
) in the specified directory:cp /path/to/proxy-cert.pem ${PDC_CLIENT_PATH}/proxy-cert.pem
Note: Ensure that the file permissions allow access by the Licensing-API service.
After updating the configuration, restart the Data Catalog services to apply the changes:
./pdc.sh restart
You have successfully configured the proxy server settings for Licensing APIs in Data Catalog.
Configure Smart Type to SQL feature in Data Catalog
In Data Catalog, you can use the Smart Type to SQL feature, which converts natural-language text into executable SQL queries within Data Pipes. This feature uses a large-language-model (LLM) service to interpret user input and automatically generate valid SQL statements for the selected database tables.
Perform the following steps to configure the Smart Type to SQL feature in Data Catalog:
Before you begin
Ensure your Data Catalog deployment includes the ml-gateway-service.
Confirm that the
aiml
profile is enabled. The feature is unavailable without it.Obtain valid credentials or API keys for your LLM provider.
Procedure
For Docker Compose deployment
Go to the
conf/.env
file in your Data Catalog installation directory.Add the following environment variables:
ML_LLM_MODEL="" ML_LLM_API_KEY="" ML_LLM_INFERENCE_BASE_URL=""
Save the file and restart the containers for the configuration to take effect.
For Kubernetes deployment
Open the
values.yaml
file of theml-gateway-service
.Update the following configuration parameters:
llmModel: llmApiKey: llmInferenceBaseUrl:
Save the file and redeploy the
ml-gateway-service
.
Result
You have successfully configured Smart Type to SQL feature. When the aiml
profile is active, users can enter plain-language prompts in the SQL Editor to generate valid SQL queries automatically.
Configure database password encoding for special characters
By default, the pg-migration
service in Pentaho Data Catalog automatically encodes special characters (such as @
, :
, /
, and ?
) in PostgreSQL database passwords to ensure they are safely included in the database connection URL, preventing connection failures caused by unescaped reserved characters. If your environment requires passing passwords without URI encoding, you can disable this behavior by setting the PDC_PG_MIGRATIONS_DB_PASSWORD_REQUIRED_ENCODING
environment variable to false
.
Disabling URI encoding is not recommended, as it may cause connection failures if passwords contain special characters.
Perform the following steps to disable URI encoding for passwords used by the pg-migration
service:
Before you begin
Ensure that you have permission to edit the PDC environment configuration file (.env
or .env.default
).
Procedure
Open the environment configuration file:
For Docker Compose deployment:
cd /opt/pentaho/pdc-docker-deployment/vendor vi .env.default
For Kubernetes deployment, update the value in the
values.yaml
file of thepg-migrations
helm chart.
Locate or add the following environment variable and set it to false:
PDC_PG_MIGRATIONS_DB_PASSWORD_REQUIRED_ENCODING=false
Save the file and start the PDC services to apply the change:
Docker Compose:
docker-compose down docker-compose up -d
Kubernetes (Helm): Deploy the pg-migrations release with the updated values.yaml takes effect:
helm upgrade <pg-migrations-release-name> <chart-path-or-repo> -f values.yaml
Replace
<pg-migrations-release-name>
and<chart-path-or-repo>
with your actual values.
Result
Data Catalog disables URI encoding for PostgreSQL database passwords used by the pg-migration
service. The password is passed to the database connection URL as-is.
Next steps
If you need to modify additional system-level configuration variables, see Configure system environment variables.
Configure table and column sorting order in Data Canvas
By default, Pentaho Data Catalog displays tables and columns in their ordinal order within the Data Canvas, that is, in the same sequence as they appear in the source database. This ordering helps users analyze the data structure as designed in the original schema. However, sometimes, you might prefer to view tables and columns in alphabetical order, which simplifies browsing and locating objects across large schemas. You can modify this behavior by updating the system environment variable PDC_FE_DATA_CANVAS_COLUMN_SORTING_ORDER
in the deployment configuration file.
Perform the following steps to configure table and column sorting order:
Before you begin
Ensure that you have access to the Data Catalog deployment directory on the server.
Procedure
Go to the Data Catalog deployment directory:
cd /opt/pentaho/pdc-docker-deployment/vendor
Open the .env.default file in a text editor:
vi .env.default
Locate or update the following environment variable:
PDC_FE_DATA_CANVAS_COLUMN_SORTING_ORDER=name,asc
Save the file and exit the editor.
Restart the Data Catalog containers for the changes to take effect:
sh pdc.sh down sh pdc.sh up
Result
Tables and columns in the Data Canvas are now displayed in alphabetical order.
Next steps
If you prefer to revert to the default ordinal order, set the variable in the
.env.default
file.PDC_FE_DATA_CANVAS_COLUMN_SORTING_ORDER=ordinalPosition,asc
For details on editing environment variables, see Configure system environment variables.
Connect to Business Intelligence Database (BIDB)
The Data Catalog includes the Business Intelligence Database (BIDB) server, which stores and manages reporting metadata. Depending on your PDC version, BIDB is implemented using either PostgreSQL (PDC 10.2.5 and later) or MongoDB (PDC 10.2.1). You can use the respective connection methods and connect to the BIDB server to access reporting data and build dashboards. See the Reporting and data visualization section in the Use Pentaho Data Catalog guide for details about BIDB and the components available in BIDB.
PDC 10.2.5 and later (PostgreSQL)
Beginning with PDC 10.2.5, BIDB has been migrated from MongoDB to PostgreSQL, providing a relational database structure that improves query performance and enhances compatibility with broader tool sets.
Perform the following steps to connect to BIDB (PostgreSQL) in PDC 10.2.5 and later:
Locate the BIDB credentials in the PDC server:
Navigate to the
/vendor/.env.default
file.Identify the variables beginning with
POSTGRES_BIDB_USER_*
.To list the values, run:
cat .env.default | grep 'POSTGRES_BIDB'
Install the required PostgreSQL driver:
For JDBC, download the PostgreSQL JDBC driver from the official PostgreSQL site.
For ODBC, install the PostgreSQL ODBC driver (psqlODBC) on your system.
Configure your connection:
JDBC connection string format:
jdbc:postgresql://pdc.pentaho.com:5432/bidb
ODBC DSN settings:
Server:
<HOSTNAME>
Port:
5432
Database:
bidb
Username:
bidb_ro
Password:
${POSTGRES_BIDB_USER_PASSWORD}
Save the configuration in your reporting or analytics tool.
Test the connection to confirm access.
Important:
It is best practice not to hardcode credentials in your application. Instead, reference the environment variables (
POSTGRES_BIDB_USER_*
) to ensure secure and flexible credential management.Ensure that the
.env.default
file is stored securely and is not shared publicly.
You have successfully connected to the BIDB PostgreSQL database in PDC 10.2.5 or later. You can now run queries, build reports, and use reporting tools through JDBC or ODBC connections.
PDC versions prior to 10.2.5 (MongoDB)
In PDC versions prior to 10.2.5, the Business Intelligence Database (BIDB) is implemented using MongoDB. To connect to the BIDB server, you can use either the Java Database Connectivity (JDBC) connector or the Open Database Connectivity (ODBC) connector, depending on your application or reporting tool requirements.
Configure the Java Database Connectivity (JDBC) connector
Perform the following steps to configure the JDBC connector for connecting to BIDB:
Download the MySQL JDBC Connector JAR file from the MySQL website after selecting the appropriate version for the operating system.
Download the JDBC Authentication Plugin.
Download the DBeaver application from the DBeaver website and install it on the system. See DBeaver installation for more details.
To add the MySQL JDBC Driver and MySQL authentication plugin to DBeaver, open DBeaver and go to Database > Driver Manager.
Click New to add a new driver.
Select MySQL from the list and enter a name for the driver.
Click Browse to locate and select the downloaded JDBC driver (JAR file) and the MySQL authentication plugin, then click OK or Finish to add the driver.
After adding the MySQL driver, to create a New Connection, go to the DBeaver home page, click New Database Connection, and select MySQL as the database type.
Enter the MySQL server connection details, such as host, port, username, password, and so on.
Specify the jars in the local client configuration as shown in the following section.
jdbc:mysql://20.8.222.21:3307?useSSL=false&authenticationPlugins=org.mongodb.mongosql.auth.plugin.MongoSqlAuthenticationPlugin
Click Test Connection to verify the connection is working.
Click Finish to save the connection configuration.
You are now connected to BIDB using the JDBC connector. Use any third-party BI tool to connect to BIDB to analyze data and create dashboards.
Configure the Open Database Connectivity (ODBC) connector
The MongoDB ODBC connector allows you to connect tools that support ODBC to MongoDB and query the data using SQL. Perform the following steps to configure the JDBC connector for connecting to BIDB.
Download and install the MongoDB ODBC connector.
See MongoDB BI Connector ODBC Driver for more information.
Download and install an ODBC driver manager on your system.
For example, on the Windows operating system, you can use the default Windows ODBC Data Source Administrator.
Open the ODBC Data Source Administrator on your machine and go to the System DSN tab.
Click Add to add a new data source and select the MongoDB Driver.
To configure the DSN (Data Source Name) settings:
Set the server field to the address of your MongoDB server.
Enter the port number if it differs from the default (
27017
).Enter the required details for authentication, username, and password.
As a part of the connection details, enter the plugin directory details.
Set the SSL Mode to Disabled in the SSL configuration.
Click Test to verify that connection is working.
Click OK to save the connection configuration.
You are now connected to BIDB using the MongoDB ODBC connector. Use any third-party BI tool to connect to BIDB to analyze data and create dashboards.
Configure a machine learning (ML) server connection in Data Catalog
You can connect a machine learning (ML) server to Data Catalog and import ML model server components into the ML Models hierarchy. Supported server types include:
Pre-Production Model Servers such as MLflow, which capture experiments, runs, versions, and artifacts.
Production Model Servers such as NVIDIA Triton, which provide model deployment, inference statistics, and operational metrics.
Once configured, the ML server appears under the Synchronize card in the Management section of Data Catalog, allowing you to import model components into the ML Models hierarchy. For more information about ML Models, see the ML Models section in Use Pentaho Data Catalog.
Configure an MLflow server connection
Perform the following steps to configure a pre-production MLflow server connection in Data Catalog.
Before you begin:
Make sure you have access to the MLflow server you want to connect to.
If the MLflow server requires authentication, make sure you have the necessary credentials, either a valid username and password or an access token.
Procedure
Verify whether the file
external-data-source-config.yml
exists in the path${PDC_CLIENT_PATH}/external-datasource/
. If not available, create it.Open the
external-data-source-config.yml
file and add ML server configuration:servers: - id: {SERVER_ID} name: {SERVER_NAME} type: {SERVER_TYPE} url: {SERVER_URL} config: username: {username} password: {password} access_token: {access_token}
id
Unique identifier (UUID) for the ML server.
916d3b20-7fd6-49d2-b911-cc051f56e837
name
Display name for the server. This name appears in the UI.
MLflowServer
type
Type of server (enum value). For ML server, use ‘MlFlow
’.
URL
The base URL of the ML server.
http://mlflow.mycompany.com
config
Configuration keys specific to ML server you are configuring. Include either, only if authentication is enabled:- Username and password
Access token
3. After configuring the ML server in the YAML file, restart the PDC services to apply the changes.
You have successfully configured the MLflow server in Data Catalog as an external data source. It appears under the Synchronize card in the Management section of Data Catalog.
You can now import ML model server components into the ML Models hierarchy of Data Catalog. For more information, see Import ML model server components into ML Models hierarchy.
Configure a Triton server connection
Perform the following steps to configure a production Triton inference server connection in Data Catalog:
Before you begin:
Ensure the Triton inference server is running and accessible.
If running Triton in Docker, confirm that the ports are exposed (HTTP, gRPC, Metrics). For example:
version: "3.9"
services:
triton:
image: nvcr.io/nvidia/tritonserver:24.11-py3
volumes:
- ./models:/models
command: ["tritonserver", "--model-repository=/models", "--backend-config=execution_mode=cpu", "--strict-model-config=false"]
ports:
- "8000:8000" # HTTP
- "8001:8001" # gRPC
- "8002:8002" # Metrics
Procedure
Verify whether the file
external-data-source-config.yml
exists in the path${PDC_CLIENT_PATH}/external-datasource/
. If not available, create it.Open the
external-data-source-config.yml
file and add ML server configuration:servers: - id: {SERVER_ID} name: {SERVER_NAME} type: Triton url: {TRITON_SERVER_URL} config: metadata_port: {metadata-port} metrics_port: {metrics-port}
id
Unique identifier (UUID) for the Triton server.
44e2fa51-e3af-4094-8dd3-c62320952de5
name
Display name for the server. This name appears in the UI.
Triton-Prod-server
type
Type of server (enum value). For Triton server, use ‘Triton
’.
Triton
URL
The base URL where the Triton server is deployed.
http://192.168.0.10
config.metadata_port
HTTP port configured when deploying the Triton server. The default port is 8000.
8000
config.metrics_port
Metrics port configured by the user when deploying the Triton server. The default is 8002.
8002
3. Save the YAML file and restart the PDC services to apply the changes.
You have successfully configured the Triton server in Data Catalog as an external data source. It appears under the Synchronize card in the Management section of Data Catalog.
You can now import ML model server components into the ML Models hierarchy of Data Catalog. For more information, see Import ML model server components into ML Models hierarchy.
Configure a Tableau server connection in Data Catalog
You can configure a connection between a Tableau server and Data Catalog to import Tableau metadata such as dashboards, workbooks, projects, and data sources into the Business Intelligence (BI) section of Data Catalog. To learn more, see the Business Intelligence section in the Use Pentaho Data Catalog guide. This guide depicts a step-by-step procedure to configure the Tableau server connection in Data Catalog.
Before you begin:
Make sure you have access to the Tableau Cloud or Tableau Server instance you want to connect to. The URL format looks like:
https://<region>.online.tableau.com/#/site/<site-id>/home
Identify the Site ID for the Tableau site. For Tableau Cloud, you can find this in the URL after
/site/
.Generate a valid Personal Access Token (PAT) in Tableau, including PAT name and PAT secret.
Perform the following steps to configure a connection between the Tableau server and Data Catalog:
Verify whether the file
external-data-source-config.yml
exists in the path$ {PDC_CLIENT_PATH}/external-datasource/
. If not available, create it.Open the
external-data-source-config.yml
file and add Tableau server configuration:servers: - id: dev-8f012f9ca7 name: Test_Server type: Tableau url: https://prod-apnortheast-a.online.tableau.com/api/3.22/auth/signin config: pat_name: 'test' pat_secret: 'kITbTaYmTPSdZ7ADeP11VA==:hwt9jkehQqGuq72Lh9V4wiFlfZcIpny8'
ParameterDescriptionExampleid
The site ID (unique identifier) of the Tableau site to connect to, as seen in the Tableau Cloud URL.
dev-8f012f9ca7
name
Display name for the server. This name appears in the UI.
TableauServer
type
Type of server (enum value). For Tableau server, use ‘
Tableau
’.Tableau
URL
The Tableau REST API authentication endpoint. Use the
signin
endpoint for the Tableau site.https://prod-apnortheast-a.online.tableau.com/api/3.22/auth/signin
config
Configuration keys specific to the Tableau server you are configuring.
-
pat_name
The name of the Tableau Personal Access Token (PAT) used for authentication.
-
pat_secret
The secret key associated with the PAT. Ensure this is stored securely and never exposed.
After configuring the Tableau server in the YAML file, restart the following PDC services to apply the changes:
Frontend service (fe)
Worker service (ws-default)
# Restart the frontend and worker services ./pdc.sh restart fe ./pdc.sh restart ws-default
You can now import Tableau server components into the Business Intelligence hierarchy of Data Catalog.
Configure a Power BI service connection in Data Catalog
You can connect the Microsoft Power BI service to Data Catalog and import Power BI metadata into the Business Intelligence section. This integration lets you discover, explore, and manage Power BI reports, datasets, and workspaces directly from Data Catalog. This guide depicts a step-by-step procedure to configure the Power BI server connection in Data Catalog.
Before you begin:
Ensure that you have a valid Microsoft account with an active Power BI service license. Contact your Microsoft administrator if you need access.
Register an Azure Active Directory (Azure AD) application in the Azure portal. For more information, see Microsoft guide to register an app.
Generate a client secret in the Azure AD application and store it securely. For more information, see Microsoft guide to add client secrets.
Assign the following API permissions to the Azure AD app:
PermissionDescriptionApp.Read.All
View all Power BI apps
Capacity.Read.All
View all capacities
Dashboard.Read.All
Read dashboards
Dataflow.Read.All
Read dataflows
Dataset.Read.All
View all datasets
Report.Read.All
Read reports
Tenant.Read.All
View all content in tenant
Workspace.Read.All
View all workspaces
See Power BI automation permissions for more information.
Ensure that a service account is available with access to all Power BI workspaces that you need to integrate with Data Catalog.
Confirm outbound HTTPS access to
https://api.powerbi.com
on port443
from the Data Catalog server.Collect the following details before continuing:
FieldDescriptionClient ID
From Azure AD App registration
Client Secret
Securely generated during Azure AD registration
Tenant ID
Azure AD Directory unique identifier
OAuth2 Token URL
https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token
API Host URL
https://api.powerbi.com
Username
Credentials for the Power BI server.
Password
Credentials for the Power BI server.
Perform the following steps to configure a connection between the Power BI server and Data Catalog:
Connect to the Data Catalog server using SSH.
From your local machine, open a terminal (for Linux or macOS) or an SSH client, such as PuTTY (for Windows).
Enter the SSH command to connect to the server where Pentaho Data Catalog is installed.
ssh <pdc-admin>@<pdc-server-ip>
Replace
<username>
with your server login account and<server-ip>
with the server’s IP address or hostname.When prompted, enter the password for the specified account. After a successful login, you will have access to the Data Catalog server’s command line.
Navigate to the configuration folder and verify whether the file
external-data-source-config.yml
exists in the path$ {PDC_CLIENT_PATH}/external-datasource/
. If not available, create it./opt/pentaho/pdc-docker-deployment/conf/external-data-source
Open the
external-data-source-config.yml
file and add Power BI server configuration:servers: - id: <your-server-id> name: PowerBIServer type: PowerBI url: https://login.microsoftonline.com/organizations/oauth2/v2.0/token config: client_id: <your-client-id> username: <your-username> password: <your-password> client_secret: <your-client-secret> host_url: https://api.powerbi.com
ParameterDescriptionExampleid
The unique identifier for the Power BI server connection. You define this value in the YAML.
dev-powerbi01
name
Display name for the server. This name appears in the Data Catalog UI.
PowerBIServer
type
Type of server (enum value). For Power BI server, use ‘
PowerBI
’.PowerBI
URL
The Microsoft identity platform OAuth 2.0 token endpoint. Used for authentication requests.
https://prod-apnortheast-a.online.tableau.com/api/3.22/auth/signin
config
Configuration keys specific to the Power BI server you are configuring.
-
client_id
The application (client) ID generated during Azure AD app registration.
12345678-abcd-1234-abcd-1234567890ab
-
password
The password associated with the Microsoft account username. Ensure this is stored securely.
-
client_secret
The client secret generated for the Azure AD app. Ensure this is stored securely and never exposed.
abcdEFGH12345!@#xyz
-
host_url
The base API endpoint for Power BI service.
https://api.powerbi.com
After configuring the Power BI server in the YAML file, restart the following PDC services to apply the changes:
./pdc.sh up
You have successfully configured the Power BI server in Data Catalog as an external data source. It appears under the Synchronize card in the Management section of Data Catalog.
You can now import Power BI server components into the Business Intelligence hierarchy of Data Catalog.
Configure the Physical Assets service in Data Catalog
In Pentaho Data Catalog, you can import operational technology (OT) components, including device services, locations, devices, and values, and view them in the Physical Assets section of Data Catalog in a hierarchical structure. With the Physical Assets feature, you can understand how data flows from physical sources into analytical systems, enabling better traceability and context. Additionally, users can enrich asset nodes with business terms, policies, lineage, and metadata to strengthen data governance and compliance. For more information, see Physical Assets in the Use Pentaho Data Catalog document.
To use the Physical Assets feature in Data Catalog, you must first configure the Physical Assets service. This involves completing the following procedures:
Note: The configuration steps assume Data Catalog is already installed. For installation instructions, seeInstall Data Catalog in the Install Pentaho Data Catalog document.
Enable the Physical Assets service in Data Catalog
Perform the following steps to enable the Physical Assets service in the existing Data Catalog deployment.
Go to the
vendor
folder:/pentaho/pdc-docker-deployment/vendor
Open the
.env.default
file:Update the following lines:
COMPOSE_PROFILES=core,mongodb,collab,pdso,mdm,physical-assets ASSET_HIERARCHY_URL=/physical-assets-service
Add the Pentaho Edge connection details:
PENTAHO_EDGE_URL=http://<PE-IP>:4000 PENTAHO_EDGE_USERNAME_PASSWORD=admin:admin PENTAHO_EDGE_BACKEND_URL=https://<PE-IP>:8443
Replace
<PE-IP>
with the IP where Pentaho Edge is installed.Restart PDC to apply changes:
cd /pdc-docker-deployment ./pdc.sh up
You have successfully enabled the Physical Assets service in the Data Catalog deployment. The service is now active and ready to connect with Pentaho Edge to receive physical assets metadata.
Configure Pentaho Edge for the Physical Assets service
Perform the following steps to configure Pentaho Edge to connect it to Data Catalog.
Clone the Pentaho Edge installer repository and navigate to the installer folder:
git clone <repo-url> cd installer
Edit the
docker-compose-pentaho-edge.yml
file:ENABLE_ASSET_HIERARCHY_FEATURE=true
Save and close the
docker-compose-pentaho-edge.yml
file.Open the
.env
file:vi .env
Update the following properties:
PDC_ASSET_HIE_SERVICE_BASE_URL=https://<PDC-IP>/physical-assets-service/api/v1/assets AUTH_URL=https://<PDC-IP>/keycloak PDC_INSECURE_SKIP_VERIFY=true ENABLE_ASSET_HIERARCHY_FEATURE=true
Note:
Replace the
<PDC-IP>
address of the URL with the IP wherepdc-docker-deployment
is installed.Use the FQDN instead of the IP address if needed.
PDC_ASSET_HIE_SERVICE_BASE_URL=https://<FQDN>/physical-assets-service/api/v1/assets AUTH_URL=https://<FQDN> PDC_INSECURE_SKIP_VERIFY=true ENABLE_ASSET_HIERARCHY_FEATURE=true
Add authentication properties:
AUTH_USERNAME= AUTH_PASSWORD= AUTH_CLIENT_ID= AUTH_REALM=
Run the Edge installer script:
./install.sh
When prompted, provide a user ID and password.
You have successfully configured Pentaho Edge to support the Physical Assets hierarchy and configured the connection to Data Catalog. You can now view OT assets in the Physical Assets in Data Catalog.
Configure PDI to send lineage to Data Catalog
You can configure Pentaho Data Integration (PDI) to write lineage information from key lineage events into the Data Catalog metadata store. When configured, PDI writes lineage metadata for supported lineage events to the Data Catalog metadata store. Data Catalog continuously runs an API that reads the lineage information from PDI. Both PDI and Data Catalog support the OpenLineage framework for data lineage collection and analysis.
See Data lineage in Use Pentaho Data Catalog guide for information on the specific lineage events that are supported.
Before you begin this task, turn off PDI and the Pentaho Server.
Perform the following steps to set up PDI to send lineage metadata to Data Catalog.
On the Support Portal home page, sign in using the Pentaho support username and password provided in your Pentaho Welcome Packet. If you don't have credentials, contact your PDI administrator.
On the Pentaho card, click Download.
Navigate to the Marketplace location with plugin downloads.
Download the PDI OpenLineage plugin.
Extract the downloaded package.
Run the installer for PDI:
Run
install.sh
if on Linux, orinstall.bat
if on Windows.Install in the
<data-integration>
folder.
Run the installer for Pentaho Server:
Run
install.sh
if on Linux, orinstall.bat
if on Windows.Install in the
<pentaho-server>
folder.
Create a
config.yml
file, adding the correct users and passwords for your environment, and the URL for Data Catalog.There is an example in the
readme.txt
file:Example of a configuration file ================================= ```yaml version: 0.0.1 consumers: console: file: - path: /path/to/file http: - name: Marquez url: http://localhost:5001 - name: PDC url: https://pdc.example.com endpoint: /lineage/api/events authenticationParameters: endpoint: /keycloak/realms/pdc/protocol/openid-connect/token username: user password: pass client_id: pdc-client-in-keycloak scope: openid
Edit the
~/.kettle/kettle.properties
file and add the following properties:KETTLE_OPEN_LINEAGE_CONFIG_FILE=</full/path/to/your/openlineage/config.yml> KETTLE_OPEN_LINEAGE_ACTIVE=true
Restart PDI and the Pentaho Server.
You have successfully configured the PDI OpenLineage plugin. PDI now sends lineage metadata to Data Catalog.
Integrate Active Directory with Pentaho Data Catalog
You can integrate Microsoft Active Directory (AD) with Pentaho Data Catalog (PDC) to enable users of AD to have single sign-on access to PDC. Part of this integration includes configuring the Keycloak identity and access management tool to use AD as an identity provider.
The configuration includes the following topics:
Verify the LDAP server configuration
To integrate Active Directory with Pentaho Data Catalog, you need to integrate Lightweight Directory Access Protocol (LDAP) with Keycloak. You first need to check that your LDAP server is configured correctly.
For detailed information on how to configure LDAP in your environment, consult your LDAP server documentation.
You should have the following components in an example configuration:
Base DN: Base Distinguished Name, such as:
dc=example,dc=com
, wheredc
is the domain component. The Base DN is the root entry where you want to start your LDAP searches.User DN: User Distinguished Name, such as:
ou=users,dc=example,dc=com
, whereou
is the organizational unit and dc is the domain component.Groups DN: Groups Distinguished Name, such as:
ou=groups,dc=example,dc=com
, whereou
is the organizational unit anddc
is the domain component.
Next steps
Configure the LDAP provider
To integrate Active Directory (AD) with Pentaho Data Catalog (PDC), you need to configure the LDAP provider for PDC in the Keycloak interface.
Use the following steps to configure the LDAP provider:
Navigate to your Keycloak admin console (such as
https://<FQDN>/keycloak/
) and log in with admin credentials.Select the PDC realm.
If you haven't already configured an LDAP provider, click Add provider and select ldap. If you have an existing LDAP provider, click on it to edit.
Enter the following information for the LDAP provider:
FieldValueVendor
Active Directory
Connection URL
ldap://*<LDAP\_SERVER>*:*<PORT>*
such as:ldap://localhost:389
Click Test connection.
You should get a success message.
Enter the following information on the remainder of the page:
FieldValueBind type
Select simple
Bind DN
DN for your LDAP admin user, such as:
cn=administrator,dc=example,dc=com
Bind credentials
Password for the LDAP admin user
Click Test authentication.
You should get a success message.
The LDAP provider is configured for use with AD.
Connect to AD using the LDAP server's SSL certificate (Optional)
When you use an LDAP server with Pentaho Data Catalog (PDC), you can use the LDAP server's SSL certificate to securely connect to Active Directory (AD). This is an optional step in integrating AD with PDC.
For more information on integrating AD with PDC, see Integrate Active Directory with Pentaho Data Catalog. Refer to Keycloak documentation if necessary.
Perform the following steps to use the LDAP server's SSL certificate to connect to AD.
To retrieve the certificate from your LDAP server, enter the following command:
openssl s_client -connect ldap.example.com:636 -showcerts
Copy the entire certificate chain (from
-----BEGIN CERTIFICATE----- to -----END CERTIFICATE-----
) and save it to a file, such asldap-cert.pem
.Update the
*<PDC\_INSTALL\_LOCATION>*/conf/extra-certs/bundle.pem
file with the LDAP server’s SSL certificate, where*<PDC\_INSTALL\_LOCATION>*
is the directory where PDC is installed.Restart PDC services by entering the following command:
sh pdc.sh restart
Log in to the Keycloak admin console (
https://*<FQDN>*/keycloak/
).Navigate to the PDC realm.
Click User Federation.
Click the LDAP provider to edit it.
Enter the following LDAP settings:
UI display name
Name to display, such as LDAPS
Vendor
Select Active Directory
Connection URL
ldaps://*<LDAP\_SERVER>*:*<PORT>*
such as: ldaps://ldap.example.com:636
10. Click Test connection. You should see a success message.
11. Enter the remaining LDAP connection and authentication settings:
Bind type
Select simple
Bind DN
DN to bind to the LDAP server, such as: cn=admin,dc=example,dc=com
.
Bind credentials
password for the Bind DN
12. Click Test authentication. You should see a success message.
13. Enter values for the required LDAP searching and updating settings:
Edit mode
It is a best practice to set this to Readonly
Users DN
Specify the DN where the user entries are located, such as: ou=users,dc=example,dc=com
Username LDAP attribute
cn
RDN LDAP attribute
cn
UUID LDAP attribute
objectGUID
User object classes
person, organizationalPerson, user
14. Click Save to save the configuration.
AD is set up to use the SSL certificate of the LDAP server for a secure connection.
Optionally, you can configure the following settings:
Set how often Keycloak should sync with LDAP.
Set periodic full sync and periodic changed users sync.
Configure LDAP mappers in Keycloak
To integrate Active Directory (AD) with Pentaho Data Catalog (PDC), you need to configure LDAP mappers so that PDC has the necessary information (such as usernames, email addresses, or group memberships) to connect to an LDAP directory.
Make sure you know the correct attribute name used in your LDAP directory for usernames.
The Keycloak LDAP mapper translates attributes stored in an LDAP directory into the corresponding attributes needed by PDC.
Use the following steps in Keycloak to configure the LDAP mappers for Data Catalog. See Keycloak documentation for more information.
In your Keycloak admin console (such as
https://<FQDN>/keycloak/
), log in with admin credentials.Select the PDC realm and go to the User Federation settings.
Click the LDAP provider and go to the Mappers tab.
Map the LDAP attribute for the username.
Map other user attributes as needed (such as email, first name, last name).
To add additional mappers to assign default roles for the users being imported from AD, enter the following settings under User federation > Settings > Mapper details.
For the Business User role:
FieldValueName
Business_User_Role_Mapper_To_LDAP_USERS
Mapper type
hardcoded-ldap-role-mapper
Role
Business_User (select from menu and click Assign)
Click Save.
Repeat step 6 for the Data User role, using the following values:
FieldValueName
Data_User_Role_Mapper_To_LDAP_USERS
Mapper type
hardcoded-ldap-role-mapper
Role
Data_User (select from menu and click Assign)
Click Save.
Save the configuration.
From the Action menu, click Sync all users to import users from LDAP.
A success message displays.
When users are synced from AD, the default PDC realm assigns the Business User and Data User roles to all the users.
Note: PDC applies limits for licensing when users receive one or more of the following roles:
Business Steward
Data Steward
Data Developer
Admin
Go to Users and verify that the users were imported correctly into Keycloak.
The LDAP mappers are configured.
Configure PDC permissions for an AD user
The last step in integrating Active Directory (AD) with Pentaho Data Catalog (PDC) is to set up permissions in PDC for the AD users.
Use the following steps to create and verify an AD user.
Log into Data Catalog as the admin user.
Click Management, and on the Users & Communities card, click Users.
Check that the imported users display correctly and make any needed adjustments.
Select an AD user and assign a community or role to the user.
Click Save, and log out.
Log in as the AD user to verify the login is working properly.
Active Directory is now integrated with PDC.
Integrate Okta with Pentaho Data Catalog
You can integrate Okta authentication with Pentaho Data Catalog for the added security provided by multi-factor authentication. To integrate Okta with Pentaho Data Catalog, you need to configure Okta in parallel with the Keycloak identity and access management tool.
The steps in the integration process are:
Add an OIDC provider in Keycloak
To integrate Okta with Pentaho Data Catalog, you need to set up an identity provider in Keycloak. Keycloak uses the OpenID Connect (OIDC) protocol to connect to identity providers.
If necessary, see the Keycloak documentation to complete this task.
Perform the following steps in the Keycloak interface:
Log in to Keycloak and select the PDC realm.
If a PDC realm does not already exist, consult your PDC administrator or see Creating a realm in the Keycloak documentation to create one.
Click Identity Providers and select OpenID Connect v1.0.
If necessary, see OpenID Connect v1.0 identity providers in the Keycloak documentation.
Use the following steps to add an OIDC ID provider:
Enter an alias in the Alias field.
This populates the Redirect URI field, in a format like the following:
http://localhost:8180/realms/master/broker/<alias>/endpoint
Copy the Redirect URI to be used in the next task, Add an OpenID Connect application in Okta.
You have added an OpenID Connect provider in Keycloak.
Perform the following tasks:
Add an OIDC application in Okta
The next step in integrating Okta with Pentaho Data Catalog is to add an OpenID Connect (OIDC) application in Okta.
Before beginning this task, make sure to perform this task:
In this task, you need the Keycloak Identity Provider Redirect URI you copied in the Add an OpenID Connect provider in Keycloak task. If necessary, see Launch the wizard in the Okta documentation.
Perform the following steps in the Okta Admin console:
From the left menu, click Applications and then Applications.
Click Create App Integration.
Select or enter the following values:
FieldValueSign-in method
OIDC – OpenID Connect
Application type
Web Application
App integration name
CatalogPlus_10.2.1
Grant type
Authorization Code
Sign-in redirect URIs
Keycloak Identity Provider Redirect URI copied in the Add OpenID Connect provider in Keycloak task
https://<application_url>/keycloak/realms/<realm_name>/broker/<alias_name>/endpoint/logout_response
For Sign-out redirect URIs, configure your logout URI in this format:
For example:
https://<ip address>/keycloak/realms/pdc/broker/okta/endpoint/logout_response
Continue entering values in Okta screens:
FieldValueControlled access
Select the default value, Allow everyone in your organization to access
Enable immediate access
Clear the checkbox
In the General tab, make a note of the Client Id and Client Secret to use in the Configure an identity provider in Keycloak task.
Click Save.
On the left menu, click Applications.
From the down arrow, select Assign to Groups.
Assign a group to your application.
You have set up an OpenID Connect application in Okta.
Perform the following tasks:
Set up security in Okta
When integrating Okta with Pentaho Data Catalog, you need to set up security in Okta for the connection to PDC.
Before beginning this task, make sure to perform these tasks:
Perform the following steps in the Okta admin console:
On the left menu, click Security, then API, then Default.
On the Access Policies tab, click Add New Access Policy.
Add details for the policy and click Create Policy.
Click Add Rule.
Add details for the rule and click Create Rule.
You have set up security for the Okta connection to PDC.
Perform the following tasks:
Configure an identity provider in Keycloak
To integrate Okta with Pentaho Data Catalog, you need to configure an identity provider in Keycloak. If necessary, see the Keycloak documentation.
Before beginning this task, make sure to perform these tasks:
In this task, you need the Client Id and Client Secret you noted during the Add an OpenID Connect application in Okta task.
Perform the following steps in the Keycloak admin console:
From the left menu, click Identity providers.
Click OpenID Connect v1.0.
Make sure the Use discovery endpoint switch is on.
In the Discovery endpoint field, enter the discovery endpoint URL in the following format:
https://hostname/auth/realms/master/.well-known/openid-configuration
The Authorization URL, Token URL, Logout URL, and User Info URL and other fields populate automatically.
Enter the Client Id and Client Secret noted during the Add an OpenID Connect application in Okta task.
On the Settings tab, select the following settings:
First login flow override: First login flow override
Sync mode: Force
Expand the Advanced settings and set the Scopes setting to openid email profile (separated by a single space).
Click Save.
You have configured the identity provider in Keycloak.
Perform the following task:
Sign in to Pentaho Data Catalog using Okta
After Pentaho Data Catalog is integrated with Okta, you have the option to log in to PDC with Okta.
Before beginning this task, make sure to perform these tasks:
To log in to PDC using Okta, perform the following steps:
On the PDC login screen, click the button corresponding to the Okta alias.
Note: The alias matches whatever is set for Okta OpenID Connect in Keycloak.
In the following example, the button is labeled CATALOG+OKTA:
Updated PDC login screen after Okta integration On the Okta login screen that appears, enter the credentials for the Okta user assigned to PDC.
Okta prompts you to enter a code.
To finish logging in, enter the code that Okta provides.
You have completed the integration of Okta with PDC.
Configure Metadata Request Access in Data Catalog
You can configure Metadata Request Access in Data Catalog to manage user requests for metadata access. This feature uses the Access Request Service, a backend service that integrates with ticketing tools such as Jira or ServiceNow. The service creates, tracks, and updates access requests, and synchronizes their status between Data Catalog and the external system. This guide depicts how to configure Metadata Request Access in Data Catalog.
By default, only metadata access requests are managed in Data Catalog. Data access requests are routed to an integrated ticketing system.
Before you begin,
Ensure you have:
Network connectivity between the service and Jira or ServiceNow.
Credentials for authenticating with your identity provider and your ticketing system.
Confirm that the external system (Jira or ServiceNow) contains a custom field to store the access request status (for example, Access Status with values Approved and Rejected). For more information, see Integrating ServiceNow with Data Catalog and Integrating Jira with Data Catalog.
Gather the following details:
Service account credentials for Jira or ServiceNow.
Database connection parameters.
Authentication service endpoint, client ID, and credentials.
Perform the following steps to configure Metadata Request Access in Data Catalog:
Configure the service by setting the required environment variables[PA1] . Use either your deployment tool (for example, Docker Compose, Kubernetes, or Helm) or a configuration file. Key environment variables
VariableDescriptionRequiredExampleTENANT_NAME
Tenant identifier
Yes
your-tenant
ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE[SR5]
Default assignee email if no auto-assignment
Yes
STATUS_FETCHER_INTERVAL
Poll interval for status updates; supports cron or @every syntax
Yes
@every 5m
PAGINATION_MAX_RESULTS_SIZE
Maximum results per API request
No
1000
LOG_LEVEL
Logging verbosity (info, debug, warn, error)
No
info
CAUTION: The value of
ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE
must match a user created in Data Catalog with the Admin role using User Management. Authentication settingsVariableDescriptionRequiredExampleAUTH_URL
Authentication endpoint URL
Yes
https://auth.example.com
AUTH_HOST
Authentication host and port
Yes
auth-host:5000
AUTH_CLIENT_ID
Client ID for authentication
Yes
generic-client-id
AUTH_USER_NAME
Username for authentication
Yes
generic-user
AUTH_PASSWORD
Password for authentication
Yes
generic-password
Database settings
VariableDescriptionRequiredExampleDB_URL
Database host
Yes
generic-db-host
DB_PORT
Database port
Yes
5432
DB_NAME
Database name
Yes
generic_db
DB_USER
Database user
Yes
generic-db-user
DB_PASSWORD
Database password
Yes
generic-db-password
DB_SSL_MODE
SSL mode (disable, require)
Yes
disable
Ticketing provider selection
VariableDescriptionRequiredExamplePROVIDER_TOOL
Select ticketing system (Jira or ServiceNow)
Yes
Jira
If Jira is used
VariableDescriptionRequiredExampleJIRA_URLrestart
Jira server URL
Yes
https://jira.example.com
JIRA_PROJECT_NAME
Project key or name
Yes
GENERIC
JIRA_PASSWORD
Jira password or API token
Yes
generic-jira-password
JIRA_ACCESS_STATUS_KEY
Jira field key for access status
Yes
Access Status
JIRA_ACCESS_STATUS_APPROVED_VALUE
Field value for approval
Yes
Approved
JIRA_ACCESS_STATUS_REJECTED_VALUE
Field value for rejection
Yes
Rejected
If ServiceNow is used
VariableDescriptionRequiredExampleSERVICENOW_URL
ServiceNow instance URL
Yes
https://servicenow.example.com
SERVICENOW_PASSWORD
ServiceNow password
Yes
generic-snow-password
SERVICENOW_CLIENT_ID
ServiceNow client ID
Yes
snow-client-id
SERVICENOW_CLIENT_SECRET
ServiceNow client secret
Yes
snow-client-secret
SERVICENOW_ACCESS_STATUS_KEY
ServiceNow field key for access status
Yes
u_access_status
SERVICENOW_ACCESS_STATUS_APPROVED_VALUE
Field value for approval
Yes
Approved
SERVICENOW_ACCESS_STATUS_REJECTED_VALUE
Field value for rejection
Yes
Rejected
Save your configuration and restart the access-request-service container.
(optional) Once you restarted the access-request-service container, you can verify the configuration:
Submit a metadata request in the PDC UI.
Confirm that a corresponding ticket is created in Jira or ServiceNow.
Update the ticket status to Approved or Rejected.
Confirm that PDC reflects the updated status after the poll interval.
You have configured Metadata Request Access in Data Catalog. Users can request metadata access, and the request is automatically created and tracked in the configured ticketing system. Approved or rejected statuses are synchronized back to PDC.
Integrating Jira with Pentaho Data Catalog
You can integrate Jira as an external ticketing system with Data Catalog to manage data access requests. This guide describes how to configure Jira integration using a config.yaml
file or environment variables, and how to create a custom field in Jira to use for the data access request statuses.
To integrate Jira with Data Catalog, perform the following tasks:
Choose one of the following configuration methods:
Integrate Jira with Data Catalog using a config.yaml file
To integrate the Jira ticketing system with Pentaho Data Catalog, you can use a config.yaml
file with settings for connection details, credentials, project information, and status mappings. If your system does not have a config.yaml
file, you can also integrate Jira with Data Catalog using environment variables. For more information, see Integrate Jira with Data Catalog using environment variables.
Perform the following steps to integrate Jira with Data Catalog using a config.yaml
file:
Go to
/pentaho/pdc-docker-deployment/conf
folder and openconfig.yaml
file. If not available, create it.Add the following configuration to the
config.yaml
file.Note: Replace the placeholder values in angle brackets (< >) with your actual Jira credentials and project details.
tools: jira: url: <your_jira_url> username: <your_jira_username> password: <your_jira_password> project_name: <your_jira_project_name> access_status_key: <your_jira_access_status_key> status_mapping: approved_status: <your_jira_approved_status_value> rejected_status: <your_jira_rejected_status_value>
database: postgres: host: 'um-postgresql' port: 5432 user: 'postgres' password: 'admin123#' dbname: 'pdc_access_request_db' sslmode: 'disable' tools: jira: url: 'https://teamwork7com.atlassian.net' username: '[email protected]' password: 'ATATT3xFfGF0u8CftyjZn0PO-p-M_J8VJUtYcn3ZiZzfM0pF7iqpnUT3TPd-7q7QO8PevM7IDIyzSwwwUoksY7tfKnJwXV1EokuGhp1YmcIkP-78H0H1-io2bkkSVL-bpRmwL4Tha0yWWHZRnEvhSRE1SX984WX4vQMp1hi9w6ua5a_jAq9gWwg=0C4170DE' project_name: 'KAN' access_status_key: 'Access Status' status_mapping: approved_status: 'Approved' rejected_status: 'Rejected' auth: url: 'https://10.177.177.7' auth_host: 'um-css-admin-api:5000' username: 'admin' password: 'admin' client_id: 'admin-cli'
Save the changes and close the
config.yaml
file.Open
vendor/docker-compose-um.yml
file. Under the access request service container configuration, add theVolumes
section parallel to theenvironment
section, save the changes, and close the file.volumes: - <path-to-the-config.yaml>:/app/config.yaml:ro
Restart the access request service to apply changes:
./pdc.sh restart access-request-service
You have successfully configured Jira with Data Catalog using the config.yaml
file.
You now need to add a custom field to Jira, to include the data access request statuses. For more information, see Add a custom field to Jira.
Integrate Jira with Data Catalog using environment variables
To integrate the Jira ticketing system with Data Catalog, you can use environment variables to set connection details, credentials, project information, and status mappings. You can also Integrate Jira with Data Catalog using a config.yaml file.
Perform the following steps to integrate Jira with Data Catalog using environment variables:
Edit the
/opt/pentaho/pdc-docker-deployment/vendor/.env.default
file and add the following lines:Note: Instead of the default location, your environment variables may be set in an
opt/pentaho/pdc-docker-deployment/conf/.env
file.ACCESS_REQUEST_SERVICE_PROVIDER_TOOL=Jira ACCESS_REQUEST_SERVICE_JIRA_URL=your_jira_url ACCESS_REQUEST_SERVICE_JIRA_USER_NAME=your_jira_username ACCESS_REQUEST_SERVICE_JIRA_PASSWORD=your_jira_password ACCESS_REQUEST_SERVICE_JIRA_PROJECT_NAME=your_jira_project_name ACCESS_REQUEST_SERVICE_JIRA_ACCESS_STATUS_KEY=your_jira_access_status_key ACCESS_REQUEST_SERVICE_JIRA_ACCESS_STATUS_APPROVED_VALUE=your_jira_approved_status_value ACCESS_REQUEST_SERVICE_JIRA_ACCESS_STATUS_REJECTED_VALUE=your_jira_rejected_status_value ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE=PDC_admin_user_email
Save the changes and close the file.
Restart the access request service to apply changes:
./pdc.sh restart access-request-service
You have successfully configured Jira with Data Catalog using environment variables.
You now need to add a custom field to Jira to include the data access request statuses. For more information, see Add a custom field to Jira.
Add a custom field to Jira
If you have configured Data Catalog to connect to Jira for managing data access requests, you need to add a custom field to Jira to map the Data Catalog data access request statuses to complete the Jira integration.
Perform the following steps to add a custom field to Jira:
Log in to Jira with administrative rights.
Go to the Jira Admin settings.
If you cannot find the Jira Admin settings, use these steps:
Open any issue.
In the Details section, click the settings icon, then click Manage Fields. In the bottom right corner, you see the Go to Custom Fields option.
Click Go to Custom Fields, and you are taken to the Jira Admin settings.
Click Custom Fields and then click Create custom field.
Select the Select List type and enter
Access Status
as the name.Add the options:
Approved
,Rejected
, andPending
, and click Create.Open any issue. In the Details section, click the settings icon, then click Manage Fields.
Locate the newly created Access Status field in the list of fields on the right side.
Drag and drop the Access Status field into the Context Fields section.
Jira is now updated to use data access request statuses with Data Catalog.
Integrating ServiceNow with Data Catalog
You can integrate ServiceNow as an external ticketing system with Data Catalog to manage data access requests. This guide describes how to configure ServiceNow integration using a config.yaml file or environment variables, and how to create a custom field in ServiceNow to track data access request statuses.
To integrate ServiceNow with Data Catalog, you need to perform the following tasks:
Choose one of the following configuration methods:
Integrate ServiceNow with Data Catalog using a config.yaml file
To integrate the ServiceNow ticketing system with Data Catalog, you can use a config.yaml
file with settings for connection details, credentials, project information, and status mappings. If your system does not have a config.yaml
file, you can also Integrate ServiceNow with Data Catalog using environment variables.
To integrate ServiceNow with Data Catalog using a config.yaml
file, use the following steps:
Go to
/pentaho/pdc-docker-deployment/conf
folder and openconfig.yaml
file. If not available, create it.Add the following configuration to the
config.yaml
file.Note: Replace the placeholder values in angle brackets (< >) with your actual ServiceNow credentials and project details.
tools: servicenow: url: <your_servicenow_url> username: <your_servicenow_username> password: <your_servicenow_password> client_id: <your_servicenow_client_id> client_secret: <your_servicenow_client_secret> access_status_key: <your_servicenow_access_status_key> status_mapping: approved_status: <your_servicenow_approved_status_value> rejected_status: <your_servicenow_rejected_status_value>
database: postgres: host: 'um-postgresql' port: 5432 user: 'postgres' password: 'admin123#' dbname: 'pdc_access_request_db' sslmode: 'disable' auth: url: 'https://10.177.177.7' auth_host: 'um-css-admin-api:5000' username: 'admin' password: 'admin' client_id: 'admin-cli' tools: servicenow: url: 'https://mycompany.service-now.com' username: 'mySNusername' password: 'mjfo39847tnd' client_id: 'ljdfsae9087534rmlvspe495rfnv' client_secret: '(YGVFJMKLOUJIHY' access_status_key: 'u_access_request_status' status_mapping: approved_status: 'Granted' rejected_status: 'Denied'
Save the changes and close the
config.yaml
file.Open
vendor/docker-compose-um.yml
file. Under the access request service container configuration, add theVolumes
section parallel to theenvironment
section, save the changes, and close the file.volumes: - <path-to-the-config.yaml>:/app/config.yaml:ro
Restart the access request service to apply changes:
./pdc.sh restart access-request-service
You have successfully configured ServiceNow with Data Catalog using the config.yaml
file.
You now need to add a custom field to ServiceNow, to include the data access request statuses. For more information, see Add a custom field to ServiceNow.
Integrate ServiceNow with Data Catalog using environment variables
To integrate the ServiceNow ticketing system with Data Catalog, you can use environment variables to set connection details, credentials, project information, and status mappings. You can also integrate ServiceNow with Data Catalog using a config.yaml
file.
Perform the following steps to integrate ServiceNow with Data Catalog using environment variables:
Edit the
/opt/pentaho/pdc-docker-deployment/vendor/.env.default
file and add the following lines:Note: Instead of the default location, your environment variables may be set in an
opt/pentaho/pdc-docker-deployment/conf/.env
file.ACCESS_REQUEST_SERVICE_PROVIDER_TOOL= ServiceNow ACCESS_REQUEST_SERVICE_SERVICENOW_URL=your_servicenow_url ACCESS_REQUEST_SERVICE_SERVICENOW_USER_NAME=your_servicenow_username ACCESS_REQUEST_SERVICE_SERVICENOW_PASSWORD=your_servicenow_password ACCESS_REQUEST_SERVICE_SERVICENOW_CLIENT_ID=your_servicenow_client_id ACCESS_REQUEST_SERVICE_SERVICENOW_CLIENT_SECRET=your_servicenow_client_secret ACCESS_REQUEST_SERVICE_SERVICENOW_ACCESS_STATUS_KEY=your_servicenow_access_status_key ACCESS_REQUEST_SERVICE_SERVICENOW_ACCESS_STATUS_APPROVED_VALUE=your_servicenow_approved_status_value ACCESS_REQUEST_SERVICE_SERVICENOW_ACCESS_STATUS_REJECTED_VALUE=your_servicenow_rejected_status_value ACCESS_REQUEST_SERVICE_DEFAULT_ASSIGNEE=PDC_admin_user_email
Save the changes and close the file.
Restart the access request service to apply changes:
./pdc.sh restart access-request-service
You have successfully configured ServiceNow with Data Catalog using environment variables.
You now need to add a custom field to ServiceNow to include the data access request statuses. For more information, see Add a custom field to ServiceNow.
Add a custom field to ServiceNow
If you have configured Data Catalog to connect to ServiceNow for managing data access requests, you need to add a custom field to ServiceNow to map the Data Catalog data access request statuses to complete the ServiceNow integration.
Perform the following steps to add a custom field to ServiceNow:
Log in to the ServiceNow instance with administrative rights.
Go to System Definition > Tables and locate the Incident table.
Open the Incident table, and at the bottom, in the Columns section, click New to add a new column (custom field).
Configure the following properties:
Column Label
Enter a descriptive name like Access Request Status
.
Column Name
Automatically generated as u_access_request_status
(prefixed with u_ to indicate it’s a custom field).
Type
Select the appropriate field type, which should be Choice for values like Pending, Granted, and Denied.
Choices
Once the field type is set to Choice, there is an option to add Choice List Values. Add the following values:- Pending
Granted
Denied You can also set a default value if desired.
5. Verify the changes you have made and click Submit.
ServiceNow is now updated to use data access request statuses with Data Catalog.
Configure Power BI templates with Pentaho Data Catalog
This guide outlines steps to configure Pentaho Data Catalog by connecting Power BI reports to a PostgreSQL database for catalog metadata. It details the setup for integration, including:
Pre-created database objects in PostgreSQL
Configuration of Power BI Desktop and Service
Scheduled refresh setup
Additionally, it explains synchronizing materialized view refresh jobs in PostgreSQL with Power BI dataset refresh schedules.
Prerequisites
Before configuring Power BI with Data Catalog, ensure:
PostgreSQL database is installed, configured, and accessible from the Power BI network.
A SQL client tool, such as DBeaver Community Edition, is available to run database scripts.
The user account has the necessary roles and privileges to create tables and materialized views in the PostgreSQL database.
Power BI Desktop is installed for creating and testing reports.
Power BI Service access is available for publishing reports and scheduling dataset refreshes.
The on-premises data gateway is installed and running to enable secure connectivity between Power BI Service and the PostgreSQL database.
Network permissions allow communication between the Power BI gateway host and the PostgreSQL server.
Table and view mapping
The following views and tables are pre-created in PDC for the Power BI.
View Name
Dependent Tables
mv_master
entities_master_view, datasource_category_mapping, currency_exchange_rates
mv_entity_category_summary_view
entities_custom_categorization, glossary_summary_view, terms_view, entities_master_view, currency_exchange_rates, datasource_category_mapping
mv_duplicate_savings_by_original_view
duplicate_files_view, entities_master_view, currency_exchange_rates, datasource_category_mapping
mv_duplicate_by_term_summary_view
entities_custom_categorization, glossary_summary_view, duplicate_files_view, terms_view, entities_master_view, currency_exchange_rates
mv_duplicate_entities_summary_view
duplicate_files_view, entities_master_view, currency_exchange_rates
mv_duplicate_entity_detail_view
duplicate_files_view, entities_master_view, currency_exchange_rates
mv_policies_summary
policies_summary_view, entities_policies_view, entities_master_view
Configure the Power BI Desktop environment
Use Power BI Desktop to update the PostgreSQL data source and configure connection settings before publishing the report to Power BI Service.
Perform the following steps to configure Power BI Desktop:
Open Power BI Desktop application and open the Power BI report (,pbix).
On the ribbon, select Transform Data > Data Source Settings.
In the Data Source Settings window, select your PostgreSQL connection.
Click Change Source, update Server name/host and Database name, and click OK.
Click Edit Permission, uncheck Encrypt connection and enter valid PostgreSQL database credentials.
Click OK and then click Close.
On the Home tab, select Apply Changes > Run All to apply the new configuration.
(Optional) To modify parameters directly, select Home > Edit Parameters, update values, and then click Apply Changes.
Power BI Desktop connects to the PostgreSQL database with the updated host, database, and credential settings. The report is now ready to be published to Power BI Service.
Configure the Power BI Service gateway
Use the Power BI Service to configure the on-premises data gateway and connect the Power BI dataset to the PostgreSQL database used by Data Catalog. Perform the following procedure to configure Power BI Service gateway:
Before you begin ensure you have an administrator access.
Install and configure on-premises Data Gateway
Download the On-premises data gateway installer from the Microsoft Power BI portal.
Run the installer as an administrator.
Select On-premises data gateway (recommended) when prompted.
Choose Standard mode, and then click Next.
Sign in with your Power BI credentials.
Enter a Gateway name and Recovery key (password), and then click Configure.
After installation completes, verify that the gateway status shows Online in the Power BI Service.
Configure Dataset in the Power BI Service
In Power BI Service, go to the workspace that contains your dataset.
Select the More options (⋮) menu next to the dataset, and then select Settings.
Under Parameters, update the Host name and Database name values.
Click Apply, and then select Reload to refresh the dataset parameters.
Add gateway connection
Go to Manage gateways from the Power BI Service navigation pane.
Select your gateway, and then choose Add to gateway.
Enter a Connection name (preferably the same as the PostgreSQL host name).
Select Basic authentication.
Enter the PostgreSQL username and password.
Click Apply to create and map the connection to the dataset.
The Power BI dataset is connected to the PostgreSQL database through the on-premises data gateway. The gateway is active and can be used for scheduled dataset refreshes in Power BI Service.
Modifying parameters in Power BI Service
Use Power BI Service to update data source parameters for the PostgreSQL connection used by Data Catalog.
Perform the following procedure to change data source details:
In Power BI Service, go to the workspace that contains your dataset.
Select the More options (⋮) menu next to the dataset, and then select Settings.
Under Parameters, update the required values such as Host name or Database name.
Click Apply, and then select Reload to refresh the dataset.
If a gateway error occurs, re-create the gateway connection from Manage gateways.
If the gateway is valid, map the dataset again to ensure the updated parameters are applied.
The Power BI dataset is updated with the new PostgreSQL connection parameters and successfully mapped through the on-premises data gateway.
Optional - Refresh scheduling and synchronisation
Generally, Cron jobs refresh materialized views in BIDB; however, you can also set a manual refresh schedule for a materialized view in PostgreSQL.
Perform the following steps to schedule a manual refresh:
PostgreSQL materialize view refresh scheduling
Create a PostgreSQL refresh function
Open your SQL client (for example, DBeaver) and run the following script to refresh all required materialized views:
CREATE OR REPLACE FUNCTION refresh_all_pentaho_views() RETURNS void AS $$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_master;
REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_entity_category_summary_view;
REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_duplicate_savings_by_original_view;
REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_duplicate_by_term_summary_view;
REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_duplicate_entities_summary_view;
REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_duplicate_entity_detail_view;
REFRESH MATERIALIZED VIEW CONCURRENTLY public.mv_policies_summary;
END;
$$ LANGUAGE plpgsql;
Schedule automatic refresh jobs
Use pgAgent or an operating system scheduler such as cron or Windows Task Scheduler to automate materialized view refreshes.
Example cron job for Linux:
# Refresh Pentaho Data Catalog views every day at 1:00 AM 0 1 * * * psql -U db_user -d your_database -c "SELECT refresh_all_pentaho_views();"
For Windows, use Task Scheduler:
Action: Launch psql.exe
Arguments: -U db_user -d your_database -c "SELECT refresh_all_pentaho_views();"
Power BI Dataset refresh configuration
Perform the following procedure to configure scheduled refresh in Power BI Service:
Power BI Service (Scheduled Refresh)
In Power BI Service, open your workspace and select the Datasets tab.
Select Schedule refresh for the dataset connected to PostgreSQL.
Turn on Keep your data up to date.
Configure the refresh schedule:
Power BI Pro: up to 8 refreshes per day
Power BI Premium: up to 48 refreshes per day
Set the time zone, frequency, and preferred refresh times.
Align the dataset refresh to run after the PostgreSQL refresh completes.
Gateway mapping validation
Go to Settings > Manage gateways > Connections.
Verify that the on-premises data gateway status is Online.
Confirm that the dataset connection is mapped correctly and uses valid credentials.
Refresh notification and logs
In Power BI Service, go to Settings > Scheduled refresh > Failure notifications.
Enable email alerts for refresh failures to receive proactive notifications.
Power BI datasets refresh automatically after PostgreSQL materialized views are updated.
Recommended refresh synchronization strategy
Component
Frequency
Trigger Time
Dependency
PostgreSQL Materialized Views
Every 24 hours (1:00 AM)
Cron job/pgAgent
N/A
Power BI Dataset
Every 24 hours (2:00 AM)
Power BI scheduled refresh
Runs after PostgreSQL refresh completes
This ensures that Power BI always retrieves updated, fully processed data from Data Catalog’s database views.
Last updated
Was this helpful?