Troubleshooting Pentaho Data Catalog
The Pentaho Data Catalog log files contain information that can help you determine the root cause of error messages you might see. Refer to the following topics for information on how to resolve the issues causing the error messages.
Low disk space message
If you see a Low disk space message from Pentaho Data Catalog while loading images into the Docker repository, you can resolve this issue by linking the Docker root directory to another directory.
Important: The other directory should have at least 100 GB of free space.
Use the following steps to resolve this issue:
Enter the following commands to link the
/var/lib/dockerdirectory to a directory with at least 100 GB of free space.Note: In this example, the directory with at least 100 GB of free space is <dir with min 100 GB free>. You should replace <dir with min 100 GB free> in the command with the full path to your directory with a minimum of 100 GB of free space.
sudo systemctl stop docker sudo mv /var/lib/docker <dir with min 100 GB free> sudo ln -s <dir with min 100 GB free> /var/lib/docker sudo systemctl start dockerRepeat the action that produced the
Low disk spacemessage.
The action should succeed without producing a Low disk space message.
Authentication failure after upgrading Remote Worker from 10.2.7 to 10.2.9
When upgrading the Remote Worker from version 10.2.7 to 10.2.9, the Remote Worker container fails to start. The startup log displays an error indicating that authentication failed with the SASL mechanism SCRAM-SHA-512.
This issue occurs because the Kafka user credentials used by the Remote Worker become invalid during the upgrade.
After the upgrade, the Remote Worker container (pdc-ws-remote) fails to start and shows an authentication error similar to the following in the log:
Workaround
You can fix this issue by resetting the Kafka SCRAM-SHA-512 password for the pdcuser on the PDC main server and restarting the Remote Worker.
Use the following steps to resolve this issue:
Log in to the Kafka container on the PDC main server:
Run the following command to reset the SCRAM-SHA-512 password for the Kafka user pdcuser:
Exit the Kafka container.
Restart the Remote Worker service:
After resetting the Kafka password and restarting the Remote Worker, authentication succeeds and the Remote Worker starts successfully.
OpenSearch jobs may fail or stall when running high parallel loads
When Pentaho Data Catalog executes a large number of parallel jobs, such as concurrent scans, profiling jobs, or metadata loads, OpenSearch may reach its default limit for open scroll contexts. By default, the OpenSearch setting search.max_open_scroll_context is set to 500. Pentaho Data Catalog uses OpenSearch scroll queries to read large result sets, and each parallel job can open one or more scroll contexts. When the combined number of concurrent read operations exceeds this limit, jobs can fail, stall, or behave unpredictably, especially in performance or high-concurrency environments. This issue is more likely to occur when:
Multiple workers execute jobs in parallel.
Large datasets are processed concurrently.
Performance or load testing environments are used.
High read concurrency is configured in Data Catalog.
Resolution
To resolve this issue, you can increase the OpenSearch scroll context limit by updating the cluster configuration. Perform the following steps to update search.max_open_scroll_context value:
Procedure
Connect to the opensearch-master pod.
Example:
Run the following command to increase the scroll context limit:
Result
OpenSearch allows up to 2000 concurrent scroll contexts, enabling Pentaho Data Catalog to run a higher number of parallel jobs without hitting scroll context limits.
The Pentaho Data Catalog team validated 2000 for performance environments. You can increase this value further if required, based on your infrastructure capacity.
Unable to connect to OpenSearch using HTTPS (Security plugin not initialized)
When accessing OpenSearch over HTTPS, the system may fail to connect because the OpenSearch Security plugin is enabled but not yet initialized. This occurs when the .opendistro_security index does not exist, preventing OpenSearch from recognizing user credentials, roles, TLS settings, and other security configurations.
You may see the following error in the logs:
This issue typically appears during an upgrade (for example, from PDC 10.2.1 to 10.2.6), not in fresh installations of Data Catalog 10.2.5 or later, where the security index is initialized by default.
For fresh installations, the initialization process should be performed only under the supervision of Pentaho Data Catalog Customer Support.
Workaround
Perform the following steps to resolve the issue:
Log in to the deployment server where Data Catalog is running.
Stop all Pentaho Data Catalog containers:
./pdc.sh stop
Identify the OpenSearch container IDs.
Remove the OpenSearch containers.
List all Docker volumes related to OpenSearch to confirm their presence:
Typical volumes include:
⚠️ Caution: Do not delete OpenSearch volumes unless explicitly instructed by Customer Support. Deleting these volumes will permanently remove all OpenSearch data, including indexes, and should only be done under the supervision of support.
Delete the pdc_opensearch_data volume:
Delete the pdc_opensearch_snapshots volume:
Restart Data Catalog services:
Verify that all the services are up and running now.
After removing the existing OpenSearch volumes and restart the system, the .opendistro_security index is reinitialized. OpenSearch initializes the Security plugin, loads its configuration successfully, and connects over HTTPS without errors.
Jobs remain in the Accepted state after deployment
After deploying Pentaho Data Catalog, some jobs may remain in the Accepted state and do not move to Running or Completed. You may notice one or more of the following:
A job appears as Accepted and does not progress.
The job does not complete even after several minutes.
Resubmitting the job shows the same behavior.
Restarting Ops services temporarily resolves the issue.
Refreshing the browser or resubmitting the job does not resolve the issue.
Workaround
Restart only the Ops services to refresh the job scheduling components. Perform the following restart procedure that matches your deployment.
Docker deployments
Log in to the server where Data Catalog is installed.
Go to the Data Catalog installation directory (where pdc.sh is located).
Run the following command:
After the restart completes, re-run the operation that was previously stuck and verify that the job progresses normally.
Kubernetes deployments
Connect to the Kubernetes cluster where Data Catalog is deployed.
Restart the Ops deployment:
Monitor the rollout status:
After the pods are ready, run the operation again and confirm that the job moves beyond the Accepted state.
Request Header Fields Too Large error when importing metadata from Power BI
When importing metadata from Power BI, Data Catalog might display the following error:
This issue occurs when the metadata payload sent to the Power BI service exceeds the allowed header size limits. In environments with large dashboards, long report paths, or extensive metadata, this batch size may be too large and may trigger the request-header-size error.
Workaround
The batch size is controlled by the POWERBI_BATCH_SIZE environment variable in the docker-compose.bi-metadata.yml file.
By default, Data Catalog uses a batch size of 50 when fetching metadata from Power BI. Reduce the value of the POWERBI_BATCH_SIZE environment variable to lower the number of metadata items sent per request. A smaller batch size reduces header size and prevents the error from occurring.
Perform the following steps to reduce the value of the POWERBI_BATCH_SIZE environment variable:
Procedure for Docker deployments
Open the Data Catalog installation directory:
Open the docker-compose.bi-metadata.yml file from the same directory:
Locate the following environment variable:
Reduce the value. For example:
You can try 20, 10, or even 5, depending on the environment size.
Save the file.
Restart the bi-metadata service:
Re-run the Power BI metadata import.
Procedure for Kubernetes deployments
If you deploy using Helm or Kubernetes manifests, update the environment variable in the bi-metadata deployment and restart the pods:
Go to the Helm chart directory for your PDC deployment:
Open the values file for the bi-metadata-api service and set or update the batch size. The environment variables for this service are defined in:
Add or update the following entry:
The chart will inject this value into the deployment templates, including:
Save the changes.
After updating the values file, redeploy the Helm chart:
Wait for the bi-metadata-api pods to restart. You can check the rollout status using:
Re-run the Power BI metadata import.
Result
Data Catalog applies the updated batch size to Power BI metadata requests. The metadata import operation proceeds without triggering the Request Header Fields Too Large error. If the error persists, reduce the batch size further and redeploy the Helm chart.
Recommended values
The optimal batch size depends on the amount of metadata per workspace. Use the following guidance:
50 (default)
Small to medium Power BI environments
20
Large metadata collections, occasional failures
10
Frequent “Request Header Fields Too Large” errors
5
Very large or complex Power BI deployments
Metadata rule execution fails for large generic conditions
When you configure a Metadata Rule with a very generic condition that matches a large number of files, the associated Data Discovery or Data Profiling job may not start or complete successfully.
In this scenario:
The rule execution status initially appears as Submitted.
The status then changes to Failed in the rule execution history.
No corresponding Data Discovery or Data Profile job completes for the matched files.
This issue typically occurs when the rule matches a very large number of assets and the job submission payload exceeds processing limits.
Workaround
In Data Catalog, the rules engine supports batching when submitting Data Discovery or Data Profile jobs. You can reduce the batch size to limit the number of asset IDs submitted in a single request.
By default, the batch size is set to 10,000 IDs per batch, and the batch size is controlled by the RULES_START_PROCESS_PAYLOAD_BATCH_SIZE environment variable.
To mitigate this issue, reduce the batch size value.
Procedure
Open the environment configuration file:
Locate the following environment variable:
Reduce the batch size to a lower value, for example:
Save the file.
Restart the Pentaho Data Catalog application to apply the change.
Rerun the metadata rule.
Result
The rules engine submits Data Discovery or Data Profile jobs in smaller batches, reducing payload size and allowing the jobs to start and complete successfully for large rule matches.
Last updated
Was this helpful?

