Install Data Catalog in AWS EKS
Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes service that lets you run Kubernetes applications on AWS without the overhead of managing the control plane. Deploying Pentaho Data Catalog on AWS EKS combines the benefits of Data Catalog's powerful data discovery, profiling, and governance capabilities with the resilience and scalability of Kubernetes in the AWS cloud. By using EKS as the orchestration platform for Data Catalog, you can:
Scale Data Catalog services dynamically based on workload demand.
Ensure high availability across AWS regions and zones.
Securely integrate with AWS data sources like S3, Redshift, and DynamoDB.
Centralize monitoring and logging using CloudWatch.
Reduce operational complexity by relying on AWS-managed Kubernetes infrastructure.
This guide describes how to install Data Catalog on AWS EKS using Helmfile, configure ingress with ALB, NGINX, or Istio, and validate the deployment. After completing the installation, Data Catalog will be available through a secure DNS endpoint in your AWS environment.
Prerequisites
Before you begin the installation, ensure you have the following tools installed and configured, and the necessary permissions granted:
Cluster and tools
An existing AWS EKS cluster (Kubernetes version 1.27 or later).
AWS CLI installed and configured.
kubectl installed and pointing to the EKS cluster.
Helm and Helmfile installed.
Okta CLI configured with AWS authentication (if using Okta-AWS integration).
Infrastructure requirements
At least three worker nodes with 32 vCPU and 128 GB RAM each (for standard PDC deployment).
EBS storage with provisioned IOPS for metadata repositories.
Multi-AZ node groups for high availability.
A Route 53 hosted zone with records pointing to the cluster ingress.
IAM permissions
Permissions to deploy workloads to the cluster.
Roles for ALB ingress controller and IRSA (IAM Roles for Service Accounts).
Policies for S3, Redshift, and other AWS services that PDC will access.
Network
Open ports 443 (HTTPS), 9200 (OpenSearch), and 5432 (PostgreSQL BIDB).
Connectivity from PDC pods to your licensing server.
Secrets management
Store credentials (SMTP, Jira, Tableau PATs, MLflow) in Kubernetes secrets or AWS Secrets Manager. Avoid embedding plaintext values in the
custom-values.yaml.
Procedure
Perform the following steps to authenticate, configure, and deploy the Data Catalog application:
Run the Okta-AWS authentication command to assume the appropriate AWS role: Replace
<profile-name>with the name of your Okta-AWS configuration profile.Verify your authentication by confirming that you can list the EKS clusters in your account.
Update your local
kubeconfigfile. This command retrieves the cluster configuration and sets it as your default context. Replace<cluster-name>,<cluster-name-alias>, and<aws-region>with your specific cluster details.Test the cluster connectivity to ensure you can connect to the cluster by listing the nodes.
Download the Helm Artifact from Hitachi’s JFrog repository. Use the following command to download the Helm chart. Replace
<jfrog_username>,<jfrog_token>, and[build_number]with your credentials and the specific build number.Extract the artifact. Unpack the downloaded
.tgzfile into your desired directory. The example below uses/optdirectory.Based on your ingress controller, copy one of the example configuration files to create your
custom-values.yaml.For AWS Application Load Balancer (ALB): https://github.com/pentaho/pdc-docker-deployment/blob/development/k8s/conf/default/example-eks-alb.custom-values.yaml
For Nginx Ingress Controller: https://github.com/pentaho/pdc-docker-deployment/blob/development/k8s/conf/default/example-eks-nginx.custom-values.yaml
Open
conf/default/custom-values.yamland customize configuration properties to match your environment.applicationFqdn: Fully qualified domain name for your PDC instance.imageRegistryName: Your container registry.licenseServerUrl: Your licensing server endpoint.Secrets for SMTP, Jira, Tableau, MLflow as Kubernetes secret references.
Ingress annotations, including TLS certificate ARN for ALB.
Pay close attention to the following required ingress settings:
The ingress annotation
alb.ingress.kubernetes.io/certificate-arn.The
hostin the hosts section, which must be updated with yourclusterFqdn.
Run the
helmfile synccommand to apply the configurations and deploy Data Catalog into your custom namespace.Note: If you encounter a "context deadline exceeded" error, you can increase the timeout by prefixing the command as follows:
Results
After the helmfile sync command completes successfully, Data Catalog application will be installed and running in your EKS cluster.
You can access the application at the following URL: https://pdc-client.<clusterFqdn>
Post-installation validation
Log in with the default administrator credentials provided in the installation package.
Confirm that the license is applied and services such as proxy, um-admin-api, fe-workers, and job-server are healthy.
Ingest a small data source to verify that scanning and profiling are working.
Last updated
Was this helpful?

