Backup and restore in Amazon EKS

n Pentaho Data Catalog deployments running on Amazon Elastic Kubernetes Service (EKS), administrators can configure and manage backups to protect critical system data and metadata. The backup and restore framework helps ensure business continuity by enabling recovery of Data Catalog components, such as PostgreSQL, MongoDB, OpenSearch, FE-Workers, and Kubernetes objects.

Data Catalog supports multiple storage options for storing backup data:

  • Amazon Simple Storage Service (S3) for scalable, cloud-based backups.

  • Amazon Elastic Block Store (EBS) and Amazon Elastic File System (EFS) for persistent storage within the Amazon EKS cluster.

This section includes detailed procedures to:

Backup and restore operations must be performed by administrators with access to the EKS cluster and the configured storage backend.

Configure a backup in Amazon EKS

In Data Catalog deployments running on Amazon EKS, administrators can configure automated or manual backups for key Data Catalog components. The configuration specifies which services to back up, how often backups run, and where backup data is stored. You can store backups in Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (Amazon EBS), or Amazon Elastic File System (Amazon EFS). Data

Catalog supports multiple storage configurations that let you choose how backups are created and managed. Depending on your environment, you can either use an existing PersistentVolumeClaim (PVC) or let Helm automatically create and manage the PVC during deployment. After setup, backups run automatically through a CronJob in Amazon EKS or can be triggered manually when needed. Retention policies, backup frequency, and storage locations are defined in the Helm configuration.

If your Data Catalog deployment uses an external PostgreSQL database such as Amazon Aurora, Data Catalog doesn’t back up that external database. In this case, set the postgres.enabled parameter to false in the backup configuration, and manage the external database backup separately.

Configure a backup using Amazon S3 with the existing PVC

In Data Catalog deployments running on Amazon EKS, administrators can store backup data in Amazon S3 using a pre-existing PersistentVolumeClaim (PVC). This configuration allows you to use an existing PVC that is already linked to an S3 bucket through the Amazon S3 Container Storage Interface (CSI) driver. By referencing this PVC in the backup configuration, Data Catalog writes backup data directly to the configured S3 bucket.

When using an existing PVC for S3 storage, ensure that the PVC and its associated StorageClass are correctly configured with the AWS S3 CSI driver and the target S3 bucket.

Perform the following steps to configure a backup using Amazon S3 with the existing PVC:

Before you begin

  • Verify that the Amazon S3 CSI driver is installed in your Amazon EKS cluster.

  • Ensure that an S3 bucket is available for storing backup data.

  • Confirm that the PersistentVolumeClaim (PVC) for S3 is pre-created and bound to the S3 StorageClass.

  • Verify that the PDC namespace and Helm deployment are accessible.

  • Ensure that worker nodes have the required IAM permissions to access the S3 bucket.

  • Locate the custom-values.yaml file used for your PDC Helm deployment.

Procedure

  1. Open the custom-values.yaml file for your PDC deployment in a text editor.

  2. Add or update the following backup configuration block:

  3. Save the configuration file.

  4. Apply the configuration to the Amazon EKS cluster.

    or

  5. Verify that the backup CronJob is created in the EKS cluster.

  6. Review the CronJob details to confirm the schedule, storage configuration, and PVC reference.

    The CronJob specification should reference the PVC name s3-pdc-backup-pvc.

Example: S3 StorageClass and PersistentVolume

The underlying PV must have S3 specifications, such as bucket-name and aws-region and PV and PVC size must match to 'backup.persistence.size'.

The PVC name s3-pdc-backup-pvc must match the value specified in the backup configuration block.

Result

Data Catalog is configured to store backups in Amazon S3 using the existing PVC. The backup CronJob runs automatically according to the configured schedule and writes backup files directly to the S3 bucket linked with the PVC.

Configure a backup using Amazon S3 with the Helm-managed PVC

In Data Catalog deployments running on Amazon EKS, administrators can configure backups to use Amazon S3 through a Helm-managed PersistentVolumeClaim (PVC). In this configuration, the Data Catalog Helm chart automatically creates the PVC and connects it to the S3 bucket using the Amazon S3 Container Storage Interface (CSI) driver. This method simplifies setup because the PVC does not need to be created manually before deployment.

The Amazon S3 CSI driver must be installed in the EKS cluster, and the specified StorageClass must be compatible with the S3 driver.

Perform the following steps to configure a backup using Amazon S3 with the Helm-managed PVC:

Before you begin

  • Verify that the Amazon S3 CSI driver is installed in the Amazon EKS cluster.

  • Ensure that an S3 bucket is available and accessible to the EKS worker nodes.

  • Confirm that Helm 3.0 or later and kubectl are installed.

  • Verify that the PDC namespace is accessible.

  • Identify or create a StorageClass compatible with S3.

  • Confirm that the custom-values.yaml file for your Helm deployment is available for editing.

Procedure

  1. Open the custom-values.yaml file used for your PDC Helm deployment.

  2. Add or update the following backup configuration block:

    In this case, if the customer wants the PVC to be created by Helmfile, the storageClass and volumeName must be pre-existing and specified in the configuration, as shown above.

  3. Save the configuration file.

  4. Apply the configuration to the Amazon EKS cluster.

    or

  5. Verify that the backup CronJob is created successfully.

  6. Review the CronJob details to confirm that the schedule and the storageClass reference match your configuration.

  7. Verify that the Helm deployment automatically created the backup PVC.

Example: S3 StorageClass and PersistentVolume

The underlying PV must have S3 specifications, such as bucket-name and aws-region and PV and PVC size must match to 'backup.persistence.size'.

The Helm deployment automatically creates the PVC using the storageClass and volumeName values defined in the backup configuration block. The volumeName must match the existing PersistentVolume that points to the S3 bucket.

Result

The PDC backup configuration is updated to use Amazon S3 with a Helm-managed PVC. When the backup CronJob runs, it automatically mounts the PVC and stores all backup files directly in the configured S3 bucket.

Configure a backup using Amazon EBS or EFS with the existing PVC

In Data Catalog deployments running on Amazon EKS, administrators can configure backups to use Amazon EBS or Amazon EFS through an existing PersistentVolumeClaim (PVC). This configuration allows you to use a pre-created PVC that points to an EBS or EFS volume already available in your Amazon EKS cluster. The PDC backup process writes all backup data to this PVC, which is mounted as persistent storage within the cluster.

When using an existing PVC, ensure the PVC and its associated StorageClass are configured properly and have sufficient capacity to store the backup files.

Perform the following steps to configure a backup using Amazon EBS or EFS with the existing PVC:

Before you begin

  • Verify that the EBS or EFS StorageClass is configured in your Amazon EKS cluster.

  • Ensure that a PersistentVolumeClaim (PVC) is pre-created and bound to the desired EBS or EFS volume.

  • Confirm that the PDC namespace and Helm deployment are accessible.

  • Ensure that you have Helm 3.0 or later and kubectl installed.

  • Locate the custom-values.yaml file used for your PDC Helm deployment.

Procedure

  1. Open the custom-values.yaml file for your PDC deployment in a text editor.

  2. Add or update the following backup configuration block:

    In this case, if the customer has their own PVC, the name of the PVC must be specified in the configuration as shown above.

  3. Save the configuration file.

  4. Apply the configuration to the Amazon EKS cluster.

    or

  5. Verify that the backup CronJob is created in the EKS cluster.

  6. Review the CronJob details to confirm the schedule, PVC reference, and component backup targets.

    The CronJob should reference the existing PVC pdc-backup-pvc.

  7. Verify that the PVC is correctly mounted and available in the cluster.

Example: EBS or EFS PersistentVolume and PVC

If using Amazon EFS, replace the awsElasticBlockStore section with an efs.csi.aws.com driver configuration.

Result

The PDC backup configuration is updated to use Amazon EBS or Amazon EFS storage through the specified existing PVC. When the backup CronJob runs, it stores all backup files on the mounted persistent volume, enabling quick recovery from local cluster storage.

Configure a backup using Amazon EBS or EFS with Helm-managed PVC

In Data Catalog deployments running on Amazon EKS, administrators can configure backups to use Amazon EBS or Amazon EFS through a Helm-managed PersistentVolumeClaim (PVC). In this configuration, the Helm deployment automatically creates and manages the PVC based on the provided StorageClass configuration. This approach is recommended when administrators prefer automated storage management and do not want to manually create PVCs before deployment.

Ensure that the StorageClass used for EBS or EFS is available and properly configured in your Amazon EKS cluster before enabling Helm-managed PVC creation.

Perform the following steps to configure a backup using Amazon EBS or EFS with Helm-managed PVC:

Before you begin

  • Verify that the EBS or EFS StorageClass is configured in your Amazon EKS cluster.

  • Confirm that Helm 3.0 or later and kubectl are installed.

  • Ensure that the PDC namespace and Helm deployment are accessible.

  • Verify that the custom-values.yaml file used for the PDC Helm deployment is available.

  • Ensure that the EBS volume or EFS mount target is accessible from the cluster nodes.

Procedure

  1. Open the custom-values.yaml file used for your PDC Helm deployment.

  2. Add or update the following backup configuration block:

    In this case, if the customer wants the PVC to be created by Helmfile, the storageClass and volumeName must be pre-existing and specified in the configuration, as shown above. The volumeName field is optional and can be left empty if you want Helm to automatically assign one.

  3. Save the configuration file.

  4. Apply the configuration to the Amazon EKS cluster.

    or

  5. Verify that the backup CronJob is created successfully.

  6. Review the CronJob details to confirm that the schedule, StorageClass, and volume configuration are correctly referenced.

  7. Verify that the Helm deployment automatically created the backup PVC.

Example: EBS or EFS StorageClass and PersistentVolume

Replace ebs-sc with your EFS StorageClass if you are using Amazon EFS (efs.csi.aws.com). The volumeName in the backup configuration can be left blank if Helm should generate it automatically.

Result

The PDC backup configuration is updated to use Amazon EBS or Amazon EFS with a Helm-managed PVC. When the backup CronJob runs, it automatically mounts the newly created PVC and stores backup data on the corresponding EBS or EFS volume.

Configure backup targets

In Data Catalog deployments running on Amazon EKS, administrators can control which PDC components are included in each backup. Backup targets represent the core services and configuration objects that store catalog metadata, application settings, and operational data.

Each backup target corresponds to a specific PDC service or metadata store. You can include or exclude services as needed and optionally define individual Kubernetes objects.

Target
Description

PostgreSQL

Stores configuration and metadata for user management, settings, and workflows.

MongoDB

Stores data asset, profiling, and relationship metadata collected from source systems.

OpenSearch

Stores indexed metadata used for catalog search, glossary, and lineage visualization.

FE-Workers

Stores dictionaries, patterns, and system-defined data used for data profiling and discovery.

Objects

Stores Kubernetes objects such as Secrets and ConfigMaps used by PDC services. You can define these objects by specifying the kind (for example, secret, configmap) and name (for example, cat-key).

You can define these backup targets in the Helm configuration to enable or disable backups for specific components at deployment time. This flexibility allows administrators to back up only the required services, exclude external databases, or include custom Kubernetes objects that need to be preserved during recovery.

Perform the following steps to configure backup targets:

Before you begin

  • Verify that you have access to the PDC Helm deployment and the custom-values.yaml file.

  • Confirm that Helm 3.0 or later and kubectl are installed on the administrator workstation.

  • Ensure that the backup configuration for your selected storage type (Amazon S3, EBS, or EFS) is already defined.

  • Identify which components and objects you want to include in the backup.

Procedure

  1. Open the custom-values.yaml file for your PDC deployment in a text editor.

  2. Locate the backup configuration block under the pdc-backup section.

  3. Define the backup targets by setting the enabled parameter to true or false for each service:

    Note:

    • You can list multiple Kubernetes objects under the object section. Common examples include:

      • kind: secret, name: cat-key

      • kind: configmap, name: pdc-settings

      • kind: secret, name: pdc-license

      • kind: configmap, name: jobserver-config

    • Enable FE-Workers and Objects backup only if these components or resources are part of your recovery plan. For external databases such as Amazon Aurora PostgreSQL, set postgres.enabled to false and manage backups externally.

  4. Save the configuration file.

  5. Apply the configuration to the Amazon EKS cluster.

    or

  6. Verify that the backup CronJob includes the selected targets.

    The job definition lists only the enabled components and specified objects as backup targets.

Result

Backup targets are configured successfully. When the backup CronJob runs, it includes only the enabled components and any defined Kubernetes objects, and stores their backups in the configured storage location.

Run a backup in Amazon EKS

In Data Catalog deployments running on Amazon EKS, administrators can perform both automated and manual backups of key Data Catalog components. Each backup captures data and configuration from PostgreSQL, MongoDB, OpenSearch, FE-Workers, and related Kubernetes objects.

After you apply the backup configuration, a CronJob is automatically created in the Amazon EKS cluster. The CronJob runs daily at midnight by default. You can also trigger a manual backup at any time, for example, before performing an upgrade or configuration change.

If your deployment uses an external PostgreSQL database such as Amazon Aurora, Data Catalog doesn’t back up that database. Set the postgres.enabled parameter to false in the custom-values.yaml configuration file.

Perform the following steps to run a backup in Amazon EKS:

Before you begin

Before you run a backup, make sure the following requirements are met:

  • The Data Catalog backup CronJob is configured in the Amazon EKS cluster.

  • kubectl and Helm are installed and configured to access the cluster.

  • You have administrator access to the PDC namespace.

  • The configured storage backend, Amazon S3 storage, Amazon EBS volumes, or Amazon EFS file systems, is accessible from the cluster.

Procedure

  1. Verify that the backup CronJob exists in the PDC namespace.

    The CronJob named pdc-backup should be listed.

  2. Check the CronJob schedule.

    The default schedule is 0 0 * * *, which runs daily at midnight.

  3. Trigger a manual backup when needed.

  4. View all backup jobs in the PDC namespace.

  5. View backup logs for each component.

    Each log confirms whether the backup completed successfully for that component.

  6. Verify backup files in Amazon S3 storage.

    The command lists all backup folders organized by service and timestamp.

  7. Create a temporary pod to verify backup files in Amazon EBS volumes or Amazon EFS file systems. Save the following YAML as backup-checker.yaml:

    Replace $<customer-artifactory> with the actual artifactory path, like ECR or any private artifactory.

  8. Apply the pod specification.

  9. List backup files inside the pod.

    The command lists all backup folders by component and timestamp.

  10. Delete the temporary pod after verification.

Result

The backup job completes successfully and stores the data in the configured Amazon S3 bucket or Amazon EBS or Amazon EFS persistent volume. The CronJob continues to run automatically according to the defined schedule. Container logs confirm that all components were backed up successfully.

Verify backups in Amazon EKS

In Data Catalog deployments running on Amazon EKS, administrators can verify that backup jobs are running successfully and that backup files are stored correctly in the configured storage backend. Verifying backups ensures that the scheduled or manual backup operations complete without errors and that data for all Data Catalog components is available for recovery when needed.

Data Catalog supports multiple storage options for backup data. The verification steps differ depending on the storage backend used in your deployment:

  • Amazon S3 storage: Backups are written to an S3 bucket, and verification is performed by inspecting the bucket contents and checking job logs. For more information, see Verify backups in Amazon S3 storage.

  • Amazon EBS volumes or Amazon EFS file systems: Backups are written directly to a persistent volume claim (PVC) mounted in the EKS cluster, and verification involves inspecting files stored inside the PVC. For more information, see Verify backups in Amazon EBS volumes or Amazon EFS file systems.

Verify backups in Amazon S3 storage

In Data Catalog deployments running on Amazon EKS with Amazon S3 as the backup storage, administrators can verify that backups are successfully created and stored in the configured S3 bucket. Verification ensures that the pdc-backup CronJob is running correctly, that each backup job completes successfully, and that the backup data for all Data Catalog components is available in S3.

Perform the following steps to verify the backups in Amazon S3 storage:

Before you begin

Make sure the following requirements are met:

  • Data Catalog backups are configured to use Amazon S3 in the Helm configuration file.

  • kubectl and AWS CLI are installed and configured.

  • The AWS credentials or IAM role attached to the Amazon EKS worker nodes provide access to the Amazon S3 bucket.

  • You have the Amazon S3 bucket name used for storing Data Catalog backups.

  • You have administrator access to the PDC namespace in the Amazon EKS cluster.

Procedure

  1. Check that the backup CronJob exists in the PDC namespace.

    The pdc-backup CronJob should appear in the list.

  2. Verify that the most recent backup job completed successfully.

    The Completed status indicates that the backup job finished without errors.

  3. Check the logs of each backup container to confirm completion.

    Each container log should display a “Backup completed successfully” message for its corresponding component.

  4. Verify that new backup folders are created in the S3 bucket.

    The command lists backup folders grouped by component and timestamp. Confirm that the latest timestamp corresponds to the last backup job run.

  5. Drill down into a component folder to verify detailed backup files.

    Each directory should contain files such as .pgdump, .tar.gz, or .yaml representing backed-up data.

  6. Verify that backup timestamps in S3 align with the CronJob schedule. For example, if the schedule is set to midnight (0 0 * * *), confirm that new backup folders appear daily at approximately that time.

  7. Optionally, download and inspect one backup file to confirm data integrity.

    The file size and timestamp confirm that the dump file was generated during the latest backup run.

Result

The backups are verified successfully in Amazon S3 storage. Each Data Catalog component’s data is available in the S3 bucket, and the folder structure reflects the latest backup job timestamp. The CronJob and job logs confirm that all backup operations completed without errors.

Verify backups in Amazon EBS volumes or Amazon EFS file systems

In Data Catalog deployments running on Amazon EKS, administrators can verify backups stored on Amazon EBS or Amazon EFS volumes. These backups are written directly to a persistent volume claim (PVC) mounted in the EKS cluster. Verification ensures that backup jobs run successfully, that backup files are created in the /backups directory of the PVC, and that each Data Catalog component is included in the backup.

Perform the following steps to verify backups in Amazon EBS volumes or Amazon EFS file systems:

Before you begin

Make sure the following requirements are met:

  • Backups are configured to use Amazon EBS or Amazon EFS in the Helm configuration file.

  • The Data Catalog backup CronJob is running in the Amazon EKS cluster.

  • kubectl is installed and configured to access the Amazon EKS cluster.

  • You have administrator access to the PDC namespace.

  • You have the PersistentVolumeClaim (PVC) name used for storing backups.

Procedure

  1. Check that the pdc-backup CronJob exists in the PDC namespace.

    The CronJob named pdc-backup should appear in the list.

  2. Verify that the most recent backup job completed successfully.

    The Completed status confirms that the backup job ran without errors.

  3. Review the logs for each backup container to confirm successful completion.

    Each log should confirm that the backup completed successfully for that component.

  4. Create a temporary verification pod to inspect backup files in the PVC. Save the following YAML file as backup-verifier.yaml.

    Replace $<customer-artifactory> with the actual artifactory path, like ECR or any private artifactory.

  5. Apply the pod specification to the EKS cluster.

  6. Connect to the verification pod.

  7. List the backup folders stored in the mounted PVC.

    Backup directories should be organized by timestamp and contain subfolders for PostgreSQL, MongoDB, OpenSearch, FE-Workers, and Kubernetes objects.

  8. Verify that backup folders are updated according to the CronJob schedule. Confirm that a new folder exists for each backup cycle (for example, daily if the schedule is 0 0 * * *).

  9. Exit the pod session after verification.

  10. Delete the temporary verification pod.

Result

The backup files are verified successfully in the Amazon EBS volumes or Amazon EFS file systems persistent volume. Backup folders for each Data Catalog component are available under the /backups directory, organized by timestamp. The job status and logs confirm that the backup CronJob is running successfully in the EKS cluster.

Verify retention in Amazon EKS

In Data Catalog deployments running on Amazon EKS, administrators can verify that backup retention policies are working correctly. Retention ensures that older backups are automatically deleted or archived based on the configured duration, preventing unnecessary storage consumption and maintaining compliance with data governance requirements.

Retention behavior depends on the type of storage used for backups:

  • Amazon EBS volumes or Amazon EFS file systems: Retention is managed through the Data Catalog configuration parameters defined in the custom-values.yaml file. The backup.retention.days setting specifies how long backups are retained before being automatically deleted.

  • Amazon S3: Retention is managed externally through AWS S3 lifecycle policies, which automatically delete or transition older backups according to the lifecycle rules defined in the bucket.

Restore data from backup in Amazon EKS

In Data Catalog deployments running on Amazon EKS, administrators can restore data and configurations from previously created backups. Restoring data helps recover Data Catalog components after system failures, data corruption, or configuration issues. PDC supports restoration from two storage types:

Each Data Catalog component, PostgreSQL, MongoDB, OpenSearch, FE-Workers, and Kubernetes objects, has its own restore procedure. Administrators can restore individual services or the complete Data Catalog environment, depending on the recovery requirement.

Restore from Amazon S3 storage

When backups are stored in Amazon S3, each Data Catalog component must be restored separately from the data in the Amazon S3 bucket. The following guides describe how to download backup files, connect to service pods, and restore data for each component.

Restore PostgreSQL data from Amazon S3

In Data Catalog deployments running on Amazon EKS, administrators can restore PostgreSQL data from backups stored in Amazon S3. PostgreSQL stores configuration and metadata for Data Catalog, so restoring it is a critical step in recovering the environment after data loss or system failure.

Before restoring PostgreSQL data, stop all Data Catalog services that connect to the database to avoid conflicts during restoration.

Perform the following steps to restore PostgreSQL data from Amazon S3 storage:

Before you begin

Make sure the following requirements are met:

  • The PostgreSQL backup is available in the Amazon S3 bucket.

  • AWS CLI and kubectl are installed and configured to access the Amazon EKS cluster.

  • You have the following information:

    • The Amazon S3 bucket name and the timestamp of the backup you want to restore.

    • The PostgreSQL pod name and PDC namespace.

    • The PostgreSQL username and password.

  • The PostgreSQL pod is in a Running state.

Procedure

  1. Download the PostgreSQL backup files from the S3 bucket.

  2. Drop existing databases in PostgreSQL.

  3. Restore the PostgreSQL database from the downloaded dump file.

  4. Verify the restore by listing all databases.

Result

The PostgreSQL data is restored successfully from the backup stored in Amazon S3 storage. After the PostgreSQL service restarts, all related Data Catalog databases are available and ready for use.

Restore MongoDB Data from Amazon S3

In Data Catalog deployments running on Amazon EKS, administrators can restore MongoDB data from backups stored in Amazon S3. MongoDB stores operational and user metadata for Data Catalog, so restoring it is an essential step in recovering a functional catalog environment.

Perform the following steps to restore MongoDB data from Amazon S3 storage:

Before you begin

Make sure the following requirements are met:

  • The MongoDB backup files are available in the Amazon S3 bucket.

  • AWS CLI and kubectl are installed and configured to access the Amazon EKS cluster.

  • You have the following information:

    • The Amazon S3 bucket name and timestamp of the backup you want to restore.

    • The MongoDB pod name and PDC namespace.

    • The MongoDB username and password.

  • The MongoDB pod is in the Running state.

    kubectl get pods -n <PDC_NAMESPACE> | grep mongo

Procedure

  1. Download the MongoDB backup files from the S3 bucket.

  2. Restore the MongoDB data to the cluster.

  3. Verify the restore by listing databases.

  4. After restoring from the existing backup, it is necessary to restart the licensing-api deployment for the data to take effect.

Result

The MongoDB data is restored successfully from the backup stored in Amazon S3. All operational and user metadata for PDC is available once the MongoDB service restarts and reconnects to the application.


Restore OpenSearch data from Amazon S3

In Data Catalog deployments running on Amazon EKS, administrators can restore OpenSearch data from backups stored in Amazon S3. OpenSearch stores indexed metadata used for search and discovery in PDC. Restoring OpenSearch ensures that catalog search results, entity references, and metadata associations are available after a recovery or redeployment.

Perform the following steps to import the data from Amazon S3 storage into the OpenSearch service running in the Amazon EKS cluster:

Before you restore, make sure curl and jq are installed.

Procedure

  1. Download the OpenSearch backup files from the S3 bucket.

  2. Create an opensearch_restore.sh file with the below content. Replace variables, <LOCAL_PATH>/<TIMESTAMP> and <PDC_NAMESPACE>)

  3. Give executable permission to opensearch_restore.sh file.

  4. Execute the following script.

  5. Verify that all indexes are restored.

  6. Restart the OpenSearch deployment to apply the restored data.

Result

OpenSearch data is restored successfully from the backup stored in Amazon S3. All indexed metadata used for search and discovery in Data Catalog is available once the OpenSearch service restarts and completes indexing.

Restore FE-Workers data from Amazon S3

In Data Catalog deployments running on Amazon EKS, administrators can restore FE-Workers data from backups stored in Amazon S3 storage. The FE-Workers component stores system-defined data patterns, dictionaries, and processed datasets that are essential for profiling and data analysis within Data Catalog. Restoring FE-Workers ensures that these reference files are recovered and available for downstream data discovery and governance tasks.

Perform the following steps to restore FE-Workers data from Amazon S3 storage:

Before you begin

Make sure the following requirements are met:

  • The FE-Workers backup files are available in the Amazon S3 bucket.

  • AWS CLI and kubectl are installed and configured to access your Amazon EKS cluster.

  • You have the Amazon S3 bucket name and the timestamp of the backup information.

Procedure

  1. Download the FE-Workers backup files from the S3 bucket.

  2. Restore the FE-Workers data to the target pod.

  3. Verify that files are extracted successfully.

Result

FE-Workers data is restored successfully from the backup stored in Amazon S3. All dictionaries, system-defined patterns, and processed datasets are available in the FE-Workers container and ready for use by the PDC application.

Restore Kubernetes objects from Amazon S3

In Data Catalog deployments running on Amazon EKS, administrators can restore Kubernetes objects such as Secrets and ConfigMaps from backups stored in Amazon S3. These objects contain configuration data and credentials required for Data Catalog components to operate correctly. Restoring Kubernetes objects ensures that secure keys, connection information, and application configuration are recovered after a cluster rebuild or configuration loss.

Ensure that the target PDC namespace exists before restoring Kubernetes objects. Restoring Secrets or ConfigMaps with the same name will overwrite existing resources in the namespace.

Perform the following steps to restore Kubernetes objects from Amazon S3:

Before you begin

Make sure the following requirements are met:

  • The Kubernetes object backup files are available in the Amazon S3 bucket.

  • AWS CLI and kubectl are installed and configured to access your Amazon EKS cluster.

  • You have the following information:

    • The Amazon S3 bucket name and timestamp of the backup.

    • The PDC namespace where the secrets must be restored.

  • You have cluster administrator privileges in the Amazon EKS cluster.

Procedure

  1. Download the object backup files from the Amazon S3 bucket.

  2. Restore the Kubernetes objects from the downloaded YAML files.

  3. Verify the restored Kubernetes secrets.

Result

Kubernetes objects are restored successfully from the backup stored in Amazon S3. All restored objects are re-applied to the specified PDC namespace, ensuring that the required credentials and configuration settings are available for Data Catalog services.

Restore from Amazon EBS volumes or Amazon EFS file systems

In Data Catalog deployments running on Amazon EKS, administrators can restore backup data stored in Amazon EBS or Amazon EFS volumes. When Data Catalog backups are configured to use persistent storage, all backup files are stored in a PersistentVolumeClaim (PVC) that remains available within the EKS cluster.

Each Data Catalog component can be restored individually by creating a temporary restore pod that mounts the same PVC used during the backup process.

Restoration from EBS or EFS storage allows administrators to recover component data such as PostgreSQL databases, MongoDB collections, OpenSearch indexes, FE-Workers data, and Kubernetes objects directly from the cluster without downloading backup files externally.

Use the same PVC that was used for the backup. Restoring data from an incorrect or outdated PVC may result in partial or inconsistent data recovery.

Each Data Catalog component has its own restore procedure that runs from within the EKS cluster. Select the appropriate guide based on the component you want to restore.

Restore PostgreSQL data from Amazon EBS volumes or Amazon EFS file systems

In Data Catalog deployments running on Amazon EKS, administrators can restore PostgreSQL data from backups stored in Amazon EBS or Amazon EFS. When Data Catalog backups are configured to use persistent storage, backup data is written directly to a PersistentVolumeClaim (PVC) in the EKS cluster. You can restore PostgreSQL data by creating a temporary restore pod that mounts the same PVC and running PostgreSQL commands to import data from the backup files.

Use the same PVC that was used during the backup process. Restoring from an incorrect or outdated volume may cause data inconsistency.

Perform the following steps to restore data from PostgreSQL:

Before you begin

Make sure the following requirements are met:

  • The backup data exists in the /backups/postgres/ directory of the PVC used for Data Catalog backups.

  • kubectl is installed and configured to access the Amazon EKS cluster.

  • The PostgreSQL service is running in the same PDC namespace.

  • You have identified the PVC name, PDC namespace, and PostgreSQL credentials.

  • All active PDC services that connect to PostgreSQL are stopped before the restore process begins.

Procedure

  1. Save the following pod configuration as pg-restore.yaml.

    Replace $<customer-artifactory> with the actual artifactory path, like ECR or any private artifactory.

  2. Apply the pod specification in the EKS cluster.

  3. Verify that the restore pod is running in the specified namespace.

  4. Access the restore pod.

  5. List the available backup files in the mounted directory.

    The directory should contain a file such as postgres_full_<TIMESTAMP>.pgdump.

  6. Set the PostgreSQL password as an environment variable.

  7. Drop existing PostgreSQL databases to avoid conflicts during restoration.

  8. Restore the PostgreSQL database from the backup file.

  9. Verify that the databases are restored successfully.

    The restored databases should appear in the list.

  10. Exit the restore pod.

  11. Delete the temporary restore pod after the restore process is complete.

Result

PostgreSQL data is restored successfully from the Amazon EBS or Amazon EFS storage used for Data Catalog backups. The restored databases are available and accessible once the PostgreSQL service restarts and reconnects with the PDC application.

Restore MongoDB data from Amazon EBS volumes or Amazon EFS file systems

In Data Catalog deployments running on Amazon EKS, administrators can restore MongoDB data from backups stored in Amazon EBS or Amazon EFS. When Data Catalog backups are configured to use persistent storage, backup data is stored in a PersistentVolumeClaim (PVC) in the EKS cluster. You can restore MongoDB data by creating a temporary restore pod that mounts the same PVC and importing the data using the mongorestore utility.

Use the same PVC that was used for backups. Restoring from an incorrect PVC may result in incomplete or outdated data.

Perform the following steps to restore the MongoDB data from backups:

Before you begin

Make sure the following requirements are met:

  • The backup files exist in the /backups/mongodb/ directory of the PersistentVolumeClaim (PVC) used for Data Catalog backups.

  • kubectl is installed and configured to access your Amazon EKS cluster.

  • The MongoDB service is running in the same PDC namespace.

  • You have identified the PVC name, PDC namespace, and MongoDB credentials.

  • All active PDC services that connect to MongoDB are stopped before restoring data.

Procedure

  1. Save the following pod configuration as mongo-restore.yaml.

    Replace $<customer-artifactory> with the actual artifactory path, like ECR or any private artifactory.

  2. Apply the pod specification to the EKS cluster.

  3. Verify that the restore pod is running.

  4. Access the restore pod.

  5. List the available backup files in the mounted directory.

    The directory should contain MongoDB backup folders or BSON files representing each database.

  6. Restore the MongoDB data from the backup.

    This command drops existing collections and restores data from the specified backup directory.

  7. Verify that the data has been restored successfully.

    The restored databases should appear in the list.

  8. Exit the restore pod.

  9. Delete the temporary restore pod.

  10. Restart the licensing-api deployment to apply the restored data.

Result

MongoDB data is restored successfully from the Amazon EBS or Amazon EFS storage used for Data Catalog backups. All MongoDB collections are recovered, and the licensing-api deployment is refreshed to reflect the restored data.

Restore FE-Workers data from Amazon EBS volumes or Amazon EFS file systems

In Data Catalog deployments running on Amazon EKS, administrators can restore FE-Workers data from backups stored in Amazon EBS or Amazon EFS. When Data Catalog backups are configured to use persistent storage, FE-Workers data, including patterns, dictionaries, and temporary profiling results, is stored in a PersistentVolumeClaim (PVC). You can restore this data by creating a temporary restore pod that mounts both the backup PVC and the FE-Workers data PVC, then extracting the backup files into the target directory.

Use the same backup PVC that was used during the backup. Restoring data from an incorrect PVC may result in missing or inconsistent worker files.

Perform the following steps to restore FE-Workers data from backups stored in Amazon EBS or Amazon EFS.

Before you begin

Make sure the following requirements are met:

  • The backup files exist in the /backups/fe-workers/ directory of the backup PersistentVolumeClaim (PVC).

  • kubectl is installed and configured to access the Amazon EKS cluster.

  • You have identified the PVC name used for the backup and the PVC name used for FE-Workers data.

  • The PDC namespace is correct.

  • All active FE-Worker jobs or services are stopped before the restore is performed.

Procedure

  1. Save the following pod configuration as fe-worker-restore.yaml.

    Replace $<customer-artifactory> with the actual artifactory path, like ECR or any private artifactory.

  2. Apply the restore pod specification to the EKS cluster.

  3. Verify that the restore pod is running.

  4. Access the restore pod.

  5. List the available FE-Workers backup files.

    The directory should contain an archive file such as fe-worker-backup-<TIMESTAMP>.tar.gz.

  6. Extract the FE-Workers backup files into the target directory.

  7. Verify that the files have been extracted successfully.

    The directory should include data folders such as patterns-systemdefined, dictionaries-en, and data.

  8. Exit the restore pod.

  9. Delete the temporary restore pod.

Result

FE-Workers data is restored successfully from the Amazon EBS or Amazon EFS storage used for Data Catalog backups. The restored dictionaries, patterns, and data files are available in the FE-Workers data directory and ready for use by the PDC application.

Restore Kubernetes objects from Amazon EBS volumes or Amazon EFS file systems

In Data Catalog deployments running on Amazon EKS, administrators can restore Kubernetes objects such as Secrets and ConfigMaps from backups stored in Amazon EBS or Amazon EFS. When Data Catalog backups are configured to use persistent storage, these objects are saved in a PersistentVolumeClaim (PVC) in the EKS cluster. You can restore Kubernetes objects by creating a temporary restore pod that mounts the same PVC and applies the backed-up manifests.

Use the same backup PVC that was used during the backup process. Restoring from an incorrect PVC may result in missing or outdated configurations. The restore pod must use the pdc-backup-sa service account to access and apply Kubernetes objects.

Perform the following steps to restore Kubernetes objects, such as Secrets and ConfigMaps, from backups stored in Amazon EBS or Amazon EFS.

Before you begin

Make sure the following requirements are met:

  • The object backup files exist in the /backups/objects/ directory of the backup PersistentVolumeClaim (PVC).

  • kubectl is installed and configured to access the Amazon EKS cluster.

  • The pdc-backup-sa service account is configured with permissions to create and update Kubernetes objects.

  • You have identified the PDC namespace and the PVC name used for storing the backup.

  • You have cluster administrator access to apply Secrets and ConfigMaps.

Procedure

  1. Save the following pod configuration as object-restore.yaml.

    Replace $<customer-artifactory> with the actual artifactory path, like ECR or any private artifactory.

  2. Apply the pod specification to the EKS cluster.

  3. Verify that the restore pod is running.

  4. Access the restore pod.

  5. List the available Kubernetes object backup files.

    The directory should contain YAML manifest files for Secrets or ConfigMaps, such as secret_cat-key_<TIMESTAMP>.yaml.

  6. Apply the backed-up object manifests to restore them in the cluster.

  7. Verify that the objects have been restored.

    The restored secret (for example, cat-key) should appear in the list.

  8. Exit the restore pod.

  9. Delete the temporary restore pod.

Result

Kubernetes Secrets and ConfigMaps are restored successfully from the Amazon EBS or Amazon EFS storage used for Data Catalog backups. The restored objects are available in the PDC namespace, allowing Data Catalog components to access their required configuration and credentials.

Restore OpenSearch data from Amazon EBS volumes or Amazon EFS file systems

In Data Catalog deployments running on Amazon EKS, administrators can restore backup data stored in Amazon EBS Volumes or Amazon EFS File Systems. When Data Catalog backups use persistent storage, all backup files are stored in a PersistentVolumeClaim (PVC) that remains available in the Amazon EKS cluster. Each Data Catalog component can be restored individually by creating a temporary restore pod that mounts the same PVC used during the backup process.

Restoring from Amazon EBS or Amazon EFS allows administrators to recover component data, such as PostgreSQL databases, MongoDB collections, OpenSearch indexes, FE-Workers data, and Kubernetes objects, directly within the cluster, without downloading backup files externally.

Use the same PVC that was used for backups. Restoring from an incorrect PVC can lead to missing or outdated search indexes. The restore process requires the jq utility in the container to process JSON data.

Before you begin

  • Confirm that backup files exist in the /backups/opensearch/ directory of the backup PVC.

  • Verify that kubectl is installed and configured to access the Amazon EKS cluster.

  • Ensure that the OpenSearch service is running in the same namespace.

  • Identify the PVC name used for the backup and the PDC namespace.

  • Confirm that the jq package is available in the container image (PDC_TOOLBOX:debian-12).

Perform the following steps to OpenSearch data from Amazon EBS Volumes or Amazon EFS file systems:

  1. Save the following pod configuration as opensearch-restore.yaml.

    Replace $<customer-artifactory> with the actual artifactory path, like ECR or any private artifactory.

  2. Apply the restore pod specification to the EKS cluster.

  3. Verify that the restore pod is running.

  4. Create the OpenSearch restore script locally and save it as opensearch_restore.sh. This script automates restoring all OpenSearch indexes from the PVC backup directory.

  5. Assign executable permissions to the restore script.

  6. Copy the restore script to the OpenSearch restore pod.

  7. Access the restore pod.

  8. Navigate to the script directory.

  9. Run the OpenSearch restore script.

    The script restores OpenSearch indexes, mappings, and data from the backup directory and automatically re-creates aliases. It processes indexes in chunks, using parallel ingestion for large data sets to speed up restoration.

  10. Confirm that the indexes are restored successfully.

    The list should display all PDC-related indexes, such as pdc_entity, pdc_policy, and pdc_glossary.

  11. Exit the restore pod.

  12. Delete the temporary restore pod after completing the process.

Result

OpenSearch data is restored successfully from the Amazon EBS or Amazon EFS storage used for Data Catalog backups. All indexes, mappings, and aliases are re-created, and search functionality is available in Data Catalog once the OpenSearch service completes synchronization.


Last updated