Installing Pentaho on AWS

Legacy page. Content moved to the Hyperscalers topic.

This content has moved to Hyperscalersarrow-up-right.

Deploy Pentaho on Amazon Web Services (AWS).

Choose a deployment option

Common prerequisites

  • An AWS account

  • Docker installed on your workstation

  • AWS CLI installed on your workstation

Each deployment option can have extra prerequisites.

Install the Platform Server or PDI Server on AWS

Use these steps to deploy Docker images of the Pentaho Platform Server or PDI Server on AWS.

This workflow uses Amazon EKS, Amazon ECR, and (optionally) Amazon S3.

Before you begin

Prerequisites

Meet these prerequisites before you start:

  • Install a stable version of Docker on your workstation.

  • Have an AWS account.

  • Install the Amazon AWS CLI on your workstation.

  • Review the supported versions:

    Application
    Supported version

    EKS

    v1.x

    Docker

    v20.10.21 or a later stable version

    AWS CLI

    v2.x

    Python

    v3.x

  • Fill in Worksheet for AWS hyperscaler. You will reuse these values later.

Process overview

Use these steps to deploy the Platform Server or PDI Server on AWS:

  1. Download and extract Pentaho for AWS.

  2. Create an Amazon ECR repository.

  3. Load and push the Pentaho Docker image to ECR.

  4. Create an RDS database.

  5. (Optional) Create an S3 bucket.

  6. Create an EKS cluster and add a node group.

  7. Install the Platform or PDI Server.

You can also dynamically update server configuration content from S3.

1

Step 1: Download and extract Platform or PDI Server for AWS

Download and open the package files that contain the files you need to install Pentaho.

  1. Navigate to the Support Portalarrow-up-right and download the AWS version of the Docker image with the corresponding license file for the applications you want to install on your workstation.

  2. Extract the image to view the directories and the README file.

    The image package file (<package-name>.tar.gz) contains the following:

    Directory or file name
    Content description

    image

    Directory containing all the Pentaho source images.

    sql-scripts

    Directory containing SQL scripts for various operations.

    yaml

    Directory containing YAML configuration files and various utility files.

    README.md

    File containing a link to detailed information about what is provided for this release.

2

Step 2: Create an Amazon ECR

Before pushing the Pentaho image to AWS, create an Amazon ECR repository.

  1. Create an ECR repository to load the Pentaho image.

    For details, see instructions for creating a private repositoryarrow-up-right in AWS docs.

  2. Record the name of the ECR repository in Worksheet for AWS hyperscaler.

3

Step 3: Load and push the Pentaho Docker image to ECR

Select and tag the Pentaho Docker image, then push it to the ECR registry.

  1. Navigate to the image directory containing the Pentaho .tar.gz files.

  2. Select and load the .tar.gz file into the local registry:

  3. Record the name of the source image that was loaded into the registry:

  4. Tag the source image so it can be pushed to AWS:

  5. Push the image file into the ECR registry:

    The AWS Management Console displays the uploaded image URI.

    For general instructions, see Pushing a Docker imagearrow-up-right.

  6. Record the newly created ECR repository URI in Worksheet for AWS hyperscaler.

4

Step 4: Create an RDS database

Use these instructions to create a Relational Database Service (RDS) database in AWS.

  1. Create an RDS PostgreSQL database for Pentaho to use.

    See the AWS instructions at Creating and connecting to a PostgreSQL DB instancearrow-up-right and apply the settings in the table below.

    Section
    Actions

    Create database

    Choose Standard create.

    Select the PostgreSQL engine.

    Set the engine version to a PostgreSQL version supported by the Components reference found in the Try Pentaho Data Integration and Analytics document, such as PostgreSQL 13.5-R1.

    Templates

    It is recommended to select the Free tier option.

    Note: For this installation, the Free tier PostgreSQL database is used with a set of options as an example. However, you are free to use other database servers with different options as necessary.

    Settings

    Set the DB instance identifier.

    Retain the default user name postgres and set the Master password.

    Use the default password authentication setting.

    Use the default values for the rest of the settings in this section.

    Instance configuration

    Use the default settings for each section.

    Storage

    Use the default settings for each section.

    Connectivity

    Set the Virtual private cloud (VPC) and the DB subnet group to any of the options available to you. If in doubt, use the default values.

    Select Public access.

    Make sure that the VPC security groups selected have a rule enabling communication to the database through the PostgreSQL port, which is 5432 by default.

    For other options, use the default settings.

    Database authentication

    Use the default setting Password authentication.

  2. Run the scripts in the sql-scripts folder in the distribution in numeric order.

  3. From the AWS Management Console > Connection & security tab, record the database Endpoint and Port number in Worksheet for AWS hyperscaler.

5

Step 5 (Optional): Create an S3 bucket

Create an S3 bucket only if you want to do one or more of the following actions. Otherwise, go to Step 6.

  • Add third-party JAR files like JDBC drivers or custom JAR files.

  • Customize the default Pentaho configuration.

  • Replace server files.

  • Upload or update the metastore.

  • Add files to the Platform and PDI Server's /home/pentaho/.kettle directory.

    This directory is mapped to the KETTLE_HOME_DIR environment variable. The content-config.properties file uses it.

  1. Create an S3 bucket.

    To create an S3 bucket, see Creating a bucketarrow-up-right.

    To upload a file to S3, see Uploading objectsarrow-up-right.

  2. Record the newly created S3 bucket name in Worksheet for AWS hyperscaler.

  3. Upload files into the S3 bucket.

    After the S3 bucket is created, manually create any needed directories and upload files by using the AWS Management Console.

    The following table lists the relevant Pentaho directories and actions for each directory.

    Directory
    Actions

    /root

    All files in the S3 bucket are copied to the Platform and PDI Server's /home/pentaho/.kettle directory.

    If you must copy a file to the /home/pentaho/.kettle directory, drop the file in the root directory of the S3 bucket.

    custom-lib

    If Pentaho needs custom JAR libraries, add the custom-lib directory to the S3 bucket and place the libraries there.

    Any files within this directory will be copied to Pentaho’s lib directory.

    jdbc-drivers

    If the Pentaho installation needs JDBC drivers, do the following:

    1. Add the jdbc-drivers directory to the S3 bucket.

    2. Place the drivers in this directory. Any files within this directory will be copied to Pentaho’s lib directory.

    plugins

    If the Pentaho installation needs additional plugins installed, do the following:

    1. Add the plugins directory to the S3 bucket.

    2. Copy the plugins to the plugins directory. Any files within this directory are copied to Pentaho’s plugins directory. For this reason, organize plugins in their own directories, as Pentaho expects.

    drivers

    If the Pentaho installation needs big data drivers installed, do the following:

    1. Add the drivers directory to the S3 bucket.

    2. Place the big data drivers in this directory. Any files placed within this directory will be copied to Pentaho’s drivers directory.

    metastore

    Pentaho can execute jobs and transformations. Some require additional information that is usually stored in the Pentaho metastore.

    If you must provide the Pentaho metastore to Pentaho, copy the local metastore directory to the root of the S3 bucket. From there, the metastore directory is copied to the proper location within the Docker image.

    server-structured-override

    Use server-structured-override only if other mechanisms do not work.

    For example, you can use it for configuring authentication and authorization.

    Any files and directories within this directory will be copied into the pentaho-server directory the same way they appear here.

    If the same files exist in pentaho-server, they are overwritten.

    The following table lists relevant Pentaho files and actions for each file.

    File
    Actions

    context.xml

    The Pentaho configuration YAML is included with the image in the templates project directory and is used to install this product. You must set the RDS host and RDS port parameters when you install Pentaho. During installation, the parameters in the YAML are used to generate a custom context.xml so the server can connect to the database repository.

    If these are the only changes required in context.xml, you don’t need to provide a context.xml in S3.

    If you need additional context.xml changes, provide your own context.xml in S3.

    In the context.xml template, replace the <RDS_HOST_NAME> and <RDS_PORT> entries with the values in Worksheet for AWS hyperscaler.

    content-config.properties

    The content-config.properties file tells the Pentaho Docker image which S3 files to copy and where to place them.

    Each instruction is a line in this format:

    ${KETTLE_HOME_DIR}/<some-dir-or-file>=${SERVER_DIR}/<some-dir>

    A template for this file is in the templates project directory.

    The template has an entry where context.xml is copied to the required location:

    ${KETTLE_HOME_DIR}/context.xml=${SERVER_DIR}/tomcat/webapps/pentaho/META-INF/context.xml

    content-config.sh

    A bash script that can configure files, change file and directory ownership, move files, install missing apps, and so on.

    Add the script to the S3 bucket.

    The script runs in the Docker image after the other files are processed.

    metastore.zip

    Pentaho can execute jobs and transformations. Some require additional information that is usually stored in the Pentaho metastore.

    If you must provide the Pentaho metastore to Pentaho, zip the content of the local.pentaho directory with the name metastore.zip and add it to the root of the S3 bucket. The metastore.zip file is extracted to the proper location within the Docker image.

    Note: You cannot copy VFS connections to the hyperscaler server the same way as named connections. Connect to Pentaho on the hyperscaler and create the VFS connection there.

For instructions on how to dynamically update server configuration content from the S3 bucket, see Dynamically update server configuration content from S3.

6

Step 6: Create an EKS cluster and add a node group

Use Amazon Elastic Kubernetes Service (EKS) to create a cluster for running the Platform or PDI Server.

  1. Create an EKS cluster on AWS.

    For instructions, see Create an Amazon EKS clusterarrow-up-right.

    For an introduction to EKS, see Getting started with Amazon EKSarrow-up-right.

    For information about creating roles to delegate permissions to an AWS service, see Create a rolearrow-up-right.

    Settings
    Actions

    Cluster service role

    Select any existing role, as long as these policies are attached:

    • AmazonEKSClusterPolicy

    • AmazonS3FullAccess

    • AmazonEKSServicePolicy

    VPC

    In the Networking section, do the following:

    1. Select an existing VPC. The selected VPC populates a group of subnets. Create it before you create a computing or cloud stack.

    2. Make sure that Auto-assign public IPv4 address under Subnets is set to Yes.

    Cluster endpoint access

    Select Public and private.

    Amazon VPC CNI

    CoreDNS

    kube_proxy

    Select all three EKS add-ons with their default configurations.

  2. Record the newly created EKS cluster name in Worksheet for AWS hyperscaler.

  3. On the Compute tab under Node groups, select Add node group.

    circle-info

    The EKS cluster must be in Active state before you create nodes.

    For more details, see Create a managed node grouparrow-up-right.

  4. In Node group configuration, set the node group Name.

  5. Select a Node IAM role or create a new one. Make sure the role includes these policies:

    • AmazonS3FullAccess

    • AmazonEC2ContainerRegistryReadOnly

    • AmazonEKSWorkerNodePolicy

    • AmazonEKS_CNI_Policy

  6. Set the instance type to one that has at least 8 GB of memory.

  7. In Node group scaling configuration, set Desired size, Minimum size, and Maximum size.

  8. In Node group network configuration, select the subnets for your node group.

  9. For the subnets, set Auto-assign public IPv4 address to Yes.

    For details, see IP addressing for your VPCs and subnetsarrow-up-right.

  10. Select a load balancer.

    For instructions on how to create an AWS Application Load Balancer, see Application load balancing on Amazon EKSarrow-up-right.

7

Step 7: Install the Platform or PDI Server on AWS

When your AWS environment is configured, install the Platform Server or PDI Server.

  1. Retrieve the kubeconfig from the EKS cluster.

    In your workstation console, run:

  2. To configure the Platform or PDI Server YAML file, open pentaho-server-aws-rds-<lb-type>.yaml in the yaml project directory.

    lb-type
    When to use

    alb

    Use this if you installed the AWS Application Load Balancer.

    nginx

    Use this if you installed the NGINX Ingress Controller.

  3. Update the YAML file by copying the values you recorded in Worksheet for AWS hyperscaler.

  4. Retrieve the Platform or PDI Server entry point URI.

    Run either command on your workstation:

    or:

    The default port is 80.

  5. Deploy the Platform or PDI Server:

  6. Test the Platform or PDI Server by retrieving the LoadBalancer Ingress URI:

    circle-info

    The port number for this load balancer is 80, not 8080.

  7. Open the URI in a Pentaho-supported browser and sign in.

    Field
    Default value

    Username

    admin

    Password

    password

Dynamically update server configuration content from S3

If the S3 bucket changed and you need to reflect these changes in the Platform or PDI Server, use these steps.

Before you deploy the Platform or PDI Server, set allow_live_config to true in pentaho-server-aws-rds.yaml.

  1. Navigate to the directory that contains the configuration you want to update.

  2. Prepare the update script by setting <config_command> to one of these values:

    Command option
    Description

    load_from_s3

    Copies the content of the bucket to the server’s /home/pentaho/.kettle directory.

    restart

    Restarts the Platform or PDI Server without restarting the pod.

    update_config

    Runs load_from_s3, runs all configuration and initialization scripts, then runs restart.

    circle-exclamation
  3. Run the configuration update script.

    circle-info

    If you have multiple Platform or PDI Server replicas, remove the comment (#) in front of sleep 60.

  4. Verify that the servers restart properly.

Install the Carte Server on AWS

These instructions help you deploy Docker images of the Carte Server on AWS.

Prerequisites

Meet these requirements before you start:

  • Install a stable version of Docker on your workstation.

  • Have an AWS account.

  • Install the AWS CLI on your workstation.

Supported versions:

  • Amazon EKS: v1.x

  • Docker: v20.10.21 or later stable version

  • AWS CLI: v2.x

Process overview

1

Step 1: Download and extract Pentaho for AWS

  1. Download the AWS Docker image package and license file you need.

  2. Extract the archive.

The package contains:

  • image/: Pentaho source images

  • yaml/: YAML configuration files and utility files

  • README.md: link to release details

2

Step 2: Create an Amazon ECR

Create an ECR repository for the Pentaho image.

  1. Record the repository name in the Worksheet for AWS hyperscaler.

3

Step 3: Load and push the Pentaho Docker image to ECR

Select and tag the Pentaho Docker image, then push it to ECR.

  1. Go to the image/ directory that contains the Pentaho tar.gz files.

  2. Load the tar.gz file into your local registry:

  3. List images and note the source image name:

  4. Tag the source image:

  5. Push the image:

    The AWS Management Console shows the uploaded image URI.

  6. Record the ECR repository URI in the Worksheet for AWS hyperscaler.

For AWS instructions, see Pushing a Docker imagearrow-up-right.

4

Step 4: Create an S3 bucket for the Carte Server

Create an S3 bucket for files the container needs at startup.

  1. Create an S3 bucket.

    See AWS docs: Creating a bucketarrow-up-right.

  2. Record the bucket name in the Worksheet for AWS hyperscaler.

  3. Upload the required directories and files.

    See AWS docs: Uploading objectsarrow-up-right.

S3 bucket directories

Create these directories in the bucket as needed:

  • root/

    • Files in this directory are copied to /home/pentaho/.kettle in the container.

  • jdbc-drivers/

    • Put JDBC drivers here.

    • Files are copied to Pentaho’s lib directory.

  • plugins/

    • Put additional plugins here.

    • Files are copied to Pentaho’s plugins directory.

    • Organize each plugin in its own directory.

S3 bucket files

Upload these files as needed:

  • content-config.properties

    • Controls which S3 files are copied and where.

    • Add one line per copy instruction:

    • Example from the template:

  • content-config.sh

    • Optional script to configure files, change ownership, install missing apps, and more.

    • Runs after the other files are processed.

Run PDI-CLI on AWS

Use the PDI-CLI Docker image to run kitchen.sh (transformations) and pan.sh (jobs) on AWS.

Prerequisites

Meet these requirements before you start:

  • Install a stable version of Docker on your workstation.

  • Have an AWS account.

  • Install the AWS CLI on your workstation.

Supported versions:

  • Docker: v20.10.21 or later stable version

  • AWS CLI: v2.x

Process overview

1

Step 1: Download and extract Pentaho for AWS

  1. Download the AWS Docker image package and license file you need.

  2. Extract the archive.

The package contains:

  • image/: Pentaho source images

  • yaml/: YAML configuration files and utility files

  • README.md: link to release details

2

Step 2: Create an Amazon ECR

Create an ECR repository for the PDI-CLI image.

  1. Record the repository URI in the Worksheet for AWS hyperscaler.

3

Step 3: Load and push the PDI-CLI Docker image to ECR

Load the image locally, tag it, then push it to ECR.

  1. Go to the image/ directory that contains the PDI-CLI tar.gz file.

  2. Load the image into your local registry:

  3. List images and note the source image name:

  4. Tag the source image:

  5. Push the image:

    The AWS console shows the uploaded image URI.

  6. Record the image URI in the Worksheet for AWS hyperscaler.

For AWS instructions, see Pushing a Docker imagearrow-up-right.

4

Step 4: Create an S3 bucket for PDI-CLI

Create an S3 bucket for files the container needs at startup.

  1. Create an S3 bucket.

    See AWS docs: Creating a bucketarrow-up-right.

  2. Record the bucket name in the Worksheet for AWS hyperscaler.

  3. Upload the required directories and files.

    See AWS docs: Uploading objectsarrow-up-right.

S3 bucket directories

Create these directories in the bucket as needed:

  • root/

    • Files in this directory are copied to /home/pentaho/data-integration/data in the container.

  • jdbc-drivers/

    • Put JDBC drivers here.

    • Files are copied to Pentaho’s lib directory.

  • plugins/

    • Put additional plugins here.

    • Files are copied to Pentaho’s plugins directory.

    • Organize each plugin in its own directory.

  • metastore/

    • Put metastore content here when jobs require it.

    • Copy your local .pentaho/ folder into this directory.

    • Content is copied to /home/pentaho/.pentaho in the container.

S3 bucket files

Upload these files as needed:

  • content-config.properties

    • Controls which S3 files are copied and where.

    • Add one line per copy instruction:

    • Example from the template:

  • content-config.sh

    • Optional script to configure files, change ownership, install missing apps, and more.

    • Runs after the other files are processed.

5

Step 5: Configure and execute PDI-CLI in AWS Batch

Create the AWS Batch resources and run a job using the PDI-CLI image.

Follow AWS guidance at Getting started with AWS Batcharrow-up-right.

  1. Create a compute environment.

  2. Create a job queue.

  3. Create a job definition.

    Set the container image to the ECR image URI from Step 3.

  4. Create a job.

  5. Set environment variables for your job:

    • PROJECT_S3_LOCATION

      • S3 location that contains the project files.

      • Example: s3://pentaho-samples/

    • METASTORE_LOCATION

      • S3 path to the metastore directory.

      • Content is copied to /home/pentaho/.pentaho in the container.

      • Example: metastore

    • PROJECT_STARTUP_JOB

      • Job (.kjb) path to run at startup.

      • Example: jobs/run_job_write_to_s3/read_csv_from_s3_job.kjb

    • LICENSE_TOKEN

      • License token or license server URL.

      • Example: http://localhost:7070/license-server/request(Sample)

    • PARAMETERS

      • Parameters passed to the job or transformation.

      • Example: -param:my_param_name=MYVALUE

You can now run jobs and transformations using PDI-CLI.

Worksheet for AWS hyperscaler

Use this worksheet to track values during setup:

Variable
Record your setting

ECR_IMAGE_URI (only Platform/PDI Server and Carte Server)

RDS_HOSTNAME (only Platform/PDI Server and Carte Server)

RDS_PORT (only Platform/PDI Server and Carte Server)

S3_BUCKET_NAME

EKS_CLUSTER_NAME (only Platform/PDI Server and Carte Server)

LICENSE_TOKEN

Last updated

Was this helpful?