Installing Pentaho on AWS

Legacy page. Content moved to the Hyperscalers topic.

This content has moved to Hyperscalers.

Deploy Pentaho on Amazon Web Services (AWS).

Choose a deployment option

Common prerequisites

An AWS account
Docker installed on your workstation
AWS CLI installed on your workstation

Each deployment option can have extra prerequisites.

Install the Platform Server or PDI Server on AWS

Use these steps to deploy Docker images of the Pentaho Platform Server or PDI Server on AWS.

This workflow uses Amazon EKS, Amazon ECR, and (optionally) Amazon S3.

Before you begin

Prerequisites

Meet these prerequisites before you start:

Install a stable version of Docker on your workstation.
Have an AWS account.
Install the Amazon AWS CLI on your workstation.
Review the supported versions:
Application
Supported version
EKS
v1.x
Docker
v20.10.21 or a later stable version
AWS CLI
v2.x
Python
v3.x
Fill in Worksheet for AWS hyperscaler. You will reuse these values later.

Process overview

Use these steps to deploy the Platform Server or PDI Server on AWS:

Download and extract Pentaho for AWS.
Create an Amazon ECR repository.
Load and push the Pentaho Docker image to ECR.
Create an RDS database.
(Optional) Create an S3 bucket.
Create an EKS cluster and add a node group.
Install the Platform or PDI Server.

You can also dynamically update server configuration content from S3.

Step 1: Download and extract Platform or PDI Server for AWS

Download and open the package files that contain the files you need to install Pentaho.

Navigate to the Support Portal and download the AWS version of the Docker image with the corresponding license file for the applications you want to install on your workstation.
Extract the image to view the directories and the README file.
The image package file (<package-name>.tar.gz) contains the following:
Directory or file name
Content description
image
Directory containing all the Pentaho source images.
sql-scripts
Directory containing SQL scripts for various operations.
yaml
Directory containing YAML configuration files and various utility files.
README.md
File containing a link to detailed information about what is provided for this release.

Step 2: Create an Amazon ECR

Before pushing the Pentaho image to AWS, create an Amazon ECR repository.

Create an ECR repository to load the Pentaho image.
For details, see instructions for creating a private repository in AWS docs.
Record the name of the ECR repository in Worksheet for AWS hyperscaler.

Step 3: Load and push the Pentaho Docker image to ECR

Select and tag the Pentaho Docker image, then push it to the ECR registry.

Navigate to the image directory containing the Pentaho .tar.gz files.
Select and load the .tar.gz file into the local registry:
```
docker load -i <pentaho-image>.tar.gz
```
Record the name of the source image that was loaded into the registry:
```
docker images
```

Tag the source image so it can be pushed to AWS:

docker tag <source-image>:<tag> <target-repository>:<tag>

Push the image file into the ECR registry:
```
docker push <target-repository>:<tag>
```
The AWS Management Console displays the uploaded image URI.
For general instructions, see Pushing a Docker image.
Record the newly created ECR repository URI in Worksheet for AWS hyperscaler.

Step 4: Create an RDS database

Use these instructions to create a Relational Database Service (RDS) database in AWS.

Create an RDS PostgreSQL database for Pentaho to use.
See the AWS instructions at Creating and connecting to a PostgreSQL DB instance and apply the settings in the table below.
Section
Actions
Create database
Choose Standard create.
Select the PostgreSQL engine.
Set the engine version to a PostgreSQL version supported by the Components reference found in the Try Pentaho Data Integration and Analytics document, such as PostgreSQL 13.5-R1.
Templates
It is recommended to select the Free tier option.
Note: For this installation, the Free tier PostgreSQL database is used with a set of options as an example. However, you are free to use other database servers with different options as necessary.
Settings
Set the DB instance identifier.
Retain the default user name postgres and set the Master password.
Use the default password authentication setting.
Use the default values for the rest of the settings in this section.
Instance configuration
Use the default settings for each section.
Storage
Use the default settings for each section.
Connectivity
Set the Virtual private cloud (VPC) and the DB subnet group to any of the options available to you. If in doubt, use the default values.
Select Public access.
Make sure that the VPC security groups selected have a rule enabling communication to the database through the PostgreSQL port, which is 5432 by default.
For other options, use the default settings.
Database authentication
Use the default setting Password authentication.
Run the scripts in the sql-scripts folder in the distribution in numeric order.
From the AWS Management Console > Connection & security tab, record the database Endpoint and Port number in Worksheet for AWS hyperscaler.

Step 5 (Optional): Create an S3 bucket

Create an S3 bucket only if you want to do one or more of the following actions. Otherwise, go to Step 6.

Add third-party JAR files like JDBC drivers or custom JAR files.
Customize the default Pentaho configuration.
Replace server files.
Upload or update the metastore.
Add files to the Platform and PDI Server's /home/pentaho/.kettle directory.
This directory is mapped to the KETTLE_HOME_DIR environment variable. The content-config.properties file uses it.

Create an S3 bucket.
To create an S3 bucket, see Creating a bucket.
To upload a file to S3, see Uploading objects.
Record the newly created S3 bucket name in Worksheet for AWS hyperscaler.
Upload files into the S3 bucket.
After the S3 bucket is created, manually create any needed directories and upload files by using the AWS Management Console.
The following table lists the relevant Pentaho directories and actions for each directory.
Directory
Actions
/root
All files in the S3 bucket are copied to the Platform and PDI Server's /home/pentaho/.kettle directory.
If you must copy a file to the /home/pentaho/.kettle directory, drop the file in the root directory of the S3 bucket.
custom-lib
If Pentaho needs custom JAR libraries, add the custom-lib directory to the S3 bucket and place the libraries there.
Any files within this directory will be copied to Pentaho’s lib directory.
jdbc-drivers
If the Pentaho installation needs JDBC drivers, do the following:
Add the jdbc-drivers directory to the S3 bucket.
Place the drivers in this directory. Any files within this directory will be copied to Pentaho’s lib directory.
plugins
If the Pentaho installation needs additional plugins installed, do the following:
Add the plugins directory to the S3 bucket.
Copy the plugins to the plugins directory. Any files within this directory are copied to Pentaho’s plugins directory. For this reason, organize plugins in their own directories, as Pentaho expects.
drivers
If the Pentaho installation needs big data drivers installed, do the following:
Add the drivers directory to the S3 bucket.
Place the big data drivers in this directory. Any files placed within this directory will be copied to Pentaho’s drivers directory.
metastore
Pentaho can execute jobs and transformations. Some require additional information that is usually stored in the Pentaho metastore.
If you must provide the Pentaho metastore to Pentaho, copy the local metastore directory to the root of the S3 bucket. From there, the metastore directory is copied to the proper location within the Docker image.
server-structured-override
Use server-structured-override only if other mechanisms do not work.
For example, you can use it for configuring authentication and authorization.
Any files and directories within this directory will be copied into the pentaho-server directory the same way they appear here.
If the same files exist in pentaho-server, they are overwritten.
The following table lists relevant Pentaho files and actions for each file.
File
Actions
context.xml
The Pentaho configuration YAML is included with the image in the templates project directory and is used to install this product. You must set the RDS host and RDS port parameters when you install Pentaho. During installation, the parameters in the YAML are used to generate a custom context.xml so the server can connect to the database repository.
If these are the only changes required in context.xml, you don’t need to provide a context.xml in S3.
If you need additional context.xml changes, provide your own context.xml in S3.
In the context.xml template, replace the <RDS_HOST_NAME> and <RDS_PORT> entries with the values in Worksheet for AWS hyperscaler.
content-config.properties
The content-config.properties file tells the Pentaho Docker image which S3 files to copy and where to place them.
Each instruction is a line in this format:
${KETTLE_HOME_DIR}/<some-dir-or-file>=${SERVER_DIR}/<some-dir>
A template for this file is in the templates project directory.
The template has an entry where context.xml is copied to the required location:
${KETTLE_HOME_DIR}/context.xml=${SERVER_DIR}/tomcat/webapps/pentaho/META-INF/context.xml
content-config.sh
A bash script that can configure files, change file and directory ownership, move files, install missing apps, and so on.
Add the script to the S3 bucket.
The script runs in the Docker image after the other files are processed.
metastore.zip
Pentaho can execute jobs and transformations. Some require additional information that is usually stored in the Pentaho metastore.
If you must provide the Pentaho metastore to Pentaho, zip the content of the local.pentaho directory with the name metastore.zip and add it to the root of the S3 bucket. The metastore.zip file is extracted to the proper location within the Docker image.
Note: You cannot copy VFS connections to the hyperscaler server the same way as named connections. Connect to Pentaho on the hyperscaler and create the VFS connection there.

For instructions on how to dynamically update server configuration content from the S3 bucket, see Dynamically update server configuration content from S3.

Step 6: Create an EKS cluster and add a node group

Use Amazon Elastic Kubernetes Service (EKS) to create a cluster for running the Platform or PDI Server.

Create an EKS cluster on AWS.
For instructions, see Create an Amazon EKS cluster.
For an introduction to EKS, see Getting started with Amazon EKS.
For information about creating roles to delegate permissions to an AWS service, see Create a role.
Settings
Actions
Cluster service role
Select any existing role, as long as these policies are attached:
AmazonEKSClusterPolicy
AmazonS3FullAccess
AmazonEKSServicePolicy
VPC
In the Networking section, do the following:
Select an existing VPC. The selected VPC populates a group of subnets. Create it before you create a computing or cloud stack.
Make sure that Auto-assign public IPv4 address under Subnets is set to Yes.
Cluster endpoint access
Select Public and private.
Amazon VPC CNI
CoreDNS
kube_proxy
Select all three EKS add-ons with their default configurations.
Record the newly created EKS cluster name in Worksheet for AWS hyperscaler.
On the Compute tab under Node groups, select Add node group.
The EKS cluster must be in Active state before you create nodes.
For more details, see Create a managed node group.
In Node group configuration, set the node group Name.
Select a Node IAM role or create a new one. Make sure the role includes these policies:
- AmazonS3FullAccess
- AmazonEC2ContainerRegistryReadOnly
- AmazonEKSWorkerNodePolicy
- AmazonEKS_CNI_Policy
Set the instance type to one that has at least 8 GB of memory.
In Node group scaling configuration, set Desired size, Minimum size, and Maximum size.
In Node group network configuration, select the subnets for your node group.
For the subnets, set Auto-assign public IPv4 address to Yes.
For details, see IP addressing for your VPCs and subnets.
Select a load balancer.
For instructions on how to create an AWS Application Load Balancer, see Application load balancing on Amazon EKS.

Step 7: Install the Platform or PDI Server on AWS

When your AWS environment is configured, install the Platform Server or PDI Server.

Retrieve the kubeconfig from the EKS cluster.
In your workstation console, run:
```
aws eks update-kubeconfig --name <my_eks_cluster_name> --region <my_EKS_region>
```
To configure the Platform or PDI Server YAML file, open pentaho-server-aws-rds-<lb-type>.yaml in the yaml project directory.
lb-type
When to use
alb
Use this if you installed the AWS Application Load Balancer.
nginx
Use this if you installed the NGINX Ingress Controller.
Update the YAML file by copying the values you recorded in Worksheet for AWS hyperscaler.
Retrieve the Platform or PDI Server entry point URI.
Run either command on your workstation:
```
kubectl get ingress -n pentaho-server
```
or:
```
echo $( kubectl get ingress -n pentaho-server -o jsonpath='{.items..hostname}' )
```
The default port is 80.

Deploy the Platform or PDI Server:

kubectl apply -f <path to Pentaho deployment YAML>

Test the Platform or PDI Server by retrieving the LoadBalancer Ingress URI:
```
echo $( kubectl get ingress -n pentaho-server -o jsonpath='{.items..hostname}' )
```
The port number for this load balancer is 80, not 8080.
Open the URI in a Pentaho-supported browser and sign in.
Field
Default value
Username
admin
Password
password

Dynamically update server configuration content from S3

If the S3 bucket changed and you need to reflect these changes in the Platform or PDI Server, use these steps.

Before you deploy the Platform or PDI Server, set allow_live_config to true in pentaho-server-aws-rds.yaml.

Navigate to the directory that contains the configuration you want to update.
Prepare the update script by setting <config_command> to one of these values:
Command option
Description
load_from_s3
Copies the content of the bucket to the server’s /home/pentaho/.kettle directory.
restart
Restarts the Platform or PDI Server without restarting the pod.
update_config
Runs load_from_s3, runs all configuration and initialization scripts, then runs restart.
restart and update_config disrupt sticky sessions and impact user sessions.

Run the configuration update script.

If you have multiple Platform or PDI Server replicas, remove the comment (#) in front of sleep 60.

for pod in $( kubectl get pods -o name -n pentaho-server )
do
  	echo "Forwarding port on pod: $pod"
 	 pid=$( kubectl port-forward -n pentaho-server $pod 8090:8090 1>/dev/null & echo $! )
 	 while ! nc -z localhost 8090; do 
   		 sleep 0.1
  	done
 	 echo "Executing command ..."
 	 result=$( curl http://localhost:8090/<config_command> )
 	 echo "Command result: $result"
  echo "Killing port forward pid: $pid"
  	while $(kill -9 $pid 2>/dev/null); do 
  	  	sleep 1
 	done
  	# sleep 60
done;

Verify that the servers restart properly.

Install the Carte Server on AWS

These instructions help you deploy Docker images of the Carte Server on AWS.

Prerequisites

Meet these requirements before you start:

Install a stable version of Docker on your workstation.
Have an AWS account.
Install the AWS CLI on your workstation.

Supported versions:

Amazon EKS: v1.x
Docker: v20.10.21 or later stable version
AWS CLI: v2.x

Process overview

Step 1: Download and extract Pentaho for AWS

Go to the Support Portal.
Download the AWS Docker image package and license file you need.
Extract the archive.

The package contains:

image/: Pentaho source images
yaml/: YAML configuration files and utility files
README.md: link to release details

Step 2: Create an Amazon ECR

Create an ECR repository for the Pentaho image.

Follow AWS guidance for creating a private repository.
Record the repository name in the Worksheet for AWS hyperscaler.

Step 3: Load and push the Pentaho Docker image to ECR

Select and tag the Pentaho Docker image, then push it to ECR.

Go to the image/ directory that contains the Pentaho tar.gz files.
Load the tar.gz file into your local registry:
```
docker load -i <pentaho-image>.tar.gz
```
List images and note the source image name:
```
docker images
```

Tag the source image:

docker tag <source-image>:<tag> <target-repository>:<tag>

Push the image:
```
docker push <target-repository>:<tag>
```
The AWS Management Console shows the uploaded image URI.
Record the ECR repository URI in the Worksheet for AWS hyperscaler.

For AWS instructions, see Pushing a Docker image.

Step 4: Create an S3 bucket for the Carte Server

Create an S3 bucket for files the container needs at startup.

Create an S3 bucket.
See AWS docs: Creating a bucket.
Record the bucket name in the Worksheet for AWS hyperscaler.
Upload the required directories and files.
See AWS docs: Uploading objects.

S3 bucket directories

Create these directories in the bucket as needed:

root/
- Files in this directory are copied to /home/pentaho/.kettle in the container.
jdbc-drivers/
- Put JDBC drivers here.
- Files are copied to Pentaho’s lib directory.
plugins/
- Put additional plugins here.
- Files are copied to Pentaho’s plugins directory.
- Organize each plugin in its own directory.

S3 bucket files

Upload these files as needed:

content-config.properties
- Controls which S3 files are copied and where.
- Add one line per copy instruction:
  ${KETTLE_HOME_DIR}/<some-dir-or-file>=${APP_DIR}/<some-dir>
- Example from the template:
  ${KETTLE_HOME_DIR}/context.xml=${APP_DIR}/context.xml
content-config.sh
- Optional script to configure files, change ownership, install missing apps, and more.
- Runs after the other files are processed.

Run PDI-CLI on AWS

Use the PDI-CLI Docker image to run kitchen.sh (transformations) and pan.sh (jobs) on AWS.

Prerequisites

Meet these requirements before you start:

Install a stable version of Docker on your workstation.
Have an AWS account.
Install the AWS CLI on your workstation.

Supported versions:

Docker: v20.10.21 or later stable version
AWS CLI: v2.x

Process overview

Step 1: Download and extract Pentaho for AWS

Go to the Support Portal.
Download the AWS Docker image package and license file you need.
Extract the archive.

The package contains:

image/: Pentaho source images
yaml/: YAML configuration files and utility files
README.md: link to release details

Step 2: Create an Amazon ECR

Create an ECR repository for the PDI-CLI image.

Follow AWS guidance for creating a private repository.
Record the repository URI in the Worksheet for AWS hyperscaler.

Step 3: Load and push the PDI-CLI Docker image to ECR

Load the image locally, tag it, then push it to ECR.

Go to the image/ directory that contains the PDI-CLI tar.gz file.
Load the image into your local registry:
```
docker load -i <pdi-cli-image>.tar.gz
```
List images and note the source image name:
```
docker images
```

Tag the source image:

docker tag <source-image>:<tag> <target-repository>:<tag>

Push the image:
```
docker push <target-repository>:<tag>
```
The AWS console shows the uploaded image URI.
Record the image URI in the Worksheet for AWS hyperscaler.

For AWS instructions, see Pushing a Docker image.

Step 4: Create an S3 bucket for PDI-CLI

Create an S3 bucket for files the container needs at startup.

Create an S3 bucket.
See AWS docs: Creating a bucket.
Record the bucket name in the Worksheet for AWS hyperscaler.
Upload the required directories and files.
See AWS docs: Uploading objects.

S3 bucket directories

Create these directories in the bucket as needed:

root/
- Files in this directory are copied to /home/pentaho/data-integration/data in the container.
jdbc-drivers/
- Put JDBC drivers here.
- Files are copied to Pentaho’s lib directory.
plugins/
- Put additional plugins here.
- Files are copied to Pentaho’s plugins directory.
- Organize each plugin in its own directory.
metastore/
- Put metastore content here when jobs require it.
- Copy your local .pentaho/ folder into this directory.
- Content is copied to /home/pentaho/.pentaho in the container.

S3 bucket files

Upload these files as needed:

content-config.properties
- Controls which S3 files are copied and where.
- Add one line per copy instruction:
  ${KETTLE_HOME_DIR}/<some-dir-or-file>=${APP_DIR}/<some-dir>
- Example from the template:
  ${KETTLE_HOME_DIR}/context.xml=${APP_DIR}/context.xml
content-config.sh
- Optional script to configure files, change ownership, install missing apps, and more.
- Runs after the other files are processed.

Step 5: Configure and execute PDI-CLI in AWS Batch

Create the AWS Batch resources and run a job using the PDI-CLI image.

Follow AWS guidance at Getting started with AWS Batch.

Create a compute environment.
Create a job queue.
Create a job definition.
Set the container image to the ECR image URI from Step 3.
Create a job.
Set environment variables for your job:
- PROJECT_S3_LOCATION
  - S3 location that contains the project files.
  - Example: s3://pentaho-samples/
- METASTORE_LOCATION
  - S3 path to the metastore directory.
  - Content is copied to /home/pentaho/.pentaho in the container.
  - Example: metastore
- PROJECT_STARTUP_JOB
  - Job (.kjb) path to run at startup.
  - Example: jobs/run_job_write_to_s3/read_csv_from_s3_job.kjb
- LICENSE_TOKEN
  - License token or license server URL.
  - Example: http://localhost:7070/license-server/request(Sample)
- PARAMETERS
  - Parameters passed to the job or transformation.
  - Example: -param:my_param_name=MYVALUE

You can now run jobs and transformations using PDI-CLI.

Worksheet for AWS hyperscaler

Use this worksheet to track values during setup:

Variable

Record your setting

ECR_IMAGE_URI (only Platform/PDI Server and Carte Server)

RDS_HOSTNAME (only Platform/PDI Server and Carte Server)

RDS_PORT (only Platform/PDI Server and Carte Server)

S3_BUCKET_NAME

EKS_CLUSTER_NAME (only Platform/PDI Server and Carte Server)

LICENSE_TOKEN

Last updated 1 month ago

Was this helpful?

hashtagChoose a deployment option

hashtagCommon prerequisites

hashtagInstall the Platform Server or PDI Server on AWS

hashtagBefore you begin

hashtagProcess overview

hashtagStep 1: Download and extract Platform or PDI Server for AWS

hashtagStep 2: Create an Amazon ECR

hashtagStep 3: Load and push the Pentaho Docker image to ECR

hashtagStep 4: Create an RDS database

hashtagStep 5 (Optional): Create an S3 bucket

hashtagStep 6: Create an EKS cluster and add a node group

hashtagStep 7: Install the Platform or PDI Server on AWS

hashtagDynamically update server configuration content from S3

hashtagInstall the Carte Server on AWS

hashtagPrerequisites

hashtagProcess overview

hashtagStep 1: Download and extract Pentaho for AWS

hashtagStep 2: Create an Amazon ECR

hashtagStep 3: Load and push the Pentaho Docker image to ECR

hashtagStep 4: Create an S3 bucket for the Carte Server

hashtagRun PDI-CLI on AWS

hashtagPrerequisites

hashtagProcess overview

hashtagStep 1: Download and extract Pentaho for AWS

hashtagStep 2: Create an Amazon ECR

hashtagStep 3: Load and push the PDI-CLI Docker image to ECR

hashtagStep 4: Create an S3 bucket for PDI-CLI

hashtagStep 5: Configure and execute PDI-CLI in AWS Batch

hashtagWorksheet for AWS hyperscaler

Choose a deployment option

Common prerequisites

Install the Platform Server or PDI Server on AWS

Before you begin

Process overview

Step 1: Download and extract Platform or PDI Server for AWS

Step 2: Create an Amazon ECR

Step 3: Load and push the Pentaho Docker image to ECR

Step 4: Create an RDS database

Step 5 (Optional): Create an S3 bucket

Step 6: Create an EKS cluster and add a node group

Step 7: Install the Platform or PDI Server on AWS

Dynamically update server configuration content from S3

Install the Carte Server on AWS

Prerequisites

Process overview

Step 1: Download and extract Pentaho for AWS

Step 2: Create an Amazon ECR

Step 3: Load and push the Pentaho Docker image to ECR

Step 4: Create an S3 bucket for the Carte Server

Run PDI-CLI on AWS

Prerequisites

Process overview

Step 1: Download and extract Pentaho for AWS

Step 2: Create an Amazon ECR

Step 3: Load and push the PDI-CLI Docker image to ECR

Step 4: Create an S3 bucket for PDI-CLI

Step 5: Configure and execute PDI-CLI in AWS Batch

Worksheet for AWS hyperscaler