Use YARN with S3
When using the Start a PDI cluster on YARN and Stop a PDI cluster on YARN job entries to run a transformation that attempts to read data from an Amazon S3 bucket, the transformation fails. The transformation fails because the Pentaho metastore is not accessible to PDI on the cluster. To resolve this problem, verify that the Pentaho metastore is accessible to PDI on the cluster.
Perform the following steps to make the Pentaho metastore accessible to PDI:
Navigate to the
<user>/.pentaho/metastore
directory on the machine with the PDI client.On the cluster where the Yarn server is located, create a new directory in the
design-tools/data-integration/plugins/pentaho-big-data-plugin
directory, then copy the metastore directory into this location. This directory is the <NEW_META_FOLDER_LOCATION> variable.Navigate to the
design-tools/data-integration
directory and open thecarte.sh
file with any text editor.Add the following code in the line before the
export OPT
line:OPT="$OPT -DPENTAHO_METASTORE_FOLDER=<NEW_META_FOLDER_LOCATION>"
, then save and close the file.Create a zip file containing the contents of the
data-integration
directory.In your Start a PDI cluster on YARN job entry, go to the Files tab of the Properties window, then locate the PDI Client Archive field. Enter the filepath for the zip file.
This task resolves S3 access issues for the following tranformation steps:
Avro Input
Avro Output
Orc Input
Orc Output
Parquet Input
Parquet Output
Text File Input
Text File Output
Last updated
Was this helpful?