Use Hadoop with Pentaho

Pentaho provides a complete big data analytics solution that supports the entire big data analytics process. From big data aggregation, preparation, and integration, to interactive visualization, analysis, and prediction, Pentaho allows you to harvest the meaningful patterns buried in big data stores. Analyzing your big data sets gives you the ability to identify new revenue sources, develop loyal and profitable customer relationships, and run your organization more efficiently and cost effectively.

In this topic

Pentaho, big data, and Hadoop

The term big data applies to very large, complex, or dynamic datasets that need to be stored and managed over a long time. To derive benefits from big data, you need the ability to access, process, and analyze data as it is being created. However, the size and structure of big data makes it very inefficient to maintain and process it using traditional relational databases.

Big data solutions re-engineer the components of traditional databases—data storage, retrieval, query, processing—and massively scales them.

Pentaho big data overview

Pentaho increases speed-of-thought analysis against even the largest of big data stores by focusing on the features that deliver performance.

  • Instant access

    Pentaho provides visual tools to make it easy to define the sets of data that are important to you for interactive analysis. These data sets and associated analytics can be easily shared with others, and as new business questions arise, new views of data can be defined for interactive analysis.

  • High performance platform

    Pentaho is built on a modern, lightweight, high performance platform. This platform fully leverages 64-bit, multi-core processors and large memory spaces to efficiently leverage the power of contemporary hardware.

  • Extreme-scale, in-memory caching

    Pentaho is unique in leveraging external data grid technologies, such as Infinispan and Memcached to load vast amounts of data into memory so that it is instantly available for speed-of-thought analysis.

  • Federated data integration

    Data can be extracted from multiple sources, including big data and traditional data stores, integrated together and then flowed directly into reports, without needing an enterprise data warehouse or data mart.

About Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

A Hadoop platform consists of a Hadoop kernel, a MapReducearrow-up-right model, a distributed file system, and often a number of related projects, such as Apache Hivearrow-up-right, Apache HBasearrow-up-right, and others.

A Hadoop Distributed File System, commonly referred to as HDFS, is a Java-based, distributed, scalable, and portable file system for the Hadoop framework.

Get started with Hadoop and PDI

Pentaho Data Integration (PDI) can operate in two distinct modes: job orchestration and data transformation. Within PDI they are called jobs and transformations.

PDI jobs sequence a set of entries that encapsulate actions. An example of a PDI big data job would be to check for new log files, copy the new files to HDFS, execute a MapReduce task to aggregate the weblog into a click stream, and stage that click stream data in an analytic database.

PDI transformations consist of a set of steps that execute in parallel and operate on a stream of data columns. Through the default Pentaho engine, columns usually flow from one system where new columns are calculated or values are looked up and added to the stream. The data stream is then sent to a receiving system like a Hadoop cluster, a database, or the Pentaho Reporting engine. PDI job entries and transformation steps are described in the Pentaho Data Integration document.

Before you begin (Get started with Hadoop and PDI)

PDI includes job entries and transformation steps for Hadoop and MongoDB.

Your cluster administrator can configure the Pentaho Server to communicate with most Hadoop distributions. For details, see Set up Pentaho to connect to a Hadoop cluster.

For a list of supported big data technologies, see Components Reference.

Configure PDI for Hadoop connections

Within PDI, a Hadoop configuration is the collection of Hadoop libraries required to communicate with a specific version of Hadoop and related tools, such as Hive, HBase, or Sqoop.

Hadoop configurations are defined in the plugin.properties file and are designed to be easily configured within PDI by changing the active hadoop.configuration property. The plugin.properties file resides in the pentaho-big-data-plugin/ folder.

All Hadoop configurations share a basic structure. Elements of the structure are defined following the code sample:

Configuration Element
Definition

lib/

Libraries specific to the version of Hadoop with which this configuration was created to communicate.

pmr/

Jar files that contain libraries required for parsing data in input/output formats or otherwise outside of any PDI-based execution.

*.jar

All other libraries required for Hadoop configuration that are not client-only or special PMR JAR files that need to be available to the entire JVM of Hadoop job tasks.

config.properties

Contains metadata and configuration options for this Hadoop configuration. It provides a way to define a configuration name, additional classpath, and native libraries that the configuration requires. See the comments in this file for more details.

core-site.xml

Configuration file that can be replaced to set a site-specific configuration. For example, hdfs-site.xml would be used to configure HDFS.

configuration-implementation.jar

File that must be replaced to communicate with this configuration.

Include or exclude classes or packages for a Hadoop configuration

You can include or exclude classes or packages from loading with a Hadoop configuration.

Configure these options in the plugin.properties file in plugins/pentaho-big-data-plugin/. For details, see the comments in the file.

  • Include additional class paths or libraries

    Use the classpath property to include additional class paths, native libraries, or a user-friendly configuration name.

  • Exclude classes or packages

    Use the ignored.classes property to prevent duplicate loading by the Hadoop configuration class loader. This is required when a library expects a single shared class across class loaders, such as Apache Commons Logging.

Hadoop connection and access information

After your Hadoop cluster has been configured, users need information and permissions to connect to the cluster and access its services.

Pentaho

You need read access to these Pentaho directories:

  • Pentaho Server, Spoon, PRD, and PME directories for cluster drivers

  • Pentaho log directories

Hadoop cluster

You need this information about your Hadoop cluster. You can get it from your Hadoop administrator or the cluster management tool.

  • Installed version

  • Hostname and IP address for each cluster node, including YARN servers

  • If your cluster is enabled for high availability, the name of the name service (DNS lookup table)

Optional services

If you use one or more optional services, collect this information first.

  • HDFS

    • Hostname or IP address, NameNode port, and NameNode web console port

    • Paths to the directories you will use

    • Owners for the various data sets in HDFS

    • If you use S3, the access key and secret key

    • Required directory permissions

  • Hive2 and Impala

    • Username and password the service runs under

    • Hostname or IP address and port

    • JDBC URL (use the Thrift interface)

  • HBase

    • Zookeeper connection hostname

    • Zookeeper connection port

  • Oozie

    • URL to the Oozie web interface

    • JobTracker hostname or IP address and port (or Resource Manager hostname or IP address and port)

    • NameNode hostname or IP address and port

  • Pentaho MapReduce (PMR)

    • Job History Server IP address and port

    • JobTracker hostname or IP address and port (or Resource Manager hostname or IP address and port)

    • Hostname or IP address, NameNode port, and NameNode web console port

  • Sqoop

    • JDBC connection details for target or source databases

    • JDBC drivers

    • JobTracker hostname or IP address and port (or Resource Manager hostname or IP address and port)

    • Hostname or IP address, NameNode port, and NameNode web console port

    • Username used to access HDFS

  • Spark

    • Master URL

    • Spark client location

    • JobTracker hostname or IP address and port (or Resource Manager hostname or IP address and port)

    • Hostname or IP address, NameNode port, and NameNode web console port

    • Job History Server IP address and port

  • Zookeeper

    • Hostname or IP address

    • Port

Connect to your Hadoop clusters in the PDI client

You can establish connections in the PDI client to multiple Hadoop clusters and versions through drivers that act as adapters between Pentaho and your clusters. See the Pentaho Data Integration document for instructions.

Use PDI outside and inside the Hadoop cluster

When connections are established to one or more clusters, you can use PDI to execute both outside of your Hadoop clusters and within the nodes of the clusters. See the Pentaho Data Integration document for details.

Advanced topics

The following topics help to extend your knowledge beyond basic setup and use:

Copy files to a Hadoop YARN cluster

If you start a job that runs on a YARN cluster, it may need extra files. Common examples include variables from kettle.properties.

Use the YARN Workspace folder to stage those files. At runtime, PDI copies the files to the YARN cluster.

This approach works well across dev, test, and prod. The job uses the correct KETTLE_HOME files for each environment.

circle-exclamation

Add files to the YARN Workspace folder

You can configure the Start a PDI Cluster on YARN job entry to copy these files at runtime:

  • kettle.properties

  • shared.xml

  • repositories.xml

You can also copy additional files to the folder manually.

circle-info

If you run the job locally, PDI copies configuration files from your local KETTLE_HOME.

If you schedule the job or run it on Pentaho Server, PDI uses the server’s configured KETTLE_HOME.

1

Prepare your environment

  • Ensure the active Hadoop driver is configured.

  • Update these properties in yarn-site.xml:

    • yarn.application.classpath: Classpaths needed to execute YARN applications. Separate paths with a comma (,).

    • yarn.resourcemanager.hostname: Update the hostname to match your environment.

    • yarn.resourcemanager.address: Update hostname and port to match your environment.

    • yarn.resourcemanager.admin.address: Update hostname and port to match your environment.

2

Configure the job entry

  1. In Spoon, create or open a job that contains a Start a PDI Cluster on YARN job entry.

  2. Open the job entry.

  3. Under Copy Local Resource Files to YARN, select any combination of:

    • kettle.properties

    • shared.xml

    • repositories.xml

  4. Save and close the job entry.

3

Add extra files (optional)

Copy any additional files to:

pentaho-big-data-plugin/workspace

4

Run the job

Run the job. PDI copies the selected files to the workspace folder, then to the YARN cluster.

Delete files from the YARN Workspace folder

Remove files from:

pentaho-big-data-plugin/workspace

PDI big data transformation steps

You can use the following Pentaho Data Integration transformation steps to help enable PDI to work with big data technologies:

  • CouchDB

  • Hadoop File Input

  • Hadoop File Output

  • HBase Input

  • HBase Output

  • HBase Row Decoder

  • Kafka Consumer

  • Kafka Producer

  • MapReduce Input

  • MapReduce Output

  • MongoDB Input

  • MongoDB Output

  • ORC Input

  • ORC Output

  • Parquet Input

  • Parquet Output

  • Splunk Input

  • Splunk Output

See the Transformation step reference in the Pentaho Data Integration document for details and additional job entries.

PDI big data job entries

You can use the following Pentaho Data Integration job entries to help enable PDI to work with big data technologies:

  • Amazon EMR Job Executor

  • Amazon Hive Job Executor

  • Hadoop Copy Files

  • Hadoop Job Executor

  • Oozie Job Executor

  • Pentaho MapReduce

  • Sqoop Export

  • Sqoop Import

  • Start a PDI Cluster on YARN

  • Stop a PDI Cluster on YARN

See the Job entry reference in the Pentaho Data Integration document for details and additional job entries.

Big data resources

The following resources may help in understanding big data architecture and components:

Troubleshooting possible Big Data issues

Follow the suggestions in these topics to help resolve common issues when working with Big Data:

See the Administer Pentaho Data Integration and Analytics document for additional troubleshooting information.

General configuration problems

Use these tables to troubleshoot common Big Data configuration issues.

Driver and configuration issues

Symptoms
Common causes
Common resolutions

Could not find cluster configuration file config.properties for the cluster in expected metastore locations or a legacy shim configuration.

Incorrect cluster name. Named cluster configuration is missing addresses, ports, or security settings (if applicable). Driver version setup is incorrect. The Big Data plugin configuration is not valid for legacy mode. Named cluster configuration files are missing.

Verify the cluster name. Verify addresses, ports, and security settings (if applicable). Verify the driver version setup. If this happens in legacy mode, update transformations and jobs to use a named cluster definition. Verify the *-site.xml files are in the expected location. See Set up Pentaho to connect to a Hadoop cluster.

Could not find service for interface associated with named cluster.

Incorrect cluster name. Cluster driver is not installed. After updating to Pentaho 9.0, older shims cannot be used. If you use a cluster driver, the driver version might be too old.

Verify the cluster name. Install a Pentaho 9.0 driver for the cluster. See Components Reference for supported versions. Edit the Hadoop cluster information to set the required vendor and driver version. If you use a cluster driver, update to a newer version. If you use HDP 2.6, CDH, or older versions, update the cluster driver version before running Pentaho 9.0.

No driver.

Driver is installed in the wrong location.

Verify the correct driver .kar file is installed in the expected location. Check your distribution instructions in Set up Pentaho to connect to a Hadoop cluster. Verify SHIM_DRIVER_DEPLOYMENT_LOCATION in the user's kettle.properties file is set to DEFAULT. See Pentaho Data Integration for details on Kettle variables.

Driver does not load.

You tried to load a shim that is not supported by your Pentaho version. Configuration file changes were made incorrectly.

Verify required licenses are installed and not expired. See Administer Pentaho Data Integration and Analytics for licensing details. Verify the driver is supported by your Pentaho version. See Components Reference. Restart the PDI client (Spoon), then test again. If the issue persists, download a fresh driver from the Support Portalarrow-up-right.

The file system's URL does not match the URL in the configuration file.

*-site.xml configuration files are not configured correctly.

Verify the *-site.xml files are configured correctly. Verify core-site.xml is configured correctly. See Set up Pentaho to connect to a Hadoop cluster.

Sqoop Unsupported major.minor version error.

In Pentaho 6.0, the Java version on your cluster is older than the Java version that Pentaho uses.

Verify the JDK meets requirements. See Components Reference. Verify the Pentaho Server JDK major version matches the cluster JDK major version.

Connection problems

Symptoms
Common causes
Common resolutions

Hostname does not resolve.

No hostname is specified. Hostname or IP address is incorrect. DNS does not resolve the hostname correctly.

Verify the hostname or IP address. Verify DNS resolves the hostname correctly.

Port number does not resolve.

Port number is incorrect. Port number is not numeric. Port number is not required for high availability (HA) clusters. No port number is specified.

Verify the port number. If the cluster uses HA, clear the port number and test again.

Cannot connect to the cluster.

Firewall blocks the connection. Other network issues exist. A *-site.xml file is invalid. Address or port information is incorrect for the cluster or service.

Verify no firewall or network issue blocks the connection. Verify all *-site.xml files are well-formed XML. Verify addresses and ports for cluster services.

Windows failure message: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset

A required setting is missing. Windows cannot locate %HADOOP_HOME%\\bin\\winutils.exe. The Hadoop_Home directory is not set.

Follow the instructions at https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblemsarrow-up-right. Set the %HADOOP_HOME% environment variable to the directory that contains WINUTILS.EXE.

Cannot access a Hive database (secured clusters only).

To access Hive, you must set two database connection parameters in the PDI client.

Open hive-site.xml on the Hive server. Note the kerberos.principal and sasl.qop values. In the PDI client, open the database connection, select Options, then add sasl.qop and principal. Use the values from hive-site.xml, then save.

Directory access or permissions issues

Symptoms
Common causes
Common resolutions

Access error when trying to reach the user home directory.

Authorization or authentication issue.

Verify you have a user account on the cluster. Verify the cluster username matches the OS username that runs Pentaho.

Cannot access directory.

Authorization or authentication issue. The directory does not exist on the cluster.

Verify the user has read, write, and execute permissions for the directory. Verify the cluster and driver security settings allow access. Verify the hostname and port are correct for the Hadoop file system NameNode.

Cannot create, read, update, or delete files or directories.

Authorization or authentication issue.

Verify the user has execute permissions for the directory. Verify cluster and driver security settings allow access. Verify the hostname and port are correct for the Hadoop file system NameNode.

Test file cannot be overwritten.

The Pentaho test file already exists in the directory.

If a previous test did not delete the file, delete it manually. Check the log for the test file name. If a different file with the same name exists, rename or remove it, then retest.

Oozie issues

Symptoms
Common causes
Common resolutions

Cannot connect to Oozie

Firewall blocks the connection. Other network issues exist. Oozie URL is incorrect.

Verify the Oozie URL. Verify no firewall blocks the connection.

Zookeeper problems

Symptoms
Common causes
Common resolutions

Cannot connect to ZooKeeper

Firewall blocks the connection to the ZooKeeper service. Other network issues exist.

Verify no firewall blocks the connection.

ZooKeeper hostname or port is missing, not found, or does not resolve

Hostname or IP address is missing or incorrect. Port number is missing or incorrect.

Try to connect to the ZooKeeper nodes using ping or another method. Verify the hostnames or IP addresses and ports are correct.

Kafka problems

Symptoms
Common causes
Common resolutions

Cannot connect to Kafka

Bootstrap server information is incorrect. The specified bootstrap server is down. Firewall blocks the connection.

Verify the bootstrap server value. Verify the bootstrap server is running. Verify no firewall blocks the connection.

Cannot access cluster with Kerberos enabled

If a step or entry cannot access a Kerberos authenticated cluster, review the steps in Set up Kerberos for Pentaho in the Administer Pentaho Data Integration and Analytics document.

If this issue persists, verify that the username, password, UID, and GID for each impersonated or spoofed user is the same on each node. When a user is deleted and recreated, it may then have different UIDs and GIDs causing this issue.

Cannot access the Hive service on a cluster

If you cannot use Kerberos impersonation to authenticate and access the Hive service on a cluster, review the steps in the Pentaho Business Analytics document.

If this issue persists, copy the hive-site.xml file on the Hive server to the configuration directory of the named cluster connection in these directories:

  • Pentaho Server

    pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]

  • PDI client

    data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]

If the problem continues to persist, disable pooled connections for Hive.

HBase Get Master Failed error

If the HBase cannot establish the authenticated portion of the connection, then copy the hbase-site.xml file from the HBase server to the configuration directory of the named cluster connection in these directories:

  • Pentaho Server:

    pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]

  • PDI client:

    data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]

Sqoop export fails

If executing a Sqoop export job and the system generates the following error because a file already exists at the destination, then Sqoop failed to clear the compile directory:

Could not rename \\tmp\\sqoop-devuser\\compile\\1894e2403c37a663c12c752ab11d8e6a\\aggregatehdfs.java to C:\\Builds\\pdi-ee-client-9.0.0.0-MS-550\\data-integration\\.\\aggregatehdfs.java. Error: Destination 'C:\\Builds\\pdi-ee-client-9.0.0.0-MS-550\\data-integration\\.\\aggregatehdfs.java' already exists.

Despite the error message, the job that generated it ended successfully. To stop this error message, you can add a Delete step to the job to remove the compile directory before execution of the Sqoop export step.

Sqoop import into Hive fails

If executing a Sqoop import into Hive fails to execute on a remote installation, the local Hive installation configuration does not match the Hadoop cluster connection information used to perform the Sqoop job.

Verify the Hadoop connection information used by the local Hive installation is configured the same as the Sqoop job entry.

Group By step is not supported in a single threaded transformation engine

If you have a job that contains both a Pentaho MapReduce entry and a Reducer transformation with a Group by step, you may receive a Step 'Group by' of type 'GroupBy' is not Supported in a Single Threaded Transformation Engine error message. This error can occur if:

  • An entire set of rows sharing the same grouping key are filtered from the transformation before the Group By step.

  • The Reduce single threaded option in the Pentaho MapReduce entry's Reducer tab is selected.

To fix this issue, open the Pentaho MapReduce entry and deselect the Reduce single threaded option in the Reducer tab.

Kettle cluster on YARN will not start

When you are using the Start a PDI Cluster on YARN job entry, the Kettle cluster may not start.

Verify in the File System Path (in the Files tab) that the Default FS setting matches the configured hostname for the HDFS Name node, then try starting the kettle cluster again.

Hadoop on Windows

If you are using Hadoop on Windows, you may get an "unexpected error" message. This message indicates that multiple cluster support across different versions of Hadoop is not available on Windows.

You are limited to using the same version of Hadoop for multiple cluster use on Windows. If you have problems accessing the Hadoop file system on a Windows machine, see the Problems running Hadoop on Windowsarrow-up-right article on the Hadoop Wiki site.

Legacy mode activated when named cluster configuration cannot be located

If you run a transformation or job for which PDI cannot locate and load a named configuration cluster, then PDI activates a legacy mode. This legacy, or fallback, mode is only available in Pentaho 9.0 and later.

When the legacy mode is activated, PDI attempts to run the transformation by finding any existing cluster configuration you have set up in the PDI Big Data plugin. PDI then migrates the existing configuration to the latest PDI instance that you are currently running.

Note: You cannot connect to more than one cluster.

The legacy mode is helpful for transformations built with previous versions of PDI and includes individual steps that are not associated to a named cluster. You can run the transformation in legacy mode without revising the cluster configuration in each individual step. For information about setting up a named connection, see the Pentaho Data Integration document.

When legacy mode is active, the transformation log displays the following message:

Could not find cluster configuration file {0} for cluster {1} in expected metastore locations or a legacy shim configuration.

If the Big Data plugin is present and PDI accesses it to successfully activate legacy mode, the transfomation log displays the following message:

Cluster configuration not found in expected location; trying legacy configuration location.

For more information about working with clusters, see Get started with Hadoop and PDI.

Unable to read or write files to HDFS on the Amazon EMR cluster

When running a transformation on an EMR cluster, the transformation appears to run successfully, but an empty file is written to the cluster. When PDI is not installed on the Amazon EC2 instance where you are running your transformation, you are unable to read or write files to the HDFS cluster. Any files written to the cluster are empty.

To resolve this issue, perform the following steps to edit the hdfs-site.xml file on the PDI client:

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory.

  2. Open the hdfs-site.xml file with any text editor.

  3. Add the following code:

  4. Save and close the file.

Use YARN with S3

When using the Start a PDI cluster on YARNarrow-up-right and Stop a PDI cluster on YARN jobarrow-up-right entries to run a transformation that attempts to read data from an Amazon S3 bucket, the transformation fails. The transformation fails because the Pentaho metastore is not accessible to PDI on the cluster. To resolve this problem, verify that the Pentaho metastore is accessible to PDI on the cluster.

Perform the following steps to make the Pentaho metastore accessible to PDI:

  1. Navigate to the <user>/.pentaho/metastore directory on the machine with the PDI client.

  2. On the cluster where the Yarn server is located, create a new directory in the design-tools/data-integration/plugins/pentaho-big-data-plugin directory, then copy the metastore directory into this location. This directory is the <NEW_META_FOLDER_LOCATION> variable.

  3. Navigate to the design-tools/data-integration directory and open the carte.sh file with any text editor.

  4. Add the following code in the line before the export OPT line: OPT="$OPT -DPENTAHO_METASTORE_FOLDER=<NEW_META_FOLDER_LOCATION>", then save and close the file.

  5. Create a zip file containing the contents of the data-integration directory.

  6. In your Start a PDI cluster on YARN job entry, go to the Files tab of the Properties window, then locate the PDI Client Archive field. Enter the filepath for the zip file.

This task resolves S3 access issues for the following tranformation steps:

  • Avro Input

  • Avro Output

  • Orc Input

  • Orc Output

  • Parquet Input

  • Parquet Output

  • Text File Input

  • Text File Output

Data Catalog searches returning incomplete or missing data

If you have a transformation that contains the Catalog Input, Catalog Output, Read Metadata, or Write Metadata steps, there may be instances when a complete search of the records in Data Catalog is not performed. This error can occur if:

The default limit provided to prevent PDI from exceeding memory limits or stop connection timeouts to PDC is too short for your environment.

To resolve this issue:

  1. Design your transformation.

  2. Right-click on the canvas to open the Transformation properties dialog box.

  3. In the Parameters tab, add the following parameter:

    catalog-result-limit

  4. In the Default Value column, enter a number greater than the default value of 25, for example 500.

  5. Run your transformation.

Note: The behavior of this parameter does not directly control the records received from Data Catalog, but rather it works to limit the sub-queries used to retrieve those records. Therefore, you may need to make additional adjustments to establish the correct limit for your environment.

Last updated

Was this helpful?