Use Hadoop with Pentaho
Pentaho provides a complete big data analytics solution that supports the entire big data analytics process. From big data aggregation, preparation, and integration, to interactive visualization, analysis, and prediction, Pentaho allows you to harvest the meaningful patterns buried in big data stores. Analyzing your big data sets gives you the ability to identify new revenue sources, develop loyal and profitable customer relationships, and run your organization more efficiently and cost effectively.
In this topic
Pentaho, big data, and Hadoop
The term big data applies to very large, complex, or dynamic datasets that need to be stored and managed over a long time. To derive benefits from big data, you need the ability to access, process, and analyze data as it is being created. However, the size and structure of big data makes it very inefficient to maintain and process it using traditional relational databases.
Big data solutions re-engineer the components of traditional databases—data storage, retrieval, query, processing—and massively scales them.
Pentaho big data overview
Pentaho increases speed-of-thought analysis against even the largest of big data stores by focusing on the features that deliver performance.
Instant access
Pentaho provides visual tools to make it easy to define the sets of data that are important to you for interactive analysis. These data sets and associated analytics can be easily shared with others, and as new business questions arise, new views of data can be defined for interactive analysis.
High performance platform
Pentaho is built on a modern, lightweight, high performance platform. This platform fully leverages 64-bit, multi-core processors and large memory spaces to efficiently leverage the power of contemporary hardware.
Extreme-scale, in-memory caching
Pentaho is unique in leveraging external data grid technologies, such as Infinispan and Memcached to load vast amounts of data into memory so that it is instantly available for speed-of-thought analysis.
Federated data integration
Data can be extracted from multiple sources, including big data and traditional data stores, integrated together and then flowed directly into reports, without needing an enterprise data warehouse or data mart.
About Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
A Hadoop platform consists of a Hadoop kernel, a MapReduce model, a distributed file system, and often a number of related projects, such as Apache Hive, Apache HBase, and others.
A Hadoop Distributed File System, commonly referred to as HDFS, is a Java-based, distributed, scalable, and portable file system for the Hadoop framework.
Get started with Hadoop and PDI
Pentaho Data Integration (PDI) can operate in two distinct modes: job orchestration and data transformation. Within PDI they are called jobs and transformations.
PDI jobs sequence a set of entries that encapsulate actions. An example of a PDI big data job would be to check for new log files, copy the new files to HDFS, execute a MapReduce task to aggregate the weblog into a click stream, and stage that click stream data in an analytic database.
PDI transformations consist of a set of steps that execute in parallel and operate on a stream of data columns. Through the default Pentaho engine, columns usually flow from one system where new columns are calculated or values are looked up and added to the stream. The data stream is then sent to a receiving system like a Hadoop cluster, a database, or the Pentaho Reporting engine. PDI job entries and transformation steps are described in the Pentaho Data Integration document.
Before you begin (Get started with Hadoop and PDI)
PDI includes job entries and transformation steps for Hadoop and MongoDB.
Your cluster administrator can configure the Pentaho Server to communicate with most Hadoop distributions. For details, see Set up Pentaho to connect to a Hadoop cluster.
For a list of supported big data technologies, see Components Reference.
Configure PDI for Hadoop connections
Within PDI, a Hadoop configuration is the collection of Hadoop libraries required to communicate with a specific version of Hadoop and related tools, such as Hive, HBase, or Sqoop.
Hadoop configurations are defined in the plugin.properties file and are designed to be easily configured within PDI by changing the active hadoop.configuration property. The plugin.properties file resides in the pentaho-big-data-plugin/ folder.
All Hadoop configurations share a basic structure. Elements of the structure are defined following the code sample:
lib/
Libraries specific to the version of Hadoop with which this configuration was created to communicate.
pmr/
Jar files that contain libraries required for parsing data in input/output formats or otherwise outside of any PDI-based execution.
*.jar
All other libraries required for Hadoop configuration that are not client-only or special PMR JAR files that need to be available to the entire JVM of Hadoop job tasks.
config.properties
Contains metadata and configuration options for this Hadoop configuration. It provides a way to define a configuration name, additional classpath, and native libraries that the configuration requires. See the comments in this file for more details.
core-site.xml
Configuration file that can be replaced to set a site-specific configuration. For example, hdfs-site.xml would be used to configure HDFS.
configuration-implementation.jar
File that must be replaced to communicate with this configuration.
Include or exclude classes or packages for a Hadoop configuration
You can include or exclude classes or packages from loading with a Hadoop configuration.
Configure these options in the plugin.properties file in plugins/pentaho-big-data-plugin/. For details, see the comments in the file.
Include additional class paths or libraries
Use the
classpathproperty to include additional class paths, native libraries, or a user-friendly configuration name.Exclude classes or packages
Use the
ignored.classesproperty to prevent duplicate loading by the Hadoop configuration class loader. This is required when a library expects a single shared class across class loaders, such as Apache Commons Logging.
Hadoop connection and access information
After your Hadoop cluster has been configured, users need information and permissions to connect to the cluster and access its services.
Pentaho
You need read access to these Pentaho directories:
Pentaho Server, Spoon, PRD, and PME directories for cluster drivers
Pentaho log directories
Hadoop cluster
You need this information about your Hadoop cluster. You can get it from your Hadoop administrator or the cluster management tool.
Installed version
Hostname and IP address for each cluster node, including YARN servers
If your cluster is enabled for high availability, the name of the name service (DNS lookup table)
Optional services
If you use one or more optional services, collect this information first.
HDFS
Hostname or IP address, NameNode port, and NameNode web console port
Paths to the directories you will use
Owners for the various data sets in HDFS
If you use S3, the access key and secret key
Required directory permissions
Hive2 and Impala
Username and password the service runs under
Hostname or IP address and port
JDBC URL (use the Thrift interface)
HBase
Zookeeper connection hostname
Zookeeper connection port
Oozie
URL to the Oozie web interface
JobTracker hostname or IP address and port (or Resource Manager hostname or IP address and port)
NameNode hostname or IP address and port
Pentaho MapReduce (PMR)
Job History Server IP address and port
JobTracker hostname or IP address and port (or Resource Manager hostname or IP address and port)
Hostname or IP address, NameNode port, and NameNode web console port
Sqoop
JDBC connection details for target or source databases
JDBC drivers
JobTracker hostname or IP address and port (or Resource Manager hostname or IP address and port)
Hostname or IP address, NameNode port, and NameNode web console port
Username used to access HDFS
Spark
Master URL
Spark client location
JobTracker hostname or IP address and port (or Resource Manager hostname or IP address and port)
Hostname or IP address, NameNode port, and NameNode web console port
Job History Server IP address and port
Zookeeper
Hostname or IP address
Port
Connect to your Hadoop clusters in the PDI client
You can establish connections in the PDI client to multiple Hadoop clusters and versions through drivers that act as adapters between Pentaho and your clusters. See the Pentaho Data Integration document for instructions.
Use PDI outside and inside the Hadoop cluster
When connections are established to one or more clusters, you can use PDI to execute both outside of your Hadoop clusters and within the nodes of the clusters. See the Pentaho Data Integration document for details.
Advanced topics
The following topics help to extend your knowledge beyond basic setup and use:
Copy files to a Hadoop YARN cluster
If you start a job that runs on a YARN cluster, it may need extra files. Common examples include variables from kettle.properties.
Use the YARN Workspace folder to stage those files. At runtime, PDI copies the files to the YARN cluster.
This approach works well across dev, test, and prod. The job uses the correct KETTLE_HOME files for each environment.
PDI copies files in the YARN Workspace folder every time you run a job that starts the YARN Kettle Cluster.
To avoid overwriting files with the same name on the cluster, do one or both:
Delete files from the YARN Workspace folder.
In the Start a PDI Cluster on YARN job entry, clear the relevant Copy Local Resource Files to YARN checkboxes.
Add files to the YARN Workspace folder
You can configure the Start a PDI Cluster on YARN job entry to copy these files at runtime:
kettle.propertiesshared.xmlrepositories.xml
You can also copy additional files to the folder manually.
If you run the job locally, PDI copies configuration files from your local KETTLE_HOME.
If you schedule the job or run it on Pentaho Server, PDI uses the server’s configured KETTLE_HOME.
Prepare your environment
Ensure the active Hadoop driver is configured.
Update these properties in
yarn-site.xml:yarn.application.classpath: Classpaths needed to execute YARN applications. Separate paths with a comma (,).yarn.resourcemanager.hostname: Update the hostname to match your environment.yarn.resourcemanager.address: Update hostname and port to match your environment.yarn.resourcemanager.admin.address: Update hostname and port to match your environment.
Configure the job entry
In Spoon, create or open a job that contains a Start a PDI Cluster on YARN job entry.
Open the job entry.
Under Copy Local Resource Files to YARN, select any combination of:
kettle.propertiesshared.xmlrepositories.xml
Save and close the job entry.
Add extra files (optional)
Copy any additional files to:
pentaho-big-data-plugin/workspace
Run the job
Run the job. PDI copies the selected files to the workspace folder, then to the YARN cluster.
Delete files from the YARN Workspace folder
Remove files from:
pentaho-big-data-plugin/workspace
PDI big data transformation steps
You can use the following Pentaho Data Integration transformation steps to help enable PDI to work with big data technologies:
CouchDB
Hadoop File Input
Hadoop File Output
HBase Input
HBase Output
HBase Row Decoder
Kafka Consumer
Kafka Producer
MapReduce Input
MapReduce Output
MongoDB Input
MongoDB Output
ORC Input
ORC Output
Parquet Input
Parquet Output
Splunk Input
Splunk Output
See the Transformation step reference in the Pentaho Data Integration document for details and additional job entries.
PDI big data job entries
You can use the following Pentaho Data Integration job entries to help enable PDI to work with big data technologies:
Amazon EMR Job Executor
Amazon Hive Job Executor
Hadoop Copy Files
Hadoop Job Executor
Oozie Job Executor
Pentaho MapReduce
Sqoop Export
Sqoop Import
Start a PDI Cluster on YARN
Stop a PDI Cluster on YARN
See the Job entry reference in the Pentaho Data Integration document for details and additional job entries.
Big data resources
The following resources may help in understanding big data architecture and components:
Apache Hadoop project : A project that contains libraries that allows for the distributed processing of large data sets across clusters of computers using simple programming models. There are several modules, including the Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data and Hadoop MapReduce, which is a key algorithm to distribute work around a cluster.
HBase: A scalable, distributed database that supports structured data storage for large tables
Hive: A data warehouse infrastructure that provides data summarization and on-demand querying
ZooKeeper: A high-performance coordination service for distributed applications
MongoDB: A NoSQL open source document-oriented database system developed and supported by 10gen
Splunk: A data collection, visualization and indexing engine for operational intelligence that is developed by Splunk, Inc.
CouchDB: A NoSQL open source document-oriented database system developed and supported by Apache
Sqoop: Software for transferring data between relational databases and Hadoop
Oozie: A workflow scheduler system to manage Hadoop jobs
Troubleshooting possible Big Data issues
Follow the suggestions in these topics to help resolve common issues when working with Big Data:
See the Administer Pentaho Data Integration and Analytics document for additional troubleshooting information.
General configuration problems
Use these tables to troubleshoot common Big Data configuration issues.
Driver and configuration issues
Could not find cluster configuration file config.properties for the cluster in expected metastore locations or a legacy shim configuration.
Incorrect cluster name. Named cluster configuration is missing addresses, ports, or security settings (if applicable). Driver version setup is incorrect. The Big Data plugin configuration is not valid for legacy mode. Named cluster configuration files are missing.
Verify the cluster name. Verify addresses, ports, and security settings (if applicable). Verify the driver version setup. If this happens in legacy mode, update transformations and jobs to use a named cluster definition. Verify the *-site.xml files are in the expected location. See Set up Pentaho to connect to a Hadoop cluster.
Could not find service for interface associated with named cluster.
Incorrect cluster name. Cluster driver is not installed. After updating to Pentaho 9.0, older shims cannot be used. If you use a cluster driver, the driver version might be too old.
Verify the cluster name. Install a Pentaho 9.0 driver for the cluster. See Components Reference for supported versions. Edit the Hadoop cluster information to set the required vendor and driver version. If you use a cluster driver, update to a newer version. If you use HDP 2.6, CDH, or older versions, update the cluster driver version before running Pentaho 9.0.
No driver.
Driver is installed in the wrong location.
Verify the correct driver .kar file is installed in the expected location. Check your distribution instructions in Set up Pentaho to connect to a Hadoop cluster. Verify SHIM_DRIVER_DEPLOYMENT_LOCATION in the user's kettle.properties file is set to DEFAULT. See Pentaho Data Integration for details on Kettle variables.
Driver does not load.
You tried to load a shim that is not supported by your Pentaho version. Configuration file changes were made incorrectly.
Verify required licenses are installed and not expired. See Administer Pentaho Data Integration and Analytics for licensing details. Verify the driver is supported by your Pentaho version. See Components Reference. Restart the PDI client (Spoon), then test again. If the issue persists, download a fresh driver from the Support Portal.
The file system's URL does not match the URL in the configuration file.
*-site.xml configuration files are not configured correctly.
Verify the *-site.xml files are configured correctly. Verify core-site.xml is configured correctly. See Set up Pentaho to connect to a Hadoop cluster.
Sqoop Unsupported major.minor version error.
In Pentaho 6.0, the Java version on your cluster is older than the Java version that Pentaho uses.
Verify the JDK meets requirements. See Components Reference. Verify the Pentaho Server JDK major version matches the cluster JDK major version.
Connection problems
Hostname does not resolve.
No hostname is specified. Hostname or IP address is incorrect. DNS does not resolve the hostname correctly.
Verify the hostname or IP address. Verify DNS resolves the hostname correctly.
Port number does not resolve.
Port number is incorrect. Port number is not numeric. Port number is not required for high availability (HA) clusters. No port number is specified.
Verify the port number. If the cluster uses HA, clear the port number and test again.
Cannot connect to the cluster.
Firewall blocks the connection. Other network issues exist. A *-site.xml file is invalid. Address or port information is incorrect for the cluster or service.
Verify no firewall or network issue blocks the connection. Verify all *-site.xml files are well-formed XML. Verify addresses and ports for cluster services.
Windows failure message: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset
A required setting is missing. Windows cannot locate %HADOOP_HOME%\\bin\\winutils.exe. The Hadoop_Home directory is not set.
Follow the instructions at https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems. Set the %HADOOP_HOME% environment variable to the directory that contains WINUTILS.EXE.
Cannot access a Hive database (secured clusters only).
To access Hive, you must set two database connection parameters in the PDI client.
Open hive-site.xml on the Hive server. Note the kerberos.principal and sasl.qop values. In the PDI client, open the database connection, select Options, then add sasl.qop and principal. Use the values from hive-site.xml, then save.
Directory access or permissions issues
Access error when trying to reach the user home directory.
Authorization or authentication issue.
Verify you have a user account on the cluster. Verify the cluster username matches the OS username that runs Pentaho.
Cannot access directory.
Authorization or authentication issue. The directory does not exist on the cluster.
Verify the user has read, write, and execute permissions for the directory. Verify the cluster and driver security settings allow access. Verify the hostname and port are correct for the Hadoop file system NameNode.
Cannot create, read, update, or delete files or directories.
Authorization or authentication issue.
Verify the user has execute permissions for the directory. Verify cluster and driver security settings allow access. Verify the hostname and port are correct for the Hadoop file system NameNode.
Test file cannot be overwritten.
The Pentaho test file already exists in the directory.
If a previous test did not delete the file, delete it manually. Check the log for the test file name. If a different file with the same name exists, rename or remove it, then retest.
Oozie issues
Cannot connect to Oozie
Firewall blocks the connection. Other network issues exist. Oozie URL is incorrect.
Verify the Oozie URL. Verify no firewall blocks the connection.
Zookeeper problems
Cannot connect to ZooKeeper
Firewall blocks the connection to the ZooKeeper service. Other network issues exist.
Verify no firewall blocks the connection.
ZooKeeper hostname or port is missing, not found, or does not resolve
Hostname or IP address is missing or incorrect. Port number is missing or incorrect.
Try to connect to the ZooKeeper nodes using ping or another method. Verify the hostnames or IP addresses and ports are correct.
Kafka problems
Cannot connect to Kafka
Bootstrap server information is incorrect. The specified bootstrap server is down. Firewall blocks the connection.
Verify the bootstrap server value. Verify the bootstrap server is running. Verify no firewall blocks the connection.
Cannot access cluster with Kerberos enabled
If a step or entry cannot access a Kerberos authenticated cluster, review the steps in Set up Kerberos for Pentaho in the Administer Pentaho Data Integration and Analytics document.
If this issue persists, verify that the username, password, UID, and GID for each impersonated or spoofed user is the same on each node. When a user is deleted and recreated, it may then have different UIDs and GIDs causing this issue.
Cannot access the Hive service on a cluster
If you cannot use Kerberos impersonation to authenticate and access the Hive service on a cluster, review the steps in the Pentaho Business Analytics document.
If this issue persists, copy the hive-site.xml file on the Hive server to the configuration directory of the named cluster connection in these directories:
Pentaho Server
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]PDI client
data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]
If the problem continues to persist, disable pooled connections for Hive.
HBase Get Master Failed error
If the HBase cannot establish the authenticated portion of the connection, then copy the hbase-site.xml file from the HBase server to the configuration directory of the named cluster connection in these directories:
Pentaho Server:
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]PDI client:
data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]
Sqoop export fails
If executing a Sqoop export job and the system generates the following error because a file already exists at the destination, then Sqoop failed to clear the compile directory:
Could not rename \\tmp\\sqoop-devuser\\compile\\1894e2403c37a663c12c752ab11d8e6a\\aggregatehdfs.java to C:\\Builds\\pdi-ee-client-9.0.0.0-MS-550\\data-integration\\.\\aggregatehdfs.java. Error: Destination 'C:\\Builds\\pdi-ee-client-9.0.0.0-MS-550\\data-integration\\.\\aggregatehdfs.java' already exists.
Despite the error message, the job that generated it ended successfully. To stop this error message, you can add a Delete step to the job to remove the compile directory before execution of the Sqoop export step.
Sqoop import into Hive fails
If executing a Sqoop import into Hive fails to execute on a remote installation, the local Hive installation configuration does not match the Hadoop cluster connection information used to perform the Sqoop job.
Verify the Hadoop connection information used by the local Hive installation is configured the same as the Sqoop job entry.
Group By step is not supported in a single threaded transformation engine
If you have a job that contains both a Pentaho MapReduce entry and a Reducer transformation with a Group by step, you may receive a Step 'Group by' of type 'GroupBy' is not Supported in a Single Threaded Transformation Engine error message. This error can occur if:
An entire set of rows sharing the same grouping key are filtered from the transformation before the Group By step.
The Reduce single threaded option in the Pentaho MapReduce entry's Reducer tab is selected.
To fix this issue, open the Pentaho MapReduce entry and deselect the Reduce single threaded option in the Reducer tab.
Kettle cluster on YARN will not start
When you are using the Start a PDI Cluster on YARN job entry, the Kettle cluster may not start.
Verify in the File System Path (in the Files tab) that the Default FS setting matches the configured hostname for the HDFS Name node, then try starting the kettle cluster again.
Hadoop on Windows
If you are using Hadoop on Windows, you may get an "unexpected error" message. This message indicates that multiple cluster support across different versions of Hadoop is not available on Windows.
You are limited to using the same version of Hadoop for multiple cluster use on Windows. If you have problems accessing the Hadoop file system on a Windows machine, see the Problems running Hadoop on Windows article on the Hadoop Wiki site.
Legacy mode activated when named cluster configuration cannot be located
If you run a transformation or job for which PDI cannot locate and load a named configuration cluster, then PDI activates a legacy mode. This legacy, or fallback, mode is only available in Pentaho 9.0 and later.
When the legacy mode is activated, PDI attempts to run the transformation by finding any existing cluster configuration you have set up in the PDI Big Data plugin. PDI then migrates the existing configuration to the latest PDI instance that you are currently running.
Note: You cannot connect to more than one cluster.
The legacy mode is helpful for transformations built with previous versions of PDI and includes individual steps that are not associated to a named cluster. You can run the transformation in legacy mode without revising the cluster configuration in each individual step. For information about setting up a named connection, see the Pentaho Data Integration document.
When legacy mode is active, the transformation log displays the following message:
Could not find cluster configuration file {0} for cluster {1} in expected metastore locations or a legacy shim configuration.
If the Big Data plugin is present and PDI accesses it to successfully activate legacy mode, the transfomation log displays the following message:
Cluster configuration not found in expected location; trying legacy configuration location.
For more information about working with clusters, see Get started with Hadoop and PDI.
Unable to read or write files to HDFS on the Amazon EMR cluster
When running a transformation on an EMR cluster, the transformation appears to run successfully, but an empty file is written to the cluster. When PDI is not installed on the Amazon EC2 instance where you are running your transformation, you are unable to read or write files to the HDFS cluster. Any files written to the cluster are empty.
To resolve this issue, perform the following steps to edit the hdfs-site.xml file on the PDI client:
Navigate to the
<username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>directory.Open the
hdfs-site.xmlfile with any text editor.Add the following code:
Save and close the file.
Use YARN with S3
When using the Start a PDI cluster on YARN and Stop a PDI cluster on YARN job entries to run a transformation that attempts to read data from an Amazon S3 bucket, the transformation fails. The transformation fails because the Pentaho metastore is not accessible to PDI on the cluster. To resolve this problem, verify that the Pentaho metastore is accessible to PDI on the cluster.
Perform the following steps to make the Pentaho metastore accessible to PDI:
Navigate to the
<user>/.pentaho/metastoredirectory on the machine with the PDI client.On the cluster where the Yarn server is located, create a new directory in the
design-tools/data-integration/plugins/pentaho-big-data-plugindirectory, then copy the metastore directory into this location. This directory is the <NEW_META_FOLDER_LOCATION> variable.Navigate to the
design-tools/data-integrationdirectory and open thecarte.shfile with any text editor.Add the following code in the line before the
export OPTline:OPT="$OPT -DPENTAHO_METASTORE_FOLDER=<NEW_META_FOLDER_LOCATION>", then save and close the file.Create a zip file containing the contents of the
data-integrationdirectory.In your Start a PDI cluster on YARN job entry, go to the Files tab of the Properties window, then locate the PDI Client Archive field. Enter the filepath for the zip file.
This task resolves S3 access issues for the following tranformation steps:
Avro Input
Avro Output
Orc Input
Orc Output
Parquet Input
Parquet Output
Text File Input
Text File Output
Data Catalog searches returning incomplete or missing data
If you have a transformation that contains the Catalog Input, Catalog Output, Read Metadata, or Write Metadata steps, there may be instances when a complete search of the records in Data Catalog is not performed. This error can occur if:
The default limit provided to prevent PDI from exceeding memory limits or stop connection timeouts to PDC is too short for your environment.
To resolve this issue:
Design your transformation.
Right-click on the canvas to open the Transformation properties dialog box.
In the Parameters tab, add the following parameter:
catalog-result-limitIn the Default Value column, enter a number greater than the default value of
25, for example500.Run your transformation.
Note: The behavior of this parameter does not directly control the records received from Data Catalog, but rather it works to limit the sub-queries used to retrieve those records. Therefore, you may need to make additional adjustments to establish the correct limit for your environment.
Last updated
Was this helpful?

