Get started with Hadoop and PDI

Pentaho Data Integration (PDI) can operate in two distinct modes: job orchestration and data transformation. Within PDI they are called jobs and transformations.

PDI jobs sequence a set of entries that encapsulate actions. An example of a PDI big data job would be to check for new log files, copy the new files to HDFS, execute a MapReduce task to aggregate the weblog into a click stream, and stage that click stream data in an analytic database.

PDI transformations consist of a set of steps that execute in parallel and operate on a stream of data columns. Through the default Pentaho engine, columns usually flow from one system where new columns are calculated or values are looked up and added to the stream. The data stream is then sent to a receiving system like a Hadoop cluster, a database, or the Pentaho Reporting engine. PDI job entries and transformation steps are described in the Pentaho Data Integration document.

You can also run transformations using the Spark engine. Pentaho uses the Adaptive Execution Layer (AEL) to run transformations on a Spark engine. AEL builds a transformation definition for Spark, which moves execution directly to the cluster, leveraging Spark’s ability to coordinate large amounts of data over multiple nodes. See the Administer Pentaho Data Integration and Analytics document for details.

Last updated

Was this helpful?