Big Data Sources: Details

This table shows the Big Data sources that are compatible with specific Pentaho tools.

Data Source

Versions

Analyzer

PIR/PDD

Pentaho Reporting

DSW

PDIServer/Client

PRD

PSW

PME

Amazon EMR

7.0.0e (Certified)

No

No

No

No

Yes

Yes

No

No

Apache Vanilla Hadoop

3.3.0 (Certified)

No

No

No

Yes

Yes

No

No

No

Cassandra (Datastax)

6.8 (Certified)

No

No

No

No

Yes

No

No

No

Cloudera Data Platform (CDP) Private Cloud

7.1.9 (for job execution)

No

No

No

No

Yes

Yes

No

Yes

via Impala (as data source)

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

via Hive3a (as data source)

No

Yes

Yes

Yes

Yes

Yes

No

Yes

1.5.4.1008b

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Google Dataprocc (for job execution)

2.1d

No

No

No

No

Yes

Yes

No

No

via Hive2 and Google BigQuery (as data source)

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Greenplum

4.3

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Microsoft Azure HDInsight

4.0

Yes

Yes

No

No

Yes

No

No

Yes

MongoDB

7

No

No

Yes

No

Yes

Yes

No

No

Vertica

11

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Notes: A generic Apache Hadoop driver is included in the Pentaho distribution for version 10.2: Other supported drivers can be downloaded from the Support Portal.a Hive3 as a data source for CDP also supports Hive LLAP, and Hive3 on Tez.

b The Simba driver required for Google BigQuery is the JDBC 4.2-compatible version, which you can download from https://storage.googleapis.com/simba-bq-release/jdbc/SimbaJDBCDriverforGoogleBigQuery42_1.2.2.1004.zip.

c HBase is not supported with Google Dataproc.

d Use the Google Dataproc 2.1 driver for your Google Dataproc 2.2 cluster. The Google Dataproc 2.1 driver is certified to work for Google Dataproc 2.2.

e EMR clusters (version 7.x and later) built with JDK 17 exclude the commons-lang-2.6.jar library from their standard Hadoop library directories ($HADOOP_HOME/lib). To use the EMR driver for EMR 7.x, obtain the commons-lang-2.6.jar file from a trusted source, such as the official Maven repository (Maven Repository: commons-lang » commons-lang » 2.6). Then manually copy the downloaded JAR file to the $HADOOP_HOME/lib or $HADOOP_MAPRED_HOME/lib directory on each node within the EMR cluster to ensure that all worker nodes have access to the library.

Last updated

Was this helpful?