Extracting data into PDI
Connect Pentaho Data Integration (PDI) to databases, file systems, clusters, and other data sources, and configure advanced options for integration.
Defining PDI database connections
You can use Pentaho Data Integration (PDI) to access data from various databases. You must connect to the database before accessing its records. You define database connections in PDI through the Database Connection dialog box.
Edit database connections in PDI
Once a connection has been established, you can open the Database Connection dialog box to refine and change aspects of the connection.
Specify advanced configuration of PDI database connections
Use the Advanced option in the Database Connection dialog box to configure properties associated with how SQL is generated. With these properties, you can set a standard across all your SQL tools, ETL tools, and design tools.
Quoting PDI database connections
Pentaho uses a database-specific quoting system. With this system, you can use any name or character that complies with the supported databases' naming conventions.
Set specific options for PDI database connections
Use the Advanced option in the Database Connection dialog box to configure properties associated with how SQL is generated. With these properties, you can set a standard across all your SQL tools, ETL tools, and design tools.
Define PDI database connection pooling
You can use the Pooling option in the Database Connection dialog box to set up a connection pool and define options like the initial pool size, maximum pool size, and connection pool parameters. By default, a connection remains open for each individual report or set of reports in PUC and for each individual step in a transformation in PDI.
Connect to clusters (PDI only)
Use the Clustering options in the Database Connection dialog box to cluster the database connection and create connections to data partitions in PDI.
Access other database-related connection tasks in PDI by right-clicking on the connection name in the View tab of the Explorer pane.
PDI and Hitachi Content Platform (HCP)
Hitachi Content Platform (HCP) is the distributed, fixed-content, data storage system from Hitachi Vantara. HCP provides a scalable, easy-to-use repository that can accommodate all types of data, from simple text files to medical images to multigigabyte database images.
Pentaho supports a hierarchical data type (HDT) by means of the Pentaho EE Marketplace hierarchical data type plugin that adds the data type and creates five steps. These steps are designed to simplify string manipulation, with the ability to convert between HDT fields and formatted strings.
Snowflake is an analytic data warehouse running completely on a cloud infrastructure. Snowflake supports loading popular data formats like JSON, Avro, Parquet, ORC, and XML. Using Pentaho Data Integration (PDI), you can load your data into Snowflake and define jobs in PDI to efficiently orchestrate warehouse operations, paying only for the storage and computing resources actually used when you use them.
Pentaho Data Integration supports simplified integration with fixed-length records in binary mainframe data files, so more users can ingest, integrate, and blend mainframe data as part of their data integration pipelines. This capability is critical if your business relies on massive amounts of customer and transactional datasets generated in mainframes that you want to search and query to create reports.
Work with the Streamlined Data Refinery
The Streamlined Data Refinery (SDR) is a simplified and specific ETL refinery composed of a series of Pentaho Data Integration (PDI) jobs that take raw data, augment and blend it through the request form, and then publish it for report designers to use in Analyzer.
Connecting to a Hadoop cluster with the PDI client
To connect to a Hadoop cluster, you must access a driver, create a named connection, then configure and test the connection.
Connecting to Virtual File Systems
You can connect to most Virtual File Systems (VFS) through VFS connections in PDI. A VFS connection is a stored set of VFS properties that you can use to connect to a specific file system.
With streaming analytics, you can constantly perform statistical analysis while moving within a data stream.
PDI jobs and transformations can interact with a variety of Web services through specialized steps. How you use these steps, and which ones you use, is largely determined by your definition of Web services.
Last updated
Was this helpful?

