Extracting data into PDI

Connect Pentaho Data Integration (PDI) to databases, file systems, clusters, and other data sources, and configure advanced options for integration.

  • Defining PDI database connections

    You can use Pentaho Data Integration (PDI) to access data from various databases. You must connect to the database before accessing its records. You define database connections in PDI through the Database Connection dialog box.

  • Edit database connections in PDI

    Once a connection has been established, you can open the Database Connection dialog box to refine and change aspects of the connection.

  • Specify advanced configuration of PDI database connections

    Use the Advanced option in the Database Connection dialog box to configure properties associated with how SQL is generated. With these properties, you can set a standard across all your SQL tools, ETL tools, and design tools.

  • Quoting PDI database connections

    Pentaho uses a database-specific quoting system. With this system, you can use any name or character that complies with the supported databases' naming conventions.

  • Set specific options for PDI database connections

    Use the Advanced option in the Database Connection dialog box to configure properties associated with how SQL is generated. With these properties, you can set a standard across all your SQL tools, ETL tools, and design tools.

  • Define PDI database connection pooling

    You can use the Pooling option in the Database Connection dialog box to set up a connection pool and define options like the initial pool size, maximum pool size, and connection pool parameters. By default, a connection remains open for each individual report or set of reports in PUC and for each individual step in a transformation in PDI.

  • Connect to clusters (PDI only)

    Use the Clustering options in the Database Connection dialog box to cluster the database connection and create connections to data partitions in PDI.

  • Modify connections from PDI

    Access other database-related connection tasks in PDI by right-clicking on the connection name in the View tab of the Explorer pane.

  • PDI and Hitachi Content Platform (HCP)

    Hitachi Content Platform (HCP) is the distributed, fixed-content, data storage system from Hitachi Vantara. HCP provides a scalable, easy-to-use repository that can accommodate all types of data, from simple text files to medical images to multigigabyte database images.

  • Hierarchical data

    Pentaho supports a hierarchical data type (HDT) by means of the Pentaho EE Marketplace hierarchical data type plugin that adds the data type and creates five steps. These steps are designed to simplify string manipulation, with the ability to convert between HDT fields and formatted strings.

  • PDI and Snowflake

    Snowflake is an analytic data warehouse running completely on a cloud infrastructure. Snowflake supports loading popular data formats like JSON, Avro, Parquet, ORC, and XML. Using Pentaho Data Integration (PDI), you can load your data into Snowflake and define jobs in PDI to efficiently orchestrate warehouse operations, paying only for the storage and computing resources actually used when you use them.

  • Copybook steps in PDI

    Pentaho Data Integration supports simplified integration with fixed-length records in binary mainframe data files, so more users can ingest, integrate, and blend mainframe data as part of their data integration pipelines. This capability is critical if your business relies on massive amounts of customer and transactional datasets generated in mainframes that you want to search and query to create reports.

  • Work with the Streamlined Data Refinery

    The Streamlined Data Refinery (SDR) is a simplified and specific ETL refinery composed of a series of Pentaho Data Integration (PDI) jobs that take raw data, augment and blend it through the request form, and then publish it for report designers to use in Analyzer.

  • Connecting to a Hadoop cluster with the PDI client

    To connect to a Hadoop cluster, you must access a driver, create a named connection, then configure and test the connection.

  • Connecting to Virtual File Systems

    You can connect to most Virtual File Systems (VFS) through VFS connections in PDI. A VFS connection is a stored set of VFS properties that you can use to connect to a specific file system.

  • Streaming analytics

    With streaming analytics, you can constantly perform statistical analysis while moving within a data stream.

  • Web services steps

    PDI jobs and transformations can interact with a variety of Web services through specialized steps. How you use these steps, and which ones you use, is largely determined by your definition of Web services.

Last updated

Was this helpful?