Adding a data source

If your role has permission to administer data sources, you can add and edit data sources.

The number of data sources you can add is determined by your license agreement. You receive a message when you have reached 75% of your data source creation quota.

If you have reached the limit of data sources allowed by your license agreement, the Add Data Source button on the Resources card is unavailable, and a message appears when you hover your cursor over the button.

circle-info

If you encounter an error while connecting to a data source, refer to the documentation of the specific data source provider for more information about the error.

Active Directory as a data source

Data Catalog supports integration with both Windows-based Active Directory (AD) and Azure Active Directory (Azure AD). You can add Active Directory as a data source to import file system security identifiers (SIDs), GUIDs, and security descriptors, and map them to user identities. With this integration Data Catalog displays the ownership and group information for files and folders from SMB, CIFS, and NFS data sources in Data Canvas.

Perform the following steps to add Active Directory as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

  4. After you have specified the basic connection information, select Active Directory in the Data Source Type field. Note: Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

  5. Specify the following additional connection information.

    Field
    Description

    Configuration Method

    Select the method used to connect to Active Directory. Options include:

    • LDAP: Establishes a standard, non-encrypted connection.

    • Secure LDAP: Establishes an encrypted connection using SSL/TLS. When you select Secure LDAP, two additional fields appear: Certificate File and Certificate Password.

    Host

    The fully qualified domain name (FQDN) or IP address of the Active Directory server.

    Port

    The port number used to connect to the Active Directory server. The default port is usually 389 for LDAP or 636 for LDAPS (secure LDAP).

    Domain

    The domain name associated with the Active Directory environment.

    User Name

    The username that has permission to query the Active Directory. Include the domain if you have not provided the domain name.

    Password

    The password associated with the username. This credential is used to authenticate the connection to the AD server.

    Certificate File (Visible only when Configuration Method is Secure LDAP)

    Upload the SSL certificate file (in .crt or .pem format) required to establish a secure connection.

    Certificate Password (Visible only when Configuration Method is Secure LDAP)

    Specify the password associated with the uploaded certificate file, if applicable.

  6. Click Test Connection to test your connection to the specified data source.

  7. Click Create Data Source to establish your data source connection.

  8. Click Import Users.

    This process imports file system security identifiers (SIDs), GUIDs, and security descriptors, and maps them to user identities, which helps to display ownership and group information for files and folders from SMB and CIFS data sources in Data Canvas. You can also monitor the status of the job on the Workers page

You have successfully created a connection to Active Directory as a data source in Data Catalog.

After completing the Import Users job, you can run the Metadata Ingest process for SMB and CIFS data sources to see the user information in the Properties panel. For more information, see the Processing unstructured data topic in the Explore your data section in the Use Pentaho Data Catalog document. Additionally, you can see the list of users who have access to a particular file or folder. For more information, see the Users with Access topic in the Use Pentaho Data Catalog document.

Amazon Redshift data source

Amazon Redshift is a fully managed, petabyte-scale data warehouse solution offered by Amazon Web Services (AWS). It allows users to run complex queries and perform real-time data analytics on large datasets by utilizing massively parallel processing (MPP) technology. Integrating Amazon Redshift as a data source within Data Catalog, you can access and manage metadata from the Amazon Redshift database. It enables data discovery to search, explore, and understand Amazon Redshift data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add Amazon Redshift as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the connection information, select Redshift in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

Select Credentials, or URI as a configuration method.

Driver

If you select the configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure efficient, secure communication between the application and the database that meets the required standards. To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Driver Class Name

The fully qualified Java class name of the JDBC driver.

User Name

The username required to authenticate to the Amazon Redshift database. This field is optional when using Secret Manager Key.

Password

The password associated with the specified Redshift user account. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the Redshift credentials (username and password). When provided, Data Catalog retrieves the credentials securely from Secret Manager instead of requiring manual entry.

Role

The IAM role that grants Data Catalog permission to read secret versions from Secret Manager. This role must include permissions such as secretmanager.versions.access.

Region

The region where the Redshift secret is stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Host (Only for the Credentials method)

The address of the machine where the Amazon Redshift database server is running. It can be an IP address or a domain name.

Port (Only for the Credentials method)

The port number on which the Amazon Redshift server is listening for incoming connections.

URI (Only for the URI method)

URIs are used to access and manage various objects and services within the Amazon Redshift environment. For example, URL would look like `jdbc:redshift://:/`.

Database Name

The name of the database within the Amazon Redshift server that you want to connect with.

  1. Click Test Connection to test your connection to the specified data source.

  2. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata.

    Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

  3. (Optional) In the Physical Location field, specify the physical location details of the data source.

  4. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  5. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  6. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  7. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  8. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the Amazon Redshift as a data source.

Apache Iceberg data source

Apache Iceberg is an open table format designed to manage large analytic datasets in modern data lake architectures. It provides a standardized way to define table metadata, track schema and partition changes, and manage transactional updates independently of the underlying storage systems and query engines.

In Data Catalog, you can configure Apache Iceberg as a data source to discover, catalog, and govern Iceberg-managed tables. This integration allows Data Catalog to connect directly to an Iceberg catalog and ingest table metadata based on Iceberg’s native metadata model, rather than relying on file system structures or inferred schemas.

The Apache Iceberg connector in Pentaho Data Catalog supports Iceberg warehouses deployed on AWS, HDFS, and Azure Data Lake. Perform the following steps to add Apache Iceberg as a data source in Data Catalog:

▶️ Watch a walkthrough

You can watch a guided walkthrough that demonstrates how to configure an Apache Iceberg data source in Pentaho Data Catalogarrow-up-right.

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

  4. After you have specified the basic connection information, select Apache Iceberg in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

  5. Specify the following additional connection information.

    Field
    Description

    URI

    URL to the Apache Iceberg catalog service. For example, URL would look like http://<ip_address_or_hostname_of_iceberg_catalog:8181

    Affinity

    This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

    Warehouse

    This is where data and metadata are stored. Currently, Data Catalog supports AWS, Azure Data Lake, and HDFS.

Based on the Warehouse selected, provide the additional connection details:

Field
Description

Region

The AWS region where the S3 bucket hosting the Iceberg warehouse is located.

Endpoint

The S3-compatible endpoint used to access the Iceberg warehouse.

Example: http://<ip_address_or_hostname_of_warehouse_location:9000

Access Key

User credentials to access data in Iceberg Catalog. This key authenticates your access to the S3 bucket.

Secret Access Key

Password credential to access data on Iceberg Catalog. This key authenticates your access to the S3 bucket.

Path Style Access

True when the user is accessing resources via a path-style access.

For example, /user/home/resource .

If using an alternate access style, then False.

  1. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

  2. Click Ingest Schema. The Select schemas for ingestion dialog opens.

    1. You can search and select schemas using the search bar at the top (starts with or using regular expressions), then click Next.

    2. On the Ingest Schema dialog box, add include or exclude patterns to filter the tables to be ingested, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting specific system-related schemas that are unnecessary for your needs.

  3. (Optional) In the Physical Location field, specify the physical location details of the data source.

  4. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  5. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  6. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  7. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  8. Click Create Data Source to establish your data source connection.

You have successfully created a connection to Apache Iceberg as a data source.

AWS S3 data source

Perform the following steps to add AWS S3 as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select AWS S3 in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Region1

The geographical location where AWS maintains a cluster of data centers.

  • If connecting directly to S3, enter the region where the S3 bucket resides.

  • If using Secrets Manager, enter the region where the Secrets Manager key resides.

Bucket Name1

Enter the name of the Amazon S3 bucket where your data resides.

For S3 access from non-EMR environments, Data Catalog uses the AWS Command Line Interface (CLI) to connect to S3 using your access credentials (Access Key and Secret Access Key).

If you are connecting to Hadoop Distributed File System (HDFS) or MapR, provide the logical root directory name defined in the hdfssite.xml configuration file using the dfs.nameservices property.

For MapR file systems, identify the root using the maprfs:/// URI scheme.

Access Key3

User credentials to access data on the bucket.

  • If connecting directly to S3, this key authenticates your access to the S3 bucket.

  • If using Secrets Manager, this key authenticates your access to the Secrets Manager key so you can retrieve the S3 bucket credentials.

Secret Access Key3

Password credential to access data on the bucket.

  • If connecting directly to S3, this key authenticates your access to the S3 bucket.

  • If using Secrets Manager, this key authenticates your access to the Secrets Manager key so you can retrieve the S3 bucket credentials.

Secret Manager Key2

Name of the AWS Secrets Manager key to retrieve the credentials securely, instead of specifying them here.

AWS role

ARN of the AWS IAM Role to assume when accessing the S3 bucket, allowing cross-account or role-based access.

Endpoint

Location of the bucket. For example, s3.<region containing S3 bucket>.amazonaws.com

Path

The directory where this data source is included.

1 These fields are mandatory. 2 When using Secrets Manager, you must provide:  • the Region where the Secrets Manager key resides  • the Secret Manager Key  • and the Access Key and Secret Access Key (to retrieve the S3 bucket credentials from Secrets Manager). In this case, the access keys authenticate your access to Secrets Manager, not directly to the S3 bucket. 3 When not using Secrets Manager, you must provide the Access Key and Secret Access Key to access the S3 bucket directly.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders into the system. You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the AWS S3 data source.

chevron-rightFeature walkthrough: Support for Secret Managerhashtag

▶️ Launch the Support for Secret Manager walkthrougharrow-up-right

This interactive walkthrough demonstrates how to configure Secret Manager in Pentaho Data Catalog. It covers:

  • Storing and managing sensitive credentials in Secret Manager

  • Configuring PDC to securely fetch these credentials for AWS S3 and other data sources

  • Reducing manual handling of secrets during data source setup

Azure Blob Storage data source

Perform the following steps to add Azure Blob Storage as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Azure Blob Storage in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

The Default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

Shared Key (default)

Account Name

Name of an Azure storage account contains all of your Azure Storage data objects.

Shared Key

A password-like credential that gives full access to an Azure storage account's data and configuration.

Container

The top-level object that logically groups blob data that holds an unlimited number of large object data.

Path

Folder where this data source is included.

  1. Click Test Connection to test your connection to the specified data source.

  2. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders into the system. You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  3. (Optional) In the Physical Location field, specify the physical location details of the data source.

  4. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  5. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  6. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  7. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  8. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the Azure Blob Storage data source.

Databricks data source

You can configure Databricks as a data source in Data Catalog to discover, catalog, and govern data stored in Databricks SQL warehouses and the underlying Delta Lake storage layer. This integration enables Pentaho Data Catalog to connect to Databricks using JDBC and ingest metadata for data identification, profiling, and governance. By adding Databricks as a data source, organizations gain centralized visibility into Databricks-managed data and can support governed analytics, machine learning, and AI workflows across structured, semi-structured, unstructured, and streaming data.

Perform the following steps to add Databricks as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

▶️ Watch a walkthrough

You can watch a guided walkthrough that demonstrates how to configure a Databricks data source in Pentaho Data Catalogarrow-up-right.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

  4. After you have specified the basic connection information, select Databricks in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

  5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method is used to configure the connection. The only option is Credentials.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Driver Class Name

The Java class name corresponding to the selected Databricks driver.

Host

The Databricks workspace hostname. This value identifies the Databricks compute resource. Example: https://dbc-xxxx.cloud.databricks.com.

Port

The port used to connect to the Databricks workspace.

Path

The HTTP path of the Databricks SQL warehouse. Pentaho Data Catalog appends this value to the host and port to form the JDBC connection URL. Example: /sql/1.0/warehouses/<warehouse-id>.

Catalog name

A user-defined name to identify the Databricks data source within Pentaho Data Catalog.

Personal access token

A Databricks personal access token (PAT) used to authenticate the connection. The token must have permission to access the specified SQL warehouse.

  1. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

  2. Click Ingest Schema. The Select schemas for ingestion dialog opens.

    1. You can search and select schemas using the search bar at the top (starts with or using regular expressions), then click Next.

    2. On the Ingest Schema dialog box, add include or exclude patterns to filter the tables to be ingested, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting specific system-related schemas that are unnecessary for your needs.

  3. (Optional) In the Physical Location field, specify the physical location details of the data source.

  4. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  5. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  6. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  7. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  8. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Databricks data source.

Denodo data source

Perform the following steps to add Denodo as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Denodo in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

By default, it is a URI, as the connection is configured using a URL.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Driver Class Name

The Java class name corresponding to the selected Denodo driver.

User Name

The username required to authenticate to the Denodo environment. This field is optional when using Secret Manager Key.

Password

The password associated with the Denodo user account. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the Denodo connection credentials. When this field is provided, Data Catalog securely retrieves credentials from Secret Manager, eliminating the need to enter the username and password manually.

Region

The region where the Denodo secret is stored in Secret Manager. Data Catalog uses this region to connect to the correct Secret Manager endpoint.

Role

The IAM role that grants Data Catalog permission to read secret versions from Secret Manager. The role must include permissions such as secretmanager.versions.access.

URI

URIs are used to access and manage various objects and services within the Denodo environment. For example, the URI would look like jdbc:vdb://<denodo-host>:<port>/<database-name>?publishCatalogsAsSchemas=true Example: jdbc:vdb://ec2-1-2-3-4.compute-1.amazonaws.com:49999/mydatabase?publishCatalogsAsSchemas=true

Database Name

The name of the data sources within the Denodo environment that contain the data you want to access.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens. Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting specific system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Denodo data source.

DynamoDB data source

Perform the following steps to add Amazon DynamoDB as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Active Directory in the Data Source Type field. Note: Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Region

Geographical location where AWS maintains a cluster of data centers.

Access Key and Secret Access Key

AWS Access Key ID and Secret Access Key that are used for authentication and authorization when interacting with DynamoDB.

Role

The AWS IAM Role that the Pentaho Data Catalog assumes to access DynamoDB securely. Use this field when authentication is performed through AWS IAM Role-based access instead of (or in addition to) Access Key credentials.

6. Click Test Connection to test your connection to the specified data source.jkljk

  1. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. You can monitor the status of the file scan on the Workers page. Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the Amazon DynamoDB data source.

Google Cloud Storage data source

Google Cloud Storage (GCS) is a storage service that enables the storage, retrieval, and management of unstructured data, including files, images, videos, and large datasets. By integrating GCS as a data source within Data Catalog, you can access and manage the metadata of files stored. You can perform data discovery to search, explore, and understand your Google Cloud Storage data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add Google Cloud Storage as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Google Object Store in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Bucket Name

The name of the Google Cloud Storage bucket in which the data resides.

Path

The path within the bucket where the files are stored.

Key Path

The authentication key file, which is used to connect to Google Cloud Storage (a JSON file).

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders into the system. You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to Google Cloud Storage as a data source in Data Catalog.

Google BigQuery data source

Google BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for analytics. Integrating BigQuery as a data source within Data Catalog, you can access and manage metadata from the BigQuery database. It enables data discovery to search, explore, and understand BigQuery data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add BigQuery as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Google BigQuery in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field

Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Driver

The standard used to establish communication between the application and the database.Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Host

The Google BigQuery API endpoint. By default, the host is set to https://www.googleapis.com/bigquery/v2, which communicates with BigQuery's REST API for data processing.

Port

The port number to connect to the BigQuery data source. The default port for HTTPS connections is 443.

Project

The Google Cloud project ID that contains the BigQuery datasets you want to access.

Database Name

The name of the dataset within the BigQuery that you want to connect with.

Secret Manager Key

The name of the secret stored in Secret Manager that contains your Google Cloud service account key (in JSON format). When provided, Data Catalog retrieves the key securely from Secret Manager instead of using a local key file.

Role

The IAM role that grants access to read secret versions from Secret Manager. This role must include permissions such as secretmanager.versions.access so that Data Catalog can retrieve the service account key.

Region

The region where the secret is stored in Secret Manager. Data Catalog uses this region to connect to the correct Secret Manager endpoint.

Key Path

The file path to your Google Cloud service account's key (a JSON file).

Oauth Type

The authentication method to connect to BigQuery. By default, it is Service-based. It uses a service account and a key file for authentication.

Client Email

The email associated with your Google Cloud service account for service-based OAuth. The service account email is usually in the format [email protected].

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the BigQuery data source.

HCP data source

You can add data to Data Catalog from Hitachi Content Platform (HCP) by adding HCP as data source. Perform the following steps to add HCP as data source:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select HCP in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Region

Geographical location where HCP maintains data centers.

Endpoint

Location of the bucket. hostname or IP address

Access Key

The access key of the S3 credentials to access the bucket.

Secret Access Key

The secret key of the S3 credentials to access the bucket.

Bucket Name

The name of the S3 bucket in which the data resides.

Path

Directory where this data source is included.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders into the system. You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the HCP data source.

HDFS data source

Hadoop Distributed File System (HDFS) is a scalable, fault-tolerant storage system for big data, designed to distribute and manage large datasets across a cluster of commodity hardware within the Apache Hadoop framework. You can create a data source using HDFS with the local file system path by mounting data as a local file system to either the remote or local worker.

The HDFS protocol uses a client-server model where the server provides the shared file system, and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to Data Catalog from any file-sharing network system if it is transferable using HDFS.

Perform the following steps to add HDFS as a data source:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, specify the following additional connection information to access the HDFS data source.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

By default, it is URI.

HDFS Version

Select the Hadoop version of the cluster that you want to run.

URI

URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like hdfs://<name node>:8020. The <name node> address can be a variable name for high availability.

Path

HDFS directory path for the data source. It can be the root (/) or a specific high-level directory based on your access control needs. For example, the path would look like /user/demodata/.

Credential Type

Select the credential type to connect to the HDFS data source.

5. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders into the system. You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you will see a message in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the HDFS data source.

IBM Db2 data source

Perform the following steps to add IBM Db2 as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the connection information, select IBM DB2 in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

Select Credentials, or URI as a configuration method.

Driver

If you select a configuration method of Credentials or URI, you must use the driver. Select an existing driver or upload a new one to ensure that communication between the application and the database is efficient, secure, and compliant with the required standards.

Driver Class Name

The Java class name corresponding to the selected driver.

User Name

The username required to authenticate to the IBM Db2 database. This field is optional when using Secret Manager Key.

Password

The password required to authenticate to the IBM Db2 database. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the IBM Db2 credentials (username and password). When provided, Data Catalog retrieves credentials securely from Secret Manager instead of requiring manual entry.

Region

The region where the secret is stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Role

The IAM role that grants Data Catalog permission to read secrets from Secret Manager. The role must include permissions such as secretmanager.versions.access.

Host (only for the Credential method)

The address of the machine where the IBM Db2 database server is running. It can be an IP address or a domain name.

Port (only for the Credential method)

The port number on which the IBM Db2 server is listening for incoming connections.

URI (only for URI method)

URIs are used to access and manage various objects and services within the IBM Db2 environment. For example, the URL would look like jdbc:db2://<HOSTNAME or IP_ADDRESS>:<PORT>/<DATABASE_NAME>.

Database Name

The name of the database within the IBM Db2 server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, with options including 'starts with' or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata.

    Note: Although you can select all schemas, it is a best practice to avoid selecting specific system-related schemas that are unnecessary for your needs.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the IBM Db2 data source.

InfluxDB data source

InfluxDB is a time-series database designed for handling high volumes of time-stamped data, such as monitoring, IoT applications, real-time analytics, and event-driven architectures. By integrating InfluxDB as a data source within Data Catalog, you can manage and utilize time-series data. It enables data discovery to search, explore, and understand InfluxDB data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add InfluxDB as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select InfluxDB in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method is used to configure the connection. The only option is Credentials.

Driver

The standard used to establish communication between the application and the database. Select Default, which is an existing driver, to ensure that communication between the application and the database is efficient, secure, and compliant with the required standards.

Driver Class Name

The Java class name associated with the selected InfluxDB driver.

Host

The address of the machine where the InfluxDB database server is running. It can be an IP address or a domain name.

Port

The port number to connect to the InfluxDB data source.

Username

User name that provide access to the InfluxDB database.

Token

The authentication token provided by the InfluxDB instance to authenticate and authorize access to the specified data source.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the InfluxDB credentials (username and token). When provided, Data Catalog securely retrieves the credentials from Secret Manager, eliminating the need to manually enter the username and token.

Role

The IAM role that permits Data Catalog to read secret versions from Secret Manager. The role must include permissions such as secretmanager.versions.access.

Region

The region where the InfluxDB secret is stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Bucket Name

The name of the bucket in the InfluxDB instance that contains the data you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using the 'starts with' or regular expressions. Once you have located the relevant schemas, select the schemas and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting specific system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the InfluxDB as a data source.

Local File System data source

You can add data to Data Catalog from your local file system by adding Local File System as a data source.

To access files on your local system, make the following changes to the vendor/docker-compose.yml file to ensure that it is accessible by the ws-default container.

  1. Open the vendor/docker-compose.yml file and add the following lines under the ws-default service.

    You can also include a remote file share as a Local File System. As an example, refer to the following code snippet for adding cifs-share to the Local File System.

  2. Save changes.

  3. Restart the ws-default container for the changes to take effect.

Perform the following steps to identify your data source within Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Local File System in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Path

Directory where this data source is included.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders to the system. You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you see a message in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the local file system as a data source.

MariaDB data source

MariaDB is an open-source relational database management system (RDBMS) that is a fork of MySQL. Perform the following steps to configure the MariaDB data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the connection information, select MariaDB in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

The default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

By default it is Credentials.

Driver

If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Driver Class Name

The Java class name associated with the selected MariaDB driver.

User Name

The username that provides access to the MariaDB database. This field is optional when using Secret Manager Key.

Password

The password associated with the MariaDB user account. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the MariaDB credentials (username and password). When provided, Data Catalog securely retrieves the credentials from Secret Manager instead of requiring manual entry.

Role

The IAM role that grants Data Catalog permission to read secrets from Secret Manager. The role must include permissions such as secretmanager.versions.access.

Region

The region where the MariaDB secret is stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Host

The address of the machine where the MariaDB database server is running. It can be an IP address or a domain name.

Port

The port number on which the MariaDB server is listening for incoming connections.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using the 'starts with' or regular expressions. Once you have located the relevant schemas, select the schemas and then click Ingest to load the database schema and related metadata.

    Note: Although you can select all schemas, it is a best practice to avoid selecting specific system-related schemas that are unnecessary for your needs.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the MariaDB data source.

Microsoft Access data source

Perform the following steps to add Microsoft Access as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Microsoft Access in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

Select Credentials or URI as the configuration method.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Driver Class Name

The Java class name associated with the selected Microsoft Access driver.

User Name

The username used to authenticate to the Microsoft Access database. This field is optional when using Secret Manager Key.

Password

The password used to authenticate to the Microsoft Access database. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the Microsoft Access credentials (username and password). When provided, Data Catalog retrieves the credentials securely from Secret Manager instead of requiring manual entry.

Role

The IAM role that grants Data Catalog permission to read secrets from Secret Manager. The role must include permissions such as secretmanager.versions.access.

Region

The region where the Microsoft Access credentials are stored in Secret Manager. Data Catalog uses this region to connect to the correct Secret Manager endpoint.

Database File (Only for the Credentials method)

The Microsoft Access database file (.mdb or .accdb) to connect to.

URI (Only for the URI method)

The JDBC URI string used to connect to the Access database. For example, URL would look like jdbc: postgresql://localhost:<port_no>/.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Microsoft Access data source.

Microsoft SQL Server data source

Perform the following steps to add Microsoft SQL Server as a data source Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Microsoft SQL Server in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

Select Credentials or URI as a configuration method.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards. To upload a new driver, click Manage Drivers, click Add New, upload the driver, and click Add Driver.

Driver Class Name

The fully qualified Java class name of the JDBC driver.

User Name

The username used to authenticate to the Microsoft SQL Server database. This field is optional when using Secret Manager Key.

Password

The password associated with the specified SQL Server username. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains SQL Server credentials (username and password). When provided, Data Catalog retrieves credentials securely from Secret Manager, eliminating the need for manual credential entry.

Role

The IAM role that grants Data Catalog permission to read secret versions from Secret Manager. The role must include permissions such as secretmanager.versions.access.

Region

The region where the SQL Server credentials are stored in Secret Manager. Data Catalog uses this region to access the appropriate Secret Manager endpoint.

Host (Only for the Credentials method)

The address of the machine where the Microsoft SQL database server is running. It can be an IP address or a domain name.

Port (Only for the Credentials method)

The port number on which the Microsoft SQL server is listening for incoming connections. The default port is 5432.

URI (Only for the URI method)

The connection string used to connect to the SQL Server database. The URI must include the server address, database name, and any required parameters. For example, URL would look like: Server=myServerAddress;Database=myDatabase;User Id=myUsername;Password=myPassword;Port=1433;Integrated Security=False;Connection Timeout=30;.

Database Name

The name of the database within the Microsoft SQL server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Microsoft SQL Server data source.

MySQL data source

Perform the following steps to add MySQL as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select MySQL in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This setting specifies which agents should be associated with the data source in a multi-agent deployment. The only option is Default.

Configuration method

The method used to configure the connection. Select Credentials or URI.

Driver

The standard used to establish communication between the application and the database.

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Driver Class Name

The fully qualified Java class name of the JDBC driver.

User Name

The username that provides access to the specified MySQL database. This field is optional when using Secret Manager Key.

Password

The password associated with the specified MySQL username. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains MySQL credentials (username and password). When provided, Data Catalog retrieves these credentials securely from Secret Manager, overriding the Username and Password fields.

Role

The IAM role that grants Data Catalog permission to read secrets from Secret Manager. The role must include permissions such as secretmanager.versions.access.

Region

The region where the MySQL secret is stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Host (Only for the Credentials method)

The address of the machine where the MySQL database server is running. It can be an IP address or a domain name.

Port (Only for the Credentials method)

The port number on which the MySQL server is listening for incoming connections.

URI (Only for the URI method)

The JDBC connection string to access the MySQL database. The URI must include the server address, port, database name, and any required parameters. Example: jdbc:mysql://myserver.example.com:3306/mydatabase?useSSL=false&connectTimeout=30000

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the MySQL data source.

NFS data source

Network File System (NFS) is a distributed file system protocol that enables remote file access over Unix and Linux networks. You can create a data source using the NFS with the local file system path by mounting data as a local file system to either the remote or local agent. Furthermore, you can easily add data to Data Catalog from Hitachi Network Attached Storage (HNAS) and NetApp data storage.

This protocol uses a client-server model where the server provides the shared file system and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to the Data Catalog from any file-sharing network system if it is transferable using NFS.

Perform the following steps to add NFS as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select NFS in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

By default, it is a URI.

URI

URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like nfs://server.example.com

Path

NFS path to access the data source. For example the path would look like nfs:/share/data

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders into the system, and you can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the NFS data source.

Okta as a data source

Okta is an identity and access management (IAM) service that helps organizations get a clear view of which users have access to which applications. By adding Okta as a data source in Data Catalog, you can automatically import a list of applications and see who is allowed to use them. This makes it easier to manage access, track ownership, and identify users who no longer need access. It also helps ensure that only authorized personnel can view and use sensitive data, supporting compliance and security goals across the organization.

Generate Okta credentials for Data Catalog

Perform the following steps to generate the Okta credentials needed to add Okta as a data source in Data Catalog.

  1. Log in to your Okta organization as a user with administrative privileges.

  2. In the Admin Console, go to Applications > Applications, and then click Create App Integration.

    The Create a new app integration page appears.

  3. Select API Services as the Sign-in method, and then click Next.

  4. Enter a name for the PDC app integration and click Save.

    The app's main page appears.

  5. From the service app page, select the Okta API Scopes tab and grant the necessary scopes:

    • okta.apps.read

    • okta.groups.read

    • okta.users.read

  6. In the Admin Roles tab, assign the role Read Only Administrator, then click Save Changes.

  7. In the General tab, edit the Client Credentials, set the Client authentication type to Public key / Private key, and add or generate a public/private key pair.

  8. Download or copy the private key file (.pem) and note the Client ID from the saved Client Credentials section. You’ll need both when creating the data source in Data Catalog.

You have successfully generated the Okta credentials for Data Catalog.

Proceed to add Okta as a data source in Data Catalog.

Add Okta as a data source

Perform the following steps to add Okta as a data source in Data Catalog:

Prerequisites:

Procedure:

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Okta in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Domain

The organization’s Okta domain (for example, https://yourcompany.okta.com).

Client ID

The Client ID, that is generated from your Okta app integration.

Private Key Path

The private key used for authentication with Okta.

Click Manage Key Paths to upload or manage keys.

Ensure the key is correctly configured in Okta and the app integration has the necessary scope and roles set in the okta admin console.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Create Data Source to establish your data source connection.

  2. Click Import Applications.

    This process loads all the groups and applications associated with the Okta service in the Application section. For more information, see Applications in the Use Pentaho Data Catalog document.

    You can also monitor the status of the job on the Workers page.

You have successfully created a connection to Okta as a data source in Data Catalog.

After the Import Applications job completes, click Applications in the left navigation menu to view the imported hierarchy.

Note: The imported details are read-only. To sync the latest data from the Okta service, you must rerun the Import Applications job for the Okta data source. Any edits made by Data Catalog users to the imported assets will be overwritten during the next import.

The root level displays the name of the Okta data source, and the next level shows the groups retrieved from the Okta service. When you expand a group, all applications associated with that group appear below it. If an application is not part of any group in Okta, Data Catalog creates a group named Default and places such applications in it. If the same application belongs to multiple groups, it appears under each group. Each appearance is treated as a unique combination in the hierarchy view. For more information, see the Applications section in the Use Pentaho Data Catalog document.

OneDrive or SharePoint data source

SharePoint and OneDrive in Microsoft 365 are cloud-based services that help organizations share and manage content, knowledge, and applications with seamless collaboration.

Perform the following steps to configure your OneDrive or SharePoint site as a data source within Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Microsoft OneDrive or SharePoint in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information to access Microsoft OneDrive or SharePoint.

Field
Description

Affinity

The Default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

Shared Key (default)

Application (client) ID

A unique identifier assigned to an application that has been registered in Azure Active Directory (Azure AD).

Client Secret

Password credentials to access data on the OneDrive or SharePoint site.

Tenant ID

A unique identifier of the OneDrive or SharePoint site.

Path

Folder where this data source is included.- Use '/' to scan all user’s OneDrive and SharePoint sites from for the root level, and use /<folder path>/ for a specific directory.

  • Use /users/<username>/ for user-specific OneDrive.

  • Use /sites/ for the root of the SharePoint sites and /sites/<SharePoint site path>/ for a specific SharePoint site.

6. Click Test Connection to test your connection to the specified data source. Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders into the system, and you can monitor the status of the file scan on the Workers page. This process loads files and folders into the system. You can monitor the status of the file scan on the Workers page.

Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you see a message in the upper corner of the screen.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to OneDrive or SharePoint as a data source.

Oracle data source

Perform the following steps to add Oracle as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Oracle in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials or URI as a configuration method.

Configuration Method: Credentials

  • Username/Password: Credentials that provide access to the specified database.

  • Host: The address of the machine where the Oracle database server is running. It can be an IP address or a domain name.

  • Port: The port number on which the Oracle server is listening for incoming connections.

  • Database Name: The name of the database within the Oracle server that you want to connect with.

Configuration method: URI

  • Username/Password: Credentials that provide access to the specified database.

  • URI: A service URL that looks like jdbc:oracle:thin:@oracle.example.com:1521/mydb.

Driver

If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Oracle database as a data source.

PostgreSQL data source

Perform the following steps to add PostgreSQL as a data source Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select PostgreSQL in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information to access the PostgreSQL data source.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

Select Credentials or URI as a configuration method.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Driver Class Name

The fully qualified Java class name of the JDBC driver.

User Name

The username used to authenticate to the PostgreSQL database. This field is optional when using Secret Manager Key.

Password

The password associated with the PostgreSQL user account. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the PostgreSQL credentials (username and password). When provided, Data Catalog retrieves the credentials securely from Secret Manager instead of requiring manual entry.

Role

The IAM role that grants Data Catalog permission to read secret versions from Secret Manager. This role must include permissions such as secretmanager.versions.access.

Region

The region where the PostgreSQL secret is stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Host (Only for the Credentials method)

The address of the machine where the PostgreSQL server is running. It can be an IP address or a domain name.

Port (Only for the Credentials method)

The port on which the PostgreSQL server listens for incoming connections. The default PostgreSQL port is 5432.

Database Name (Only for the Credentials method)

The name of the database or schema within the PostgreSQL server that you want to connect with.

URI (Only for the URI method)

A unique identifier to locate the data source. It should have the name of the databases in the connection string itself. For example, URL would look like jdbc: postgresql://localhost:<port_no>/.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the PostgreSQL data source.

SAP HANA data source

Perform the following steps to add SAP HANA as a data source Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select SAP HANA in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

Select Credentials or URI as a configuration method.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Driver Class Name

The fully qualified Java class name of the JDBC driver.

User Name

The username required to authenticate to SAP HANA when using URI mode. Optional when using Secret Manager Key.

Password

The password associated with the SAP HANA user account. Optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains SAP HANA credentials (username and password). When provided, Data Catalog retrieves the credentials securely from Secret Manager, overriding manual entry.

Role

The IAM role that grants Data Catalog permission to read secrets from Secret Manager. This role must include permissions such as secretmanager.versions.access.

Region

The region where the SAP HANA secret is stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Host (Only for the Credentials method)

A physical or virtual machine (server) where an instance of SAP HANA is installed and running. It can be an IP address or a domain name.

Port (Only for the Credentials method)

The port number on which the SAP HANA database server is listening for incoming connections.

URI (Only for the URI method)

The JDBC connection string used to connect to SAP HANA. It must specify the host, port, database name, and optional credentials. For example, URL would look like jdbc: sap://localhost:<port_no>/<database_name>?user=<user>&password=<password>.

Database Name

The name of the database within the SAP HANA server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the SAP HANA data source.

Salesforce data source

Salesforce is a cloud-based customer relationship management (CRM) platform that provides organizations with a unified platform to manage customer data, sales processes, marketing campaigns, support interactions, and other key business functions. By integrating Salesforce as a data source within Data Catalog, you can access and manage metadata from Salesforce. It enables data discovery to search, explore, and understand Salesforce data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add Salesforce as a data source in Data Catalog:

Before you begin

  • Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  • Configure trusted IP ranges in Salesforce to trust login requests coming from PDC IP addresses, eliminating the need for a security token.

    1. Sign in to your Salesforce organization (Production or Sandbox).

    2. In Setup, search for Network Access, or navigate to Security Controls > Network Access.

    3. Obtain the IP address of your Pentaho Data Catalog server.

    4. Add this IP address as a new Trusted IP Range.

      • If your deployment uses a single server, enter the same IP address in both the Start IP Address and End IP Address fields.

  • Ensure that the Salesforce user account used for Data Catalog integration has sufficient privileges to access the required objects and metadata. If not, grant the View All Data permission:

    1. In Salesforce, go to Setup.

    2. Navigate to Users > Users.

    3. Locate and select the user account used for the PDC integration.

    4. On the User Details page, click the Profile link and then click Edit.

    5. In the Administrative Permissions section, select the View All Data checkbox and click Save.

Procedure

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Salesforce in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method is used to configure the connection. The only option is Credentials.

Driver

The standard used to establish communication between the application and the database. Select Default, which is an existing driver, to ensure that communication between the application and the database is efficient, secure, and compliant with the required standards.

CAUTION: Don’t change the driver for the Salesforce data source type. Changing it might disrupt the connection and cause unexpected behavior.

Driver Class Name

The fully qualified Java class name of the JDBC driver.

Username

The Salesforce login username, associated with the Salesforce account. When Secret Manager Key is configured, the User Name is optional.

Password

The password of the Salesforce account to authenticate the connection. When Secret Manager Key is configured, the Password is optional.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the Salesforce connection credentials (username and password). When provided, Data Catalog retrieves the credentials securely from Secret Manager, overriding manual entry.

Role

The IAM role that grants Data Catalog permission to read secrets from Secret Manager. The role must include permissions such as secretmanager.versions.access.

Region

The region where the Salesforce credentials are stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Host

The domain or endpoint of the Salesforce instance you are connecting to.

Port

The port number to connect to the Salesforce instance.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to Salesforce as a data source.

SMB/CIFS data source

Server Message Block (SMB) and Common Internet File System (CIFS) are Windows file sharing protocols used in storage systems. You can add data to Data Catalog from a file sharing protocol, such as CIFS or SMB, to either the remote agent or local agent, thereby enabling the creation of a data source as CIFS or SMB with a local file system path.

This protocol uses a client-server model where the server provides a shared file system and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to the Data Catalog from any file-sharing network system that supports transfer via the Server Message Block (SMB) and Common Internet File System (CIFS) protocols.

Perform the following steps to add SMB or CIFS as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select SMB or CIFS in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

By default, it is a URI.

URI

URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like smb/cifs://server.example.com

Domain

The domain name, if the SMB or CIFS server is part of a Windows domain.

Path

NFS path to access the data source. For example, the path would look like smb/cifs://server:/path/to/resource

Username/Password

Credentials that provide access to the SMB or CIFS resource.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. The Scan Files dialog box appears. Here you can refine metadata ingestion using Include and Exclude Patterns. This feature enables you to specify which folders or files should be scanned or excluded. By applying these filters, you can shorten scan duration and control the scope of metadata ingestion. For more information, see the feature walkthrough Include or exclude patternsarrow-up-right. This process loads files and folders into the system, and you can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen.

  2. (Optional) Configure the following options for the data source.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

    Cost per Terabyte

    Menu to select currency and text field to enter the price per terabyte.

    Total Capacity

    Field to enter the total capacity of the data source in terabytes.

  3. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  4. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the SMB or CISF resource as a data source.

Snowflake data source

Perform the following steps to add Snowflake as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Snowflake in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method used to configure the connection. The only option is Credentials.

Driver

The standard used to establish communication between the application and the database.Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Driver Class Name

The fully qualified Java class name of the JDBC driver.

Username

The Snowflake login username associated with the Snowflake account. This field is optional when using Secret Manager Key.

Password

The password for the Snowflake user account. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains Snowflake credentials (username and password). When provided, Data Catalog retrieves credentials securely from Secret Manager instead of requiring manual entry.

Role

The IAM role that grants Data Catalog permission to read secrets from Secret Manager. This role must include permissions such as secretmanager.versions.access.

Region

The region where the Snowflake credentials are stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Host

The address of the machine where the Snowflake database server is running. It can be an IP address or a domain name.

Port

The port number to connect to the Snowflake data source.

Database Name

The name of the database within the Snowflake that you want to connect with.

Warehouse

The name of the Snowflake virtual warehouse to use for executing queries, loading data, and performing compute operations.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before finalizing and saving your new data source configuration, you must perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Snowflake data source.

Sybase data source

Sybase is a relational database management system (RDBMS) used for data warehousing, business intelligence, and enterprise applications. Integrating Sybase as a data source within Data Catalog, you can access and manage metadata from the Sybase database. It enables data discovery to search, explore, and understand Sybase data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add Sybase as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Sybase in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method, used to configure the connection. The only option is URI.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards. To upload a new driver, click Manage Drivers, click Add New, upload the driver, and then click Add Driver.

Driver Class Name

The fully qualified Java class name of the JDBC driver.

URI

URIs are used to access and manage various objects and services within the Sybase environment. For example, the URI would look like jdbc:sybase:tds:<hostname>:<port>?ServiceName=<dbname>

Username

The username required to authenticate to the Sybase database. This field is optional when using Secret Manager Key.

Password

The password associated with the Sybase user account. This field is optional when using Secret Manager Key.

Secret Manager Key

The name or identifier of the secret stored in Secret Manager that contains the Sybase credentials (username and password). When provided, Data Catalog retrieves the credentials securely from Secret Manager, overriding manual entry.

Role

The IAM role that grants Data Catalog permission to read secrets from Secret Manager. This role must include permissions such as secretmanager.versions.access.

Region

The region where the Sybase credentials are stored in Secret Manager. Data Catalog uses this region to access the correct Secret Manager endpoint.

Database Name

The name of the data sources within the Sybase environment that contain the data you want to access.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before finalizing and saving your new data source configuration, you must perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Sybase data source.

Vertica data source

Perform the following steps to add Vertica as a data source in Data Catalog:

Refer to the Component Reference section in the Install Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

    Field
    Description

    Data Source Name

    Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

    Data Source ID (Optional)

    Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

    Description (Optional)

    Specify a description of your data source.

4. After you have specified the basic connection information, select Vertica in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

Select Credentials or URI as a configuration method.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards. To upload a driver, click Manage Drivers, select Add New, upload the driver, and click Add Driver.

Driver Class Name

The fully qualified Java class name of the JDBC driver used to connect to Vertica.

User Name

The username that provides access to the Vertica database. If using Secret Manager Key, this field becomes optional.

Password

The password that provides access to the Vertica database. If using Secret Manager Key, this field becomes optional.

Secret Manager Key

The identifier of the secret stored in the Secret Manager (AWS Secrets Manager or GCP Secret Manager). When provided, the system retrieves database credentials securely from the secret instead of using manually entered Username and Password.

Role

The IAM role used to access the secret in the Secret Manager. Required when using role-based access to retrieve credentials.

Region

The cloud region where the Secret Manager key is stored.

Host (Only for the Credentials method)

A physical or virtual machine (server) where an instance of the Vertica database software is installed and running. It can be an IP address or a domain name.

Port (Only for the Credentials method)

The port number on which the Vertica server is listening for incoming connections.

URI (Only for the URI method)

The JDBC connection string used to establish a connection with Vertica. For example, the URL would look like jdbc:vertica://<hostname>:<port>/<database>?user=<username>&password=<password>.

Database Name

The name of the database within the Vertica server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema. The Select schemas for ingestion dialog opens. You can search for schemas using the search bar at the top, using starts with or regular expressions. Once you have located the relevant schemas, select the schemas, and then click Ingest to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Vertica data source.

Last updated

Was this helpful?