Adding a data source

If your role has permission to administer data sources, you can add and edit data sources.

The number of data sources you can add is determined by your license agreement. You receive a message when you have reached 75% of your data source creation quota.

If you have reached the limit of data sources allowed by your license agreement, the Add Data Source button on the Resources card is unavailable, and a message appears when you hover your cursor over the button.

Note: If you encounter an error while connecting to a data source, refer to the documentation of the specific data source provider for more information about the error.

Active Directory as a data source

Data Catalog supports integration with both Windows-based Active Directory (AD) and Azure Active Directory (Azure AD). You can add Active Directory as a data source to import file system security identifiers (SIDs), GUIDs, and security descriptors, and map them to user identities. With this integration Data Catalog displays the ownership and group information for files and folders from SMB, CIFS, and NFS data sources in Data Canvas.

Perform the following steps to add Active Directory as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Active Directory in the Data Source Type field. Note: Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Host

The fully qualified domain name (FQDN) or IP address of the Active Directory server.

Port

The port number used to connect to the Active Directory server. The default port is usually 389 for LDAP or 636 for LDAPS (secure LDAP).

Domain

The domain name associated with the Active Directory environment.

User Name

The username that has permission to query the Active Directory. Include the domain if you have not provided the domain name.

Password

The password associated with the username. This credential is used to authenticate the connection to the AD server.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Create Data Source to establish your data source connection.

  2. Click Import Users.

    This process imports file system security identifiers (SIDs), GUIDs, and security descriptors, and maps them to user identities, which helps to display ownership and group information for files and folders from SMB, CIFS, and NFS data sources in Data Canvas. You can also monitor the status of the job on the Workers page

You have successfully created a connection to Active Directory as a data source in Data Catalog.

After completing the Import Users job, you can run the Metadata Ingest process for SMB, CIFS, and NFS data sources to see the user information in the Properties panel. For more information, see the Processing unstructured data topic in the Explore your data section in the Use Pentaho Data Catalog document. Additionally, you can see the list of users who have access to a particular file or folder. For more information, see the Users with Access topic in the Use Pentaho Data Catalog document.

DynamoDB data source

Perform the following steps to add Amazon DynomoDB as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Active Directory in the Data Source Type field. Note: Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Region

Geographical location where AWS maintains a cluster of data centers.

Access Key and Secret Access Key

AWS Access Key ID and Secret Access Key that are used for authentication and authorization when interacting with DynamoDB.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. This process loads files and folders into the system.

    You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the Amazon DynamoDB data source.

Amazon Redshift data source

Amazon Redshift is a fully managed, petabyte-scale data warehouse solution offered by Amazon Web Services (AWS). It allows users to run complex queries and perform real-time data analytics on large datasets by utilizing massively parallel processing (MPP) technology. Integrating Amazon Redshift as a data source within Data Catalog, you can access and manage metadata from the Amazon Redshift database. It enables data discovery to search, explore, and understand Amazon Redshift data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add Amazon Redshift as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the connection information, select Redshift in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials, or URI as a configuration method.

Configuration Method: Credentials

  • Username/Password: The credentials that provide access to the Amazon Redshift database.

  • Host: The address of the machine where the Amazon Redshift database server is running. It can be an IP address or a domain name.

  • Port: The port number on which the Amazon Redshift server is listening for incoming connections.

Configuration method: URI

  • Username/Password: Credentials that provide access to the Amazon Redshift database.

  • URI: URIs are used to access and manage various objects and services within the Amazon Redshift environment. For example, URL would look like ```jdbc:redshift://:/`.

Driver

If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Database Name

The name of the database within the Amazon Redshift server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata.

    Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need Pentaho Data Oprimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the Amazon Redshift as a data source.

AWS S3 data source

Perform the following steps to add AWS S3 as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select AWS S3 in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Region

Geographical location where AWS maintains a cluster of data centers.

Endpoint

Location of the bucket. For example, s3.<region containing S3 bucket>.amazonaws.com

Access Key

User credential to access data on the bucket.

Secret Access Key

Password credential to access data on the bucket.

Bucket Name

The name of the S3 bucket in which the data resides. For S3 access from non-EMR file systems, Data Catalog uses the AWS command line interface to access S3 data.

These commands send requests using access keys, which consist of an access key ID and a secret access key. You must specify the logical name for the cluster root.

This value is defined by dfs.nameservices in thehdfssite.xml configuration file. For S3 access from AWS S3 and MapR file systems, you must identify the root of the MapR file system with maprfs:///.

Path

Directory where this data source is included.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. This process loads files and folders into the system.

    You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the AWS S3 data source.

Azure Blob Storage data source

Perform the following steps to add Azure Blob Storage as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Azure Blob Storage in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

The Default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

A Shared Key (default)

Account Name

Name of an Azure storage account contains all of your Azure Storage data objects.

Shared Key

A password-like credential that gives full access to an Azure storage account's data and configuration.

Container

The top-level object that logically groups blob data that holds an unlimited number of large object data.

Path

Folder where this data source is included.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. This process loads files and folders into the system.

    You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the Azure Blob Storage data source.

Denodo data source

Perform the following steps to add Denodo as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Denodo in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

By default, it is URI, as the connection is configured using a URL.- Username and Password: Credentials associated with your Denodo account to log in and access the Denodo environment.

  • URI: URIs are used to access and manage various objects and services within the Denodo environment. For example, the URI would look like jdbc:vdb://<denodo-host>:<port>/<database-name>?publishCatalogsAsSchemas=true

Example: jdbc:vdb://ec2-1-2-3-4.compute-1.amazonaws.com:49999/mydatabase?publishCatalogsAsSchemas=true

  • Database Name: The name of the data sources within the Denodo environment that contain the data you want to access.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens. Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Denodo data source.

Google Cloud Storage data source

Google Cloud Storage (GCS) is a storage service that enables the storage, retrieval, and management of unstructured data, including files, images, videos, and large datasets. By integrating GCS as a data source within Data Catalog, you can access and manage the metadata of files stored. You can perform data discovery to search, explore, and understand your Google Cloud Storage data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add Google Cloud Storage as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Google Object Store in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Bucket Name

The name of the Google Cloud Storage bucket in which the data resides.

Path

The path within the bucket where the files are stored.

Key Path

The authentication key file, which is used to connect to Google Cloud Storage (a JSON file).

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. This process loads files and folders into the system.

    You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to Google Cloud Storage as a data source in Data Catalog.

Google BigQuery data source

Google BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for analytics. Integrating BigQuery as a data source within Data Catalog, you can access and manage metadata from the BigQuery database. It enables data discovery to search, explore, and understand BigQuery data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add BigQuery as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Google BigQuery in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field

Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Driver

The standard used to establish communication between the application and the database.Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Host

The Google BigQuery API endpoint. By default, the host is set to https://www.googleapis.com/bigquery/v2, which communicates with BigQuery's REST API for data processing.

Port

The port number to connect to the BigQuery data source. The default port for HTTPS connections is 443.

Project

The Google Cloud project ID that contains the BigQuery datasets you want to access.

Database Name

The name of the dataset within the BigQuery that you want to connect with.

Key Path

The file path to your Google Cloud service account's key (a JSON file).

Oauth Type

The authentication method to connect to BigQuery. By default, it is Service-based. It uses a service account and a key file for authentication.

Client Email

The email associated with your Google Cloud service account for service based OAuth. The service account email is usually in the format [email protected].

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the BigQuery data source.

HCP data source

You can add data to Data Catalog from Hitachi Content Platform (HCP) by adding HCP as data source. Perform the following steps to add HCP as data source:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select HCP in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Region

Geographical location where HCP maintains data centers.

Endpoint

Location of the bucket. hostname or IP address

Access Key

The access key of the S3 credentials to access the bucket.

Secret Access Key

The secret key of the S3 credentials to access the bucket.

Bucket Name

The name of the S3 bucket in which the data resides.

Path

Directory where this data source is included.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. This process loads files and folders into the system.

    You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the HCP data source.

HDFS data source

Hadoop Distributed File System (HDFS) is a scalable, fault-tolerant storage system for big data, designed to distribute and manage large datasets across a cluster of commodity hardware within the Apache Hadoop framework. You can create a data source using HDFS with the local file system path by mounting data as a local file system to either the remote or local worker.

The HDFS protocol uses a client-server model where the server provides the shared file system, and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to Data Catalog from any file-sharing network system if it is transferable using HDFS.

Perform the following steps to add HDFS as a data source:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, specify the following additional connection information to access the HDFS data source.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

By default, it is URI.

HDFS Version

Select the Hadoop version of the cluster that you want to run.

URI

URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like hdfs://<name node>:8020. The <name node> address can be a variable name for high availability.

Path

HDFS directory path for the data source. It can be the root (/) or a specific high-level directory based on your access control needs. For example, the path would look like /user/demodata/.

Credential Type

Select the credential type to connect to the HDFS data source.

5. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. This process loads files and folders into the system.

    You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you will see a message in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the HDFS data source.

IBM Db2 data source

Perform the following steps to add IBM Db2 as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the connection information, select IBM DB2 in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials, or URI as a configuration method.

Configuration Method: Credentials

  • Username/Password: The credentials that provide access to the specified database.

  • Host: The address of the machine where the IBM Db2 database server is running. It can be an IP address or a domain name.

  • Port: The port number on which the IBM Db2 server is listening for incoming connections.

Configuration method: URI

  • Username/Password: Credentials that provide access to the specified database.

  • URI: URIs are used to access and manage various objects and services within the IBM Db2 environment. For example, URL would look like jdbc:db2://<HOSTNAME or IP_ADDRESS>:<PORT>/<DATABASE_NAME>.

Driver

If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Database Name

The name of the database within the IBM Db2 server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata.

    Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the IBM Db2 data source.

InfluxDB data source

InfluxDB is a time-series database designed for handling high volumes of time-stamped data, such as monitoring, IoT applications, real-time analytics, and event-driven architectures. Integrating InfluxDB as a data source within Data Catalog, you can manage and utilize the time-series data. It enables data discovery to search, explore, and understand InfluxDB data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add InfluxDB as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select InfluxDB in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method is used to configure the connection. The only option is Credentials.

Driver

The standard used to establish communication between the application and the database.Select Default, which is an existing driver, to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Host

The address of the machine where the InfluxDB database server is running. It can be an IP address or a domain name.

Port

The port number to connect to the InfluxDB data source.

Username

User name that provide access to the InfluxDB database.

Token

The authentication token provided by the InfluxDB instance to authenticate and authorize access to the specified data source.

Database Name

The name of the database within the InfluxDB that you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the InfluxDB as a data source.

Local File System data source

You can add data to Data Catalog from your local file system by adding Local File System as a data source.

To access files on your local system, make the following changes to the vendor/docker-compose.yml file to ensure that it is accessible by the ws_default container.

  1. Open the vendor/docker-compose.yml file and add the following lines under the ws_default service.

    services:
      ws_default:
        volumes:
          - /my/path/to/file:/tmp/my-path

    You can also include a remote file share as a Local File System. As an example, refer to the following code snippet for adding cifs-share to the Local File System.

    services:
      ws_default:
        volumes:
          - cifs-share:/cifs-share
          
          // Following are optional settings to add cifs share to local file system
          - cifs-share:/cifs-share //Remote file share
    volumes:
      cifs-share:
        driver_opts:
          type: cifs
          o: "username=<user1>,password=<password>,file_mode=0777,dir_mode=0777"
          device: "<IP Address>”
    
  2. Save changes.

  3. Restart the ws_default container for the changes to take effect.

Perform the following steps to identify your data source within Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Local File System in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Path

Directory where this data source is included.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files. This process loads files and folders to the system.

    You can monitor the status of the file scan on the Workers page.

    Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you see a message in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the local file system as a data source.

MariaDB data source

MariaDB is an open-source relational database management system (RDBMS) that is a fork of MySQL. Perform the following steps to configure the MariaDB data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the connection information, select MariaDB in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

The default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials, or URI as a configuration method.

Configuration Method:

By default it is Credentials.

  • Username/Password: The credentials that provide access to the specified database.

  • Host: The address of the machine where the MariaDB database server is running. It can be an IP address or a domain name.

  • Port: The port number on which the MariaDB server is listening for incoming connections.

Driver

If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata.

    Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created the connection to the MariaDB data source.

Microsoft Access data source

Perform the following steps to add Microsoft Access as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Microsoft Access in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials or URI as a configuration method.

Configuration Method: Credentials

  • Username/Password: Credentials that provide access to the specified database.

  • Database File: The location of the Microsoft Access database file to connect.

Configuration Method: URI

  • Username/Password: Credentials that provide access to the specified database.

  • URI: For example, URL would look like jdbc: postgresql://localhost:<port_no>/.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Microsoft Access data source.

Microsoft SQL Server data source

Perform the following steps to add Microsoft SQL Server as a data source Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Microsoft SQL Server in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method: Select Credentials or URI as a configuration method.

Configuration method: Credentials

  • Username/Password: Credentials that provide access to the specified database.

  • Host: The address of the machine where the Microsoft SQL database server is running. It can be an IP address or a domain name.

  • Port: The port number on which the Microsoft SQL server is listening for incoming connections. The default port is 5432.

Configuration method: URI

  • Username/Password: Credentials that provide access to the specified database.

  • URI: For example, URL would look like: Server=myServerAddress;Database=myDatabase;User Id=myUsername;Password=myPassword;Port=1433;Integrated Security=False;Connection Timeout=30;.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Database Name

The name of the database within the Microsoft SQL server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Microsoft SQL Server data source.

MySQL data source

Perform the following steps to add MySQL as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select MySQL in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This setting specifies which agents should be associated with the data source in a multi-agent deployment. The only option is Default.

Configuration method

The method used to configure the connection. The only option is Credentials.

Driver

The standard used to establish communication between the application and the database.

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

User Name

User name that provides access to the specified database.

Password

Password that provides access to the specified database.

Host

The address of the machine where the MySQL database server is running. It can be an IP address or a domain name.

Port

The port number on which the MySQL server is listening for incoming connections.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the MySQL data source.

NFS data source

Network File System (NFS) is a distributed file system protocol that enables remote file access over Unix and Linux networks. You can create a data source using the NFS with the local file system path by mounting data as a local file system to either the remote or local agent. Furthermore, you can easily add data to Data Catalog from Hitachi Network Attached Storage (HNAS) and NetApp data storage.

This protocol uses a client-server model where the server provides the shared file system and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to the Data Catalog from any file-sharing network system if it is transferable using NFS.

Perform the following steps to add NFS as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select NFS in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

By default, it is URI.

  • URI: URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like nfs://server.example.com

  • Path: NFS path to access the data source. For example the path would look like nfs:/share/data

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files to scan the available files in the data source.

    Note: If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen.

  2. (Optional) In the Physical Location field, specify the physical location details of the data source.

  3. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  4. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  5. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  6. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  7. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the NFS data source.

Okta as a data source

Okta is an identity and access management (IAM) service that helps organizations get a clear view of which users have access to which applications. By adding Okta as a data source in Data Catalog, you can automatically import a list of applications and see who is allowed to use them. This makes it easier to manage access, track ownership, and identify users who no longer need access. It also helps ensure that only authorized personnel can view and use sensitive data, supporting compliance and security goals across the organization.

Generate Okta credentials for Data Catalog

Perform the following steps to generate the Okta credentials needed to add Okta as a data source in Data Catalog.

  1. Log in to your Okta organization as a user with administrative privileges.

  2. In the Admin Console, go to Applications > Applications, and then click Create App Integration.

    The Create a new app integration page appears.

  3. Select API Services as the Sign-in method, and then click Next.

  4. Enter a name for the PDC app integration and click Save.

    The app's main page appears.

  5. From the service app page, select the Okta API Scopes tab and grant the necessary scopes:

    • okta.apps.read

    • okta.groups.read

    • okta.users.read

  6. In the Admin Roles tab, assign the role Read Only Administrator, then click Save Changes.

  7. In the General tab, edit the Client Credentials, set the Client authentication type to Public key / Private key, and add or generate a public/private key pair.

  8. Download or copy the private key file (.pem) and note the Client ID from the saved Client Credentials section. You’ll need both when creating the data source in Data Catalog.

You have successfully generated the Okta credentials for Data Catalog.

Proceed to add Okta as a data source in Data Catalog.

Add Okta as a data source

Perform the following steps to add Okta as a data source in Data Catalog:

Prerequisites:

Procedure:

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Okta in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Domain

The organization’s Okta domain (for example, https://yourcompany.okta.com).

Client ID

The Client ID, that is generated from your Okta app integration.

'Private Key Path

The private key used for authentication with Okta.

Click Manage Key Paths to upload or manage keys.

Ensure the key is correctly configured in Okta and the app integration has the necessary scope and roles set in the okta admin console.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Create Data Source to establish your data source connection.

  2. Click Import Applications.

    This process loads all the groups and applications associated with the Okta service in the Application section. For more information, see Applications in the Use Pentaho Data Catalog document.

    You can also monitor the status of the job on the Workers page.

You have successfully created a connection to Okta as a data source in Data Catalog.

After the Import Applications job completes, click Applications in the left navigation menu to view the imported hierarchy.

Note: The imported details are read-only. To sync the latest data from the Okta service, you must rerun the Import Applications job for the Okta data source. Any edits made by Data Catalog users to the imported assets will be overwritten during the next import.

The root level displays the name of the Okta data source, and the next level shows the groups retrieved from the Okta service. When you expand a group, all applications associated with that group appear below it. If an application is not part of any group in Okta, Data Catalog creates a group named Default and places such applications in it. If the same application belongs to multiple groups, it appears under each group. Each appearance is treated as a unique combination in the hierarchy view. For more information, see the Applications section in the Use Pentaho Data Catalog document.

OneDrive or SharePoint data source

SharePoint and OneDrive in Microsoft 365 are cloud-based services that help organizations share and manage content, knowledge, and applications with seamless collaboration.

Perform the following steps to configure your OneDrive or SharePoint site as a data source within Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Microsoft OneDrive or SharePoint in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information to access Microsoft OneDrive or SharePoint.

Field
Description

Affinity

The Default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method

A Shared Key (default)

Application (client) ID

A unique identifier assigned to an application that has been registered in Azure Active Directory (Azure AD).

Client Secret

Password credentials to access data on the OneDrive or SharePoint site.

Tenant ID

A unique identifier of the OneDrive or SharePoint site.

Path

Folder where this data source is included.- Use '/' to scan all user’s OneDrive and SharePoint sites from for the root level, and use /<folder path>/ for a specific directory.

  • Use /users/<username>/ for user-specific OneDrive.

  • Use /sites/ for the root of the SharePoint sites and /sites/<SharePoint site path>/ for a specific SharePoint site.

6. Click Test Connection to test your connection to the specified data source. Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Scan Files. This process loads files and folders into the system. You can monitor the status of the file scan on the Workers page.

Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you see a message in the upper corner of the screen.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to OneDrive or SharePoint as a data source.

Oracle data source

Perform the following steps to add Oracle as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Oracle in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials or URI as a configuration method.

Configuration Method: Credentials

  • Username/Password: Credentials that provide access to the specified database.

  • Host: The address of the machine where the Oracle database server is running. It can be an IP address or a domain name.

  • Port: The port number on which the Oracle server is listening for incoming connections.

  • Database Name: The name of the database within the Oracle server that you want to connect with.

Configuration method: URI

  • Username/Password: Credentials that provide access to the specified database.

  • URI: A service URL that looks like jdbc:oracle:thin:@oracle.example.com:1521/mydb.

Driver

If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Oracle database as a data source.

PostgreSQL data source

Perform the following steps to add PostgreSQL as a data source Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select PostgreSQL in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information to access the PostgreSQL data source.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials or URI as a configuration method.

Configuration Method: Credentials

  • Username/Password: Credentials that provide access to the specified database.

  • Host: The address of the machine where the PostgreSQL database server is running. It can be an IP address or a domain name.

  • Port: The port number on which the PostgreSQL server is listening for incoming connections. The default port is 5432.

  • Database Name: The name of the database or schema within the PostgreSQL server that you want to connect with.

Configuration Method: URI

  • Username/Password: Credentials that provide access to the specified database.

  • URI: A unique identifier to locate the data source. It should have the name of the databases in the connection string itself. For example, URL would look like jdbc: postgresql://localhost:<port_no>/.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the PostgreSQL data source.

SAP HANA data source

Perform the following steps to add SAP HANA as a data source Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select SAP HANA in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials or URI as a configuration method.

Configuration Method: Credentials

  • Username/Password: Credentials that provide access to the specified database.

  • Host: A physical or virtual machine (server) where an instance of SAP HANA is installed and running. It can be an IP address or a domain name.

  • Port: The port number on which the SAP HANA database server is listening for incoming connections.

Configuration Method: URI

  • Username/Password: Credentials that provide access to the specified database.

  • URI: For example, URL would look like jdbc: sap://localhost:<port_no>/<database_name>?user=<user>&password=<password>.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Database Name

The name of the database within the SAP HANA server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the SAP HANA data source.

Salesforce data source

Salesforce is a cloud-based customer relationship management (CRM) platform that provides organizations with a unified platform to manage customer data, sales processes, marketing campaigns, support interactions, and other key business functions. By integrating Salesforce as a data source within Data Catalog, you can access and manage metadata from Salesforce. It enables data discovery to search, explore, and understand Salesforce data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add Salesforce as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Salesforce in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method is used to configure the connection. The only option is Credentials.

Driver

The standard used to establish communication between the application and the database. Select Default, which is an existing driver, to ensure that communication between the application and the database is efficient, secure, and compliant with the required standards.

CAUTION: Don’t change the driver for the Salesforce data source type. Changing it might disrupt the connection and cause unexpected behavior.

Username

The Salesforce login username, associated with the Salesforce account.

Password

The password of the Salesforce account to authenticate the connection.

Host

The domain or endpoint of the Salesforce instance you are connecting to.

Port

The port number to connect to the Salesforce instance.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need Pentaho Data Oprimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to Salesforce as a data source.

SMB/CIFS data source

Server Message Block (SMB) and Common Internet File System (CIFS) are Windows file sharing protocols used in storage systems. You can add data to Data Catalog from a file sharing protocol, such as CIFS or SMB, to either the remote agent or local agent, thereby enabling the creation of a data source as CIFS or SMB with a local file system path.

This protocol uses a client-server model where the server provides a shared file system and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to the Data Catalog from any file-sharing network system that supports transfer via the Server Message Block (SMB) and Common Internet File System (CIFS) protocols.

Perform the following steps to add SMB or CIFS as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select SMB or CIFS in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

By default, it is URI.

  • URI: URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like smb/cifs://server.example.com

  • Domain: The domain name if the SMB or CIFS server is part of a Windows domain.

  • Path: NFS path to access the data source. For example the path would look like smb/cifs://server:/path/to/resource

  • Username/Password: Credentials that provide access to the SMB or CIFS resource.

6. Click Test Connection to test your connection to the specified data source.

  1. Click Scan Files to scan the available files in the data source.

    Note: If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen.

  2. (Optional) Configure the following options for the data source.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

    Cost per Terabyte

    Menu to select currency and text field to enter the price per terabyte.

    Total Capacity

    Field to enter the total capacity of the data source in terabytes.

  3. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  4. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the SMB or CISF resource as a data source.

Snowflake data source

Perform the following steps to add Snowflake as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Snowflake in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method used to configure the connection. The only option is Credentials.

Driver

The standard used to establish communication between the application and the database.Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.

Username

User name that provide access to the specified database.

Password

Password that provides access to the specified database.

Host

The address of the machine where the Snowflake database server is running. It can be an IP address or a domain name.

Port

The port number to connect to the Snowflake data source.

Database Name

The name of the database within the Snowflake that you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before finalizing and saving your new data source configuration, you must perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Snowflake data source.

Sybase data source

Sybase is a relational database management system (RDBMS) used for data warehousing, business intelligence, and enterprise applications. Integrating Sybase as a data source within Data Catalog, you can access and manage metadata from the Sybase database. It enables data discovery to search, explore, and understand Sybase data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.

Perform the following steps to add Sybase as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Sybase in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration method

The method, used to configure the connection. The only option is URI.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards. To upload a new driver, click Manage Drivers, click Add New, upload the driver, and then click Add Driver.

URI

URIs are used to access and manage various objects and services within the Sybase environment. For example, the URI would look like jdbc:sybase:tds:<hostname>:<port>?ServiceName=<dbname>

Username

Username that provides access to the Sybase database.

Password

Password that provides access to the Sybase database.

Database Name

The name of the data sources within the Sybase environment that contain the data you want to access.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before finalizing and saving your new data source configuration, you must perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Sybase data source.

Vertica data source

Perform the following steps to add Vertica as a data source in Data Catalog:

Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.

  2. In the Resources card, click Add Data Source.

    The Create Data Source page opens.

    Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.

  3. Specify the following information for the connection to your data source.

    Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.

Field
Description

Data Source Name

Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.

Data Source ID (Optional)

Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.

Description (Optional)

Specify a description of your data source.

4. After you have specified the basic connection information, select Vertica in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.

5. Specify the following additional connection information.

Field
Description

Affinity

This default setting specifies which agents should be associated with the data source in a multi-agent deployment.

Configuration Method: Select Credentials or URI as a configuration method.

Configuration Method: Credentials

  • Username/Password: Credentials that provide access to the specified database.

  • Host: A physical or virtual machine (server) where an instance of the Vertica database software is installed and running. It can be an IP address or a domain name.

  • Port: The port number on which the Vertica server is listening for incoming connections.

Configuration Method: URI

  • Username/Password: Credentials that provide access to the specified database.

  • URI: For example, URL would look like jdbc:vertica://<hostname>:<port>/<database>?user=<username>&password=<password>.

Driver

Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.

Database Name

The name of the database within the Vertica server that you want to connect with.

6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.

Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.

7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.

8. (Optional) In the Physical Location field, specify the physical location details of the data source.

  1. (Optional) Configure the following storage optimization options for the data source.

    Note: To use storage optimization options, you need a Pentaho Data Optimizer license.

    Field
    Description

    Available for Migration

    Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.

    Available for Writing

    Enables or disables writing capabilities for the data source and enables migration when turned on.

    Available for Data Mastering

    Enables or disables the data source for data mastering purposes.

  2. (Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.

  3. (Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.

  4. (Optional) Enter a Note for any additional information to share with others who might access this data source.

  5. Click Create Data Source to establish your data source connection.

You have successfully created a connection to the Vertica data source.

Last updated

Was this helpful?