Adding a data source
If your role has permission to administer data sources, you can add and edit data sources.
The number of data sources you can add is determined by your license agreement. You receive a message when you have reached 75% of your data source creation quota.
If you have reached the limit of data sources allowed by your license agreement, the Add Data Source button on the Resources card is unavailable, and a message appears when you hover your cursor over the button.
Note: If you encounter an error while connecting to a data source, refer to the documentation of the specific data source provider for more information about the error.
Active Directory as a data source
Data Catalog supports integration with both Windows-based Active Directory (AD) and Azure Active Directory (Azure AD). You can add Active Directory as a data source to import file system security identifiers (SIDs), GUIDs, and security descriptors, and map them to user identities. With this integration Data Catalog displays the ownership and group information for files and folders from SMB, CIFS, and NFS data sources in Data Canvas.
Perform the following steps to add Active Directory as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Active Directory in the Data Source Type field. Note: Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Host
The fully qualified domain name (FQDN) or IP address of the Active Directory server.
Port
The port number used to connect to the Active Directory server. The default port is usually 389 for LDAP or 636 for LDAPS (secure LDAP).
Domain
The domain name associated with the Active Directory environment.
User Name
The username that has permission to query the Active Directory. Include the domain if you have not provided the domain name.
Password
The password associated with the username. This credential is used to authenticate the connection to the AD server.
6. Click Test Connection to test your connection to the specified data source.
Click Create Data Source to establish your data source connection.
Click Import Users.
This process imports file system security identifiers (SIDs), GUIDs, and security descriptors, and maps them to user identities, which helps to display ownership and group information for files and folders from SMB, CIFS, and NFS data sources in Data Canvas. You can also monitor the status of the job on the Workers page
You have successfully created a connection to Active Directory as a data source in Data Catalog.
After completing the Import Users job, you can run the Metadata Ingest process for SMB, CIFS, and NFS data sources to see the user information in the Properties panel. For more information, see the Processing unstructured data topic in the Explore your data section in the Use Pentaho Data Catalog document. Additionally, you can see the list of users who have access to a particular file or folder. For more information, see the Users with Access topic in the Use Pentaho Data Catalog document.
DynamoDB data source
Perform the following steps to add Amazon DynomoDB as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Active Directory in the Data Source Type field. Note: Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Region
Geographical location where AWS maintains a cluster of data centers.
Access Key and Secret Access Key
AWS Access Key ID and Secret Access Key that are used for authentication and authorization when interacting with DynamoDB.
6. Click Test Connection to test your connection to the specified data source.
Click Scan Files. This process loads files and folders into the system.
You can monitor the status of the file scan on the Workers page.
Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created the connection to the Amazon DynamoDB data source.
Amazon Redshift data source
Amazon Redshift is a fully managed, petabyte-scale data warehouse solution offered by Amazon Web Services (AWS). It allows users to run complex queries and perform real-time data analytics on large datasets by utilizing massively parallel processing (MPP) technology. Integrating Amazon Redshift as a data source within Data Catalog, you can access and manage metadata from the Amazon Redshift database. It enables data discovery to search, explore, and understand Amazon Redshift data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.
Perform the following steps to add Amazon Redshift as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the connection information, select Redshift in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials, or URI as a configuration method.
Configuration Method: Credentials
Username/Password: The credentials that provide access to the Amazon Redshift database.
Host: The address of the machine where the Amazon Redshift database server is running. It can be an IP address or a domain name.
Port: The port number on which the Amazon Redshift server is listening for incoming connections.
Configuration method: URI
Username/Password: Credentials that provide access to the Amazon Redshift database.
URI: URIs are used to access and manage various objects and services within the Amazon Redshift environment. For example, URL would look like ```jdbc:redshift://:/`.
Driver
If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
Database Name
The name of the database within the Amazon Redshift server that you want to connect with.
6. Click Test Connection to test your connection to the specified data source.
Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata.
Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need Pentaho Data Oprimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created the connection to the Amazon Redshift as a data source.
AWS S3 data source
Perform the following steps to add AWS S3 as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select AWS S3 in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Region
Geographical location where AWS maintains a cluster of data centers.
Endpoint
Location of the bucket. For example, s3.<region containing S3 bucket>.amazonaws.com
Access Key
User credential to access data on the bucket.
Secret Access Key
Password credential to access data on the bucket.
Bucket Name
The name of the S3 bucket in which the data resides. For S3 access from non-EMR file systems, Data Catalog uses the AWS command line interface to access S3 data.
These commands send requests using access keys, which consist of an access key ID and a secret access key. You must specify the logical name for the cluster root.
This value is defined by dfs.nameservices
in thehdfssite.xml
configuration file. For S3 access from AWS S3 and MapR file systems, you must identify the root of the MapR file system with maprfs:///
.
Path
Directory where this data source is included.
6. Click Test Connection to test your connection to the specified data source.
Click Scan Files. This process loads files and folders into the system.
You can monitor the status of the file scan on the Workers page.
Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the AWS S3 data source.
Azure Blob Storage data source
Perform the following steps to add Azure Blob Storage as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Azure Blob Storage in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
The Default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method
A Shared Key (default)
Account Name
Name of an Azure storage account contains all of your Azure Storage data objects.
Shared Key
A password-like credential that gives full access to an Azure storage account's data and configuration.
Container
The top-level object that logically groups blob data that holds an unlimited number of large object data.
Path
Folder where this data source is included.
6. Click Test Connection to test your connection to the specified data source.
Click Scan Files. This process loads files and folders into the system.
You can monitor the status of the file scan on the Workers page.
Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created the connection to the Azure Blob Storage data source.
Denodo data source
Perform the following steps to add Denodo as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Denodo in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration method
By default, it is URI, as the connection is configured using a URL.- Username and Password: Credentials associated with your Denodo account to log in and access the Denodo environment.
URI: URIs are used to access and manage various objects and services within the Denodo environment. For example, the URI would look like
jdbc:vdb://<denodo-host>:<port>/<database-name>?publishCatalogsAsSchemas=true
Example: jdbc:vdb://ec2-1-2-3-4.compute-1.amazonaws.com:49999/mydatabase?publishCatalogsAsSchemas=true
Database Name: The name of the data sources within the Denodo environment that contain the data you want to access.
Driver
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens. Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the Denodo data source.
Google Cloud Storage data source
Google Cloud Storage (GCS) is a storage service that enables the storage, retrieval, and management of unstructured data, including files, images, videos, and large datasets. By integrating GCS as a data source within Data Catalog, you can access and manage the metadata of files stored. You can perform data discovery to search, explore, and understand your Google Cloud Storage data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.
Perform the following steps to add Google Cloud Storage as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Google Object Store in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Bucket Name
The name of the Google Cloud Storage bucket in which the data resides.
Path
The path within the bucket where the files are stored.
Key Path
The authentication key file, which is used to connect to Google Cloud Storage (a JSON file).
6. Click Test Connection to test your connection to the specified data source.
Click Scan Files. This process loads files and folders into the system.
You can monitor the status of the file scan on the Workers page.
Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to Google Cloud Storage as a data source in Data Catalog.
Google BigQuery data source
Google BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for analytics. Integrating BigQuery as a data source within Data Catalog, you can access and manage metadata from the BigQuery database. It enables data discovery to search, explore, and understand BigQuery data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.
Perform the following steps to add BigQuery as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Google BigQuery in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Field
Description
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Driver
The standard used to establish communication between the application and the database.Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.
Host
The Google BigQuery API endpoint. By default, the host is set to https://www.googleapis.com/bigquery/v2
, which communicates with BigQuery's REST API for data processing.
Port
The port number to connect to the BigQuery data source. The default port for HTTPS connections is 443
.
Project
The Google Cloud project ID that contains the BigQuery datasets you want to access.
Database Name
The name of the dataset within the BigQuery that you want to connect with.
Key Path
The file path to your Google Cloud service account's key (a JSON file).
Oauth Type
The authentication method to connect to BigQuery. By default, it is Service-based. It uses a service account and a key file for authentication.
Client Email
The email associated with your Google Cloud service account for service based OAuth. The service account email is usually in the format [email protected]
.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the BigQuery data source.
HCP data source
You can add data to Data Catalog from Hitachi Content Platform (HCP) by adding HCP as data source. Perform the following steps to add HCP as data source:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select HCP in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Region
Geographical location where HCP maintains data centers.
Endpoint
Location of the bucket. hostname or IP address
Access Key
The access key of the S3 credentials to access the bucket.
Secret Access Key
The secret key of the S3 credentials to access the bucket.
Bucket Name
The name of the S3 bucket in which the data resides.
Path
Directory where this data source is included.
6. Click Test Connection to test your connection to the specified data source.
Click Scan Files. This process loads files and folders into the system.
You can monitor the status of the file scan on the Workers page.
Note: If you are nearing or have exceeded the data scanning limit specified in your license agreement, a message will appear in the upper corner of the screen.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created the connection to the HCP data source.
HDFS data source
Hadoop Distributed File System (HDFS) is a scalable, fault-tolerant storage system for big data, designed to distribute and manage large datasets across a cluster of commodity hardware within the Apache Hadoop framework. You can create a data source using HDFS with the local file system path by mounting data as a local file system to either the remote or local worker.
The HDFS protocol uses a client-server model where the server provides the shared file system, and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to Data Catalog from any file-sharing network system if it is transferable using HDFS.
Perform the following steps to add HDFS as a data source:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, specify the following additional connection information to access the HDFS data source.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method
By default, it is URI.
HDFS Version
Select the Hadoop version of the cluster that you want to run.
URI
URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like hdfs://<name node>:8020
. The <name node>
address can be a variable name for high availability.
Path
HDFS directory path for the data source. It can be the root (/) or a specific high-level directory based on your access control needs. For example, the path would look like /user/demodata/
.
Credential Type
Select the credential type to connect to the HDFS data source.
5. Click Test Connection to test your connection to the specified data source.
Click Scan Files. This process loads files and folders into the system.
You can monitor the status of the file scan on the Workers page.
Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you will see a message in the upper corner of the screen.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the HDFS data source.
IBM Db2 data source
Perform the following steps to add IBM Db2 as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the connection information, select IBM DB2 in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials, or URI as a configuration method.
Configuration Method: Credentials
Username/Password: The credentials that provide access to the specified database.
Host: The address of the machine where the IBM Db2 database server is running. It can be an IP address or a domain name.
Port: The port number on which the IBM Db2 server is listening for incoming connections.
Configuration method: URI
Username/Password: Credentials that provide access to the specified database.
URI: URIs are used to access and manage various objects and services within the IBM Db2 environment. For example, URL would look like
jdbc:db2://<HOSTNAME or IP_ADDRESS>:<PORT>/<DATABASE_NAME>
.
Driver
If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
Database Name
The name of the database within the IBM Db2 server that you want to connect with.
6. Click Test Connection to test your connection to the specified data source.
Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata.
Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created the connection to the IBM Db2 data source.
InfluxDB data source
InfluxDB is a time-series database designed for handling high volumes of time-stamped data, such as monitoring, IoT applications, real-time analytics, and event-driven architectures. Integrating InfluxDB as a data source within Data Catalog, you can manage and utilize the time-series data. It enables data discovery to search, explore, and understand InfluxDB data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.
Perform the following steps to add InfluxDB as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select InfluxDB in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration method
The method is used to configure the connection. The only option is Credentials.
Driver
The standard used to establish communication between the application and the database.Select Default, which is an existing driver, to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
Host
The address of the machine where the InfluxDB database server is running. It can be an IP address or a domain name.
Port
The port number to connect to the InfluxDB data source.
Username
User name that provide access to the InfluxDB database.
Token
The authentication token provided by the InfluxDB instance to authenticate and authorize access to the specified data source.
Database Name
The name of the database within the InfluxDB that you want to connect with.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the InfluxDB as a data source.
Local File System data source
You can add data to Data Catalog from your local file system by adding Local File System as a data source.
To access files on your local system, make the following changes to the vendor/docker-compose.yml
file to ensure that it is accessible by the ws_default
container.
Open the
vendor/docker-compose.yml
file and add the following lines under thews_default
service.services: ws_default: volumes: - /my/path/to/file:/tmp/my-path
You can also include a remote file share as a Local File System. As an example, refer to the following code snippet for adding
cifs-share
to the Local File System.services: ws_default: volumes: - cifs-share:/cifs-share // Following are optional settings to add cifs share to local file system - cifs-share:/cifs-share //Remote file share volumes: cifs-share: driver_opts: type: cifs o: "username=<user1>,password=<password>,file_mode=0777,dir_mode=0777" device: "<IP Address>”
Save changes.
Restart the
ws_default
container for the changes to take effect.
Perform the following steps to identify your data source within Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Local File System in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Path
Directory where this data source is included.
6. Click Test Connection to test your connection to the specified data source.
Click Scan Files. This process loads files and folders to the system.
You can monitor the status of the file scan on the Workers page.
Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you see a message in the upper corner of the screen.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the local file system as a data source.
MariaDB data source
MariaDB is an open-source relational database management system (RDBMS) that is a fork of MySQL. Perform the following steps to configure the MariaDB data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the connection information, select MariaDB in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
The default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials, or URI as a configuration method.
Configuration Method:
By default it is Credentials.
Username/Password: The credentials that provide access to the specified database.
Host: The address of the machine where the MariaDB database server is running. It can be an IP address or a domain name.
Port: The port number on which the MariaDB server is listening for incoming connections.
Driver
If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
6. Click Test Connection to test your connection to the specified data source.
Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata.
Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created the connection to the MariaDB data source.
Microsoft Access data source
Perform the following steps to add Microsoft Access as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Microsoft Access in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials or URI as a configuration method.
Configuration Method: Credentials
Username/Password: Credentials that provide access to the specified database.
Database File: The location of the Microsoft Access database file to connect.
Configuration Method: URI
Username/Password: Credentials that provide access to the specified database.
URI: For example, URL would look like
jdbc: postgresql://localhost:<port_no>/
.
Driver
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the Microsoft Access data source.
Microsoft SQL Server data source
Perform the following steps to add Microsoft SQL Server as a data source Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Microsoft SQL Server in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration method: Select Credentials or URI as a configuration method.
Configuration method: Credentials
Username/Password: Credentials that provide access to the specified database.
Host: The address of the machine where the Microsoft SQL database server is running. It can be an IP address or a domain name.
Port: The port number on which the Microsoft SQL server is listening for incoming connections. The default port is 5432.
Configuration method: URI
Username/Password: Credentials that provide access to the specified database.
URI: For example, URL would look like:
Server=myServerAddress;Database=myDatabase;User Id=myUsername;Password=myPassword;Port=1433;Integrated Security=False;Connection Timeout=30;
.
Driver
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
Database Name
The name of the database within the Microsoft SQL server that you want to connect with.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the Microsoft SQL Server data source.
MySQL data source
Perform the following steps to add MySQL as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select MySQL in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This setting specifies which agents should be associated with the data source in a multi-agent deployment. The only option is Default.
Configuration method
The method used to configure the connection. The only option is Credentials.
Driver
The standard used to establish communication between the application and the database.
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.
User Name
User name that provides access to the specified database.
Password
Password that provides access to the specified database.
Host
The address of the machine where the MySQL database server is running. It can be an IP address or a domain name.
Port
The port number on which the MySQL server is listening for incoming connections.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the MySQL data source.
NFS data source
Network File System (NFS) is a distributed file system protocol that enables remote file access over Unix and Linux networks. You can create a data source using the NFS with the local file system path by mounting data as a local file system to either the remote or local agent. Furthermore, you can easily add data to Data Catalog from Hitachi Network Attached Storage (HNAS) and NetApp data storage.
This protocol uses a client-server model where the server provides the shared file system and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to the Data Catalog from any file-sharing network system if it is transferable using NFS.
Perform the following steps to add NFS as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select NFS in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration method
By default, it is URI.
URI: URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like
nfs://server.example.com
Path: NFS path to access the data source. For example the path would look like
nfs:/share/data
6. Click Test Connection to test your connection to the specified data source.
Click Scan Files to scan the available files in the data source.
Note: If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen.
(Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the NFS data source.
Okta as a data source
Okta is an identity and access management (IAM) service that helps organizations get a clear view of which users have access to which applications. By adding Okta as a data source in Data Catalog, you can automatically import a list of applications and see who is allowed to use them. This makes it easier to manage access, track ownership, and identify users who no longer need access. It also helps ensure that only authorized personnel can view and use sensitive data, supporting compliance and security goals across the organization.
Generate Okta credentials for Data Catalog
Perform the following steps to generate the Okta credentials needed to add Okta as a data source in Data Catalog.
Log in to your Okta organization as a user with administrative privileges.
In the Admin Console, go to Applications > Applications, and then click Create App Integration.
The Create a new app integration page appears.
Select API Services as the Sign-in method, and then click Next.
Enter a name for the PDC app integration and click Save.
The app's main page appears.
From the service app page, select the Okta API Scopes tab and grant the necessary scopes:
okta.apps.read
okta.groups.read
okta.users.read
In the Admin Roles tab, assign the role Read Only Administrator, then click Save Changes.
In the General tab, edit the Client Credentials, set the Client authentication type to Public key / Private key, and add or generate a public/private key pair.
Download or copy the private key file (.pem) and note the Client ID from the saved Client Credentials section. You’ll need both when creating the data source in Data Catalog.
You have successfully generated the Okta credentials for Data Catalog.
Proceed to add Okta as a data source in Data Catalog.
Add Okta as a data source
Perform the following steps to add Okta as a data source in Data Catalog:
Prerequisites:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Make sure you have the necessary Okta details. For more information, see Generate Okta credentials for Data Catalog.
Procedure:
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Okta in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Domain
The organization’s Okta domain (for example, https://yourcompany.okta.com).
Client ID
The Client ID, that is generated from your Okta app integration.
'Private Key Path
The private key used for authentication with Okta.
Click Manage Key Paths to upload or manage keys.
Ensure the key is correctly configured in Okta and the app integration has the necessary scope and roles set in the okta admin console.
6. Click Test Connection to test your connection to the specified data source.
Click Create Data Source to establish your data source connection.
Click Import Applications.
This process loads all the groups and applications associated with the Okta service in the Application section. For more information, see Applications in the Use Pentaho Data Catalog document.
You can also monitor the status of the job on the Workers page.
You have successfully created a connection to Okta as a data source in Data Catalog.
After the Import Applications job completes, click Applications in the left navigation menu to view the imported hierarchy.
Note: The imported details are read-only. To sync the latest data from the Okta service, you must rerun the Import Applications job for the Okta data source. Any edits made by Data Catalog users to the imported assets will be overwritten during the next import.
The root level displays the name of the Okta data source, and the next level shows the groups retrieved from the Okta service. When you expand a group, all applications associated with that group appear below it. If an application is not part of any group in Okta, Data Catalog creates a group named Default and places such applications in it. If the same application belongs to multiple groups, it appears under each group. Each appearance is treated as a unique combination in the hierarchy view. For more information, see the Applications section in the Use Pentaho Data Catalog document.
OneDrive or SharePoint data source
SharePoint and OneDrive in Microsoft 365 are cloud-based services that help organizations share and manage content, knowledge, and applications with seamless collaboration.
Perform the following steps to configure your OneDrive or SharePoint site as a data source within Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Microsoft OneDrive or SharePoint in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information to access Microsoft OneDrive or SharePoint.
Affinity
The Default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method
A Shared Key (default)
Application (client) ID
A unique identifier assigned to an application that has been registered in Azure Active Directory (Azure AD).
Client Secret
Password credentials to access data on the OneDrive or SharePoint site.
Tenant ID
A unique identifier of the OneDrive or SharePoint site.
Path
Folder where this data source is included.- Use '/
' to scan all user’s OneDrive and SharePoint sites from for the root level, and use /<folder path>/
for a specific directory.
Use
/users/<username>/
for user-specific OneDrive.Use
/sites/
for the root of the SharePoint sites and/sites/<SharePoint site path>/
for a specific SharePoint site.
6. Click Test Connection to test your connection to the specified data source. Note: Before finalizing and saving your new data source configuration, you must perform a process called 'Scan files'. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Scan Files. This process loads files and folders into the system. You can monitor the status of the file scan on the Workers page.
Note: If you are nearing or have exceeded the limit of data you can scan with your license agreement, you see a message in the upper corner of the screen.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to OneDrive or SharePoint as a data source.
Oracle data source
Perform the following steps to add Oracle as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Oracle in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials or URI as a configuration method.
Configuration Method: Credentials
Username/Password: Credentials that provide access to the specified database.
Host: The address of the machine where the Oracle database server is running. It can be an IP address or a domain name.
Port: The port number on which the Oracle server is listening for incoming connections.
Database Name: The name of the database within the Oracle server that you want to connect with.
Configuration method: URI
Username/Password: Credentials that provide access to the specified database.
URI: A service URL that looks like
jdbc:oracle:thin:@oracle.example.com:1521/mydb
.
Driver
If you are selecting configuration method as Credentials or URI, then you must use the driver. Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the Oracle database as a data source.
PostgreSQL data source
Perform the following steps to add PostgreSQL as a data source Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select PostgreSQL in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information to access the PostgreSQL data source.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials or URI as a configuration method.
Configuration Method: Credentials
Username/Password: Credentials that provide access to the specified database.
Host: The address of the machine where the PostgreSQL database server is running. It can be an IP address or a domain name.
Port: The port number on which the PostgreSQL server is listening for incoming connections. The default port is 5432.
Database Name: The name of the database or schema within the PostgreSQL server that you want to connect with.
Configuration Method: URI
Username/Password: Credentials that provide access to the specified database.
URI: A unique identifier to locate the data source. It should have the name of the databases in the connection string itself. For example, URL would look like
jdbc: postgresql://localhost:<port_no>/
.
Driver
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the PostgreSQL data source.
SAP HANA data source
Perform the following steps to add SAP HANA as a data source Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select SAP HANA in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials or URI as a configuration method.
Configuration Method: Credentials
Username/Password: Credentials that provide access to the specified database.
Host: A physical or virtual machine (server) where an instance of SAP HANA is installed and running. It can be an IP address or a domain name.
Port: The port number on which the SAP HANA database server is listening for incoming connections.
Configuration Method: URI
Username/Password: Credentials that provide access to the specified database.
URI: For example, URL would look like
jdbc: sap://localhost:<port_no>/<database_name>?user=<user>&password=<password>
.
Driver
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
Database Name
The name of the database within the SAP HANA server that you want to connect with.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the SAP HANA data source.
Salesforce data source
Salesforce is a cloud-based customer relationship management (CRM) platform that provides organizations with a unified platform to manage customer data, sales processes, marketing campaigns, support interactions, and other key business functions. By integrating Salesforce as a data source within Data Catalog, you can access and manage metadata from Salesforce. It enables data discovery to search, explore, and understand Salesforce data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.
Perform the following steps to add Salesforce as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Salesforce in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration method
The method is used to configure the connection. The only option is Credentials.
Driver
The standard used to establish communication between the application and the database. Select Default, which is an existing driver, to ensure that communication between the application and the database is efficient, secure, and compliant with the required standards.
CAUTION: Don’t change the driver for the Salesforce data source type. Changing it might disrupt the connection and cause unexpected behavior.
Username
The Salesforce login username, associated with the Salesforce account.
Password
The password of the Salesforce account to authenticate the connection.
Host
The domain or endpoint of the Salesforce instance you are connecting to.
Port
The port number to connect to the Salesforce instance.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need Pentaho Data Oprimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to Salesforce as a data source.
SMB/CIFS data source
Server Message Block (SMB) and Common Internet File System (CIFS) are Windows file sharing protocols used in storage systems. You can add data to Data Catalog from a file sharing protocol, such as CIFS or SMB, to either the remote agent or local agent, thereby enabling the creation of a data source as CIFS or SMB with a local file system path.
This protocol uses a client-server model where the server provides a shared file system and the client mounts the file system to access the shared files as if they were on a local disk. You can add data to the Data Catalog from any file-sharing network system that supports transfer via the Server Message Block (SMB) and Common Internet File System (CIFS) protocols.
Perform the following steps to add SMB or CIFS as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select SMB or CIFS in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration method
By default, it is URI.
URI: URIs are used to identify and locate resources on the internet or within a network. For example, the URI would look like
smb/cifs://server.example.com
Domain: The domain name if the SMB or CIFS server is part of a Windows domain.
Path: NFS path to access the data source. For example the path would look like
smb/cifs://server:/path/to/resource
Username/Password: Credentials that provide access to the SMB or CIFS resource.
6. Click Test Connection to test your connection to the specified data source.
Click Scan Files to scan the available files in the data source.
Note: If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen.
(Optional) Configure the following options for the data source.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
Cost per Terabyte
Menu to select currency and text field to enter the price per terabyte.
Total Capacity
Field to enter the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the SMB or CISF resource as a data source.
Snowflake data source
Perform the following steps to add Snowflake as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Snowflake in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration method
The method used to configure the connection. The only option is Credentials.
Driver
The standard used to establish communication between the application and the database.Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
To upload a new driver, click Manage Drivers, and click Add New, upload the driver, and then click Add Driver.
Username
User name that provide access to the specified database.
Password
Password that provides access to the specified database.
Host
The address of the machine where the Snowflake database server is running. It can be an IP address or a domain name.
Port
The port number to connect to the Snowflake data source.
Database Name
The name of the database within the Snowflake that you want to connect with.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before finalizing and saving your new data source configuration, you must perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To utilize storage optimization options, a Pentaho Data Optimizer license is required.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the Snowflake data source.
Sybase data source
Sybase is a relational database management system (RDBMS) used for data warehousing, business intelligence, and enterprise applications. Integrating Sybase as a data source within Data Catalog, you can access and manage metadata from the Sybase database. It enables data discovery to search, explore, and understand Sybase data. Additionally, it enhances data lineage and compliance by providing detailed tracking of data movements.
Perform the following steps to add Sybase as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Sybase in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration method
The method, used to configure the connection. The only option is URI.
Driver
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards. To upload a new driver, click Manage Drivers, click Add New, upload the driver, and then click Add Driver.
URI
URIs are used to access and manage various objects and services within the Sybase environment. For example, the URI would look like jdbc:sybase:tds:<hostname>:<port>?ServiceName=<dbname>
Username
Username that provides access to the Sybase database.
Password
Password that provides access to the Sybase database.
Database Name
The name of the data sources within the Sybase environment that contain the data you want to access.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before finalizing and saving your new data source configuration, you must perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the Sybase data source.
Vertica data source
Perform the following steps to add Vertica as a data source in Data Catalog:
Refer to the Component Reference section in the Get started with Pentaho Data Catalog document to confirm that you have met all the necessary requirements listed for the data source you want to connect.
Click Management in the left navigation menu.
The Manage Your Environment page opens.
In the Resources card, click Add Data Source.
The Create Data Source page opens.
Note: If you are nearing or have exceeded the limit of data sources allowed by your license agreement, a message appears when you try to add a new data source.
Specify the following information for the connection to your data source.
Note: Data Catalog encrypts your data source connection details, such as user name and password, before storing them.
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. Note: Names must start with a letter, and must contain only letters, digits, and underscores. Spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. CAUTION: If this field is left blank, Data Catalog generates a permanent identifier, which cannot be modified.
Description (Optional)
Specify a description of your data source.
4. After you have specified the basic connection information, select Vertica in the Data Source Type field. Data Catalog then prompts you to specify additional connection information based on the file system or database type you are trying to access.
5. Specify the following additional connection information.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials or URI as a configuration method.
Configuration Method: Credentials
Username/Password: Credentials that provide access to the specified database.
Host: A physical or virtual machine (server) where an instance of the Vertica database software is installed and running. It can be an IP address or a domain name.
Port: The port number on which the Vertica server is listening for incoming connections.
Configuration Method: URI
Username/Password: Credentials that provide access to the specified database.
URI: For example, URL would look like
jdbc:vertica://<hostname>:<port>/<database>?user=<username>&password=<password>
.
Driver
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
Database Name
The name of the database within the Vertica server that you want to connect with.
6. Click Test Connection to test your connection to the specified data source. A Test Connection confirmation message window opens.
Note: Before you finalize and save your new data source configuration, you need to perform a process called Scan files. If you are nearing or have exceeded the data scanning limit set by your license agreement, a message appears in the upper corner of the screen. Databases do not have a data scan quota.
7. Click Ingest Schema, select the schemas, and then click Ingest Schemas to load the database schema and related metadata. Note: Although you can select all schemas, it is a best practice to avoid selecting certain system-related schemas that are unnecessary for your needs.
8. (Optional) In the Physical Location field, specify the physical location details of the data source.
(Optional) Configure the following storage optimization options for the data source.
Note: To use storage optimization options, you need a Pentaho Data Optimizer license.
FieldDescriptionAvailable for Migration
Enables or disables the data source for storage optimization. When enabled, it includes the data source for data optimizer activities.
Available for Writing
Enables or disables writing capabilities for the data source and enables migration when turned on.
Available for Data Mastering
Enables or disables the data source for data mastering purposes.
(Optional) In the Cost per Terabyte field, specify the data source pricing details like currency, price per terabyte, and billing frequency.
(Optional) In the Total Capacity field, specify the total capacity of the data source in terabytes.
(Optional) Enter a Note for any additional information to share with others who might access this data source.
Click Create Data Source to establish your data source connection.
You have successfully created a connection to the Vertica data source.
Last updated
Was this helpful?