Connecting to Virtual File Systems

You can connect to most Virtual File Systems (VFS) through VFS connections in PDI. A VFS connection stores VFS properties for a specific file system. You can reuse the connection whenever you access files or folders. For example, you can use an HCP connection in HCP steps without re-entering credentials.

With a VFS connection, you can set your VFS properties with a single instance that can be used multiple times. The VFS connection supports the following file systems:

Amazon S3/Minio/HCP
- Simple Storage Service (S3) accesses the resources on Amazon Web Services. See Working with AWS Credentials for Amazon S3 setup instructions.
  Note: If a connectivity issue occurs with AWS / S3, perform either of the following actions:
  - Set the Environment Variables for AWS_REGION or AWS_DEFAULT_REGION to the applicable Default Region.
  - Set the correct Default Region in the shared configuration file (~/.aws/config) or the credentials file (~/.aws/credentials). For example:
    AWS sample config file
  See https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html and https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html for more information.
- Minio accesses data objects on an Amazon compatible storage server. See the Minio Quickstart Guide for Minio setup instructions.
- HCP uses the S3 protocol to access HCP. See Access to HCP REST for setup details.
Azure Data Lake Gen 1
Accesses data objects on Microsoft Azure Gen 1 storage services. You must create an Azure account and configure Azure Data Lake Storage Gen 1. See Access to Microsoft Azure.
Note: Support for Azure Data Lake Gen 1 is discontinued and limited to users with existing Gen 1 accounts. As a best practice, use Azure Data Lake Storage Gen 2. See Azure for details.
Azure Data Lake Gen 2/Blob
Accesses data objects on Microsoft Azure Gen 2 or Blob storage services. You must create an Azure account and configure Azure Data Lake Storage Gen 2 and Blob Storage. See Access to Microsoft Azure.
Google Cloud Storage
Accesses data in the Google Cloud Storage file system. See Google Cloud Storage for more information on this protocol.
HCP REST
Accesses data in the Hitachi Content Platform. You must configure HCP and PDI before accessing the platform. See Access to HCP REST for more information.
Local
Accesses data in your local physical file system.
SMB/UNC Provider
Accesses data in a Windows platform that uses the Server Message Block (SMB) protocol and Universal Naming Convention (UNC) string to specify the resource location path.
Snowflake Staging
Accesses a staging area used by Snowflake to load files. See Snowflake staging area for more information on this protocol.

After you create a VFS connection, you can use it with PDI steps and entries that support the use of VFS connections. If you are connected to a repository, the VFS connection is saved in the repository. If you are not connected to a repository, the connection is saved locally on the machine where it was created.

If a VFS connection is not available for your file system, you may be able to access it with the VFS browser.

Before you begin

You may need to set up access for specific providers before you start.

Access to Google Cloud

To access Google Cloud from PDI, you must have a Google account and a service account key file in JSON format. You must also set permissions for your Google Cloud accounts. To create service account credentials, see https://cloud.google.com/storage/docs/authentication.

Perform the following steps to set up Google Cloud Storage access:

Download the service account credentials file from the Google API Console.
Create a system environment variable named GOOGLE_APPLICATION_CREDENTIALS.
Set the variable value to the full path of the JSON key file.

You can now access Google Cloud Storage from PDI.

Access to HCP REST

Hitachi Content Platform (HCP) is a distributed storage system that you can access through a VFS connection in the PDI client.

Within HCP, access control lists (ACLs) grant privileges for file operations. Namespaces are used for logical groupings, permissions, and object metadata. For more information, see the Introduction to Hitachi Content Platform.

Perform the following steps to set up access to HCP:

This process assumes you have tenant permissions and existing namespaces. See Tenant Management Console.

To create a successful VFS connection to HCP, configure object versioning in your HCP Namespaces.

Sign in to the HCP Tenant Management Console.
Click Namespaces, then select the namespace Name you want to configure.
HCP Tenant Management Console
On the Protocols tab, click HTTP(S).
Verify these settings:
- Enable HTTPS
- Enable REST API with Authenticated access only
On the Settings tab, select ACLs.
Select Enable ACLs.
When prompted, click Enable ACLs to confirm.

HCP is now set up for access from the PDI client.

Access to Microsoft Azure

To access Azure services from PDI, create and configure the following:

Azure Data Lake Gen 1, or
Azure Data Lake Storage Gen2 and Blob Storage services

Enable the hierarchical namespace to maximize file system performance.

Access requires an Azure account with an active subscription. See Create an account for free.
Access to Azure Storage requires an Azure Storage account. See Create a storage account.

Create a VFS connection

Perform the following steps to create a VFS connection in PDI:

Start the PDI client (Spoon).
In the View tab of the Explorer pane, right-click VFS Connections, then click New.
The New VFS connection dialog box opens.
New VFS Connection dialog box
In Connection Name, enter a unique name. Optionally, add a Description.
The name can include spaces. Do not use special characters. Avoid #, $, /, \, %.
In Connection Type, select a type:
- Amazon S3/Minio/HCP
- Azure Data Lake Gen 1
- Azure Data Lake Gen 2 / Blob
- Google Cloud Storage
- HCP REST
- Local
- SMB/UNC Provider
- Snowflake Staging
In the connection details panel, set the options for your connection type.
You can add a predefined variable to fields that have the “insert variable” icon. Place your cursor in the field, then press Ctrl+Space. Variables must be predefined in kettle.properties. Runtime variables are not supported.
See Kettle Variables.

Connection type

Options

Amazon

Click S3 Connection Type and select Amazon from the list to use an Amazon S3 connection.

Simple Storage Service (S3) accesses the resources on Amazon Web Services. See Working with AWS Credentials for Amazon S3 setup instructions.

Select the Authentication Type: - Access Key/Secret Key - Credentials File
Select the Region.
When Authentication Type is:
- Access Key/Secret Key, then enter the Access Key and Secret Key, and optionally enter the Session Token.
- Credentials File, then enter the Profile Name and the File Location.
Select the Default S3 Connection checkbox to make Amazon the default S3 connection.

Minio/HCP

Click S3 Connection Type and select Minio/HCP from the list to use a Minio/HCP S3 connection.

Minio accesses data objects on an Amazon compatible storage server. See the Minio Quickstart Guide for Minio setup instructions.

Enter the Access Key.
Enter the Secret Key.
Enter the Endpoint.
Enter the Signature Version.
Select the PathStyle Access checkbox to use path-style requests. Otherwise, Amazon S3 bucket-style access is used.
Select the Default S3 Connection checkbox to make Minio/HCP the default S3 connection.

Azure Data Lake Gen 1

Accesses data objects on Microsoft Azure Gen 1 storage services. You must create an Azure account and configure Azure Data Lake Storage Gen 1. See Access to Microsoft Azure.

The Authentication Type is Service-to-service authentication only.
Enter the Account Fully Qualified Domain Name.
Enter the Application (client) ID.
Enter the Client Secret.
Enter the OAuth 2.0 token endpoint.

Azure Data Lake Gen 2 / Blob

Accesses data objects on Microsoft Azure Gen 2 and Blob storage services. You must create an Azure account and configure Azure Data Lake Storage Gen 2 and Blob Storage. See Access to Microsoft Azure.

Select the Authentication Type: - Account Shared Key - Azure Active Directory - Shared Access Signature
Enter the Service Account Name.
Enter the Block Size (Min 1 MB to Max 100 MB). The default is 50.
Enter the Buffer Count (Min 2). The default is 5.
Enter the Max Block Upload Size (Min 1 MB to 900 MB). The default is 100.
Select the Access Tier. The default value is Hot.
When Authentication Type is:
- Account Shared Key, then enter the Service Account Shared Key.
- Azure Active Directory, then enter the Application (client) ID, Client Secret, and Directory (tenant) ID.
- Shared Access Signature, then enter the Shared Access Signature.

Google Cloud Storage

Accesses data objects on the Google Cloud Storage file system. See Google Cloud Storage for more information on this protocol.

Enter the Service Account Key Location.

HCP REST

Accesses data objects on the Hitachi Content Platform. You must configure HCP and PDI before accessing the platform. You must also configure object versioning in HCP namespaces. See Access to HCP REST.

Enter the Host and Port.
Enter the Tenant, Namespace, Username, and Password.
Click More options, then enter the Proxy Host and Proxy Port.
Select whether to use Accept self-signed certificate. Default: No.
Select whether the Proxy is secure. Default: No.

Local

Accesses a file system on your local machine.

Enter the Root Folder Path or click Browse to set a folder connection. Optionally, use an empty path to allow access to the root directory and its folders.

SMB/UNC Provider

Accesses Server Message Block data using a Universal Naming Convention string to specify the file location.

Enter the Domain. The domain name of the target machine hosting the resource. If the machine has no domain name, use the machine name.
Enter the Port Number. Default: 445.
Enter the Server, User Name, and Password.

Snowflake Staging

Accesses a staging area used by Snowflake to load files. See Snowflake staging area for more information.

Enter the Host Name.
Enter the Port Number. Default: 443.
Enter the Database.
Enter the Namespace, User Name, and Password.

For all connection types except Local, enter the Root Folder Path for your VFS connection. Enter the full path to connect to a specific folder. Optionally, use an empty path to allow access to all folders in the root.
```
The default is the root and its folders in your local physical file system.
```
Optional: Click Test to verify the connection.
Click OK.

You can now use the connection in steps and entries that support VFS connections, such as Snowflake entries or HCP steps. For related information, see:

For general access details, see Access files with the VFS browser.

Edit a VFS connection

Perform the following steps to edit an existing VFS connection:

Right-click VFS Connections and select Edit.
In the Edit VFS Connection dialog box, select the pencil icon next to the section you want to edit.

Delete a VFS connection

Perform the following steps to delete a VFS connection:

Right-click VFS Connections.
Select Delete, then Yes, Delete.

The deleted connection no longer appears under VFS Connections in the View tab.

Access files with a VFS connection

After you create a VFS connection, you can use the VFS Open and Save dialog boxes to access files in the PDI client.

In the PDI client, select File > Open URL to open a file, or File > Save as to save a file.
The VFS Open or Save As dialog box opens.
Open dialog box in the PDI client
In the left pane, select VFS connection, then navigate to your folders and files.
Optional: Click the navigation path to show and copy the Pentaho file path. See Pentaho address for a VFS connection.
Select the file and click Open or Save.

If you are not connected to a repository, you can rename a folder or file. Click the item again to edit its name.

Pentaho address for a VFS connection

The Pentaho address is the Pentaho virtual file system (pvfs) location within your VFS connection. When you browse in the file access dialog box, the address bar shows the path for your VFS location.

When you click in the address bar, the Pentaho address appears.

You can copy and paste a Pentaho address into file path fields in steps and entries that support VFS connections.

Use the Pentaho virtual file system for Amazon S3. Existing transformations and jobs that use Amazon S3 are supported when Amazon S3 is set as the Default S3 Connection.

Create a VFS metastore

A PDI metastore is a location for storing resources shared by multiple transformations. It enables hyperscaler deployments to access the metastore in the cloud. It also lets the PDI client and Pentaho Server reference the same VFS metastore.

The VFS connection information is stored in an XML file. The metastore can be located in one of these places:

On the machine where you run PDI, in your user directory or in a repository
On Pentaho Server, as a remote metastore in the server repository
In a cloud location that is accessible through a VFS connection

Multiple users can access the metastore when it is stored in a remote location. The remote metastore has priority over a local metastore. For example, if you configure a local metastore-config file and then connect to a Pentaho Server repository, transformations still use the remote metastore.

Enable a VFS metastore

Before you can use a remote metastore, enable a VFS connection in the PDI client. You do this by creating a metastore configuration file, then editing it.

Perform the following steps to enable a VFS metastore:

Open the PDI client and create a VFS connection to the storage location you want to use as your metastore. See Create a VFS connection.
Close the PDI client.
Go to Users\<yourusername>\.pentaho\metastore\pentaho\Amazon S3 Connection\ and copy the VFS connection file you created into Users\<yourusername>\.kettle.
Rename the file to metastore-config.
Open metastore-config in a text editor. Add the scheme and rootPath elements and their values. See Metastore configuration.
Save the file.
Restart the PDI client.

The remote VFS metastore is now enabled. Previous local connections still exist in your local metastore directory. They no longer display in the PDI client. New VFS connections are stored in the location specified in metastore-config.

Metastore configuration

The elements listed in this section are required for all remote environments. When you create a VFS connection in the PDI client, you do not need to manually edit anything in the <configuration> section.

Common elements

These elements are required for all VFS connections:

Element

Value

Description

scheme

The type of connection. The values are:

s3 - Amazon, MinIO, and HCP

gs - Google cloud storage

abfss - Azure Data Lake Storage Gen2

rootPath

<bucket-name>[/<path>]

The bucket name and optional folder path where you want to create the VFS metastore. The rootPath element must point to the location where you will store the metastore file on the cloud location.

This is analogous to the .pentaho folder in a local metastore.

Examples:

miniobucket/dir1
gcpbucket/dir1

children

A container for type-specific configurations. For example:

<children>
    <child>
<id>description</id>
         <value></value>
    <type>String</type>
</child>
…
</children>

S3 elements

The elements listed below apply to S3 environments. Some elements are conditional.

Element

Value

Description

accessKey

<s3-access-key>

The S3 user’s access key.

secretKey

<s3-secret-key>

The S3 user’s secret key.

endPoint

<s3-endpoint>

The URL to access the S3 location. Examples:

http://<host ip>:port

https://my-hcp-namespace.my-hcp-tenant.hcpdemo.hitachivantara.com

region

<s3-region>

The user-designated region. For example, us-east-1.

connectionType

0 or 1

The connection type value. The values are:

0 - connect to AWS

1 - connect to MinIO or HCP

credentialFile

An encrypted string that is not user editable

profileName

<string>

The AWS user profile connection when the Type is 0 (AWS) and the authType is 1 (credentials file)

defaultS3Config

true or false

Controls whether the default S3 configuration is used. Set to true to use the default S3 configuration

credentialsFilePath

<path to AWS credentials file>

The path to the AWS credentials file when the connectionType is 0 (AWS) and the authType is 1 (credentials file)

pathStyleAccess

true or false

Controls which access style is used. Specify true for path-style access. Specify false for bucket-style access

signatureVersion

AWSS3V4SignerType

The signature version used when communicating with the AWS S3 metastore location.

name

vfsMetastore

The connection name.

description

<string>

A description of the connection.

sessionToken

<session token string>

Optional. A temporary credential used if the AWS bucket requires a session token for access

authType

0 or 1

The authentication type when connectionType is 0 (AWS):

0 – Access key/Secret key

1 – Credentials file

GCP elements

The elements listed below apply to GCP environments:

Element

Value

Description

serviceAccountKey

<string>

A key that is generated based on the contents of the service account JSON.

keyPath

<path>

The path to the file containing the GCP service account JSON.

name

<string>

The name of the connection.

description

<string>

A description of the connection.

Azure Data Lake Storage Gen2 elements

The elements listed below apply to Azure Data Lake Storage Gen2 environments. See Azure Blob Storage for more information.

Element

Value

Description

sharedKey

<encrypted string>

The shared key for accessing the service.

accountName

<encrypted string>

The name of the account.

accessTier

<string>

The access tier value. Default: Hot.

blockSize

<Integer>

Default: 50.

maxSingleUploadSize

<Integer>

Default: 100.

bufferCount

<Integer>

Default: 5.

name

<string>

The connection name.

authType

0, 1, or 2

The authorization type. Values:

0 - Account Shared Key

1 - Azure Active Directory

2 - Shared Access Signature

Steps and entries supporting VFS connections

You may have a transformation or job containing a step or entry that accesses a file on a Virtual File System.

The following steps and entries support VFS connections:

VFS browser

Some transformation steps and job entries use a Virtual File System (VFS) browser instead of VFS connections and the Open dialog box. When you use the VFS browser, you specify a VFS URL instead of a VFS connection. Files are accessed using HTTP. The URLs include schema data that identifies the protocol.

Files can be local or remote. Files can also be compressed formats, such as TAR and ZIP. For more information, see the Apache Commons VFS documentation.

Before you begin

If you need to access Google Drive, see Access to a Google Drive.

Access to a Google Drive

Perform the following setup steps to initially access Google Drive.

Follow the “Step 1” procedure in Build your first Drive app (Java) in the Google Drive APIs documentation.
This procedure turns on the Google Drive API and creates a credentials.json file.
Rename credentials.json to client_secret.json. Copy it to data-integration/plugins/pentaho-googledrive-vfs/credentials.
Restart PDI.
The Google Drive option does not appear for the VFS browser until you copy client_secret.json into the credentials directory and restart PDI.
Sign in to your Google account.
Enter your Google account credentials.
In the permission window, click Allow.

After initialization, Pentaho stores a token named StoredCredential in data-integration/plugins/pentaho-googledrive-vfs/credentials. This token lets you access Google Drive resources without signing in again. If you delete the token, you are prompted to sign in after restarting PDI. If you change account permissions, delete the token and repeat the setup.

To access Google Drive from a transformation that runs on Pentaho Server, copy StoredCredential and client_secret.json into pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-googledrive-vfs/credentials on the server.

Access files with the VFS browser

Perform the following steps to access files with the VFS browser.

Select File > Open in the PDI client.
The Open dialog box appears.
Open dialog box
In the left pane, select the file system type. Supported file systems include:
- Local: Files on your local machine.
- Hadoop Cluster: Files on any Hadoop cluster except S3.
- HDFS: Files on Hadoop distributed file systems.
- Google Drive: Files on Google Drive. See Access to a Google Drive.
- VFS Connections: Files using a stored VFS connection.
Optional: In the Address bar, enter a VFS URI.
Examples:
- Local: ftp://userID:[email protected]/path_to/file.txt
- HDFS: hdfs://myusername:mypassword@mynamenode:port/path
- SMB/UNC Provider: smb://<domain>;<username>:<password>@<server>:<port>/<path>
For SMB, “domain” is the Windows host name. “Domain” and “server” can be the same when using an IP address.
Optional: Use File type to filter on file types other than transformations and jobs.
Optional: Select a file or folder and click the X icon to delete it.
Optional: Click the + icon to create a new folder.

VFS dialog boxes are configured through transformation parameters. See Configure VFS options.

Supported steps and entries

The following steps and entries support the VFS browser:

Amazon EMR Job Executor (introduced in v9.0)
Amazon Hive Job Executor (introduced in v9.0)
AMQP Consumer (introduced in v9.0)
Avro Input (introduced in v8.3)
Avro Output (introduced in v8.3)
ETL metadata injection
File Exists (Job Entry)
Hadoop Copy Files
Hadoop File Input
Hadoop File Output
JMS Consumer (introduced in v9.0)
Job Executor (introduced in v9.0)
Kafka consumer (introduced in v9.0)
Kinesis consumer (introduced in v9.0)
Mapping (sub-transformation)
MQTT Consumer (introduced in v9.0)
ORC Input (introduced in v8.3)
ORC Output (introduced in v8.3)
Parquet Input (introduced in v8.3)
Parquet Output (introduced in v8.3)
Oozie Job Executor (introduced in v9.0)
Simple Mapping (introduced in v9.0)
Single Threader (introduced in v9.0)
Sqoop Export (introduced in v9.0)
Sqoop Import (introduced in v9.0)
Transformation Executor (introduced in v9.0)
Weka Scoring (introduced in v9.0)

If you have a Pentaho address for an existing VFS connection, you can paste the pvfs location into file or folder fields. You do not need to use Browse.

For more information on configuring options for SFTP, see Configure SFTP VFS.

Configure VFS options

The VFS browser can be configured to set variables as parameters at runtime. The sample transformation VFS Configuration Sample.ktr is located in data-integration/samples/transformations.

For more information on setting variables, see VFS properties.

For an example of configuring an SFTP VFS connection, see Configure SFTP VFS.

Last updated 2 days ago

Was this helpful?

hashtagBefore you begin

hashtagAccess to Google Cloud

hashtagAccess to HCP REST

hashtagAccess to Microsoft Azure

hashtagCreate a VFS connection

hashtagEdit a VFS connection

hashtagDelete a VFS connection

hashtagAccess files with a VFS connection

hashtagPentaho address for a VFS connection

hashtagCreate a VFS metastore

hashtagEnable a VFS metastore

hashtagMetastore configuration

hashtagSteps and entries supporting VFS connections

hashtagVFS browser

hashtagBefore you begin

hashtagAccess to a Google Drive

hashtagAccess files with the VFS browser

hashtagSupported steps and entries

hashtagConfigure VFS options

Before you begin

Access to Google Cloud

Access to HCP REST

Access to Microsoft Azure

Create a VFS connection

Edit a VFS connection

Delete a VFS connection

Access files with a VFS connection

Pentaho address for a VFS connection

Create a VFS metastore

Enable a VFS metastore

Metastore configuration

Steps and entries supporting VFS connections

VFS browser

Before you begin

Access to a Google Drive

Access files with the VFS browser

Supported steps and entries

Configure VFS options