Elasticsearch REST bulk insert

This step is available as a separate plugin from the Pentaho EE Marketplacearrow-up-right.

Use the Elasticsearch REST bulk insert step if you have records that you want to submit to an Elasticsearch server for indexing. Elastic is a platform of products to search, analyze, and visualize data. The Elastic platform includes Elasticsearch, which is a Lucene-based, multi-tenant-capable, distributed search and analytics engine.

This step sends one or more batches of records to an Elasticsearch server for indexing. Because you can specify the batch size, you can send one, a few, or many records to Elasticsearch.

When record data flows out of the Elasticsearch REST bulk insert step, PDI sends it to Elasticsearch along with your index as metadata. This step is commonly used when you want to send a batch of data to an Elasticsearch server and create new indexes. You can also use this step to add a batch of data to an existing index.

For more information about Elasticsearch, see:

Before you begin

Gather the following items:

  • The Elasticsearch REST bulk insert plugin. For installation details, see Install plugins.

  • A working server with Elasticsearch version 7.x or 8.x installed, or a SaaS offering for your Elasticsearch server. You should be able to connect to Elasticsearch from the computer running PDI.

    circle-info

    As a best practice, use compatibility mode when connecting to Elasticsearch 8.x with older clients. For details, see Connecting to Elasticsearch v8.x using the v7.17.x clientarrow-up-right.

  • Privileges to create, insert, and update on the directories that you need to access on the Elasticsearch server.

  • Files or data that you want Elasticsearch to index.

Step name

  • Step name: Specify the unique name of the Elasticsearch REST bulk insert step on the canvas. You can customize the name or leave it as the default.

Options

The Elasticsearch REST bulk insert step includes three tabs: General, Document, and Output.

General tab

Elasticsearch REST bulk insert step

Use the General tab to configure connections to your Elastic nodes and set options for the destination index.

Connection

Specify the connection options for each server in the Servers table.

Column
Description

#

Number of the entry.

Address

Hostname (optionally specified with a variable) of the node you want to connect to.

Port

Port (optionally specified with a variable) of the Elastic REST interface.

Scheme

Scheme or protocol (optionally specified with a variable) to use for REST communication. Typically http or https for secured Elastic nodes.

Authentication

Use the Authentication tab to set user verification options.

Field
Description

Authentication

Authentication method for the Elastic nodes:

  • None: Connect without authentication.

  • Basic: Provide Username and Password to use basic authentication. | | Test | Test the connection and authentication settings. |

Index

Use the Index options to name and test the output Elastic index.

Field
Description

Index

Name of the target index for documents submitted by bulk insert requests. You can specify this value as a variable. If the index does not exist in Elasticsearch, the step creates it.

Test

Test connectivity to the output index.

Create

Create the index if it does not exist.

Document tab

Use the Document tab to specify the documents to index in bulk insert requests. You can either create a document to index from stream fields or use an existing JSON document from a field.

Create a document to index with stream field data

Elasticsearch REST Bulk Insert step, Document tab - Create index option

Use Create a document to index with stream field data to turn each row of stream data into a unique JSON document to be indexed in the bulk request.

Define the fields to use from the input stream with a target name. Select Get Fields to automatically populate the list with all incoming stream fields.

Field
Description

Name

Name of the source field that the step receives on the input stream.

Target name

Name of the destination field in the generated JSON document.

Use an existing JSON document from a field

Elasticsearch REST Bulk Insert step, Document tab - Use existing option

Use Use an existing JSON document from a field if the document you want to index is already available as JSON in a field on the input stream.

Field
Description

JSON Field

Name of the incoming field that contains a JSON document to be indexed for each row of input.

Output tab

Elasticsearch REST Bulk Insert step, Output tab

Use the Output tab to configure step output and error handling.

Index settings

Field
Description

ID Field

(Optional) Value that identifies the document indexed in Elasticsearch. If you do not specify a value, Elasticsearch generates an ID automatically.

Overwrite if exists

If selected and ID Field is specified, updates a document if the ID exists in the target index. If the ID does not exist, a new document is added to the index.

Step settings

Field
Description

Stop on error

Stop processing if there is an error, such as a problem adding the document or pushing the batch to the index, or if the JSON is not well-formed. If this option is cleared and an error occurs, the row is not processed, but the transformation continues so other rows can be processed.

Output rows

Pass through the input row data, and optionally output a new document index ID if ID Output Field is specified.

ID Output Field

(Optional) Name of the ID field to output newly indexed document IDs. If you leave this blank, the value in ID Field is used.

Batch settings

Field
Description

Size

Number of items in a batch. Specify a size greater than 1 to perform a bulk insert. A size of 1 does not perform a bulk insert.

Timeout

Value and unit of measure for the maximum amount of time the bulk request can take to process on the Elastic server before the batch times out.

Last updated

Was this helpful?