Elasticsearch REST bulk insert
This step is available as a separate plugin from the Pentaho EE Marketplace.
Use the Elasticsearch REST bulk insert step if you have records that you want to submit to an Elasticsearch server for indexing. Elastic is a platform of products to search, analyze, and visualize data. The Elastic platform includes Elasticsearch, which is a Lucene-based, multi-tenant-capable, distributed search and analytics engine.
This step sends one or more batches of records to an Elasticsearch server for indexing. Because you can specify the batch size, you can send one, a few, or many records to Elasticsearch.
When record data flows out of the Elasticsearch REST bulk insert step, PDI sends it to Elasticsearch along with your index as metadata. This step is commonly used when you want to send a batch of data to an Elasticsearch server and create new indexes. You can also use this step to add a batch of data to an existing index.
For more information about Elasticsearch, see:
Before you begin
Gather the following items:
The Elasticsearch REST bulk insert plugin. For installation details, see Install plugins.
A working server with Elasticsearch version 7.x or 8.x installed, or a SaaS offering for your Elasticsearch server. You should be able to connect to Elasticsearch from the computer running PDI.
As a best practice, use compatibility mode when connecting to Elasticsearch 8.x with older clients. For details, see Connecting to Elasticsearch v8.x using the v7.17.x client.
Privileges to create, insert, and update on the directories that you need to access on the Elasticsearch server.
Files or data that you want Elasticsearch to index.
Step name
Step name: Specify the unique name of the Elasticsearch REST bulk insert step on the canvas. You can customize the name or leave it as the default.
Options
The Elasticsearch REST bulk insert step includes three tabs: General, Document, and Output.
General tab

Use the General tab to configure connections to your Elastic nodes and set options for the destination index.
Connection
Specify the connection options for each server in the Servers table.
#
Number of the entry.
Address
Hostname (optionally specified with a variable) of the node you want to connect to.
Port
Port (optionally specified with a variable) of the Elastic REST interface.
Scheme
Scheme or protocol (optionally specified with a variable) to use for REST communication. Typically http or https for secured Elastic nodes.
Authentication
Use the Authentication tab to set user verification options.
Authentication
Authentication method for the Elastic nodes:
None: Connect without authentication.
Basic: Provide Username and Password to use basic authentication. | | Test | Test the connection and authentication settings. |
Index
Use the Index options to name and test the output Elastic index.
Index
Name of the target index for documents submitted by bulk insert requests. You can specify this value as a variable. If the index does not exist in Elasticsearch, the step creates it.
Test
Test connectivity to the output index.
Create
Create the index if it does not exist.
Document tab
Use the Document tab to specify the documents to index in bulk insert requests. You can either create a document to index from stream fields or use an existing JSON document from a field.
Create a document to index with stream field data

Use Create a document to index with stream field data to turn each row of stream data into a unique JSON document to be indexed in the bulk request.
Define the fields to use from the input stream with a target name. Select Get Fields to automatically populate the list with all incoming stream fields.
Name
Name of the source field that the step receives on the input stream.
Target name
Name of the destination field in the generated JSON document.
Use an existing JSON document from a field

Use Use an existing JSON document from a field if the document you want to index is already available as JSON in a field on the input stream.
JSON Field
Name of the incoming field that contains a JSON document to be indexed for each row of input.
Output tab

Use the Output tab to configure step output and error handling.
Index settings
ID Field
(Optional) Value that identifies the document indexed in Elasticsearch. If you do not specify a value, Elasticsearch generates an ID automatically.
Overwrite if exists
If selected and ID Field is specified, updates a document if the ID exists in the target index. If the ID does not exist, a new document is added to the index.
Step settings
Stop on error
Stop processing if there is an error, such as a problem adding the document or pushing the batch to the index, or if the JSON is not well-formed. If this option is cleared and an error occurs, the row is not processed, but the transformation continues so other rows can be processed.
Output rows
Pass through the input row data, and optionally output a new document index ID if ID Output Field is specified.
ID Output Field
(Optional) Name of the ID field to output newly indexed document IDs. If you leave this blank, the value in ID Field is used.
Batch settings
Size
Number of items in a batch. Specify a size greater than 1 to perform a bulk insert. A size of 1 does not perform a bulk insert.
Timeout
Value and unit of measure for the maximum amount of time the bulk request can take to process on the Elastic server before the batch times out.
Last updated
Was this helpful?

