# XML Input Stream (StAX)

The XML Input Stream (StAX) step reads data from XML files using the Streaming API for XML (StAX) parser.

This step is designed for fast processing of large and complex XML structures. Unlike the [Get Data from XML](http://wiki.pentaho.com/display/EAI/Get+Data+From+XML) step (which uses in-memory processing), the XML Input Stream (StAX) step streams the XML and lets you implement the processing logic in the transformation.

This step is useful when you need to parse XML and:

* you need fast data loads independent of memory (regardless of file size)
* you need flexibility to read different parts of the XML in different ways without repeatedly parsing the file

Because some XML processing logic can be complex, you should be familiar with common PDI steps before using this step.

### Options

![XML Input Stream (StAX) step](/files/njGyzJUEqGnBIaF6NkAA)

| Option                                                 | Description                                                                                                                                                                                                                                       | Default value / Data type                               |
| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| **Step name**                                          | Unique name of the XML Input Stream (StAX) step on the canvas.                                                                                                                                                                                    |                                                         |
| **Filename**                                           | Path to the input XML file. Select **Browse** to choose a file. If the step is connected to a previous step, you can select an incoming field that contains the file path (and **Browse** is hidden). You can use internal variables in the path. |                                                         |
| **Source is from a previous step**                     | Select to accept XML data from a previous step.                                                                                                                                                                                                   |                                                         |
| **Source field name**                                  | Incoming field to use as XML data.                                                                                                                                                                                                                |                                                         |
| **Add filename to result?**                            | Adds the processed XML file name to the transformation result.                                                                                                                                                                                    | No                                                      |
| **Skip (Elements/Attributes)**                         | Number of elements or attributes to skip before producing rows.                                                                                                                                                                                   | 0                                                       |
| **Limit (Elements/Attributes)**                        | Limits the number of elements or attributes to process. Together with **Skip**, this supports chunk loading in an outer loop.                                                                                                                     | 0                                                       |
| **Default String Length**                              | Default string length for XML data name/value fields.                                                                                                                                                                                             | 1024                                                    |
| **Encoding**                                           | Encoding of the XML file.                                                                                                                                                                                                                         | UTF-8                                                   |
| **Add Namespace information?**                         | Adds the XML data type `NAMESPACE` to the stream, including optional prefix (in name) and URI (in value). Enabling this can reduce throughput due to extra namespace handling.                                                                    | No                                                      |
| **Trim strings?**                                      | Trims whitespace, tabs, carriage returns, and line feeds from the start and end of name/value strings.                                                                                                                                            | Yes                                                     |
| **Include filename in output? / Fieldname**            | Adds the processed file name to the specified field.                                                                                                                                                                                              | `xml_filename` (String 256)                             |
| **Row number in output? / Fieldname**                  | Adds the processed row number (starting at 1).                                                                                                                                                                                                    | `xml_row_number` (Integer)                              |
| **XML data type (numeric) in output? / Fieldname**     | Adds the processed XML data type as a numeric value.                                                                                                                                                                                              | `xml_data_type_numeric` (Integer)                       |
| **XML data type (description) in output? / Fieldname** | Adds the processed XML data type as text. This is easier to read but can be slower and consume more memory than numeric types.                                                                                                                    | `xml_data_type_description` (String 25)                 |
| **XML location line in output? / Fieldname**           | Adds the source XML line number.                                                                                                                                                                                                                  | `xml_location_line` (Integer)                           |
| **XML location column in output? / Fieldname**         | Adds the source XML column number.                                                                                                                                                                                                                | `xml_location_column` (Integer)                         |
| **XML element ID in output? / Fieldname**              | Adds the element number (starting at 0). This increments per new element (not per row) and preserves nesting across levels.                                                                                                                       | `xml_element_id` (Integer)                              |
| **XML parent element ID in output? / Fieldname**       | Adds the parent element number. Together with element ID, you can reconstruct the element tree.                                                                                                                                                   | `xml_parent_element_id` (Integer)                       |
| **XML element level in output? / Fieldname**           | Adds the element nesting level, starting at 0 for root `START_` and `END_DOCUMENT`.                                                                                                                                                               | `xml_element_level` (Integer)                           |
| **XML path in output? / Fieldname**                    | Adds the XML path.                                                                                                                                                                                                                                | `xml_path` (String 1024)                                |
| **XML parent path in output? / Fieldname**             | Adds the parent XML path.                                                                                                                                                                                                                         | `xml_parent_path` (String 1024)                         |
| **XML data name in output? / Fieldname**               | Adds the element/attribute name and optional namespace prefix to the output.                                                                                                                                                                      | `xml_data_name` (String 1024 or Default String Length)  |
| **XML data value in output? / Fieldname**              | Adds the element/attribute value and optional namespace URI to the output.                                                                                                                                                                        | `xml_data_value` (String 1024 or Default String Length) |

If you need Set/Reset functionality, you can use [Modified Java Script Value](/pdia-data-integration/pdi-transformation-steps-reference-overview/modified-java-script-value.md) or [User Defined Java Class](/pdia-data-integration/pdi-transformation-steps-reference-overview/user-defined-java-class.md). User Defined Java Class is typically faster.

### Samples

Sample transformations are included in `design-tools/data-integration/samples/transformations`:

* `XML Input Stream (StAX) Test 1 - Basic Tests.ktr`
* `XML Input Stream (StAX) Test 2 - Element Blocks.ktr`
* `XML Input Stream (StAX) Test 3 - Attribute Groups.ktr`
* `XML Input Stream (StAX) Test 4 - Hierarchies.ktr`
* `XML Input Stream (StAX) Test 5 - Performance Test Data for Element Blocks.ktr`
* `XML Input Stream (StAX) Test 6 - Namespaces.ktr`

### Example: element blocks

This example parses the `XML Input Stream (StAX) Test 2 - Element Blocks.xml` file, which includes two main blocks: Analyzer Lists and Products.

The transformation separates blocks by splitting the parent XML path into levels using Switch/Case steps. In more complex flows, consider using mappings (sub-transformations) so each block is clearly represented.

Sample XML:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<ProductInformation ExportTime="2010-11-23 23:56:40"
    ExportContext="german" ContextID="german" WorkspaceID="Test" id="1"
    parent="0">
    <AnalyzerResult>
        <AnalyzerLists>
            <AnalyzerList name="items.added">
                <AnalyzerElement ItemID="product?id=123456"
                    ProductID="123456" />
                <AnalyzerElement ItemID="product?id=789"
                    ProductID="789" />
            </AnalyzerList>
            <AnalyzerList name="items.deleted">
                <AnalyzerElement ItemID="product?id=111111"
                    ProductID="111111" />
                <AnalyzerElement ItemID="product?id=222222"
                    ProductID="222222" />
            </AnalyzerList>
            <AnalyzerList name="items.dummy_test">
                <AnalyzerElement ItemID="product?id=test1"
                    ProductID="test1" />
                <AnalyzerElement ItemID="product?id=test2"
                    ProductID="test2" />
            </AnalyzerList>
        </AnalyzerLists>
        <AnalyzerDummyTest>
            <AnalyzerDummyTest name="Dummy not processed" />
        </AnalyzerDummyTest>
    </AnalyzerResult>
    <Products>
        <Product id="123456" name="Product A">
            <MetaData>
                <Value AttributeID="AttrA">false</Value>
                <Value AttributeID="AttrB">true</Value>
                <Value AttributeID="AttrShortName">
                    Product A Short Name
                </Value>
                <Value AttributeID="AttrLongName">
                    Product A Long Name
                </Value>
            </MetaData>
        </Product>
        <Product id="789" name="Product B">
            <MetaData>
                <Value AttributeID="AttrA">true</Value>
                <Value AttributeID="AttrB">false</Value>
                <Value AttributeID="AttrShortName">
                    Product B Short Name
                </Value>
                <Value AttributeID="AttrLongName">
                    Product B Long Name
                </Value>
            </MetaData>
        </Product>
    </Products>
</ProductInformation>
```

Preview examples:

* Step preview: ![Step preview](/files/j9wfAMfaVLMXfotE5Tul)
* Example transformation: ![Example transformation](/files/LTUX2mfr0vtvQDIys8zm)
* Analyzer lists results: ![Analyzer lists results](/files/fWmjaoSSIJ2hOYQ9c1Gu)
* Products results: ![Products results](/files/Nc8AWgBQSymwj3KsmXyc)

### Metadata injection support

All fields of this step support metadata injection. You can use this step with [ETL metadata injection](/pdia-data-integration/pdi-transformation-steps-reference-overview/etl-metadata-injection.md) to pass metadata to your transformation at runtime.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdia-data-integration/pdi-transformation-steps-reference-overview/xml-input-stream-stax.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
