# Regex Evaluation

The Regex evaluation step matches the strings of an input field against a text pattern you define with a regular expression (regex). This step uses the `java.util.regex` package. The syntax for creating the regular expressions used by this step is defined in the [java.util.regex.Pattern javadoc](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html).

You can use this step to parse a complex string of text and create new fields out of the input field with capture groups (defined by parentheses). For example, if you have an input field containing an author's name in quotes and the number of posts made by them, you can create two new fields in your transformation—one for the name and one for the number of posts.

Text to parse:

```
"Author, Ann" - 53 posts
```

Regex to create two capture groups:

```
^"([^"]*)" - (\d*) posts$
```

The resulting field values are `Ann` and `53`.

### Step name

Enter the following information in the transformation step field:

* **Step name**: Specify the unique name of the Regex evaluation step on the canvas. You can customize the name or leave it as the default.

### Settings tab

![Settings tab in Regex evaluation](/spaces/YwnJ6Fexn4LZwKRHghPK/files/Qu3RjuM4R7X2o1Cii4py)

The **Settings** tab contains the following options:

| Option                               | Description                                                                                                                                                                                                                                                                                                                                                                                             |
| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Field to evaluate**                | Specify the name of the field from the incoming PDI stream to match against the regular expression.                                                                                                                                                                                                                                                                                                     |
| **Result field name**                | Specify the name of the output field. This field is added to the outgoing PDI stream and has a value of `Y` if the input matched the regular expression, or `N` if it did not match.                                                                                                                                                                                                                    |
| **Create fields for capture groups** | Select to create new fields based on capture groups in the regular expression. When selected, substrings in the captured groups are extracted and stored in new output fields that you specify in the **Capture Group Fields** table. Each capture group must have a corresponding field definition in the table, and the order must match the order of the capturing groups in the regular expression. |
| **Replace previous fields**          | Select to replace incoming fields with newly created capture group fields when the names match. If cleared, the step adds new fields to the outgoing stream for each capture group. This option is available only when **Create fields for capture groups** is selected.                                                                                                                                |
| **Regular expression**               | Specify your regular expression. Click **Test regEx** to open the Regular expression evaluation window.                                                                                                                                                                                                                                                                                                 |
| **Use variable substitution**        | Select to expand variable references to their values before evaluating the regular expression pattern.                                                                                                                                                                                                                                                                                                  |

### Capture Group Fields table

Use the **Capture Group Fields** table to specify the new fields for the substrings captured by the regular expression from the input string.

| Column        | Description                                                                                                                                                                                                                                                                                                         |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **New field** | Name of the new field generated from the regular expression.                                                                                                                                                                                                                                                        |
| **Type**      | Type of data.                                                                                                                                                                                                                                                                                                       |
| **Length**    | Length of the field.                                                                                                                                                                                                                                                                                                |
| **Precision** | Number of floating point digits for number-type fields.                                                                                                                                                                                                                                                             |
| **Format**    | Optional mask for converting the format of the original field. See [Common Formats](/pdia-data-integration/pdi-transformation-steps-reference-overview/common-formats.md) for common valid date and numeric formats. **Note:** Format is applied only when converting a non-string data type to a string data type. |
| **Group**     | The grouping character (`,` for `10,000.00`, or `.` for `5.000,00`).                                                                                                                                                                                                                                                |
| **Decimal**   | The character used as a decimal point.                                                                                                                                                                                                                                                                              |
| **Currency**  | Currency symbol (for example, `$` or `€`).                                                                                                                                                                                                                                                                          |
| **Null If**   | Treat this value as null.                                                                                                                                                                                                                                                                                           |
| **Default**   | Default value when the incoming value is not specified (empty).                                                                                                                                                                                                                                                     |
| **Trim**      | The trim method to apply to a string.                                                                                                                                                                                                                                                                               |

For more information, see [Understanding PDI data types and field metadata](/pdia-data-integration/understanding-pdi-data-types-and-field-metadata.md).

### Content tab

![Content tab in Regex evaluation](/spaces/YwnJ6Fexn4LZwKRHghPK/files/B2nlePBA5hYSmj1BVrDJ)

The **Content** tab contains the following options:

| Option                                        | Description                                                                                                                                                                                                                   |
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Ignore differences in Unicode encodings**   | Select to ignore different Unicode character encodings. This action may improve performance, but your data can only contain US ASCII characters.                                                                              |
| **Enables case-insensitive matching**         | Select to use case-insensitive matching. Only characters in the US-ASCII charset are matched. The execution flag is (`?i`).                                                                                                   |
| **Permit whitespace and comments in pattern** | Select to ignore whitespace and embedded comments starting with `#` through the end of the line. In this mode, you must use the `\s` token to match whitespace. The execution flag is (`?x`).                                 |
| **Enable dotall mode**                        | Select to include line terminators with the dot character (`.`) match. The execution flag is (`?s`).                                                                                                                          |
| **Enable multiline mode**                     | Select to match the start of a line (`^`) or the end of a line (`$`) of the input sequence. By default, these expressions match only at the beginning and the end of the entire input sequence. The execution flag is (`?m`). |
| **Enable Unicode-aware case folding**         | Select this option with **Enables case-insensitive matching** to perform case-insensitive matching consistent with the Unicode standard. The execution flag is (`?u`).                                                        |
| **Enables Unix lines mode**                   | Select to recognize only the line terminator `\n` in the behavior of `.`, `^`, and `$`. The execution flag is (`?d`).                                                                                                         |

### Example

Suppose your input field contains a text value like `"Author, Ann" - 53 posts`. The following regular expression creates four capturing groups and can be used to parse out the different parts:

```
^"((["]), (["]))" - (\d+) posts\.$
```

This expression creates the following four capturing groups, which become output fields:

* Fullname: `((["]), (["]))`
* Lastname: `([^"]+)`
* Firstname: `([^"]+)`
* Number of posts: `(\d+)`

A field definition must be present for each capturing group.

If the number of capture groups in the regular expression does not match the number of fields specified, the step fails and writes an error to the log.

Capturing groups can be nested. In the example above, the fields **Lastname** and **Firstname** correspond to capturing groups that are contained inside the **Fullname** capturing group.

The `design-tools/data-integration/samples/transformations` directory contains `Regex Eval - parse NCSA access log records.ktr` as another example.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdia-data-integration/pdi-transformation-steps-reference-overview/regex-evaluation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
