# Discover metadata from a text file

Use the **Discover metadata from a text file** step to determine the structure and metadata of delimited text files for which you have limited knowledge of the structure.

Enter a list of possible delimiters, enclosures, and escape characters to determine the configuration that produces the most consistent match of data in the file. Consistency is determined by the count of fields in the rows. For example, when testing a semicolon as the delimiter, the count of fields in the first three rows is 3, 8, and 4, which means the field count is inconsistent; therefore, the semicolon is not the correct delimiter. When testing a comma delimiter on those rows produces a field count of 6, 6, and 6, this is considered a consistent delimiter and is an acceptable candidate for use.

You can use this step to generate data to send to the **ETL Metadata Injection** step. The **ETL Metadata Injection** step can then set up the metadata in the **Text File Input** step for use in your transformations or jobs.

The **Discover metadata from a text file** step also determines the field names from the header row and predicts the data types for the list of fields.

{% hint style="info" %}
This step opens and reads the file four times to gather all required information. On very large files, scanning may take a long time if you do not set a row limit with **Limit scanned rows**.
{% endhint %}

### Step name

* **Step name**: Specify the unique name of the **Discover metadata from a text file** step on the canvas. You can customize the name or leave it as the default.

### Options

The **Discover metadata from a text file** step includes four tabs: **Input**, **Delimiter candidates**, **Enclosure candidates**, and **Escape candidates**.

#### Input tab

![Discover metadata from a text file Input tab](/files/4JfXVMapmnTVua6vYeMS)

Use the following options in the **Input** tab to specify details for the input text file:

| Option                                    | Description                                                                                                                                                                                                                                    |
| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **File name**                             | Select the delimited file you want to evaluate. The file location can be any location supported by a VFS connection. See [Connecting to Virtual File Systems](/pdia-data-integration/extracting-data-into-pdi/virtual-file-system-browser.md). |
| **Trim fields**                           | Select to remove leading and trailing spaces from fields.                                                                                                                                                                                      |
| **Header column name detection strategy** | Select the strategy to use to determine the column names in the file. After the step identifies the header row, it ignores any rows above the header row and counts the following rows as data.                                                |

* **First possible line containing only strings**: Selects the first line that contains only string values as the header row. For example, if the data has five fields per row and you set **Maximum number of header rows** to 6, the step searches the first six rows in the file for a row containing five string fields. The first row encountered with five string fields is selected as the header.
* **First possible line containing any data type**: Selects the first line that contains a consistent number of fields as the header row. For example, if the file rows contain five fields, the first line containing five fields is selected as the header row within **Maximum number of header rows**, regardless of the data types in the fields.
* **Last possible line containing only strings**: Selects the last line that contains only string values as the header row.
* **Last possible line containing any data type**: Selects the last line that contains a consistent number of fields as the header row. | | **Maximum number of header rows** | Enter the maximum number of rows that can be a header. If the file does not have a header row, set this to `0`. Only one row can be a header. | | **Maximum number of footer rows** | Enter the maximum number of rows that can be a footer. If the file does not have a footer row, set this to `0`.

**Note:** The number of footer rows can only be determined if the entire file is scanned. | | **Fallback charset** | Select the character set of the file. If the step cannot determine a character set for the file, it defaults to ISO-8859-1. | | **Limit scanned rows** | Enter the number of rows to scan in the file before determining the valid set of delimiters and enclosures used in the file. To scan the entire file, enter `0`. |

#### Delimiter candidates tab

![Discover metadata from a text file Input tab](/files/MjPprpIl89B6DPqphh7S)

Enter the delimiter candidates to use in the file scan in the **Delimiter candidates** field. The delimiter candidates can be one or more characters.

The step tests each candidate and all combinations of the candidates to find the most accurate match. For example, you may receive a file from two sources where the first source uses a comma delimiter and the second source uses a semicolon delimiter. The step can determine the data structure of both files.

#### Enclosure candidates tab

![Discover metadata from a text file Enclosure candidates tab](/files/rXJbLSd9RraNn1q7FG5r)

You can enter characters to use for enclosures on the **Enclosure candidates** tab. The available options are listed in the following table:

| Option                           | Description                                                                                                                                                                  |
| -------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Enclosure character required** | Select to require that all fields must be enclosed with enclosure characters. Clear to make enclosure characters optional.                                                   |
| **Ignore enclosure errors**      | Select to ignore enclosure errors when a row is parsed that contains a different number of fields. Clear to generate enclosure errors, which stops the step from continuing. |
| **Enclosure candidates**         | Enter the enclosure characters to use for the file scanning. Enclosure candidates can be one or more characters.                                                             |

#### Escape candidates tab

![Discover metadata from a text file Escape candidates tab](/files/i6wNd369aSe11MN7nCVF)

Use the **Escape candidates** tab to require and define escape characters. The available options are listed in the following table:

| Option                        | Description                                                                                       |
| ----------------------------- | ------------------------------------------------------------------------------------------------- |
| **Escape character required** | Select to require an escape character. Clear to have the step ignore escape characters.           |
| **Escape candidates**         | Enter a character to use as an escape character. Escape candidates can be one or more characters. |

#### Delimiter and data type detection rules

Because there are many different variations of delimited files, this step might not detect the structure of every type of delimited file. The following rules are used to evaluate fields:

* The step uses a left outer Cartesian join of the delimiter, enclosure, and escape candidate options.
  * The step performs inner joins when **Enclosure character required** or **Escape character required** is selected. This means a null enclosure or a null escape is not allowed.
  * The delimiter character cannot be the same as the enclosure or escape character. For example, if you use a double quote (`"`) for both enclosure and escape, the step ignores that candidate combination.
* If the enclosure and escape characters are the same, the delimiter cannot be escaped. Only the enclosure character can be escaped.
* If any enclosure errors are found (such as an unclosed enclosure or an unescaped enclosure character), the enclosure is not considered valid. If you expect enclosure errors, select **Ignore enclosure errors**.
* A header row is any row within **Maximum number of header rows** that has a number of fields consistent with the number of fields in the data. The row after the last inconsistent row may also be considered a header row if:
  * Header row strategy is **First possible line containing only strings** and the first row that contains only string values is the next row after the last inconsistent row.
  * Header row strategy is **First possible line containing any data type** and there are no rows with a consistent number of fields before the last inconsistent row.
  * Header row strategy is **Last possible line containing only strings** and the next row after the last inconsistent row is all strings.
  * Header row strategy is **Last possible line containing any data type**.
* A footer row is the first row with an inconsistent number of fields and any following rows. When **Limit scanned rows** prevents the entire file from being scanned, the file is not evaluated for footer rows.
* If multiple delimiters, enclosures, or escape characters appear to match the file, the step cannot determine the format.
* Multiple-character enclosures or escapes may result in incorrect data type results.
* Ignoring enclosure errors may result in incorrect data type results.
* The field length is determined by the length of the longest field detected during the scan. If you set **Limit scanned rows**, the step only checks that number of rows.

### Examples

Your Pentaho distribution includes several sample transformations and datasets in the `design-tools/data-integration/samples/transformations/discover-metadata-from-textfile` directory.

The following code is a portion of the `Sample1.txt` file found in the directory:

```
policyID,county,eq_site_limit,eq_site_deductible,point_longitude
710400,CLAY COUNTY,0,0,-81.71624
703001,CLAY COUNTY,0,0,-81.706865
352792,CLAY COUNTY,0,0,-81.718452
717603,CLAY COUNTY,0,0,-81.718452
937659,SUWANNEE COUNTY,0,0,-82.926659
294022,SUWANNEE COUNTY,0,0,-82.926659
410500,SUWANNEE COUNTY,0,0,-82.926659
524433,SUWANNEE COUNTY,218475,0,-82.926155
972562,SUWANNEE COUNTY,0,0,-82.933777

```

When the step runs, the file is scanned to determine a consistent number of fields using the tab character, then the semicolon, then the comma (default). When **Header column name detection strategy** is set to **First possible line containing only strings**, the step identifies the first row as the header row. The following table shows the column names and data types.

| Column name          | Data type |
| -------------------- | --------- |
| policylD2            | Integer   |
| county               | String    |
| eq\_site\_limit      | BigNumber |
| eq\_site\_deductible | Integer   |
| point\_longitude     | BigNumber |

If any of the fields in the first row are numbers or dates, the row is treated as data, which means there is no header row in this example.

### Data lineage

This step includes a data lineage analyzer. See [Data lineage](broken://pages/2XMrd1mL6ltIgh0VTZTu).

### Metadata injection support

All fields of this step support metadata injection. You can use this step with [ETL metadata injection](/pdia-data-integration/pdi-transformation-steps-reference-overview/etl-metadata-injection.md) to pass metadata to your transformation at runtime.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdia-data-integration/pdi-transformation-steps-reference-overview/discover-metadata-from-a-text-file.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
