# Hadoop File Input

Use the **Hadoop File Input** step to read data from a variety of text file formats stored on a Hadoop cluster. Common formats include comma-separated values (CSV) files generated by spreadsheets and fixed-width flat files.

You can use this step to:

* Specify a list of files to read.
* Specify directories and use wildcards (regular expressions).
* Accept file names from a previous step.

### Step name

* **Step name**: Specify the unique name of the step on the canvas. You can customize the name or leave the default.

### Options

The Hadoop File Input step includes the following tabs: **File**, **Content**, **Error Handling**, **Filters**, and **Fields**.

#### File tab

![File tab](/files/aCfXZdZi8YRWx8i2J7l1)

In this tab, specify the environment and other details for the file you want to read.

| Option          | Description                                                 |
| --------------- | ----------------------------------------------------------- |
| **Environment** | File system or specific cluster where the input is located: |

* **Local**: The file is on a file system local to the PDI client (Spoon).
* : Use the path in **File/Folder** (for example, when you want to paste a known path).
* **S3**: The file is stored on S3.
* : The file is in the selected cluster. | | **File/Folder** | Location and/or name of the text file to read. Select the ellipsis button (`...`) to browse in the [VFS browser](/pdia-data-integration/archived-merged-pages/connecting-to-virtual-file-systems-archive/vfs-browser-connecting-to-virtual-file-systems.md). | | **Wildcard (RegExp)** | Regular expression used to select files in the directory specified in **File/Folder**. See [Selecting a file using regular expressions](#selecting-a-file-using-regular-expressions). | | **Required** | Whether the file is required. | | **Include subfolders** | Whether to include subfolders. |

**Accept file names from previous steps**

![Accept filenames from previous steps](/files/tAmPxhx1BJRCNuZTSofW)

The **Accept filenames from previous steps** section lets you pass file names into this step from another step, such as *Get File Names*. File names can come from any source, such as a text file or a database table.

| Option                                     | Description                                          |
| ------------------------------------------ | ---------------------------------------------------- |
| **Accept file names from previous steps**  | Select to get file names from previous steps.        |
| **Pass through fields from previous step** | Select to get field information from previous steps. |
| **Step to read file names from**           | Name of the step to read file names from.            |
| **Field in the input to use as file name** | Field that contains the file name.                   |

**Show action buttons**

![Action buttons](/files/xSp3MOAjREiYfEUDO2ml)

After you enter file details, you can use the following buttons:

| Button                                | Description                                                                |
| ------------------------------------- | -------------------------------------------------------------------------- |
| **Show filename(s)**                  | Displays a list of all files loaded based on the current file definitions. |
| **Show file content**                 | Displays the raw content of the selected file.                             |
| **Show content from first data line** | Displays the content starting from the first data line.                    |

**Selecting a file using regular expressions**

Use **Wildcard (RegExp)** to search for files by regular expression.

| File name | Regular expression          | Files selected                                                                                                   |
| --------- | --------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| `/dirA/`  | `.userdata.\.txt`           | Finds all files in `/dirA/` with names containing `userdata` and ending with `.txt`.                             |
| `/dirB/`  | `AAA.\*`                    | Finds all files in `/dirB/` with names that start with `AAA`.                                                    |
| `/dirC/`  | `\[ENG:A-Z\]\[ENG:0-9\].\*` | Finds all files in `/dirC/` with names that start with a capital letter and are followed by a digit (`A0`-`Z9`). |

**Open file (S3 environment)**

When you select **S3** in **Environment** and then select the ellipsis button (`...`) in **File/Folder**, the Open File dialog box appears.

![Open File dialog box](/files/km3NRMEtj6BFBbaVxQCb)

1. In **Connection**, provide:

   | Option               | Description                                                         |
   | -------------------- | ------------------------------------------------------------------- |
   | **Access Key**       | User name needed to access the S3 file system.                      |
   | **Secret Key**       | Password needed to access the S3 file system.                       |
   | **Open from Folder** | Path of the directory to browse. This becomes the active directory. |
2. In **Open from Folder**, navigate to the directory.
3. Use the toolbar icons to view and manage the active directory:

   | Option                 | Description                                 |
   | ---------------------- | ------------------------------------------- |
   | **Up One Level**       | Displays the parent directory.              |
   | **Delete**             | Deletes a folder from the active directory. |
   | **Create Folder**      | Creates a folder in the active directory.   |
   | **Name/Type/Modified** | Displays directory contents and metadata.   |
   | **Filter**             | Filters results displayed in the directory. |
4. Select **OK** to continue or **Cancel** to return to the **File** tab.

#### Content tab

![Content tab](/files/diPNCBkX8bV8wwtwKvtp)

Use the **Content** tab to specify the format of the text files that are being read.

| Option                                                                                   | Description                                                                                                                                                                                 |
| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Filetype**                                                                             | Select **CSV** or **Fixed length**. Based on this selection, the PDI client launches a different helper UI when you select **Get Fields** on the **Fields** tab.                            |
| **Separator**                                                                            | One or more characters that separate fields in a line of text. Typically semicolon (`;`) or tab.                                                                                            |
| **Enclosure**                                                                            | Optional string used to enclose fields (to allow separator characters within fields).                                                                                                       |
| **Allow breaks in enclosed fields**                                                      | Not implemented.                                                                                                                                                                            |
| **Escape**                                                                               | Escape character(s). Example: with backslash (`\`) as an escape character and a single quote (`'`) as the enclosure, `Not the nine o\'clock news` is parsed as `Not the nine o'clock news`. |
| **Header** and **Number of header lines**                                                | Select if your text file includes header lines. Specify how many times the header line appears.                                                                                             |
| **Footer** and **Number of footer lines**                                                | Select if your text file includes footer lines. Specify how many times the footer line appears.                                                                                             |
| **Wrapped lines** and **Number of times wrapped**                                        | Select if lines wrap beyond a page limit. Headers and footers are never considered wrapped.                                                                                                 |
| **Paged layout (printout)**, **Number of lines per page**, and **Document header lines** | Use as a last resort for printer-oriented text. Use **Document header lines** to skip introductory text and **Number of lines per page** to position the data lines.                        |
| **Compression**                                                                          | Use if the text file is in a ZIP or GZIP archive. Only the first file in the archive is read.                                                                                               |
| **No empty rows**                                                                        | Select to prevent sending empty rows to downstream steps.                                                                                                                                   |
| **Include filename in output?**                                                          | Select to include the file name in the output stream.                                                                                                                                       |
| **Filename fieldname**                                                                   | Name of the output field that contains the file name.                                                                                                                                       |
| **Rownum in output?**                                                                    | Select to include the row number in the output stream.                                                                                                                                      |
| **Rownum fieldname** and **Rownum by file?**                                             | Name of the output field that contains the row number.                                                                                                                                      |
| **Format**                                                                               | Line ending format: DOS, UNIX, or mixed.                                                                                                                                                    |
| **Encoding** and **Limit**                                                               | Text encoding to use. Leave blank to use the default system encoding. For Unicode, specify UTF-8 or UTF-16.                                                                                 |
| **Be lenient when parsing dates?**                                                       | Select for lenient parsing (for example, `Jan 32nd` becomes `Feb 1st`). Clear for strict parsing.                                                                                           |
| **The date format Locale**                                                               | Locale used to parse dates written in full (for example, `February 2nd, 2016`).                                                                                                             |
| **Add filenames to result**                                                              | Adds file names to the transformation’s result file list.                                                                                                                                   |

#### Error Handling tab

![Error Handling tab](/files/pzkqw3B99LRwX8UuUcX3)

Use the **Error Handling** tab to specify how the step reacts to parsing errors.

| Option                                   | Description                                                                                                                       |
| ---------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| **Ignore errors?**                       | Select to ignore errors during parsing.                                                                                           |
| **Skip error lines?**                    | Select to skip lines that contain errors. You can generate an extra file that contains the line numbers where errors occur.       |
| **Error count field name**               | Output field that contains the number of errors on the line.                                                                      |
| **Error fields field name**              | Output field that contains the field names on which an error occurred.                                                            |
| **Error fields text field name**         | Output field that contains the parsing error descriptions.                                                                        |
| **Warnings file directory**              | Directory for warning files. File name format: `<warning dir>/filename.<date_time>.<warning extension>`.                          |
| **Error files directory**                | Directory for error files. File name format: `<errorfile_dir>/filename.<date_time>.<errorfile_extension>`.                        |
| **Failing line numbers files directory** | Directory for files listing failing line numbers. File name format: `<errorline dir>/filename.<date_time>.<errorline extension>`. |

#### Filters tab

![Filters tab](/files/bBuzY4vaBr70wqUEJkaD)

Use the **Filters** tab to specify lines you want to skip.

| Option              | Description                                                                                                       |
| ------------------- | ----------------------------------------------------------------------------------------------------------------- |
| **Filter string**   | String to search for.                                                                                             |
| **Filter position** | Position where the filter string must appear. `0` is the first position. Values below `0` search the entire line. |
| **Stop on filter**  | Enter `Y` to stop processing the current file when the filter string is encountered.                              |
| **Positive match**  | When enabled, only matching lines are passed. Negative filters take precedence and are discarded.                 |

#### Fields tab

Use the **Fields** tab to specify the name and format of the fields being read.

| Option        | Description                                                                                                                                                  |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Name**      | Field name.                                                                                                                                                  |
| **Type**      | Field type, such as **String**, **Date**, or **Number**.                                                                                                     |
| **Format**    | Format pattern. See [Number formats](#number-formats) and [Date formats](#date-formats).                                                                     |
| **Position**  | Position for fixed-length file types (0-based).                                                                                                              |
| **Length**    | For **Number**: total number of significant figures. For **String**: string length. For **Date**: printed output length (for example, `4` returns the year). |
| **Precision** | For **Number**: number of digits after the decimal point. Unused for other types.                                                                            |
| **Currency**  | Currency symbol used to interpret numbers such as `$10,000.00` or `E5.000,00`.                                                                               |
| **Decimal**   | Decimal symbol (period `.` or comma `,`).                                                                                                                    |
| **Group**     | Grouping symbol (comma `,` or period `.`).                                                                                                                   |
| **Null if**   | Value to treat as null.                                                                                                                                      |
| **Default**   | Default value when the file field is empty.                                                                                                                  |
| **Trim type** | Trim behavior: None, Left, Right, or Both.                                                                                                                   |
| **Repeat**    | Repeat the last non-empty value when this value is empty (`Y` or `N`).                                                                                       |

For general guidance on field metadata, see [Understanding PDI data types and field metadata](/pdia-data-integration/understanding-pdi-data-types-and-field-metadata.md).

**Number formats**

For further information on valid numeric formats, see the [Number Formatting Table](http://wiki.pentaho.com/display/Reporting/Number+Formatting+Table).

| Symbol | Location            | Localized | Meaning                                                                                                                                                                     |
| ------ | ------------------- | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `0`    | Number              | Yes       | Digit.                                                                                                                                                                      |
| `#`    | Number              | Yes       | Digit; zero shows as absent.                                                                                                                                                |
| `.`    | Number              | Yes       | Decimal separator or monetary decimal separator.                                                                                                                            |
| `-`    | Number              | Yes       | Minus sign.                                                                                                                                                                 |
| `,`    | Number              | Yes       | Grouping separator.                                                                                                                                                         |
| `E`    | Number              | Yes       | Separates mantissa and exponent in scientific notation.                                                                                                                     |
| `;`    | Subpattern boundary | Yes       | Separates positive and negative patterns.                                                                                                                                   |
| `%`    | Prefix or suffix    | Yes       | Multiply by 100 and show as a percentage.                                                                                                                                   |
| `‰`    | Prefix or suffix    | Yes       | Multiply by 1000 and show as per mille.                                                                                                                                     |
| `¤`    | Prefix or suffix    | No        | Currency sign. If doubled, replaced by the international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator. |
| `'`    | Prefix or suffix    | No        | Quotes special characters in a prefix or suffix. To create a single quote itself, use two in a row: `# o''clock`.                                                           |

**Scientific notation**

In a pattern, the exponent character immediately followed by one or more digits indicates scientific notation.

Example: `0.###E0` formats `1234` as `1.234E3`.

**Date formats**

For further information on valid date formats, see the [Date Formatting Table](http://wiki.pentaho.com/display/Reporting/Date+Formatting+Table).

| Letter | Date or time component | Presentation      | Examples                                       |
| ------ | ---------------------- | ----------------- | ---------------------------------------------- |
| `G`    | Era designator         | Text              | `AD`                                           |
| `y`    | Year                   | Year              | `1996` or `96`                                 |
| `M`    | Month in year          | Month             | `July`, `Jul`, or `07`                         |
| `w`    | Week in year           | Number            | `27`                                           |
| `W`    | Week in month          | Number            | `2`                                            |
| `D`    | Day in year            | Number            | `189`                                          |
| `d`    | Day in month           | Number            | `10`                                           |
| `F`    | Day of week in month   | Number            | `2`                                            |
| `E`    | Day in week            | Text              | `Tuesday` or `Tue`                             |
| `a`    | am/pm marker           | Text              | `PM`                                           |
| `H`    | Hour in day (0-23)     | Number            | n/a                                            |
| `k`    | Hour in day (1-24)     | Number            | n/a                                            |
| `K`    | Hour in am/pm (0-11)   | Number            | n/a                                            |
| `h`    | Hour in am/pm (1-12)   | Number            | n/a                                            |
| `m`    | Minute in hour         | Number            | n/a                                            |
| `s`    | Second in minute       | Number            | n/a                                            |
| `S`    | Millisecond            | Number            | n/a                                            |
| `z`    | Time zone              | General time zone | `Pacific Standard Time`, `PST`, or `GMT-08:00` |
| `Z`    | Time zone              | RFC 822 time zone | `-0800`                                        |

### Metadata injection support

All fields of this step support metadata injection except **Hadoop Cluster**. You can use this step with [ETL metadata injection](/pdia-data-integration/pdi-transformation-steps-reference-overview/etl-metadata-injection.md) to pass metadata to your transformation at runtime.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pentaho.com/pdia-data-integration/pdi-transformation-steps-reference-overview/hadoop-file-input-cp-main-page.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
