Hadoop File Input

Use the Hadoop File Input step to read data from a variety of text file formats stored on a Hadoop cluster. Common formats include comma-separated values (CSV) files generated by spreadsheets and fixed-width flat files.

You can use this step to:

  • Specify a list of files to read.

  • Specify directories and use wildcards (regular expressions).

  • Accept file names from a previous step.

Step name

  • Step name: Specify the unique name of the step on the canvas. You can customize the name or leave the default.

Options

The Hadoop File Input step includes the following tabs: File, Content, Error Handling, Filters, and Fields.

File tab

File tab

In this tab, specify the environment and other details for the file you want to read.

Option
Description

Environment

File system or specific cluster where the input is located:

  • Local: The file is on a file system local to the PDI client (Spoon).

  • : Use the path in File/Folder (for example, when you want to paste a known path).

  • S3: The file is stored on S3.

  • : The file is in the selected cluster. | | File/Folder | Location and/or name of the text file to read. Select the ellipsis button (...) to browse in the VFS browser. | | Wildcard (RegExp) | Regular expression used to select files in the directory specified in File/Folder. See Selecting a file using regular expressions. | | Required | Whether the file is required. | | Include subfolders | Whether to include subfolders. |

Accept file names from previous steps

Accept filenames from previous steps

The Accept filenames from previous steps section lets you pass file names into this step from another step, such as Get File Names. File names can come from any source, such as a text file or a database table.

Option
Description

Accept file names from previous steps

Select to get file names from previous steps.

Pass through fields from previous step

Select to get field information from previous steps.

Step to read file names from

Name of the step to read file names from.

Field in the input to use as file name

Field that contains the file name.

Show action buttons

Action buttons

After you enter file details, you can use the following buttons:

Button
Description

Show filename(s)

Displays a list of all files loaded based on the current file definitions.

Show file content

Displays the raw content of the selected file.

Show content from first data line

Displays the content starting from the first data line.

Selecting a file using regular expressions

Use Wildcard (RegExp) to search for files by regular expression.

File name
Regular expression
Files selected

/dirA/

.userdata.\.txt

Finds all files in /dirA/ with names containing userdata and ending with .txt.

/dirB/

AAA.\*

Finds all files in /dirB/ with names that start with AAA.

/dirC/

\[ENG:A-Z\]\[ENG:0-9\].\*

Finds all files in /dirC/ with names that start with a capital letter and are followed by a digit (A0-Z9).

Open file (S3 environment)

When you select S3 in Environment and then select the ellipsis button (...) in File/Folder, the Open File dialog box appears.

Open File dialog box
  1. In Connection, provide:

    Option
    Description

    Access Key

    User name needed to access the S3 file system.

    Secret Key

    Password needed to access the S3 file system.

    Open from Folder

    Path of the directory to browse. This becomes the active directory.

  2. In Open from Folder, navigate to the directory.

  3. Use the toolbar icons to view and manage the active directory:

    Option
    Description

    Up One Level

    Displays the parent directory.

    Delete

    Deletes a folder from the active directory.

    Create Folder

    Creates a folder in the active directory.

    Name/Type/Modified

    Displays directory contents and metadata.

    Filter

    Filters results displayed in the directory.

  4. Select OK to continue or Cancel to return to the File tab.

Content tab

Content tab

Use the Content tab to specify the format of the text files that are being read.

Option
Description

Filetype

Select CSV or Fixed length. Based on this selection, the PDI client launches a different helper UI when you select Get Fields on the Fields tab.

Separator

One or more characters that separate fields in a line of text. Typically semicolon (;) or tab.

Enclosure

Optional string used to enclose fields (to allow separator characters within fields).

Allow breaks in enclosed fields

Not implemented.

Escape

Escape character(s). Example: with backslash (\) as an escape character and a single quote (') as the enclosure, Not the nine o\'clock news is parsed as Not the nine o'clock news.

Header and Number of header lines

Select if your text file includes header lines. Specify how many times the header line appears.

Footer and Number of footer lines

Select if your text file includes footer lines. Specify how many times the footer line appears.

Wrapped lines and Number of times wrapped

Select if lines wrap beyond a page limit. Headers and footers are never considered wrapped.

Paged layout (printout), Number of lines per page, and Document header lines

Use as a last resort for printer-oriented text. Use Document header lines to skip introductory text and Number of lines per page to position the data lines.

Compression

Use if the text file is in a ZIP or GZIP archive. Only the first file in the archive is read.

No empty rows

Select to prevent sending empty rows to downstream steps.

Include filename in output?

Select to include the file name in the output stream.

Filename fieldname

Name of the output field that contains the file name.

Rownum in output?

Select to include the row number in the output stream.

Rownum fieldname and Rownum by file?

Name of the output field that contains the row number.

Format

Line ending format: DOS, UNIX, or mixed.

Encoding and Limit

Text encoding to use. Leave blank to use the default system encoding. For Unicode, specify UTF-8 or UTF-16.

Be lenient when parsing dates?

Select for lenient parsing (for example, Jan 32nd becomes Feb 1st). Clear for strict parsing.

The date format Locale

Locale used to parse dates written in full (for example, February 2nd, 2016).

Add filenames to result

Adds file names to the transformation’s result file list.

Error Handling tab

Error Handling tab

Use the Error Handling tab to specify how the step reacts to parsing errors.

Option
Description

Ignore errors?

Select to ignore errors during parsing.

Skip error lines?

Select to skip lines that contain errors. You can generate an extra file that contains the line numbers where errors occur.

Error count field name

Output field that contains the number of errors on the line.

Error fields field name

Output field that contains the field names on which an error occurred.

Error fields text field name

Output field that contains the parsing error descriptions.

Warnings file directory

Directory for warning files. File name format: <warning dir>/filename.<date_time>.<warning extension>.

Error files directory

Directory for error files. File name format: <errorfile_dir>/filename.<date_time>.<errorfile_extension>.

Failing line numbers files directory

Directory for files listing failing line numbers. File name format: <errorline dir>/filename.<date_time>.<errorline extension>.

Filters tab

Filters tab

Use the Filters tab to specify lines you want to skip.

Option
Description

Filter string

String to search for.

Filter position

Position where the filter string must appear. 0 is the first position. Values below 0 search the entire line.

Stop on filter

Enter Y to stop processing the current file when the filter string is encountered.

Positive match

When enabled, only matching lines are passed. Negative filters take precedence and are discarded.

Fields tab

Use the Fields tab to specify the name and format of the fields being read.

Option
Description

Name

Field name.

Type

Field type, such as String, Date, or Number.

Format

Format pattern. See Number formats and Date formats.

Position

Position for fixed-length file types (0-based).

Length

For Number: total number of significant figures. For String: string length. For Date: printed output length (for example, 4 returns the year).

Precision

For Number: number of digits after the decimal point. Unused for other types.

Currency

Currency symbol used to interpret numbers such as $10,000.00 or E5.000,00.

Decimal

Decimal symbol (period . or comma ,).

Group

Grouping symbol (comma , or period .).

Null if

Value to treat as null.

Default

Default value when the file field is empty.

Trim type

Trim behavior: None, Left, Right, or Both.

Repeat

Repeat the last non-empty value when this value is empty (Y or N).

For general guidance on field metadata, see Understanding PDI data types and field metadata.

Number formats

For further information on valid numeric formats, see the Number Formatting Tablearrow-up-right.

Symbol
Location
Localized
Meaning

0

Number

Yes

Digit.

#

Number

Yes

Digit; zero shows as absent.

.

Number

Yes

Decimal separator or monetary decimal separator.

-

Number

Yes

Minus sign.

,

Number

Yes

Grouping separator.

E

Number

Yes

Separates mantissa and exponent in scientific notation.

;

Subpattern boundary

Yes

Separates positive and negative patterns.

%

Prefix or suffix

Yes

Multiply by 100 and show as a percentage.

‰

Prefix or suffix

Yes

Multiply by 1000 and show as per mille.

¤

Prefix or suffix

No

Currency sign. If doubled, replaced by the international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.

'

Prefix or suffix

No

Quotes special characters in a prefix or suffix. To create a single quote itself, use two in a row: # o''clock.

Scientific notation

In a pattern, the exponent character immediately followed by one or more digits indicates scientific notation.

Example: 0.###E0 formats 1234 as 1.234E3.

Date formats

For further information on valid date formats, see the Date Formatting Tablearrow-up-right.

Letter
Date or time component
Presentation
Examples

G

Era designator

Text

AD

y

Year

Year

1996 or 96

M

Month in year

Month

July, Jul, or 07

w

Week in year

Number

27

W

Week in month

Number

2

D

Day in year

Number

189

d

Day in month

Number

10

F

Day of week in month

Number

2

E

Day in week

Text

Tuesday or Tue

a

am/pm marker

Text

PM

H

Hour in day (0-23)

Number

n/a

k

Hour in day (1-24)

Number

n/a

K

Hour in am/pm (0-11)

Number

n/a

h

Hour in am/pm (1-12)

Number

n/a

m

Minute in hour

Number

n/a

s

Second in minute

Number

n/a

S

Millisecond

Number

n/a

z

Time zone

General time zone

Pacific Standard Time, PST, or GMT-08:00

Z

Time zone

RFC 822 time zone

-0800

Metadata injection support

All fields of this step support metadata injection except Hadoop Cluster. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.

Last updated

Was this helpful?