Hadoop File Input
Use the Hadoop File Input step to read data from a variety of text file formats stored on a Hadoop cluster. Common formats include comma-separated values (CSV) files generated by spreadsheets and fixed-width flat files.
You can use this step to:
Specify a list of files to read.
Specify directories and use wildcards (regular expressions).
Accept file names from a previous step.
Step name
Step name: Specify the unique name of the step on the canvas. You can customize the name or leave the default.
Options
The Hadoop File Input step includes the following tabs: File, Content, Error Handling, Filters, and Fields.
File tab

In this tab, specify the environment and other details for the file you want to read.
Environment
File system or specific cluster where the input is located:
Local: The file is on a file system local to the PDI client (Spoon).
: Use the path in File/Folder (for example, when you want to paste a known path).
S3: The file is stored on S3.
: The file is in the selected cluster. | | File/Folder | Location and/or name of the text file to read. Select the ellipsis button (
...) to browse in the VFS browser. | | Wildcard (RegExp) | Regular expression used to select files in the directory specified in File/Folder. See Selecting a file using regular expressions. | | Required | Whether the file is required. | | Include subfolders | Whether to include subfolders. |
Accept file names from previous steps

The Accept filenames from previous steps section lets you pass file names into this step from another step, such as Get File Names. File names can come from any source, such as a text file or a database table.
Accept file names from previous steps
Select to get file names from previous steps.
Pass through fields from previous step
Select to get field information from previous steps.
Step to read file names from
Name of the step to read file names from.
Field in the input to use as file name
Field that contains the file name.
Show action buttons

After you enter file details, you can use the following buttons:
Show filename(s)
Displays a list of all files loaded based on the current file definitions.
Show file content
Displays the raw content of the selected file.
Show content from first data line
Displays the content starting from the first data line.
Selecting a file using regular expressions
Use Wildcard (RegExp) to search for files by regular expression.
/dirA/
.userdata.\.txt
Finds all files in /dirA/ with names containing userdata and ending with .txt.
/dirB/
AAA.\*
Finds all files in /dirB/ with names that start with AAA.
/dirC/
\[ENG:A-Z\]\[ENG:0-9\].\*
Finds all files in /dirC/ with names that start with a capital letter and are followed by a digit (A0-Z9).
Open file (S3 environment)
When you select S3 in Environment and then select the ellipsis button (...) in File/Folder, the Open File dialog box appears.

In Connection, provide:
OptionDescriptionAccess Key
User name needed to access the S3 file system.
Secret Key
Password needed to access the S3 file system.
Open from Folder
Path of the directory to browse. This becomes the active directory.
In Open from Folder, navigate to the directory.
Use the toolbar icons to view and manage the active directory:
OptionDescriptionUp One Level
Displays the parent directory.
Delete
Deletes a folder from the active directory.
Create Folder
Creates a folder in the active directory.
Name/Type/Modified
Displays directory contents and metadata.
Filter
Filters results displayed in the directory.
Select OK to continue or Cancel to return to the File tab.
Content tab

Use the Content tab to specify the format of the text files that are being read.
Filetype
Select CSV or Fixed length. Based on this selection, the PDI client launches a different helper UI when you select Get Fields on the Fields tab.
Separator
One or more characters that separate fields in a line of text. Typically semicolon (;) or tab.
Enclosure
Optional string used to enclose fields (to allow separator characters within fields).
Allow breaks in enclosed fields
Not implemented.
Escape
Escape character(s). Example: with backslash (\) as an escape character and a single quote (') as the enclosure, Not the nine o\'clock news is parsed as Not the nine o'clock news.
Header and Number of header lines
Select if your text file includes header lines. Specify how many times the header line appears.
Footer and Number of footer lines
Select if your text file includes footer lines. Specify how many times the footer line appears.
Wrapped lines and Number of times wrapped
Select if lines wrap beyond a page limit. Headers and footers are never considered wrapped.
Paged layout (printout), Number of lines per page, and Document header lines
Use as a last resort for printer-oriented text. Use Document header lines to skip introductory text and Number of lines per page to position the data lines.
Compression
Use if the text file is in a ZIP or GZIP archive. Only the first file in the archive is read.
No empty rows
Select to prevent sending empty rows to downstream steps.
Include filename in output?
Select to include the file name in the output stream.
Filename fieldname
Name of the output field that contains the file name.
Rownum in output?
Select to include the row number in the output stream.
Rownum fieldname and Rownum by file?
Name of the output field that contains the row number.
Format
Line ending format: DOS, UNIX, or mixed.
Encoding and Limit
Text encoding to use. Leave blank to use the default system encoding. For Unicode, specify UTF-8 or UTF-16.
Be lenient when parsing dates?
Select for lenient parsing (for example, Jan 32nd becomes Feb 1st). Clear for strict parsing.
The date format Locale
Locale used to parse dates written in full (for example, February 2nd, 2016).
Add filenames to result
Adds file names to the transformation’s result file list.
Error Handling tab

Use the Error Handling tab to specify how the step reacts to parsing errors.
Ignore errors?
Select to ignore errors during parsing.
Skip error lines?
Select to skip lines that contain errors. You can generate an extra file that contains the line numbers where errors occur.
Error count field name
Output field that contains the number of errors on the line.
Error fields field name
Output field that contains the field names on which an error occurred.
Error fields text field name
Output field that contains the parsing error descriptions.
Warnings file directory
Directory for warning files. File name format: <warning dir>/filename.<date_time>.<warning extension>.
Error files directory
Directory for error files. File name format: <errorfile_dir>/filename.<date_time>.<errorfile_extension>.
Failing line numbers files directory
Directory for files listing failing line numbers. File name format: <errorline dir>/filename.<date_time>.<errorline extension>.
Filters tab

Use the Filters tab to specify lines you want to skip.
Filter string
String to search for.
Filter position
Position where the filter string must appear. 0 is the first position. Values below 0 search the entire line.
Stop on filter
Enter Y to stop processing the current file when the filter string is encountered.
Positive match
When enabled, only matching lines are passed. Negative filters take precedence and are discarded.
Fields tab
Use the Fields tab to specify the name and format of the fields being read.
Name
Field name.
Type
Field type, such as String, Date, or Number.
Format
Format pattern. See Number formats and Date formats.
Position
Position for fixed-length file types (0-based).
Length
For Number: total number of significant figures. For String: string length. For Date: printed output length (for example, 4 returns the year).
Precision
For Number: number of digits after the decimal point. Unused for other types.
Currency
Currency symbol used to interpret numbers such as $10,000.00 or E5.000,00.
Decimal
Decimal symbol (period . or comma ,).
Group
Grouping symbol (comma , or period .).
Null if
Value to treat as null.
Default
Default value when the file field is empty.
Trim type
Trim behavior: None, Left, Right, or Both.
Repeat
Repeat the last non-empty value when this value is empty (Y or N).
For general guidance on field metadata, see Understanding PDI data types and field metadata.
Number formats
For further information on valid numeric formats, see the Number Formatting Table.
0
Number
Yes
Digit.
#
Number
Yes
Digit; zero shows as absent.
.
Number
Yes
Decimal separator or monetary decimal separator.
-
Number
Yes
Minus sign.
,
Number
Yes
Grouping separator.
E
Number
Yes
Separates mantissa and exponent in scientific notation.
;
Subpattern boundary
Yes
Separates positive and negative patterns.
%
Prefix or suffix
Yes
Multiply by 100 and show as a percentage.
‰
Prefix or suffix
Yes
Multiply by 1000 and show as per mille.
¤
Prefix or suffix
No
Currency sign. If doubled, replaced by the international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
'
Prefix or suffix
No
Quotes special characters in a prefix or suffix. To create a single quote itself, use two in a row: # o''clock.
Scientific notation
In a pattern, the exponent character immediately followed by one or more digits indicates scientific notation.
Example: 0.###E0 formats 1234 as 1.234E3.
Date formats
For further information on valid date formats, see the Date Formatting Table.
G
Era designator
Text
AD
y
Year
Year
1996 or 96
M
Month in year
Month
July, Jul, or 07
w
Week in year
Number
27
W
Week in month
Number
2
D
Day in year
Number
189
d
Day in month
Number
10
F
Day of week in month
Number
2
E
Day in week
Text
Tuesday or Tue
a
am/pm marker
Text
PM
H
Hour in day (0-23)
Number
n/a
k
Hour in day (1-24)
Number
n/a
K
Hour in am/pm (0-11)
Number
n/a
h
Hour in am/pm (1-12)
Number
n/a
m
Minute in hour
Number
n/a
s
Second in minute
Number
n/a
S
Millisecond
Number
n/a
z
Time zone
General time zone
Pacific Standard Time, PST, or GMT-08:00
Z
Time zone
RFC 822 time zone
-0800
Metadata injection support
All fields of this step support metadata injection except Hadoop Cluster. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.
Last updated
Was this helpful?

