Discover metadata from a text file
Use the Discover metadata from a text file step to determine the structure and metadata of delimited text files for which you have limited knowledge of the structure.
Enter a list of possible delimiters, enclosures, and escape characters to determine the configuration that produces the most consistent match of data in the file. Consistency is determined by the count of fields in the rows. For example, when testing a semicolon as the delimiter, the count of fields in the first three rows is 3, 8, and 4, which means the field count is inconsistent; therefore, the semicolon is not the correct delimiter. When testing a comma delimiter on those rows produces a field count of 6, 6, and 6, this is considered a consistent delimiter and is an acceptable candidate for use.
You can use this step to generate data to send to the ETL Metadata Injection step. The ETL Metadata Injection step can then set up the metadata in the Text File Input step for use in your transformations or jobs.
The Discover metadata from a text file step also determines the field names from the header row and predicts the data types for the list of fields.
This step opens and reads the file four times to gather all required information. On very large files, scanning may take a long time if you do not set a row limit with Limit scanned rows.
Step name
Step name: Specify the unique name of the Discover metadata from a text file step on the canvas. You can customize the name or leave it as the default.
Options
The Discover metadata from a text file step includes four tabs: Input, Delimiter candidates, Enclosure candidates, and Escape candidates.
Input tab

Use the following options in the Input tab to specify details for the input text file:
File name
Select the delimited file you want to evaluate. The file location can be any location supported by a VFS connection. See Connecting to Virtual File Systems.
Trim fields
Select to remove leading and trailing spaces from fields.
Header column name detection strategy
Select the strategy to use to determine the column names in the file. After the step identifies the header row, it ignores any rows above the header row and counts the following rows as data.
First possible line containing only strings: Selects the first line that contains only string values as the header row. For example, if the data has five fields per row and you set Maximum number of header rows to 6, the step searches the first six rows in the file for a row containing five string fields. The first row encountered with five string fields is selected as the header.
First possible line containing any data type: Selects the first line that contains a consistent number of fields as the header row. For example, if the file rows contain five fields, the first line containing five fields is selected as the header row within Maximum number of header rows, regardless of the data types in the fields.
Last possible line containing only strings: Selects the last line that contains only string values as the header row.
Last possible line containing any data type: Selects the last line that contains a consistent number of fields as the header row. | | Maximum number of header rows | Enter the maximum number of rows that can be a header. If the file does not have a header row, set this to
0. Only one row can be a header. | | Maximum number of footer rows | Enter the maximum number of rows that can be a footer. If the file does not have a footer row, set this to0.
Note: The number of footer rows can only be determined if the entire file is scanned. | | Fallback charset | Select the character set of the file. If the step cannot determine a character set for the file, it defaults to ISO-8859-1. | | Limit scanned rows | Enter the number of rows to scan in the file before determining the valid set of delimiters and enclosures used in the file. To scan the entire file, enter 0. |
Delimiter candidates tab

Enter the delimiter candidates to use in the file scan in the Delimiter candidates field. The delimiter candidates can be one or more characters.
The step tests each candidate and all combinations of the candidates to find the most accurate match. For example, you may receive a file from two sources where the first source uses a comma delimiter and the second source uses a semicolon delimiter. The step can determine the data structure of both files.
Enclosure candidates tab

You can enter characters to use for enclosures on the Enclosure candidates tab. The available options are listed in the following table:
Enclosure character required
Select to require that all fields must be enclosed with enclosure characters. Clear to make enclosure characters optional.
Ignore enclosure errors
Select to ignore enclosure errors when a row is parsed that contains a different number of fields. Clear to generate enclosure errors, which stops the step from continuing.
Enclosure candidates
Enter the enclosure characters to use for the file scanning. Enclosure candidates can be one or more characters.
Escape candidates tab

Use the Escape candidates tab to require and define escape characters. The available options are listed in the following table:
Escape character required
Select to require an escape character. Clear to have the step ignore escape characters.
Escape candidates
Enter a character to use as an escape character. Escape candidates can be one or more characters.
Delimiter and data type detection rules
Because there are many different variations of delimited files, this step might not detect the structure of every type of delimited file. The following rules are used to evaluate fields:
The step uses a left outer Cartesian join of the delimiter, enclosure, and escape candidate options.
The step performs inner joins when Enclosure character required or Escape character required is selected. This means a null enclosure or a null escape is not allowed.
The delimiter character cannot be the same as the enclosure or escape character. For example, if you use a double quote (
") for both enclosure and escape, the step ignores that candidate combination.
If the enclosure and escape characters are the same, the delimiter cannot be escaped. Only the enclosure character can be escaped.
If any enclosure errors are found (such as an unclosed enclosure or an unescaped enclosure character), the enclosure is not considered valid. If you expect enclosure errors, select Ignore enclosure errors.
A header row is any row within Maximum number of header rows that has a number of fields consistent with the number of fields in the data. The row after the last inconsistent row may also be considered a header row if:
Header row strategy is First possible line containing only strings and the first row that contains only string values is the next row after the last inconsistent row.
Header row strategy is First possible line containing any data type and there are no rows with a consistent number of fields before the last inconsistent row.
Header row strategy is Last possible line containing only strings and the next row after the last inconsistent row is all strings.
Header row strategy is Last possible line containing any data type.
A footer row is the first row with an inconsistent number of fields and any following rows. When Limit scanned rows prevents the entire file from being scanned, the file is not evaluated for footer rows.
If multiple delimiters, enclosures, or escape characters appear to match the file, the step cannot determine the format.
Multiple-character enclosures or escapes may result in incorrect data type results.
Ignoring enclosure errors may result in incorrect data type results.
The field length is determined by the length of the longest field detected during the scan. If you set Limit scanned rows, the step only checks that number of rows.
Examples
Your Pentaho distribution includes several sample transformations and datasets in the design-tools/data-integration/samples/transformations/discover-metadata-from-textfile directory.
The following code is a portion of the Sample1.txt file found in the directory:
When the step runs, the file is scanned to determine a consistent number of fields using the tab character, then the semicolon, then the comma (default). When Header column name detection strategy is set to First possible line containing only strings, the step identifies the first row as the header row. The following table shows the column names and data types.
policylD2
Integer
county
String
eq_site_limit
BigNumber
eq_site_deductible
Integer
point_longitude
BigNumber
If any of the fields in the first row are numbers or dates, the row is treated as data, which means there is no header row in this example.
Data lineage
This step includes a data lineage analyzer. See Data lineage.
Metadata injection support
All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.
Last updated
Was this helpful?

