⌘Ctrlk

Try Pentaho! - Start your 30 day evaluation today

Processing structured, unstructured, and semi-structured files

You can use Data Catalog to profile a wide variety of file types, including structured, unstructured, and semi-structured files. These file-based assets share the same user interface and profiling workflow. During profiling, the system analyzes file content and metadata to extract insights such as field patterns, value distribution, and document structure.

Data Catalog supports profiling of a wide range of file-based data assets. The following table highlights the major categories and commonly used file types that share a unified profiling interface and results:

Data Catalog supports more file formats than those listed in the following table. For a comprehensive list of supported file formats and compatibility details, contact Pentaho Support.

Category

File Types

Additional Information

Structured files

.csv, .tsv, .psv

Structured files with consistent field delimiters. You can configure header row detection and delimiter type during profiling.

Compressed files

.gz, .snappy, .deflate, .bz2, .lzo, .lz4

Unstructured documents

.pdf, .doc, .docx, .txt, .rtf

Profiling extracts document metadata and textual content. Includes string detection, summarization, and duplicate detection.

Semi-structured files

.parquet , .json, .avro, .orc

Stores structured data in columnar format. Profiling includes schema detection, field types, null values, and value frequency analysis.

Perform the following steps to process the structured, unstructured, and semi-structured files:

Structured (delimited) and semi-structured files are treated as unstructured but can be profiled via Data Discovery with structured outputs.

Select the structured, unstructured, and semi-structured resources you want to investigate in Data Canvas.
This can be a file or a folder. To detect duplicates, select the files or folders you want to check for duplicates.
Click Process.
The Choose Process pane opens with Metadata Ingest, Data Discovery, and Data Identification options.
Unstructured data processing options
In the Metadata Ingest card, click Start to begin the metadata ingestion.
You can view the status of the Metadata Ingest process on the Manage Workers page.
Note: If you have already scanned more than 75% of your data quota, you see a message when you start the scan. Even if you cannot scan new data, you still can run Data Discovery or Data Identification on data you have already scanned.
To perform the data discovery, click the Data Discovery card.
The Configure Process page opens with the three tabs: Data Discovery, Document Processing, and Data Profiling. Configure the process by using the options available under these three tabs.
Note: When configuring data discovery, it is recommended to use the default settings as they are suitable for most situations.
In the Data Discovery tab, configure the following options:
1. Checksum Calculation
  Field
  Description
  Compute checksum of document content
  Calculates checksums for each file which are used to detect duplicates. After processing, any duplicate files are displayed on the Duplicates tab.
2. Advanced Options
  Field
  Description
  Files Modified More Than Day(s) Ago
  Filters file processing by modification timestamp.
  Files Accessed More Than Day(s) Ago
  Filters file processing by access timestamp.
Click the Document Processing tab, and configure the following options:
1. Machine Learning Options Note: These options use Machine Learning and Large Language Models.
  Field
  Description
  Summarize Documents
  Generate a concise summary of unstructured files such as .docx, .pdf, and .rtx and more. The summary appears under the Document Summary section of the asset’s Summary tab. Also performs sentiment analysis, which is shown under the Data Labels section. For more information, see Summarize documents.
  Address Detection
  Scans documents for U.S. postal addresses. When this option is selected, you must choose a relevant business term. If addresses are found, the selected business term is automatically tagged to the asset and displayed in the Business Terms panel. For more information, see Address Detection.
  Data Classification
  Classifies unstructured documents based on their semantic content using machine learning. When this option is selected, you provide one or more business terms that represent the classifications you want to identify. Data Catalog analyzes the document content and automatically assigns the matching business terms to documents where a semantic match is found. The assigned classifications are displayed in the Business Terms panel of the asset. For more information, see Document classification.
2. Document Metadata
  Field
  Description
  Extract document properties
  Collects additional document properties from the file, such as the owner, page count, number of paragraphs, and so on. It applies only to Office365 or PDF files.
3. Content Scan for String Detection
  Field
  Description
  Detect presence of strings
  Based on the applied dictionary, if the dictionary value exists in the file, it applies the actions defined in the dictionary and returns true in the metadata store (mds).
  Determine presence and count of occurrences both
  Based on the applied dictionary, if the dictionary value exists in the file, it returns the aggregate count of the dictionary values within the file in the metadata store and applies the actions defined in the dictionary.
4. String Detection Note: During the string detection process, it ignores the rules defined in the dictionaries.
  Field
  Description
  Add Dictionary
  Select and add available dictionaries to use in string detection and to apply actions specified in the dictionary.
  Add Patterns
  Select and add available patterns to use in string detection and to apply actions specified in the patterns. [PA1]
5. Advanced Options
  Field
  Description
  Include File Extensions
  Specify the document extension, such as pdf, .doc, .txt, and so on. Profiling is performed for the specified extension. Leave empty to use all supported extensions.
  Restrict Processing to Max File Size of
  Files larger in size than this amount are skipped. For example, 100 MB.
  File Processing Threads
  Number of processing threads for file processing per job (should keep this low if running many jobs).
  Persistence Threads
  Number of persistence writing per job (should keep this low if running many jobs).
Click the Data Profiling tab for structured (delimited) files and configure:
Field
Description
Extract samples
Extracts a small random sample of data (typically ~200 rows) for preview and validation during profiling and displays it in the summary tab. It is generally used internally.
Treat First Row as Header (only for structured or delimited files)
When you set the flag during profiling, the Data Discovery step considers the first row of the data as a header and assigns its values to the column names in the profiled data.If you don't set the flag, the Data Discovery step assigns default names like column-0, column-1, column-2, and so on to the profiled data.
Skip Recent (days)
Skips profiling for recently profiled tables. For example, if the days field is set to 7, any table profiled within the last 7 days is skipped.
Include Patterns*
Add global patterns to apply during profiling.
Exclude Patterns*
Add global patterns to exclude during profiling.Note: If files or folders match both include and exclude patterns, then profiling excludes the patterns.
* For more information about patterns and limitations, see Java documentation.
Click Start Discovering.
You can view the status of the Data Discovery process on the Manage Workers page.
(Optional) To perform data identification on structured and semi-structured files, click the Data Identification card.
Important: You must perform Data Discovery before proceeding with the Data Identification process. If the Data Discovery process was not completed previously, Data Catalog highlights the Configure Process as Required. Expand the Configure Process and complete the process.
Click Select Methods, select the Dictionaries and Patterns, click Apply, and then click Start.
You can view the status of the Data Identification process on the Manage Workers page.
Go to Data Canvas and select the processed file to view its properties.

The selected structured, unstructured, or semi-structured files are processed, and the document properties are displayed in the Document Properties pane. Samples from structured and semi-structured files are available in the Sample Data pane, providing insights into data distribution and characteristics. Additionally, you can also explore the file’s relationships and tags using the Galaxy View.

The displayed unstructured properties vary based on the selected unstructured data type.

PreviousProcessing data NextProcessing structured data

Last updated 1 day ago

Was this helpful?