Processing structured, unstructured, and semi-structured files

You can use Data Catalog to profile a wide variety of file types, including structured, unstructured, and semi-structured files. These file-based assets share the same user interface and profiling workflow. During profiling, the system analyzes file content and metadata to extract insights such as field patterns, value distribution, and document structure.

Data Catalog supports profiling of a wide range of file-based data assets. The following table highlights the major categories and commonly used file types that share a unified profiling interface and results:

Data Catalog supports more file formats than those listed in the following table. For a comprehensive list of supported file formats and compatibility details, contact Pentaho Support.

Category
File Types
Additional Information

Structured files

.csv, .tsv, .psv

Structured files with consistent field delimiters. You can configure header row detection and delimiter type during profiling.

Compressed files

.gz, .snappy, .deflate, .bz2, .lzo, .lz4

Unstructured documents

.pdf, .doc, .docx, .txt, .rtf

Profiling extracts document metadata and textual content. Includes string detection, summarization, and duplicate detection.

Semi-structured files

.parquet , .json, .avro, .orc

Stores structured data in columnar format. Profiling includes schema detection, field types, null values, and value frequency analysis.

Perform the following steps to process the structured, unstructured, and semi-structured files:

Structured (delimited) and semi-structured files are treated as unstructured but can be profiled via Data Discovery with structured outputs.

  1. Select the structured, unstructured, and semi-structured resource you want to investigate in Data Canvas.

    This can be a file or a folder. To detect duplicates, select the files or folders you want to check for duplicates.

  2. Click Process.

    The Choose Process pane opens with Metadata Ingest, Data Discovery, and Data Identification options.

    Unstructured data processing options
  3. In the Metadata Ingest card, click Start to begin the metadata ingestion.

    You can view the status of the Metadata Ingest process on the Manage Workers page.

    Note: If you have already scanned more than 75% of your data quota, you see a message when you start the scan. Even if you cannot scan new data, you still can run Data Discovery or Data Identification on data you have already scanned.

  4. To perform the data discovery, click the Data Discovery card.

    The Configure Process page opens with the three tabs: Data Discovery, Document Processing, and Data Profiling. Configure the process by using the options available under these three tabs.

    Note: When configuring data discovery, it is recommended to use the default settings as they are suitable for most situations.

  5. In the Data Discovery tab, configure the following options:

    1. Checksum Calculation

      Field
      Description

      Compute checksum of document content

      Calculates checksums for each file which are used to detect duplicates. After processing, any duplicate files are displayed on the Duplicates tab.

    2. Advanced Options

      Field
      Description

      Files Modified More Than Day(s) Ago

      Filters file processing by modification timestamp.

      Files Accessed More Than Day(s) Ago

      Filters file processing by access timestamp.

  6. Click the Document Processing tab, and configure the following options:

    1. Machine Learning Options Note: These options use Machine Learning and Large Language Models.

      Field
      Description

      Summarize Documents

      Generate a concise summary of unstructured files such as .docx, .pdf, and .rtx and more. The summary appears under the Document Summary section of the asset’s Summary tab. Also performs sentiment analysis, which is shown under the Data Labels section.

      Address Detection

      Scans documents for U.S. postal addresses. When this option is selected, you must choose a relevant business term. If addresses are found, the selected business term is automatically tagged to the asset and displayed in the Business Terms panel.

    2. Document Metadata

      Field
      Description

      Extract document properties

      Collects additional document properties from the file, such as the owner, page count, number of paragraphs, and so on. It applies only to Office365 or PDF files.

    3. Content Scan for String Detection

      Field
      Description

      Detect presence of strings

      Based on the applied dictionary, if the dictionary value exists in the file, it applies the actions defined in the dictionary and returns true in the metadata store (mds).

      Determine presence and count of occurrences both

      Based on the applied dictionary, if the dictionary value exists in the file, it returns the aggregate count of the dictionary values within the file in the metadata store and applies the actions defined in the dictionary.

    4. String Detection Note: During the string detection process, it ignores the rules defined in the dictionaries.

      Field
      Description

      Add Dictionary

      Select and add available dictionaries to use in string detection and to apply actions specified in the dictionary.

      Add Patterns

      Select and add available patterns to use in string detection and to apply actions specified in the patterns. [PA1]

    5. Advanced Options

      Field
      Description

      Include File Extensions

      Specify the document extension, such as pdf, .doc, .txt, and so on. Profiling is performed for the specified extension. Leave empty to use all supported extensions.

      Restrict Processing to Max File Size of

      Files larger in size than this amount are skipped. For example, 100 MB.

      File Processing Threads

      Number of processing threads for file processing per job (should keep this low if running many jobs).

      Persistence Threads

      Number of persistence writing per job (should keep this low if running many jobs).

  7. Click the Data Profiling tab for structured (delimited) files and configure:

    Field
    Description

    Extract samples

    Extracts the sample data during profiling and displays it in the summary tab.

    Treat First Row as Header (only for structured or delimited files)

    When you set the flag during profiling, the Data Discovery step considers the first row of the data as a header and assigns its values to the column names in the profiled data.If you don't set the flag, the Data Discovery step assigns default names like column-0, column-1, column-2, and so on to the profiled data.

    Skip Recent (days)

    Skips profiling for recently profiled tables. For example, if the days field is set to 7, any table profiled within the last 7 days is skipped.

    Include Patterns*

    Add global patterns to apply during profiling.

    Exclude Patterns*

    Add global patterns to exclude during profiling.Note: If files or folders match both include and exclude patterns, then profiling excludes the patterns.

    * For more information about patterns and limitations, see Java documentation.

  8. Click Start Discovering.

    You can view the status of the Data Discovery process on the Manage Workers page.

  9. (Optional) To perform data identification on structured and semi-structured files, click the Data Identification card.

    Important: You must perform Data Discovery before proceeding with the Data Identification process. If the Data Discovery process was not completed previously, Data Catalog highlights the Configure Process as Required. Expand the Configure Process and complete the process.

  10. Click Select Methods, select the Dictionaries and Patterns, click Apply, and then click Start.

    You can view the status of the Data Identification process on the Manage Workers page.

  11. Go to Data Canvas and select the processed file to view its properties.

The selected structured, unstructured, or semi-structured files are processed, and the document properties are displayed in the Document Properties pane. Samples from structured and semi-structured files are available in the Sample Data pane, providing insights into data distribution and characteristics. Additionally, you can also explore the file’s relationships and tags using the Galaxy View.

The unstructured properties displayed vary according to the type of unstructured data selected.

Last updated

Was this helpful?