Processing unstructured data
For unstructured data, the data scanning quota determined by your license is the sum of the capacity of each file and object in the data source or file system.
Perform the following steps to process the unstructured data and delimited files:
Select the unstructured resource you want to investigate in Data Canvas.
This can be a file or a folder. To detect duplicates, select the files or folders you want to check for duplicates.
Click Process.
The Choose Process pane opens with Metadata Ingest, Data Discovery, and Data Identification options.
Unstructured data processing options In the Metadata Ingest card, click Start to begin the metadata ingestion.
You can view the status of the Metadata Ingest process on the Manage Workers page.
Note: If you have already scanned more than 75% of your data quota, you see a message when you start the scan. Even if you cannot scan new data, you still can run Data Discovery or Data Identification on data you have already scanned.
To perform the data discovery, click the Data Discovery card.
The Data Discovery page opens with the following options to configure data and scan the content:
Note: When configuring data discovery, it is recommended to use the default settings as they are suitable for most situations.
Checksum Calculation
Compute checksum of document content
Calculates checksums for each file which are used to detect duplicates. After processing, any duplicate files are displayed on the Duplicates tab.
Machine Learning Options(These options use Machine Learning and Large Language Models)
Summarize Documents
Generate a concise summary of unstructured files such as .docx, .pdf, and .rtx and more. The summary appears under the Document Summary section of the asset’s Summary tab. Also performs sentiment analysis, which is shown under the Data Labels section.
Address Detection
Scans documents for U.S. postal addresses. When this option is selected, you must choose a relevant business term. If addresses are found, the selected business term is automatically tagged to the asset and displayed in the Business Terms panel.
Document Metadata
Extract document properties
Collects additional document properties from the file, such as the owner, page count, number of paragraphs, and so on. It applies only to Office365 or PDF files.
Content Scan for String Detection
Detect if the string exists
Based on the applied dictionary, if the dictionary value exists in the file, it applies the actions defined in the dictionary and returns true
in the metadata store (mds).
Detect the string count
Based on the applied dictionary, if the dictionary value exists in the file, it returns the aggregate count of the dictionary values within the file in the metadata store and applies the actions defined in the dictionary.
String Detection
Add Dictionary
Select and add available dictionaries to use in string detection and to apply actions specified in the dictionary. Note: During the string detection process, it ignores the rules defined in the dictionaries.
Data Profiling
Treat First Row as Header (only for delimited files)
When you set the flag during profiling, the Data Discovery step considers the first row of the data as a header and assigns its values to the column names in the profiled data.If you don't set the flag, the Data Discovery step assigns default names like column-0, column-1, column-2, and so on to the profiled data.
Advanced Options
Files Modified More Than Day(s) Ago
Filters file processing by modification timestamp.
Files Accessed More Than Day(s) Ago
Filters file processing by access timestamp.
Include File Extensions
Specify the document extension, such as pdf, .doc, .txt, and so on. Profiling is performed for the specified extension.Leave empty to use all supported extensions.
Restrict Processing to Max File Size of
Files larger in size than this amount are skipped. For example, 100 MB.
File Processing Threads
Number of processing threads for file processing per job (should keep this low if running many jobs).
Persistence Threads
Number of persistence writing per job (should keep this low if running many jobs).
Include Patterns*
Specifies global patterns to apply during profiling.
Exclude Patterns*
Specifies global patterns to exclude during profiling.Note: If files or folders match both include and exclude patterns, then profiling excludes the patterns.
* For more information about patterns and limitations, see Java documentation.
Click Start.
You can view the status of the Data Discovery process on the Manage Workers page.
(Optional) To perform data identification on delimited files, click the Data Identification card.
Important: You must perform Data Discovery before proceeding with the Data Identification process. If the Data Discovery process was not completed previously, Data Catalog highlights it as Required. You can start Data Discovery process from the Data Identification card by clicking Start.
Click Select Methods, select the Dictionaries and Patterns, click Apply, and then click Start.
You can view the status of the Data Identification process on the Manage Workers page.
Go to Data Canvas and select the processed file to view its properties.
The unstructured data is processed, and the document properties are displayed in the Document Properties pane.
Note: The unstructured properties displayed vary according to the type of unstructured data selected.
Last updated
Was this helpful?