Processing data

You can extract meaningful insights and ensure the effective utilization of data with Data Catalog processing. The significant stages in the processing of data are:

Metadata Ingest
Data Profiling (for structured data) and Data Discovery (for unstructured data)
Data Identification for structured data, including delimited files
Usage Statistics for the Microsoft SQL and Oracle databases
PII Detection
Calculate Trust Score

Note: Your Data Catalog license determines the number of data sources you can add, and the amount of data you can scan. Databases do not have a data scan quota.

Metadata Ingest

The Metadata Ingest step updates Data Catalog to reflect current metadata changes. The Metadata Ingest step scans the data source for new or modified files since the last run, updating the existing metadata. In addition, it removes metadata for deleted files, ensuring Data Catalog represents the data source accurately.

Note: If you are close to or have exceeded the quota of data you can scan with your license, you see a message in the upper corner of the screen when you try to start a scan. If you have exceeded the amount of data you can scan, you are unable to start a scan.

Data Profiling and Data Discovery

The Data Profiling and Data Discovery steps are crucial for analyzing both structured and unstructured data, respectively.

Data Profiling

In the Data Profiling process, Data Catalog examines structured data within JDBC data sources and gathers statistics about the data. It profiles data in the cluster and uses its algorithms to compute detailed properties, including field-level data quality metrics, data statistics, and data patterns.

Note: When configuring data profiling, it is considered a best practice to use the default settings as they are suitable for most situations. With the default settings, the data profiling is limited to 500,000 rows.

Data Discovery

In the Data Discovery process, Data Catalog examines unstructured data by scanning file contents to compile data statistics, which involves the following steps:

Calculating checksums to identify duplicates, if the Compute checksum of document contents checkbox is selected.
Extracting document properties from Office365 and PDF files.
Using dictionaries to scan documents for specific strings and keywords, triggering predefined actions.
Profiling data within the cluster to ascertain detailed attributes, including quality metrics, statistics, and patterns for delimited files. These processes ensure a thorough understanding and assessment of both structured and unstructured data, setting a solid foundation for subsequent analysis.
Extracting and classifying text from scanned documents and image files using Optical Character Recognition (OCR). When the OCR option is configured, during Data Discovery and Document Processing, Data Catalog uses the configured OCR engine (Tesseract or EasyOCR) to extract text from image-based content. The extracted text is then scanned against predefined data patterns, enabling users to identify sensitive information, apply tags, and associate matched values with relevant business glossary terms. For more information, see the PDC OCR feature walkthrough.

Data Identification

The Data Identification process helps to manage your structured data, including delimited files. It involves tagging data to make it easier to search, retrieve, and analyze. By associating dictionaries and data patterns with tables and columns, you can ensure that data is appropriately categorized and easily accessed when needed.

CAUTION: You must run Data Profiling (for structured data) or Data Discovery (for unstructured data) before proceeding with any Data Identification activities.

Usage Statistics

When processing supported databases, Data Catalog gives an additional feature, Usage Statistics. The Usage Statistics feature is a capability in Data Catalog (PDC) that captures and stores metadata, including the number of times an entity is read, written to, or altered in the Business Intelligence Database (BIDB).

Usage Statistics provides clear insights into data consumption, highlighting the most frequently accessed entities. This supports resource optimization, strengthens governance through usage audit trails, and enables impact analysis by showing which entities are affected by changes in data flow.

Note: The Usage Statistics process is only available for the Microsoft SQL, Oracle, and Snowflake databases.

Microsoft SQL Server and Oracle: The auditing feature in the respective databases must be enabled to capture and save usage statistics. For Microsoft SQL and Oracle databases, the auditing feature should be enabled. For more information, refer to the official documentation for Microsoft SQL and Oracle.
Snowflake: Usage statistics are available without additional configuration.

When you run the Usage Statistics process, the Entity Usage Worker job retrieves various usage metrics, including the number of times an entity is read, written, and altered from an audit database, and stores them under the Entity Usage Statistic View within the BIDB. You can use this repository for analysis and visualization of the data with third-party BI tools. For more information, see Business Intelligence Database.

Additionally, you can also view certain usage-related properties in the Properties panel of the Summary tab in Data Canvas. The properties displayed may vary depending on the selected data asset.

PII Detection

The PII Detection feature in Data Catalog uses Machine Learning (ML) and Large Language Models (LLMs) to analyze data in JDBC tables and identify Personally Identifiable Information (PII). This feature is specifically trained for Korean and Japanese datasets and automatically detects and classifies sensitive data, such as names, addresses, and ID numbers. It helps you to streamline compliance with privacy regulations by automatically identifying and classifying personally identifiable information (PII) in datasets. To learn more, see PII Detection.

Note: This feature currently supports only JDBC data sources with Korean and Japanese content.

,When you start PII Detection, Data Catalog scans the selected JDBC table for column names that contain PII entities. Once the process is complete and if PII data is identified:

A new glossary titled ML_PII is automatically created (if not already present). If the ML_PII glossary already exists, newly identified PII terms are added to it.
Detected PII entities are tagged with relevant business terms from the ML_PII glossary.

These tags appear in the Business Terms panel of the respective columns.

Calculate Trust Score

The Calculate Trust Score feature in Data Catalog allows users to compute and monitor the quality and reliability of data assets. The score is calculated using multiple parameters like Data Quality, User Ratings, Data Lineage, and Classification. Users can manually or programmatically initiate the computation or refresh the score, ensuring up-to-date trust information for decision-making.

Users can manually or programmatically initiate and refresh the Trust Score calculation via Data Canvas or API. Trust Score calculation considers:

Data Quality (Completeness, Accuracy, Validity, Uniqueness, Consistency)
User Ratings (1–5 stars)
Data Lineage (Verified or Not)
Classification (Term Assignment or Not)

This feature is currently available for Tables and Files and is not available in public API's.

To initiate the Calculate Trust Score process:

Select the table(s) or file(s).
Navigate to the Actions menu and choose Process. The Choose Process page appears.
On the Calculate Trust Score card, click Start. The Calculate Trust Score process starts and appears in the Workers page.

After the process completes, the Trust Score calculation result appears in the Key Metrics panel for the selected entity.

PreviousCollections NextProcessing structured, unstructured, and semi-structured files

Last updated 1 month ago