Processing data

You can extract meaningful insights and ensure the effective utilization of data with Data Catalog processing. The significant stages in the processing of data are:

  1. Metadata Ingest

  2. Data Profiling (for structured data) and Data Discovery (for unstructured data)

  3. Data Identification for structured data, including delimited files

  4. Usage Statistics for the Microsoft SQL and Oracle databases

  5. PII Detection

Note: Your Data Catalog license determines the number of data sources you can add, and the amount of data you can scan. Databases do not have a data scan quota.

Metadata Ingest

The Metadata Ingest step updates Data Catalog to reflect current metadata changes. The Metadata Ingest step scans the data source for new or modified files since the last run, updating the existing metadata. In addition, it removes metadata for deleted files, ensuring Data Catalog represents the data source accurately.

Note: If you are close to or have exceeded the quota of data you can scan with your license, you see a message in the upper corner of the screen when you try to start a scan. If you have exceeded the amount of data you can scan, you are unable to start a scan.

Data Profiling and Data Discovery

The Data Profiling and Data Discovery steps are crucial for analyzing both structured and unstructured data, respectively.

Data Profiling

In the Data Profiling process, Data Catalog examines structured data within JDBC data sources and gathers statistics about the data. It profiles data in the cluster and uses its algorithms to compute detailed properties, including field-level data quality metrics, data statistics, and data patterns.

Note: When configuring data profiling, it is considered a best practice to use the default settings as they are suitable for most situations. With the default settings, the data profiling is limited to 500,000 rows.

Data Discovery

In the Data Discovery process, Data Catalog examines unstructured data by scanning file contents to compile data statistics, which involves the following steps:

  • Calculating checksums to identify duplicates, if the Compute checksum of document contents checkbox is selected.

  • Extracting document properties from Office365 and PDF files.

  • Using dictionaries to scan documents for specific strings and keywords, triggering predefined actions.

  • Profiling data within the cluster to ascertain detailed attributes, including quality metrics, statistics, and patterns for delimited files. These processes ensure a thorough understanding and assessment of both structured and unstructured data, setting a solid foundation for subsequent analysis.

Data Identification

The Data Identification process helps to manage your structured data, including delimited files. It involves tagging data to make it easier to search, retrieve, and analyze. By associating dictionaries and data patterns with tables and columns, you can ensure that data is appropriately categorized and easily accessed when needed.

CAUTION: You must run Data Profiling (for structured data) or Data Discovery (for unstructured data) before proceeding with any Data Identification activities.

Usage Statistics

Note: The Usage Statistics process is only available for the Microsoft SQL and Oracle databases if the auditing feature in these databases is enabled.

When processing Microsoft SQL or Oracle databases, Data Catalog gives an additional feature to gather usage statistics and store them in the Business Intelligence Database (BIDB). During this process, the Entity Usage Worker job fetches various usage metrics, such as how many times an entity is read, written, and altered, along with the timestamps, from an audit database and stores them under Entity Usage Statistic View collection within the BIDB. You can use this repository for analysis and visualization of the data with third-party BI tools. For more information, see Business Intelligence Database.

PII Detection

The PII Detection feature in Data Catalog uses Machine Learning (ML) and Large Language Models (LLMs) to analyze data in JDBC tables and identify Personally Identifiable Information (PII). This feature is specifically trained for Korean and Japanese datasets and automatically detects and classifies sensitive data, such as names, addresses, and ID numbers. It helps you to streamline compliance with privacy regulations by automatically identifying and classifying personally identifiable information (PII) in datasets. To learn more, see PII Detection.

Note: This feature currently supports only JDBC data sources with Korean and Japanese content.

When you start PII Detection, Data Catalog scans the selected JDBC table for column names that contain PII entities. Once the process is complete and if PII data is identified:

  • A new glossary titled ML_PII is automatically created (if not already present). If the ML_PII glossary already exists, newly identified PII terms are added to it.

  • Detected PII entities are tagged with relevant business terms from the ML_PII glossary.

These tags appear in the Business Terms panel of the respective columns.

Last updated

Was this helpful?