Manage worker processes

Pentaho Data Catalog uses worker processes to implement virtually all the data analytics functions. Most worker processes consist of a single primary worker process that Data Catalog launches from a user action or a scheduled action. Some processes might also initiate secondary worker processes.

Worker processes

The following table lists the worker processes:

Process

Description

Actions performed

Test Connection

Returns detailed success or failure information for each step of the test. Data Catalog starts this worker process when you configure or update a data source connection. Data Catalog marks the data source “OFFLINE” until a successful test completes.

Connect to data source
Authenticate
Retrieve list of schemas and store in MongoDB

Metadata Ingest

Ingests the metadata for one or more schemas. Note: Your license agreement determines the amount of data you can scan. Databases do not have a data scan quota.

Read schema from data source and store in MongoDB

Data Profiling

Generates a variety of statistics and intermediate data with a single pass through the source data. Typically, this is the first process you run on your data.

Create bitset
Create HyperLogLogs (HLL) for full data
Generate statistics (numeric and string related)
Generate data patterns
Lucene Indexing (optional)
Extract samples for viewing (<100)

Data Identification

Identifies and tags columns and tables using ontology information (dictionaries, aliases), along with underlying data and metadata.

Tag columns based on dictionaries
Tag columns based on metadata and aliases

Key Discovery

Performs a variety of key discovery actions. Foreign key discovery requires that Data Profiling of the data sources has completed.

Foreign key discovery
Superkey identification
Composite key discovery
Compound key discovery
Secondary key discovery
Natural and Surrogate key identification

Data Quality

Performs a full data quality (DQ) analysis on the underlying data, using regular expressions and other configurable business rules.

RegEx matching
Data pattern analysis
Update column statistics
Evaluate column DQ rules
Evaluate row-relative DQ rules

Sensitive Data Discovery (SDD)

Performs the tasks beyond data identification for SDD. This process uses flows, lineage, Foreign Keys, and more to put together the items comprising PI and PII.

Generate separate SDD Lucene Index which cross- references data

Monitor worker status

From the Manage Your Environment page, you can see the number of completed worker processes and the number of worker alerts on the Workers card.

Note: You may be able to see a Processed Items region on your Home page, if your Landing page options window has the Processed Items check box selected.

Use the following steps to monitor the status of a worker process:

From the Manage Your Environment page, click View Workers to see the completed and in-progress worker processes.
The Status column shows the status of the worker processing.
Click the up arrow at the beginning of the worker process row to expand the information.

View worker process details

Use the following steps to view details of a worker process:

On the Workers page, locate the worker process for which you want more information.
If an up arrow is visible at the beginning of the row for the worker process, click the arrow to expand the information.
Click the View Details icon (>) at the end of the row.
The View Worker Details window opens. If the process failed, an Exception tab might be available, in addition to the Details tab.
Click Close to close the View Worker Details window.

Cancel a worker process

Use the following steps to cancel a worker process:

While a worker process is running, go to the Workers page and locate the worker process you want to cancel.
Click Cancel at the end of the row.
Data Catalog cancels the worker process, and displays Cancelling in the Job Status column.

PreviousManage schedules NextManage reference data sets

Last updated 9 days ago

Was this helpful?