Data Catalog user features

This article describes the Pentaho Data Catalog user interface and the tasks non-admin users can perform. Before proceeding, make sure that a Data Catalog service user or administrator has set up your catalog for you. Data Catalog builds a complete inventory of data assets in a data warehouse, automatically and securely. It provides:

  • exact data discovery and faster delivery to an authenticated user.

  • improved and simplified understanding of data quality.

  • an inventory of all assets for efficient data repository governance.

Data Catalog complements data visualization, data discovery, and data wrangling tools by streamlining the collection and initial data quality checks of the data repository, making the data repository available to those tools for further processing.

Data self-service

Data Catalog provides a rich interface for data self-service to help you find the best instances of the integrated data you are looking for. These services include:

Role-based access control

Data Catalog roles give users the ability to perform specific actions, especially edit actions, and allow lines of business to restrict access to sensitive or confidential information. You can create named communities, such as US Business Users or Commercial Lending Business Users to fine-tune the actions users can perform, as well as to allow access to a subset of glossaries and data sources.

Note: The number of data sources you can add is limited by your license.

Business Terms

You can discover metadata about files and fields and have Data Catalog associate fields to customer's business terms. You can associate business terms with data elements, business rules, related terms, and custom properties to form a comprehensive view of the organization’s business concepts and data landscape.

Data quality metrics and statistics

Data Catalog's profiling processes produce and present detailed data quality metrics and statistics that help you decide if the data is useful, valid, and complete, without having to write code to graph each field in the file.

User roles

User roles in Data Catalog are used for access control and permissions management. They help control who can view, edit, or delete items in the Data Catalog and ensure data security.

Note: The number of Expert user roles you can assign to users is limited by your license. The Expert user roles are:

  • Business Steward

  • Data Steward

  • Admin

  • Data Developer

Table creation

When you find a file you are interested in, you can create a table for that file. This ability is useful for non-technical users to find the files that contain the data that they need, create a table and use a tool like Grafana to visualize the data without any added efforts.

Data inventory

Behind Data Catalog’s self-service user interface is an engine that profiles the data repository and enriches it by propagating terms created by users. Data Catalog identifies the formats of the resources and profiles their contents, creating an inventory of data assets in the data warehouse securely.

Profiling

Most of the data-curation process entails writing code to profile and graph data. Data Catalog automates this process, improving the productivity of data engineers and data scientists.

Data profiling and discovery are the processes in which Data Catalog examines file data and gathers statistics about the data. It profiles data in the cluster, and uses its algorithms to compute detailed properties, including field-level data quality metrics, and data statistics. The resulting inventory includes rich metadata for delimited files, like JSON, and Parquet, and files compressed with supported compression algorithms such as gzip.

Note: The amount of data you can scan is limited by your license. Databases do not have a data scan quota.

Sensitive data discovery

Sensitive data residing in the data cluster presents a sizable liability if it is not protected and managed. The algorithms in Data Catalog identify sensitive data throughout the data clusters as a part of profiling with minimal additional processing overhead. Identification is the first step, and often the most difficult step in the process of protecting sensitive data. You cannot protect sensitive data unless you know where it resides. Data Catalog identifies sensitive data and facilitates the next step of protecting it through masking, encryption, or quarantine.

Data quality

You can discover data quality metrics automatically using large-scale profiling, such as discovering the number of nulls in a data column or cardinality. For example, you can assess the number of values that should be in a field versus the actual numbers that have been profiled.

Note: Data Catalog writes profiling process notifications to the log files.

Data governance

Data Catalog provides data governance by securing access to the data, by managing metadata creation, enrichment, and approval, and by linking physical data to business-related terminology.

Securing access to data

In Data Catalog, you can protect resources with secured access using glossaries. A glossary is a logical grouping of business terms that you can assign to a specific project user group. Once you have set up roles for glossaries, you can use them to limit access to data via specific roles and users.

Managing metadata creation, enrichment, and approval

By default, users with the Data Steward role can only associate terms with data, while users with the Business Steward role can create new metadata by adding business terms, term hierarchies, and custom properties. Users with the Business Steward role can also perform these functions.

Linking physical data to business-defined terms

In Data Catalog, with applicable permissions, you can manually tag data directly in the user interface. To learn more about this process, see the Administer Pentaho Data Catalog document.

Reporting and data visualization

Reporting in Data Catalog usually is done through dashboards using third-party BI tools. Dashboards further extend the visual discovery and relationship discovery capabilities of the Data Catalog in several ways. They also provide a means to add customized insight assets unique to the organization.

Business Intelligence Database

Data Catalog contains the Business Intelligence Database (BIDB) server, containing a variety of collections filled with specific metadata. Using the Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) connector, you can connect to BIDB, access reporting data, and create dashboards.

Several services, including bi-mongo and bi-views, are deployed during the Data Catalog installation. The bi-views service runs periodically based on the configuration set in the .env file (by default, daily) and aggregates the data from all connected databases and stores in BIDB. Utilizing the mongo-bi-connector provided by the bi-mongo service, you can link to BIDB with a JDBC or ODBC connector. This connection makes BIDB data available on port 3307, enabling analysis through any compatible BI tool. For detailed guidance on connecting to BIDB, see How to connect to BIDB.

Collections available in Business Intelligence Database

BIDB hosts several collections, updated regularly by the bi-views service. This service systematically extracts data from all connected databases, organizing it into the following collections available within BIDB.

Checksum Aggregated View

The Checksum Aggregated View collection contains a summary of duplicate files for a specific entity, including their count and total size. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

Checksum derived from bi.entities_master_view.

String

“968dl402bd0ce783a573al4172c37690”

duplicateFilesCount

The total number of duplicate files identified.

Integer

3

duplicateFilesSize

The total size of duplicate files.

Integer

381

Custom Properties View

The Custom Properties View collection contains the details of custom properties in an entity, including their values. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

A unique identifier for the custom property entry.

String

“65dfc901d04619a9e6a8d62d”

EntityId

A unique identifier for the entity.

String

"11"

PropertyId

A unique identifier for the property.

String

"5"

PropertyName

The name of the custom property.

String

"Name"

Value

The assigned value of the custom property.

String

"John"

Entities Aggregated View

The Entities Aggregated View collection includes the details of the key attributes and values of aggregated entities. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

A unique identifier for the entity aggregation.

String

“65dfc901d04619a9e6a8d62d”

attribute

The attribute name of the entity.

String

"DataSources"

type

The data type of the attribute.

String

"Structured"

value

The value associated with the attribute.

String

"34"

Entities Extension Count View

The Entities Extension Count view collection includes extension details, such as the file count, data source, and date of recording for each extension. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

A unique identifier for the count entry.

String

“65c56b02250cc54a7b43943f”

DataSourceFqdnId

A fully qualified domain name identifier is required for the data source.

String

"5"

Date

The date when the file count was recorded.

Date

2024-02-09T00:00:02.110+00:00

Extension

The file extension.

String

"text/plain; charset=IS0-8859-l;delimiter=comma”

FileCount

The number of files with the specified extension.

Integer

1

Entities Master View

The Entities Master View contains the essential structure and data field details of an entity. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

A unique identifier for the entity.

String

"11"

Name

The name of the entity.

String

"customers"

Type

The type of the entity (for example, file, table).

String

"Table"

Parent

The parent entity identifier.

String

"12"

DataSourceId

A unique identifier for the data source.

String

"dataSource_01"

DataSourceName

The name of the data source.

String

"SalesDB"

DataSourceType

The type of the data source (for example, SQL, NoSQL).

String

"SQL"

ResourceType

The type of the resource.

String

"Database"

DataProfileStatus

The status of data profiling (for example, Complete, In Progress).

String

"Complete"

DataProfiled

Whether the data has been profiled (True or False).

Boolean

True

LastUpdate

The timestamp of the last update.

Timestamp

"2023-12-14T15:05:00Z"

ProductName

The name of the product.

String

"MySQL"

ProductVersion

A version of the product.

String

"8.0"

DriverName

The name of the driver used.

String

"MySQL ODBC 8.0 Driver"

Url

The URL associated with the entity.

String

"jdbc:mysql://example.com/db"

ParentName

The name of the parent entity.

String

"SalesRegion"

TotalTables

The total number of tables.

Integer

12

TotalColumns

The total number of columns.

Integer

120

SchemaName

The name of the schema.

String

"public"

DatabaseName

The name of the database.

String

"SalesDB"

LastUpdateStatistics

The timestamp of the last statistics update.

Timestamp

2023-12-14T14:00:00Z

RowCount

Number of rows in the entity.

Integer

10000

NullCount

Number of nulls in the entity.

Integer

50

Cardinality

The cardinality of the entity.

Integer

9500

Hll

HyperLogLog of the entity.

String

"hll:6a9..."

BlankCount

The number of blank entries in the entity.

Integer

20

Min

The minimum value in the entity.

String

"1"

Max

The maximum value in the entity.

String

"10000"

AvgValue

The average value of the entity.

Float

5000.5

MinWidth

The minimum width of the entity.

Integer

1

MaxWidth

The maximum width of the entity.

Integer

10

AvgWidth

The average width of the entity.

Float

5.5

ColumnsCount

The count of columns in the entity.

Integer

10

Path

The path of the entity.

String

"/data/salesdb/customers"

CheckClause

Check the clause of the entity.

String

"age > 18"

TableName

The name of the table.

String

"customers"

DataType

The data type of the entity.

String

"VARCHAR"

TypeName

The name of the entity type.

String

"varchar"

ColumnSize

The size of the column.

Integer

255

BufferLength

The length of the buffer.

Integer

256

DecimalDigits

A number of decimal digits.

Integer

2

NumPrecRadix

A numeric precision radix.

Integer

10

IsNullable

Whether the entity is nullable (True or False).

Boolean

True

OrdinalPosition

Ordinal position of the entity.

Integer

1

IsPrimaryKey

Whether the entity is a primary key (True or False).

Boolean

False

IsForeignKey

Whether the entity is a foreign key (True or False).

Boolean

False

ParentPath

The parent path of the entity.

String

"/data/salesdb"

PathType

The path type of the entity.

String

"Directory"

FileExtension

The file extension of the entity.

String

".txt"

Size

The size of the entity.

Integer

2048

Flags

Flags associated with the entity.

Integer

0

Owner

The owner of the entity.

String

"admin"

Group

The group associated with the entity.

String

"sales"

SymLinkTarget

Symbolic link target of the entity.

String

"/var/salesdb/link"

FileType

The file type of the entity.

String

"Text File"

CreatedAt

The timestamp when the entity is created.

Timestamp

2021-01-01T12:00:00Z

ModifiedAt

The timestamp when the entity is modified.

Timestamp

2023-01-01T12:00:00Z

AccessedAt

The timestamp when the entity is accessed.

Timestamp

2023-01-02T12:00:00Z

ScannedAt

The timestamp when the entity is scanned.

Timestamp

2023-01-03T12:00:00Z

IsSymlink

Whether the entity is a symbolic link (True or False).

Boolean

False

LinkType

The link type of the entity.

String

“<example>”

PhysicalLocation

The physical location of the entity.

String

"ServerRoom1"

Title

The title of the entity.

String

"2023 Sales Report"

Author

The author of the entity.

String

"John Doe"

Subject

The subject of the entity.

String

"Sales Analysis"

Application

An application associated with the entity.

String

"Microsoft Excel"

Producer

The producer of the entity.

String

"Microsoft"

Version

A version of the entity.

String

"16.0"

DocumentSize

The size of the document.

Integer

102400

PageSize

The size of the page.

String

"A4"

PageCount

Number of pages in the entity.

Integer

10

Company

The company associated with the entity.

String

"Acme Corp"

Paragraphs

The number of paragraphs in the entity.

Integer

50

Lines

The number of lines in the entity.

Integer

200

Words

The number of words in the entity.

Integer

1000

Characters

The number of characters in the entity.

Integer

5000

CharactersWithSpaces

The number of characters with spaces in the entity.

Integer

6000

Language

The language of the entity.

String

"English"

Checksum

The checksum of the entity.

String

"e4d909c290d0fb1ca068ffaddf22cbd0"

PropertiesChecksum

The checksum of the properties of the entity.

String

"abcd1234efgh5678ijkl9012mnop3456"

ChildDirs

The number of child directories.

Integer

5

ChildFiles

The number of child files.

Integer

20

ChildDirSize

The size of child directories.

Integer

4096

ChildFileSize

The size of child files.

Integer

8192

TotalChildDirs

The total number of child directories.

Integer

5

TotalChildFiles

The total number of child files.

Integer

20

TotalChildDirSize

The total size of child directories.

Integer

4096

TotalChildFileSize

The total size of child files.

Integer

8192

Entities Summary View

The Entities Summary View collection contains the details of an entity, such as type and the parent. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

A unique identifier for the summary entry.

String

“11/XE/SYNTHEA/ALLERGIES”

Name

The name of the entity.

String

"ALLERGIES"

Type

The type of the entity (for example, file, table).

String

"TABLE"

Parent

The parent entity identifier.

String

"11/XE/SYNTHEA"

Entities Temperature Count View

The Entities Temperature View contains entity details emphasizing the categorization of data based on its temperature, which often indicates the frequency of access or modification, including the number of files. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

A unique identifier for the temperature count entry.

String

"65d2f44dd30b49309488b9dd"

DataSourceFqdnId

A fully qualified domain name identifier for the data source.

String

"5"

Date

The date when the file count and temperature were recorded.

String

2024-02-19TO6:25:17.918+00:00

FileCount

The number of files associated with the specified temperature.

String

2

Temperature

The temperature category of the data (for example, unclassified, hot, warm, and cold).

String

"unclassified"

Entity Usage Statistic View

The Entity Usage Statistic View collection includes a range of usage metrics, such as the number of times an entity is read, written to, and altered, along with the timestamp. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

A unique identifier for the statistics view entry.

String

“11/XE/SYNTHEA/ALLERGIES”

PeriodStartTime

The start time of the entity's profiling period, in ISO format.

Timestamp

2023-12-14T15:00:00Z

PeriodEndTime

The end time of the entity's profiling period, in ISO format.

Timestamp

2023-12-14T15:05:00Z

EntityID

A unique identifier for the entity.

String

"12"

DatabaseName

The name of the database where the entity is located.

String

“Postgres”

SchemaName

The name of the schema within the database.

String

“Chinook”

TableName

The name of the table containing the entity.

String

“Album”

LastReadTime

The timestamp of the last read operation on the entity, in ISO format

Timestamp

2023-12-14T11:00:00Z

LastWriteTime

The timestamp of the last write operation on the entity, in ISO format.

Timestamp

2023-12-14T11:50:00Z

LastAlterTime

The timestamp of the last modification to the entity in ISO format.

Timestamp

2023-12-14T11:20:00Z

ReadCount

The total number of times the entity has been read.

Integer

120

WriteCount

The total number of times the entity has been written.

Integer

45

AlterCount

The total number of times the entity has been altered.

Integer

3

CollectionTime

The timestamp indicating when this data was collected, in ISO format.

Timestamp

2023-12-16T15:00:00Z

Terms View

The Terms View collection contains information of terms related to items. It gives a structured overview, linking terms to specific entities and domains. Each record uniquely identifies a term, its association with an entity, the domain it belongs to, and a unique term identifier, enabling a comprehensive semantic mapping of data assets. The following table shows the details of the data available in this collection.

Field
Description
Data Type
Example Value

_id

A unique identifier for the term entry.

String

"abc12345-d678-90ef-ghij-klmn01234567"

EntityId

A unique identifier for the associated entity.

String

"entity78901-2345-6789-abcd-ef0123456789"

TermName

The name of the term.

String

"Customer Satisfaction Index"

DomainId

An identifier for the domain the term belongs to.

String

"domain1234-5678-90ab-cdef-ghijklmnop"

TermId

A unique identifier for the term.

String

"term5678-9012-3456-7890-abcd12345678"

Pentaho Data Optimizer

If you have a license for Pentaho Data Optimizer, use Data Optimizer to inventory stored data, identify content, view usage, and tier files and objects into long term or deep archival storage. You can use rule-driven actions about data lifecycles to account for compliance, manage costs, and mitigate risks, using a set of convenient tools and self-service processes for sustainable improvements in data management.

Last updated

Was this helpful?