Data Catalog user features
This article describes the Pentaho Data Catalog user interface and the tasks non-admin users can perform. Before proceeding, make sure that a Data Catalog service user or administrator has set up your catalog for you. Data Catalog builds a complete inventory of data assets in a data warehouse, automatically and securely. It provides:
exact data discovery and faster delivery to an authenticated user.
improved and simplified understanding of data quality.
an inventory of all assets for efficient data repository governance.
Data Catalog complements data visualization, data discovery, and data wrangling tools by streamlining the collection and initial data quality checks of the data repository, making the data repository available to those tools for further processing.
Data self-service
Data Catalog provides a rich interface for data self-service to help you find the best instances of the integrated data you are looking for. These services include:
Role-based access control
Data Catalog roles give users the ability to perform specific actions, especially edit actions, and allow lines of business to restrict access to sensitive or confidential information. You can create named communities, such as US Business Users or Commercial Lending Business Users to fine-tune the actions users can perform, as well as to allow access to a subset of glossaries and data sources.
Note: The number of data sources you can add is limited by your license.
Business Terms
You can discover metadata about files and fields and have Data Catalog associate fields to customer's business terms. You can associate business terms with data elements, business rules, related terms, and custom properties to form a comprehensive view of the organization’s business concepts and data landscape.
Data quality metrics and statistics
Data Catalog's profiling processes produce and present detailed data quality metrics and statistics that help you decide if the data is useful, valid, and complete, without having to write code to graph each field in the file.
User roles
User roles in Data Catalog are used for access control and permissions management. They help control who can view, edit, or delete items in the Data Catalog and ensure data security.
Note: The number of Expert user roles you can assign to users is limited by your license. The Expert user roles are:
Business Steward
Data Steward
Admin
Data Developer
Table creation
When you find a file you are interested in, you can create a table for that file. This ability is useful for non-technical users to find the files that contain the data that they need, create a table and use a tool like Grafana to visualize the data without any added efforts.
Data inventory
Behind Data Catalog’s self-service user interface is an engine that profiles the data repository and enriches it by propagating terms created by users. Data Catalog identifies the formats of the resources and profiles their contents, creating an inventory of data assets in the data warehouse securely.
Profiling
Most of the data-curation process entails writing code to profile and graph data. Data Catalog automates this process, improving the productivity of data engineers and data scientists.
Data profiling and discovery are the processes in which Data Catalog examines file data and gathers statistics about the data. It profiles data in the cluster, and uses its algorithms to compute detailed properties, including field-level data quality metrics, and data statistics. The resulting inventory includes rich metadata for delimited files, like JSON, and Parquet, and files compressed with supported compression algorithms such as gzip.
Note: The amount of data you can scan is limited by your license. Databases do not have a data scan quota.
Sensitive data discovery
Sensitive data residing in the data cluster presents a sizable liability if it is not protected and managed. The algorithms in Data Catalog identify sensitive data throughout the data clusters as a part of profiling with minimal additional processing overhead. Identification is the first step, and often the most difficult step in the process of protecting sensitive data. You cannot protect sensitive data unless you know where it resides. Data Catalog identifies sensitive data and facilitates the next step of protecting it through masking, encryption, or quarantine.
Data quality
You can discover data quality metrics automatically using large-scale profiling, such as discovering the number of nulls in a data column or cardinality. For example, you can assess the number of values that should be in a field versus the actual numbers that have been profiled.
Note: Data Catalog writes profiling process notifications to the log files.
Data governance
Data Catalog provides data governance by securing access to the data, by managing metadata creation, enrichment, and approval, and by linking physical data to business-related terminology.
Securing access to data
In Data Catalog, you can protect resources with secured access using glossaries. A glossary is a logical grouping of business terms that you can assign to a specific project user group. Once you have set up roles for glossaries, you can use them to limit access to data via specific roles and users.
Managing metadata creation, enrichment, and approval
By default, users with the Data Steward role can only associate terms with data, while users with the Business Steward role can create new metadata by adding business terms, term hierarchies, and custom properties. Users with the Business Steward role can also perform these functions.
Linking physical data to business-defined terms
In Data Catalog, with applicable permissions, you can manually tag data directly in the user interface. To learn more about this process, see the Administer Pentaho Data Catalog document.
Reporting and data visualization
Reporting in Data Catalog usually is done through dashboards using third-party BI tools. Dashboards further extend the visual discovery and relationship discovery capabilities of the Data Catalog in several ways. They also provide a means to add customized insight assets unique to the organization.
Business Intelligence Database
Data Catalog contains the Business Intelligence Database (BIDB) server, containing a variety of collections filled with specific metadata. Using the Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) connector, you can connect to BIDB, access reporting data, and create dashboards.
Several services, including bi-mongo and bi-views, are deployed during the Data Catalog installation. The bi-views service runs periodically based on the configuration set in the .env
file (by default, daily) and aggregates the data from all connected databases and stores in BIDB. Utilizing the mongo-bi-connector provided by the bi-mongo service, you can link to BIDB with a JDBC or ODBC connector. This connection makes BIDB data available on port 3307
, enabling analysis through any compatible BI tool. For detailed guidance on connecting to BIDB, see How to connect to BIDB.
Collections available in Business Intelligence Database
BIDB hosts several collections, updated regularly by the bi-views service. This service systematically extracts data from all connected databases, organizing it into the following collections available within BIDB.
Checksum Aggregated View
The Checksum Aggregated View collection contains a summary of duplicate files for a specific entity, including their count and total size. The following table shows the details of the data available in this collection.
_id
Checksum derived from bi.entities_master_view
.
String
“968dl402bd0ce783a573al4172c37690”
duplicateFilesCount
The total number of duplicate files identified.
Integer
3
duplicateFilesSize
The total size of duplicate files.
Integer
381
Custom Properties View
The Custom Properties View collection contains the details of custom properties in an entity, including their values. The following table shows the details of the data available in this collection.
_id
A unique identifier for the custom property entry.
String
“65dfc901d04619a9e6a8d62d”
EntityId
A unique identifier for the entity.
String
"11"
PropertyId
A unique identifier for the property.
String
"5"
PropertyName
The name of the custom property.
String
"Name"
Value
The assigned value of the custom property.
String
"John"
Entities Aggregated View
The Entities Aggregated View collection includes the details of the key attributes and values of aggregated entities. The following table shows the details of the data available in this collection.
_id
A unique identifier for the entity aggregation.
String
“65dfc901d04619a9e6a8d62d”
attribute
The attribute name of the entity.
String
"DataSources"
type
The data type of the attribute.
String
"Structured"
value
The value associated with the attribute.
String
"34"
Entities Extension Count View
The Entities Extension Count view collection includes extension details, such as the file count, data source, and date of recording for each extension. The following table shows the details of the data available in this collection.
_id
A unique identifier for the count entry.
String
“65c56b02250cc54a7b43943f”
DataSourceFqdnId
A fully qualified domain name identifier is required for the data source.
String
"5"
Date
The date when the file count was recorded.
Date
2024-02-09T00:00:02.110+00:00
Extension
The file extension.
String
"text/plain; charset=IS0-8859-l;delimiter=comma”
FileCount
The number of files with the specified extension.
Integer
1
Entities Master View
The Entities Master View contains the essential structure and data field details of an entity. The following table shows the details of the data available in this collection.
_id
A unique identifier for the entity.
String
"11"
Name
The name of the entity.
String
"customers"
Type
The type of the entity (for example, file, table).
String
"Table"
Parent
The parent entity identifier.
String
"12"
DataSourceId
A unique identifier for the data source.
String
"dataSource_01"
DataSourceName
The name of the data source.
String
"SalesDB"
DataSourceType
The type of the data source (for example, SQL, NoSQL).
String
"SQL"
ResourceType
The type of the resource.
String
"Database"
DataProfileStatus
The status of data profiling (for example, Complete, In Progress).
String
"Complete"
DataProfiled
Whether the data has been profiled (True or False).
Boolean
True
LastUpdate
The timestamp of the last update.
Timestamp
"2023-12-14T15:05:00Z"
ProductName
The name of the product.
String
"MySQL"
ProductVersion
A version of the product.
String
"8.0"
DriverName
The name of the driver used.
String
"MySQL ODBC 8.0 Driver"
Url
The URL associated with the entity.
String
"jdbc:mysql://example.com/db"
ParentName
The name of the parent entity.
String
"SalesRegion"
TotalTables
The total number of tables.
Integer
12
TotalColumns
The total number of columns.
Integer
120
SchemaName
The name of the schema.
String
"public"
DatabaseName
The name of the database.
String
"SalesDB"
LastUpdateStatistics
The timestamp of the last statistics update.
Timestamp
2023-12-14T14:00:00Z
RowCount
Number of rows in the entity.
Integer
10000
NullCount
Number of nulls in the entity.
Integer
50
Cardinality
The cardinality of the entity.
Integer
9500
Hll
HyperLogLog of the entity.
String
"hll:6a9..."
BlankCount
The number of blank entries in the entity.
Integer
20
Min
The minimum value in the entity.
String
"1"
Max
The maximum value in the entity.
String
"10000"
AvgValue
The average value of the entity.
Float
5000.5
MinWidth
The minimum width of the entity.
Integer
1
MaxWidth
The maximum width of the entity.
Integer
10
AvgWidth
The average width of the entity.
Float
5.5
ColumnsCount
The count of columns in the entity.
Integer
10
Path
The path of the entity.
String
"/data/salesdb/customers"
CheckClause
Check the clause of the entity.
String
"age > 18"
TableName
The name of the table.
String
"customers"
DataType
The data type of the entity.
String
"VARCHAR"
TypeName
The name of the entity type.
String
"varchar"
ColumnSize
The size of the column.
Integer
255
BufferLength
The length of the buffer.
Integer
256
DecimalDigits
A number of decimal digits.
Integer
2
NumPrecRadix
A numeric precision radix.
Integer
10
IsNullable
Whether the entity is nullable (True or False).
Boolean
True
OrdinalPosition
Ordinal position of the entity.
Integer
1
IsPrimaryKey
Whether the entity is a primary key (True or False).
Boolean
False
IsForeignKey
Whether the entity is a foreign key (True or False).
Boolean
False
ParentPath
The parent path of the entity.
String
"/data/salesdb"
PathType
The path type of the entity.
String
"Directory"
FileExtension
The file extension of the entity.
String
".txt"
Size
The size of the entity.
Integer
2048
Flags
Flags associated with the entity.
Integer
0
Owner
The owner of the entity.
String
"admin"
Group
The group associated with the entity.
String
"sales"
SymLinkTarget
Symbolic link target of the entity.
String
"/var/salesdb/link"
FileType
The file type of the entity.
String
"Text File"
CreatedAt
The timestamp when the entity is created.
Timestamp
2021-01-01T12:00:00Z
ModifiedAt
The timestamp when the entity is modified.
Timestamp
2023-01-01T12:00:00Z
AccessedAt
The timestamp when the entity is accessed.
Timestamp
2023-01-02T12:00:00Z
ScannedAt
The timestamp when the entity is scanned.
Timestamp
2023-01-03T12:00:00Z
IsSymlink
Whether the entity is a symbolic link (True or False).
Boolean
False
LinkType
The link type of the entity.
String
“<example>”
PhysicalLocation
The physical location of the entity.
String
"ServerRoom1"
Title
The title of the entity.
String
"2023 Sales Report"
Author
The author of the entity.
String
"John Doe"
Subject
The subject of the entity.
String
"Sales Analysis"
Application
An application associated with the entity.
String
"Microsoft Excel"
Producer
The producer of the entity.
String
"Microsoft"
Version
A version of the entity.
String
"16.0"
DocumentSize
The size of the document.
Integer
102400
PageSize
The size of the page.
String
"A4"
PageCount
Number of pages in the entity.
Integer
10
Company
The company associated with the entity.
String
"Acme Corp"
Paragraphs
The number of paragraphs in the entity.
Integer
50
Lines
The number of lines in the entity.
Integer
200
Words
The number of words in the entity.
Integer
1000
Characters
The number of characters in the entity.
Integer
5000
CharactersWithSpaces
The number of characters with spaces in the entity.
Integer
6000
Language
The language of the entity.
String
"English"
Checksum
The checksum of the entity.
String
"e4d909c290d0fb1ca068ffaddf22cbd0"
PropertiesChecksum
The checksum of the properties of the entity.
String
"abcd1234efgh5678ijkl9012mnop3456"
ChildDirs
The number of child directories.
Integer
5
ChildFiles
The number of child files.
Integer
20
ChildDirSize
The size of child directories.
Integer
4096
ChildFileSize
The size of child files.
Integer
8192
TotalChildDirs
The total number of child directories.
Integer
5
TotalChildFiles
The total number of child files.
Integer
20
TotalChildDirSize
The total size of child directories.
Integer
4096
TotalChildFileSize
The total size of child files.
Integer
8192
Entities Summary View
The Entities Summary View collection contains the details of an entity, such as type and the parent. The following table shows the details of the data available in this collection.
_id
A unique identifier for the summary entry.
String
“11/XE/SYNTHEA/ALLERGIES”
Name
The name of the entity.
String
"ALLERGIES"
Type
The type of the entity (for example, file, table).
String
"TABLE"
Parent
The parent entity identifier.
String
"11/XE/SYNTHEA"
Entities Temperature Count View
The Entities Temperature View contains entity details emphasizing the categorization of data based on its temperature, which often indicates the frequency of access or modification, including the number of files. The following table shows the details of the data available in this collection.
_id
A unique identifier for the temperature count entry.
String
"65d2f44dd30b49309488b9dd"
DataSourceFqdnId
A fully qualified domain name identifier for the data source.
String
"5"
Date
The date when the file count and temperature were recorded.
String
2024-02-19TO6:25:17.918+00:00
FileCount
The number of files associated with the specified temperature.
String
2
Temperature
The temperature category of the data (for example, unclassified, hot, warm, and cold).
String
"unclassified"
Entity Usage Statistic View
The Entity Usage Statistic View collection includes a range of usage metrics, such as the number of times an entity is read, written to, and altered, along with the timestamp. The following table shows the details of the data available in this collection.
_id
A unique identifier for the statistics view entry.
String
“11/XE/SYNTHEA/ALLERGIES”
PeriodStartTime
The start time of the entity's profiling period, in ISO format.
Timestamp
2023-12-14T15:00:00Z
PeriodEndTime
The end time of the entity's profiling period, in ISO format.
Timestamp
2023-12-14T15:05:00Z
EntityID
A unique identifier for the entity.
String
"12"
DatabaseName
The name of the database where the entity is located.
String
“Postgres”
SchemaName
The name of the schema within the database.
String
“Chinook”
TableName
The name of the table containing the entity.
String
“Album”
LastReadTime
The timestamp of the last read operation on the entity, in ISO format
Timestamp
2023-12-14T11:00:00Z
LastWriteTime
The timestamp of the last write operation on the entity, in ISO format.
Timestamp
2023-12-14T11:50:00Z
LastAlterTime
The timestamp of the last modification to the entity in ISO format.
Timestamp
2023-12-14T11:20:00Z
ReadCount
The total number of times the entity has been read.
Integer
120
WriteCount
The total number of times the entity has been written.
Integer
45
AlterCount
The total number of times the entity has been altered.
Integer
3
CollectionTime
The timestamp indicating when this data was collected, in ISO format.
Timestamp
2023-12-16T15:00:00Z
Terms View
The Terms View collection contains information of terms related to items. It gives a structured overview, linking terms to specific entities and domains. Each record uniquely identifies a term, its association with an entity, the domain it belongs to, and a unique term identifier, enabling a comprehensive semantic mapping of data assets. The following table shows the details of the data available in this collection.
_id
A unique identifier for the term entry.
String
"abc12345-d678-90ef-ghij-klmn01234567"
EntityId
A unique identifier for the associated entity.
String
"entity78901-2345-6789-abcd-ef0123456789"
TermName
The name of the term.
String
"Customer Satisfaction Index"
DomainId
An identifier for the domain the term belongs to.
String
"domain1234-5678-90ab-cdef-ghijklmnop"
TermId
A unique identifier for the term.
String
"term5678-9012-3456-7890-abcd12345678"
Pentaho Data Optimizer
If you have a license for Pentaho Data Optimizer, use Data Optimizer to inventory stored data, identify content, view usage, and tier files and objects into long term or deep archival storage. You can use rule-driven actions about data lifecycles to account for compliance, manage costs, and mitigate risks, using a set of convenient tools and self-service processes for sustainable improvements in data management.
Last updated
Was this helpful?