EDC - Overview - KT
EDC - Overview - KT
Metadata management is the concept to capture and maintain details of information assets
present in an enterprise. It enables Business and IT users realize the full potential of their
enterprise data assets by providing a unified view that includes technical metadata, business
context, user annotations, relationships, data quality and usage.
A completely built metadata management model would give a 360-degree view on how
different systems in an organization are connected together (Data Lineage)
• Technical Metadata
• Business Metadata
EDC Overview
Enterprise Data Catalog - Vision
3
Technical Business
ppss
pop
o
p
pp
ptpt
Enterprise
Unified
Metadata Usage Operational
Informatica
PowerCenter
PowerCenter || DQ
DQ
Documents MDM || BDM
MDM BDM || MM
MM Business Intelligence
BG
BG || ILM
ILM || Axon
Axon || Informatica
Informatica
Cloud
Cloud Tableau
Tableau || IBM
IBM Cognos
Cognos ||
MS Excel | Adobe PDF | Flat
File | MS Powerpoint | MS Word | SAP
SAP BusinessObjects
BusinessObjects
Compressed Files Microstrategy
Microstrategy || OBIEE
OBIEE
Applications
AWS S3 (CSV/XML/JSON) || Microsoft
Microsoft SSIS
SSIS || Erwin
Erwin Models
Models ||
| AWS Redshift | Azure Custom
Custom Scanner
Scanner
DB | Azure ADLS, Blob SAP R/3 | Salesforce
Oracle
=
Enterprise Cloud REAL TIME/
CLOUD BIG DATA A
A TRADITIONAL
Data Management STREAMING
Products DATA BIG DATA CLOUD DATA DATA MASTER DATA DATA
INTEGRATION MANAGEMENT MANAGEMENT QUALITY MANAGEMENT SECURITY
CLAIRETM
(ENTERPRISE UNIFIED METADATA INTELLIGENCE)
Intelligent
Data Platform MONITOR AND MANAGE
COMPUTE
CONNECTIVITY
Enterprise Data Catalog Powered by
Broad Enterprise Data Catalog Self Service Analytics
Metadata [Data Analysts, Data Scientists]
Sources • Google for enterprise data assets
• Technical • Data Lineage, holistic relationship view
• Operational • Trust with data profile
• Access to data
• Usage
Data Governance
Business [Data Stewards]
Context AI Curated Catalog Business & Crowd
• Associate Business glossary to technical
• Glossary Sourced Curation objects
• Policies Structure Discovery, Profiling and • Verify business to technical lineage
Business Glossary Associations, • Track key data elements compliance
• Process Domain Discovery, Similarity
Business Classifications,
Clustering, Recommendations
Annotations, Comments
Wisdom Data Asset Management
of Crowd [Architects, Developers]
• Comments • Analyze column-level Lineage &
Change Impact
• Ratings
• View transformation Logic
• Behavior Knowledge Graph • Data asset and BI usage
8
Customer examples aligned to use cases
Large Healthcare
Large European bank Large shipping company
insurance company
Flipped the 80-20% to 40-60% Became compliant for BCBS 239 Enterprise-scale metadata
for data analysts and GDPR and saved millions management allows them
in fines from ECB to effectively simplify data
management
9
Unleash the power of data with Enterprise Data Catalog
10
Self Service Data Discovery
For Everyone
Semantic search
Search datasets using business terminology, synonyms,
across related objects and data flows. Example: when
you type “grade”, EDC can suggest “tier” aligning
with the terminology used in the organization.
Facets
Intelligent facets, based on the search results, allow users
to narrow the search to the data sets of interest. Facets
include both system and custom classifications.
11
Relationship Discovery
@ scale
Data Lineage
Interactively trace data origin through business-friendly
summarized lineage views that highlight the end points
and all the complex details in between. A drill-down lineage
view expands any lineage path to show columns
and lineage diagram metrics.
Impact Analysis
Perform detailed impact analysis on upstream
and downstream data assets. See impact across data
asset, resources and users.
12
Understand Business Context
Across Conceptual, Logical, Physical Data Models
Revenue (Business Concept): Revenue is the amount of money that a company receives
during a specific period, including discounts and deductions for returned merchandise.
Conceptual
business concepts Revenue
Sales (Logical Entity): The Logical Model may use the term “sales” for the same entity
Logical Sales
datasets/attributes
Which may be implemented/used as multiple physical data and reporting assets:
tables/columns/reports
Sales2017.csv S_TABLE17 P&LReport2017
13
AI enabled Cataloging
Data Problems
• Data Curation
• Data Discovery Finding the Data That Matters
• Data Exploration
14
Data Catalog is the Foundation for Any Data-Driven Organization
Data Analyst
IT/Data Architect Data Steward
Data Scientist
15
Machine Learning for
- Semantic & Relationship Inference
- Assisted Curation
- Recommendation in Consumption
Data Domain Discovery
• Smart Domains
• An user might tag a column containing customer names as “CustomerNames”.
System learns from the association and tag other columns in the enterprise that
contain similar data.
• Expansion to metadata similarity: column names, data patterns
Entity Recognition using Composite Domains
ORDER
18
Entity Recognition using Composite Domains
• Composite Domains for automatic
classification and identification of
entities like Customers, Orders, Patients
and other datasets
• Entity Recognition is used by search,
facets, classifications and business
glossary recommendations
• Identify entities within and across
tables and files with support for both
structured and unstructured sources
• Includes both rule based and smart
domains
Domain Discovery from Unstructured Files
Date(1)
Passage from “Chapter 2: The Science of Deduction” from “A Study in Scarlet” by Arthur Conan Doyle
Location(3)
Person(6)
..
21
Architecture
23
24
25
26
27
28
Connectivity
Different Types of Metadata Scanners
Supported/Native scanners Partner scanners Custom scanners
30
Supported/Native Scanners ( in 10.2.1)
• Big Data • Database • Business Intelligence
• Hive (CDH /HW /MapR /EMR /HDInsights) • Oracle • IBM Cognos
• HDFS Files (CSV/XML/JSON/Avro/Parquet) • DB2 • SAP Business Objects
• Cloudera Navigator • DB2 for z/OS • Tableau
• Hortonworks Atlas • SQL Server • Microstrategy
• Sybase • OBIEE
• File systems
• JDBC • QlikView
• Unix/windows • Azure Blob Store
• Teradata • Applications
• Sharepoint • Azure ADLS
• Netezza
• Onedrive • SAP R/3
• Azure SQL DB
• Informatica
• Salesforce
• Azure SQL DW
• Informatica Powercenter
• Other ETL
• Scripts (SQL, BTEQ, HiveQL)
• Informatica BDM
• Microsoft SSIS
• Cloud • Modelling
• Informatica Cloud
• Amazon S3 (CSV/XML/JSON) • ERWIN
• Informatica Data Integration Hub
• Amazon Redshift
• Informatica Axon Glossary
31
Supported Source Types: HDFS, Amazon
Supported Unstructured Formats S3, Azure WASB, ADLS, Windows, Linux,
SharePoint, Onedrive
32
Partner scanners overview
• Partner scanners are specialized scanners with
advanced feature allowing EDC to better support
certain use case
• Store procedure code parsing
• Complex application metadata extractions (ex. SAP,
Siebel, MS Dynamics CRM,… )
33
Partner Scanners
• Silwood • Manta • Compact BI
Silwood specializes in ERP and CRM application Manta scanner parses SQL Script & Stored CompactBI Metadex scanner parses SQL Script
level metadata. They do so by reverse engineering Procedure to provide column level data lineage & Stored Procedure to provide column level
application metadata and configuration instead of and & transformation logic for the databases data lineage and & transformation logic for the
scanning the raw table metadata. Note: Does Not list below. Manta is good at Teradata databases list below PLUS SAS, COBOL, JCL,
Data profiling and domain discovery. BTEQ script parsing. SSIS, SSRS, OWB, ODI and Composite.
• SAP including: • Database SQL script and Stored • Database SQL script and Stored
Procedures: Teradata, Oracle, MS SQL Procedures: Teradata, Oracle, MS SQL
• SAP ERP
Server, IBM DB2, IBM Netezza, SAP/Sybase Server, MySQL, IBM DB2, IBM Netezza,
• SAP BW ASE, PostgreSQL, HIVEQL, AWS Redshift, IBM Informix, SAP/Sybase ASE.
Greenplum, Impala, Hive.
• SAP BW/4HANA (planned-March 2019)
Note: Does Not support Data profiling and
• SAP BW (Bex) Queries (planned - early 2019) Note: Does Not support Data profiling and domain discovery.
• SAP BW Extractor (planned-2019) domain discovery.
• Statistical Tools: SAS
• SAP S/4HANA • Mainframe: COBOL, JCL
• Oracle eBusiness Suite • ETL Tools: Microsoft SSIS (detailed
• Siebel lineage) 2005, 2008, 2012, Oracle
Warehouse Builder, Oracle Data Integrator
• PeopleSoft (ODI)
• JD Edwards EnterpriseOne • BI Tools: Microsoft SSRS 2008 or newer
• Microsoft Dynamics AX
• Microsoft Dynamics CRM
34
Custom Scanners
• Model any data source within the catalog
• Create and register custom models through an open
framework
• Custom models can include classes, attributes and
relationships to model any kind of data source
35
Models and custom
resource type
Models
Delimited Files
37
Model details
• Class: A class represents similar types of objects
that a data source includes. For example, a
relational data source can include classes such as Simple BI Model
schemas, views, and tables.
• Attribute: An attribute represents the properties Report
of a fact and the links for the fact in the data
source.
Sheet Sheet Sheet
• Attribute Options: Searchable, Facetable, Boost,
Multivalued, Suggestable, Sortable, Visible, Metric Metric Metric Metric Metric
Attributes
• Association: An association represents the
relationship between classes in the data source. ReportPath: Path of the Report
URL: URL of the sheet
• Association Kind: An association kind represents the
type of relationship between classes in the data Type: Data Type of a Metric – String, Integer,
source. For example, a parent-child inheritance Date etc.
relationship, Class Parent-Child Association Kind
38
Core package • All Class derives
from IClass
• All common attributes
are assigned to IClass
• All new class will be
contained in the
Resource Class
• Defines common
data container with
• Data Sources
• Data Sets
• Data Elements
39
Relation database package
Database
Schema
Tables
Views
Synonym
Column
View Column
40
Other definitions in core package
• Data Types
• String, Boolean, Integer, Decimal, Date, RichText …
• KeyValue, CSV, Path, AttributeDataType
• Generic attributes
• Name, Description, Created By, Created Time, Last Modified. Last Scan Date
41
Resource types
42
Custom Scanner Framework: Concept Overview
• Custom Resource Type: A logical container of models that represents
the resource type registered with the catalog. Simple BI Resource Type
• Model: A model represents the structure and properties defined for
metadata ingested into the catalog. You must define the model as an
XML file. Metadata ingested includes classes, associations, and Simple BI Model
attributes.
• Attribute: An attribute represents the properties of a fact and the links Sheet Sheet Sheet
for the fact in the data source.
• Attribute Options: Searchable, Facetable, Boost, Multivalued, Suggestable, Metric Metric Metric Metric Metric
Sortable, Visible,
Attributes
• Association: An association represents the relationship between
classes in the data source. ReportPath: Path of the Report
• Association Kind: An association kind represents the type of relationship URL: URL of the sheet
between classes in the data source. For example, a parent-child inheritance
relationship, Type: Data Type of a Metric – String, Integer,
Date etc.
Class Parent-Child Association Kind
43
Custom Scanner Example
Representing lineage with code language program
44
Custom Attributes
System and Custom Attributes
46
Why Custom Attributes?
47
Custom Attribute Types
• Boolean
• Date
• Decimal
• Integer
• String
• User
• Business Classifications*
48
REST API
Why?
Open Platform Automated Pervasive Data
Curation Governance
50
Open REST API
• REST APIs for directly accessing
metadata services to create custom
analytics and metadata powered
applications
52
Accessing the API
• Use the following URL to view the list of REST APIs exposed:
http(s)://<cataloghost>:<port>/access/
• Use the following URL format to call the REST APIs:
http(s)://<cataloghost>:<port>/access/2/catalog/models/<REST API>
• JAVA Client and Source are available here:
• Java client: <cataloghost>:<port>/access/2/files/client.jar
• Source files and Javadocs. <cataloghost>:<port>/access/2/files/client-sources.jar
53
General guidelines
• Avoid running large query at once, always use a pageSize not larger than a few
thousands and iterate.
• Common class types to use
Table: com.infa.ldm.relational.Table PC Mapping: com.infa.ldm.etl.pc.Mapping
54
Useful links
55