100% found this document useful (1 vote)
419 views55 pages

EDC - Overview - KT

Metadata management provides a unified view of enterprise data including technical metadata, business context, user annotations, relationships, data quality and usage. It gives a 360-degree view of how systems are connected through data lineage. Metadata can be classified as technical (details about the data) or business (contextual information like policies and processes).

Uploaded by

dsxgsdg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
419 views55 pages

EDC - Overview - KT

Metadata management provides a unified view of enterprise data including technical metadata, business context, user annotations, relationships, data quality and usage. It gives a 360-degree view of how systems are connected through data lineage. Metadata can be classified as technical (details about the data) or business (contextual information like policies and processes).

Uploaded by

dsxgsdg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Metadata Management

Metadata management is the concept to capture and maintain details of information assets
present in an enterprise. It enables Business and IT users realize the full potential of their
enterprise data assets by providing a unified view that includes technical metadata, business
context, user annotations, relationships, data quality and usage.

A completely built metadata management model would give a 360-degree view on how
different systems in an organization are connected together (Data Lineage)

Metadata can be broadly classified as:

• Technical Metadata
• Business Metadata
EDC Overview
Enterprise Data Catalog - Vision

Enterprise Data Catalog enables Business and IT users realize


the full potential of their enterprise data assets by providing a
unified metadata view that includes technical metadata,
business context, user annotations, relationships, data quality
and usage

3
Technical Business

ppss
pop
o

p
pp
ptpt
Enterprise
Unified
Metadata Usage Operational
Informatica
PowerCenter
PowerCenter || DQ
DQ
Documents MDM || BDM
MDM BDM || MM
MM Business Intelligence
BG
BG || ILM
ILM || Axon
Axon || Informatica
Informatica
Cloud
Cloud Tableau
Tableau || IBM
IBM Cognos
Cognos ||
MS Excel | Adobe PDF | Flat
File | MS Powerpoint | MS Word | SAP
SAP BusinessObjects
BusinessObjects
Compressed Files Microstrategy
Microstrategy || OBIEE
OBIEE

Big Data Databases


Enterprise
HIVE (Cloudera, HortonWorks,
Data Oracle
MapR, IBM BigInsights, EMR, HDI) Oracle || DB2
DB2 || DB2
DB2 for
for z/OS
z/OS
HDFS (CVS, XML, JSON) Catalog SQL
SQL Server
Server || Sybase
Sybase || Teradata
Teradata
Netezza
Netezza || JDBC
JDBC
Cloudera Navigator | Atlas

Cloud Platforms Other

Applications
AWS S3 (CSV/XML/JSON) || Microsoft
Microsoft SSIS
SSIS || Erwin
Erwin Models
Models ||
| AWS Redshift | Azure Custom
Custom Scanner
Scanner
DB | Azure ADLS, Blob SAP R/3 | Salesforce
Oracle
=
Enterprise Cloud REAL TIME/
CLOUD BIG DATA A
A TRADITIONAL
Data Management STREAMING

Solutions CUSTOMER PRODUCT SUPPLIER REFERENCE INTELLIGENT ENTERPRISE DATA SECURE@SOURCE


360 360 360 360 DATA LAKE INFORMATION GOVERNANCE
CATALOG

Products DATA BIG DATA CLOUD DATA DATA MASTER DATA DATA
INTEGRATION MANAGEMENT MANAGEMENT QUALITY MANAGEMENT SECURITY

CLAIRETM
(ENTERPRISE UNIFIED METADATA INTELLIGENCE)
Intelligent
Data Platform MONITOR AND MANAGE

COMPUTE

CONNECTIVITY
Enterprise Data Catalog Powered by
Broad Enterprise Data Catalog Self Service Analytics
Metadata [Data Analysts, Data Scientists]
Sources • Google for enterprise data assets
• Technical • Data Lineage, holistic relationship view
• Operational • Trust with data profile
• Access to data
• Usage

Data Governance
Business [Data Stewards]
Context AI Curated Catalog Business & Crowd
• Associate Business glossary to technical
• Glossary Sourced Curation objects
• Policies Structure Discovery, Profiling and • Verify business to technical lineage
Business Glossary Associations, • Track key data elements compliance
• Process Domain Discovery, Similarity
Business Classifications,
Clustering, Recommendations
Annotations, Comments
Wisdom Data Asset Management
of Crowd [Architects, Developers]
• Comments • Analyze column-level Lineage &
Change Impact
• Ratings
• View transformation Logic
• Behavior Knowledge Graph • Data asset and BI usage

8
Customer examples aligned to use cases

Self-service Analytics Data Governance Data Asset Analysis

Large Healthcare
Large European bank Large shipping company
insurance company

Flipped the 80-20% to 40-60% Became compliant for BCBS 239 Enterprise-scale metadata
for data analysts and GDPR and saved millions management allows them
in fines from ECB to effectively simplify data
management

9
Unleash the power of data with Enterprise Data Catalog

Search & Broad Open


Discovery Connectivity APIs

AI-powered Intelligent Data Catalog

10
Self Service Data Discovery
For Everyone

Semantic search
Search datasets using business terminology, synonyms,
across related objects and data flows. Example: when
you type “grade”, EDC can suggest “tier” aligning
with the terminology used in the organization.

Data profiling and domain discovery


View data profiling statistics alongside data assets
to understand data quality before using data for analysis.
Profiling statistics include value distributions, patterns,
and data type and data domain inference. Example:
you can search for “tables with emails” to get all tables
that have email information, regardless of the underlying
column names.

Facets
Intelligent facets, based on the search results, allow users
to narrow the search to the data sets of interest. Facets
include both system and custom classifications.

11
Relationship Discovery
@ scale

Holistic Relationships View


Get a holistic view of data in a knowledge graph that lets
you quickly search, discover, and understand enterprise
data and meaningful data relationships. Automatically
discover related data sets, technical, business, semantic
and usage based relationships.

Data Lineage
Interactively trace data origin through business-friendly
summarized lineage views that highlight the end points
and all the complex details in between. A drill-down lineage
view expands any lineage path to show columns
and lineage diagram metrics.

Impact Analysis
Perform detailed impact analysis on upstream
and downstream data assets. See impact across data
asset, resources and users.

12
Understand Business Context
Across Conceptual, Logical, Physical Data Models

Revenue (Business Concept): Revenue is the amount of money that a company receives
during a specific period, including discounts and deductions for returned merchandise.
Conceptual
business concepts Revenue
Sales (Logical Entity): The Logical Model may use the term “sales” for the same entity

Logical Sales
datasets/attributes
Which may be implemented/used as multiple physical data and reporting assets:

Physical REV_TABLE VBAK SalesFACT

tables/columns/reports
Sales2017.csv S_TABLE17 P&LReport2017

13
AI enabled Cataloging

Data Problems
• Data Curation
• Data Discovery Finding the Data That Matters
• Data Exploration

How CLAIRETM Helps

• CLAIRETM automatically tags data for understanding


• CLAIRETM intelligently discovers data domains (name,
phone, email…) and data entities (purchase order,
health record…)
• CLAIRETM intelligently recommends other data sets
that are similar to what they are working on.

14
Data Catalog is the Foundation for Any Data-Driven Organization

Data Analyst
IT/Data Architect Data Steward
Data Scientist

Enabler Curator Consumer

Enterprise Data Catalog

Machine Learning based “Google”


“Google” search
search for
for
data catalog enterprise
enterprise data
data asset
asset

15
Machine Learning for
- Semantic & Relationship Inference
- Assisted Curation
- Recommendation in Consumption
Data Domain Discovery

• Smart Domains
• An user might tag a column containing customer names as “CustomerNames”.
System learns from the association and tag other columns in the enterprise that
contain similar data.
• Expansion to metadata similarity: column names, data patterns
Entity Recognition using Composite Domains
ORDER

Date Customer Product Amount

First Name Last Name Address Product ID Product Name

Street City State Zip

18
Entity Recognition using Composite Domains
• Composite Domains for automatic
classification and identification of
entities like Customers, Orders, Patients
and other datasets
• Entity Recognition is used by search,
facets, classifications and business
glossary recommendations
• Identify entities within and across
tables and files with support for both
structured and unstructured sources
• Includes both rule based and smart
domains
Domain Discovery from Unstructured Files
Date(1)
Passage from “Chapter 2: The Science of Deduction” from “A Study in Scarlet” by Arthur Conan Doyle

Location(3)

Person(6)

..

• Domain Discovery from unstructured data


• Support for rule based (data) domains only
• New Composite Domains also supported for Entity Recognition
More Intelligence..
• Machine Learning for • Machine Learning for
Curation Consumption
• Business Glossary • Search: Asset Type Detection
Recommendations • Synonyms Usage
• Structure Detection for Files • Search Rankings: PageRank
• Search Rankings: Usage
• Join Inference
• Related Assets in Search
• Identifying Key Data Elements • Collaborative Filtering for
• Identifying Duplicates Datasets
• Inferred Lineage

21
Architecture
23
24
25
26
27
28
Connectivity
Different Types of Metadata Scanners
Supported/Native scanners Partner scanners Custom scanners

• Most common, high • Lower demand, • Long-tail systems, one-


demand systems and specialized skills, time to offs
strategic ones market sensitive
• Can be built for many
• Resource Types: BI (SAP • Resource Types: RDBMS types of metadata
BO, Cognos), Big Data Stored procedure parsing, sources.
(Cloudera), RDBMS 3rd party ETL, ERP
(Oracle, DB2, SQL Server), Applications, SAS, • Resource Types: Access,
Cloud (Azure, AWS), INFA COBOL/JCL MicroFocus, custom
(PowerCenter, IICS), plus applications, etc.
many more • Built by partners
• To be built by customer.
• Fully supported,
maintained by
Informatica

30
Supported/Native Scanners ( in 10.2.1)
• Big Data • Database • Business Intelligence
• Hive (CDH /HW /MapR /EMR /HDInsights) • Oracle • IBM Cognos
• HDFS Files (CSV/XML/JSON/Avro/Parquet) • DB2 • SAP Business Objects
• Cloudera Navigator • DB2 for z/OS • Tableau
• Hortonworks Atlas • SQL Server • Microstrategy
• Sybase • OBIEE
• File systems
• JDBC • QlikView
• Unix/windows • Azure Blob Store
• Teradata • Applications
• Sharepoint • Azure ADLS
• Netezza
• Onedrive • SAP R/3
• Azure SQL DB
• Informatica
• Salesforce
• Azure SQL DW
• Informatica Powercenter
• Other ETL
• Scripts (SQL, BTEQ, HiveQL)
• Informatica BDM
• Microsoft SSIS
• Cloud • Modelling
• Informatica Cloud
• Amazon S3 (CSV/XML/JSON) • ERWIN
• Informatica Data Integration Hub
• Amazon Redshift
• Informatica Axon Glossary

• Informatica Business Glossary

31
Supported Source Types: HDFS, Amazon
Supported Unstructured Formats S3, Azure WASB, ADLS, Windows, Linux,
SharePoint, Onedrive

Metadata and domain discovery only


• Microsoft Excel: .xlsx, .xls, .xlam, .xlc, .xlsb, • Compressed - "gz", "tgz", "emz", "sz", "xz",
.xlsm, .xltm, .xltx, .xlw "z", "7z", "ace", "arj", "cab", "rar", "gtar",
"gz", "tgz", "emz", "zip", "bz2", "tbz2", "boz"
• Microsoft Word: .docx, .doc, .docm,.dot, .dotm,
.dotx • Email - "eml", "emlx", "mime", "mht",
"mhtml", "msg", "pst", "ost","nsf","ncf"
• Microsoft PowerPoint: .pptx, .ppt, .pptm, .pot,
.potm, .potx, .pps, .ppsm, .ppsx, . • Webpages - "chm", "oth", "xhtml", "xht",
"mhtml", "html", "htm", "ihtml“
• Adobe Pdf: .pdf
• Extended formats - other file types that
• Text Files are supported by Apache Tika for ex,
mp3,mp4,,bmp,jpg….. (will catalog these
but will mostly not have any textual data
to process and auto-discover)

32
Partner scanners overview
• Partner scanners are specialized scanners with
advanced feature allowing EDC to better support
certain use case
• Store procedure code parsing
• Complex application metadata extractions (ex. SAP,
Siebel, MS Dynamics CRM,… )

• Partner scanners are implemented through


custom scanner capabilities
• Partners are delivering full support of the
metadata extraction from source and custom
resource extract generation.

33
Partner Scanners
• Silwood • Manta • Compact BI
Silwood specializes in ERP and CRM application Manta scanner parses SQL Script & Stored CompactBI Metadex scanner parses SQL Script
level metadata. They do so by reverse engineering Procedure to provide column level data lineage & Stored Procedure to provide column level
application metadata and configuration instead of and & transformation logic for the databases data lineage and & transformation logic for the
scanning the raw table metadata. Note: Does Not list below. Manta is good at Teradata databases list below PLUS SAS, COBOL, JCL,
Data profiling and domain discovery. BTEQ script parsing. SSIS, SSRS, OWB, ODI and Composite.

• SAP including: • Database SQL script and Stored • Database SQL script and Stored
Procedures: Teradata, Oracle, MS SQL Procedures: Teradata, Oracle, MS SQL
• SAP ERP
Server, IBM DB2, IBM Netezza, SAP/Sybase Server, MySQL, IBM DB2, IBM Netezza,
• SAP BW ASE, PostgreSQL, HIVEQL, AWS Redshift, IBM Informix, SAP/Sybase ASE.
Greenplum, Impala, Hive.
• SAP BW/4HANA (planned-March 2019)
Note: Does Not support Data profiling and
• SAP BW (Bex) Queries (planned - early 2019) Note: Does Not support Data profiling and domain discovery.
• SAP BW Extractor (planned-2019) domain discovery.
• Statistical Tools: SAS
• SAP S/4HANA • Mainframe:  COBOL, JCL
• Oracle eBusiness Suite • ETL Tools: Microsoft SSIS (detailed
• Siebel lineage) 2005, 2008, 2012, Oracle
Warehouse Builder, Oracle Data Integrator
• PeopleSoft (ODI)
• JD Edwards EnterpriseOne • BI Tools: Microsoft SSRS 2008 or newer

• Salesforce (including. Force apps) • Data Virtualization: Composite

• Microsoft Dynamics AX
• Microsoft Dynamics CRM

34
Custom Scanners
• Model any data source within the catalog
• Create and register custom models through an open
framework
• Custom models can include classes, attributes and
relationships to model any kind of data source

• Ingest metadata including object names, relationships


and attributes through delimited files
• Key Functionality like search, lineage, relationships,
custom attributes, Business Glossary Associations
etc. are all available for custom objects
• Custom Lineage scanner can be used to add lineage
links for custom objects
• Custom objects can be linked to any object in the
catalog including those that come from native
scanners

35
Models and custom
resource type
Models

• Model: A model represents the structure and


properties defined for metadata ingested into
the catalog. A Model defines classes,
associations, and attributes. A model can
extend or reuse definition from another model. Core Package

Relational ETL Core Tableau SAP Business


File System …
Data Source Model Server Objects

PowerCenter CSV Files

Delimited Files

37
Model details
• Class: A class represents similar types of objects
that a data source includes. For example, a
relational data source can include classes such as Simple BI Model
schemas, views, and tables.
• Attribute: An attribute represents the properties Report
of a fact and the links for the fact in the data
source.
Sheet Sheet Sheet
• Attribute Options: Searchable, Facetable, Boost,
Multivalued, Suggestable, Sortable, Visible, Metric Metric Metric Metric Metric
Attributes
• Association: An association represents the
relationship between classes in the data source.  ReportPath: Path of the Report
 URL: URL of the sheet
• Association Kind: An association kind represents the
type of relationship between classes in the data  Type: Data Type of a Metric – String, Integer,
source. For example, a parent-child inheritance Date etc.
relationship, Class Parent-Child Association Kind

38
Core package • All Class derives
from IClass
• All common attributes
are assigned to IClass
• All new class will be
contained in the
Resource Class

• Defines common
data container with
• Data Sources
• Data Sets
• Data Elements

39
Relation database package
Database
Schema

Tables
Views
Synonym

Column
View Column

40
Other definitions in core package

• Data Types
• String, Boolean, Integer, Decimal, Date, RichText …
• KeyValue, CSV, Path, AttributeDataType

• Association Kinds & associations


• Parent-Child, Classification Kind, Data Flow, Data Map, Join, Lookup, PKFK, Related Kind, Synonym…

• Generic attributes
• Name, Description, Created By, Created Time, Last Modified. Last Scan Date

41
Resource types

• Resource are source system definitions


• 2 types of resource are available
• System resource type
• Predefined resource type corresponding to support source system
• Custom resource type defining a way to ingest additional metadata
• include one of more models
• Define connection information to be publish for linking with other resource

42
Custom Scanner Framework: Concept Overview
• Custom Resource Type: A logical container of models that represents
the resource type registered with the catalog. Simple BI Resource Type
• Model: A model represents the structure and properties defined for
metadata ingested into the catalog. You must define the model as an
XML file. Metadata ingested includes classes, associations, and Simple BI Model
attributes.

• Class: A class represents similar types of objects that a data source


includes. For example, a relational data source can include classes Report
such as schemas, views, and tables.

• Attribute: An attribute represents the properties of a fact and the links Sheet Sheet Sheet
for the fact in the data source.
• Attribute Options: Searchable, Facetable, Boost, Multivalued, Suggestable, Metric Metric Metric Metric Metric
Sortable, Visible,
Attributes
• Association: An association represents the relationship between
classes in the data source.  ReportPath: Path of the Report

• Association Kind: An association kind represents the type of relationship  URL: URL of the sheet
between classes in the data source. For example, a parent-child inheritance
relationship,  Type: Data Type of a Metric – String, Integer,
Date etc.
Class Parent-Child Association Kind

43
Custom Scanner Example
Representing lineage with code language program

44
Custom Attributes
System and Custom Attributes

• Attributes are metadata properties attached to data assets.


• Two types of attributes:
• System Attributes: System attributes are metadata properties extracted by metadata scanners
directly from the source.
• Custom Attributes: Custom Attributes are metadata properties created by users to capture additional
detail about data assets that are not covered by the system attributes
• Examples are: Data Owner, Data Center Location, Department etc.

46
Why Custom Attributes?

• Benefits of Data Classification:


• Good data classification ensures that important data assets are easy to find and retrieve
• Helps in identifying people in the organization responsible for data assets like data owners, data
stewards, data custodians, data trustees etc.
• Identify location boundaries of data like Data Center Locations etc to identify what governance
policies apply on the data assets
• Better compliance and risk management for different data asset types

47
Custom Attribute Types

• Boolean
• Date
• Decimal
• Integer
• String
• User
• Business Classifications*

48
REST API
Why?
Open Platform Automated Pervasive Data
Curation Governance

• No metadata lock-in; any • Programmatic Curation of data • Analytics on Metadata


metadata can be ingested and assets to deal with metadata at Repository
accessed from the repository scale
• Integrate with third party
applications search, lineage and
asset relationship services

50
Open REST API
• REST APIs for directly accessing
metadata services to create custom
analytics and metadata powered
applications

REST APIs include:


• Access
• Search
• Read Objects and Relationships
• Read System and Custom Attributes
• Annotations
• Create and populate Custom
Attributes
• Business Glossary Associations
• Control
• Execute scanner jobs (from 10.2.1
51 onward)
Testing & using the API

• Use Swagger to discover endpoints


• Use a REST client to test your API calls,
multiple chrome plugins are available
• Postman
• Restlet client
• …

• Use you language of choice to operationalize


the REST API calls
• Java
• Python
• Shell scripts

52
Accessing the API

• Use the following URL to view the list of REST APIs exposed:
http(s)://<cataloghost>:<port>/access/
• Use the following URL format to call the REST APIs:
http(s)://<cataloghost>:<port>/access/2/catalog/models/<REST API>
• JAVA Client and Source are available here:
• Java client: <cataloghost>:<port>/access/2/files/client.jar
• Source files and Javadocs. <cataloghost>:<port>/access/2/files/client-sources.jar

53
General guidelines

• Avoid running large query at once, always use a pageSize not larger than a few
thousands and iterate.
• Common class types to use
Table: com.infa.ldm.relational.Table PC Mapping: com.infa.ldm.etl.pc.Mapping

Column: com.infa.ldm.relational.Column BDM Mapping: com.infa.ldm.bdm.platform.Mapping

Data Domain: com.infa.ldm.profiling.DataDomain BG Term : com.infa.ldm.bg.BGTerm

• More can be found with http(s)://<cataloghost>:<port>/access/2/catalog/models/classes

• Object IDs are built the following way:


• DB object : <resource_name>://<database_name>/<schema_name>/<table_name>/<column_name>
• custom attributes: com.infa.appmodels.ldm.LDM_<ID>

54
Useful links

• Official REST API documentation:


• https://kb.informatica.com/proddocs/Product%20Documentation/6/IN_1021_EnterpriseDataCatalog[
REST-API]Reference_en.pdf

• EDC REST API Sample repository


• https://github.com/Informatica-EIC/REST-API-Samples

55

You might also like