0% found this document useful (0 votes)

50 views96 pages

BI Unit 3

1) Data provisioning involves extracting data from various sources and preparing it for analysis by cleaning, integrating, and loading it into a target system. 2) Extracting data involves determining what data to extract (full snapshots or just changes) and how to extract it (using database tools, calculating differentials, etc.). 3) The extracted data is then typically loaded into a staging area where data transformation and cleaning occurs before the data is loaded into a data warehouse or other target system for analysis.

Uploaded by

Michael Jone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views96 pages

BI Unit 3

Uploaded by

Michael Jone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Business Intelligence

Unit – III

Data Provisioning
Data Provisioning

• Data Provisioning is the process of creation, preparation and enablement of a network to

provide data from source to destination system of the user

• Or

• It is the process of making data available in an orderly and secure way to users, application
developers, and applications that need it

• Or
• It involves the procedures of making data and resources available to the system and users

• [Note : In general, Provisioning means making something available, or “providing”]

Data Provisioning

• Data provisioning constitutes the prerequisite for any Business Intelligence (BI) project

• Clearly, without any data basis, there will be no analysis at all and without a database of
good quality, the expected quality of the analysis can be expected to be low as well

• However, data collection, extraction, and integration are often the most complex and
expensive tasks in a BI project
Cont.

• Due to big data, more and more data is available holding potential for valuable analysis

• Possible sources for large data volumes are e-business and social network data

• On top of data volume, data variety and data velocity pose additional challenges

• For example,
• Data velocity might demand for data extraction in very short time frames or even in a
continuous way
• Data variety addresses the fact that data from different sources might be structured,
semi-structured, or even unstructured while being available in different formats
Cont.

• Data volume, variety, and velocity are referred to as the three

Vs in big data where additional Vs such as data veracity, i.e., the
trustworthiness of the data, might also be an important issue

• Specifically, it has to be acknowledged that real world data is

dirty
• Therefore, data quality constitutes a crucial challenge as well
Note : The Exclusive Or-Join (XOR) differs in that it has multiple predecessor worksteps and a single successor
workstep.
Exclusive split gateways are used when process splits into several paths and only one of them can be active.
Exclusive gateways

• Exclusive gateways (often called XOR) are used when process splits
into several paths and only one of them can be active

• i.e For a given instance of the Process, only one of the paths can be
taken
Simple example 1

• Simple example can be e.g. sales process where customer either accepts the offer as it
is, rejects it fully or asks for negotiations

• In this case name of the gateway could be “Customer decision?” and conditions for the
process paths outgoing from this gateway: “offer accepted”, “offer rejected”,
“negotiations requested”.
Below you can see example of a process with XOR gateway in two options – with
closing gateway and without
Example 2

• the invoice amount

• Only two flows have conditions on them going to CFO Approval and Finance Director
Approval

• The last sequence flow has no condition and will be selected by default if the other
conditional flows evaluate to false
• In this unit , an understanding on how to
– collect and describe data
– extract data from various sources
– find the adequate target format for subsequent analysis as well as to clean and integrate data
• in such a way that the analysis goals can be achieved
• The corresponding data provisioning process has already been depicted
• Note that data cleaning and integration might be done in an iterative way depending on
whether the data quality has reached a sufficient level
Data Collection and Description

• What is often underestimated is the effort of collecting the data for

the BI project including the identification and selection of relevant
data sources

• It also involves the clarification of issues such as data access

(particularly, if external data sources are to be accessed)
• As shown in the following Fig. , in some projects, the data sources might even become the driver
for later analysis, i.e., the analysis goals might partly depend on which data sources are available
Use Case 1: Patient treatment processes − 2
EBMC project

• EBMC2 project co-funded by University of Vienna and Medical University of Vienna

• Formalizing medical guidelines for skin cancer treatment
• Mining and analysis of real-world treatment processes
• Selected Key Performance Indicators:

– Survival time
– Health status of a specific group of persons
– Cost effectiveness of certain health policies
• Balance between:
• What data sources do we need(to fulfil a certain analysis goal)
• Which data sources are actually available and accessible(privacy, data ownership, data access costs, etc.)

• Available data sources:

• the Stage IV Medical Database (S4MDB) that contains the treatment data of skin cancer patients of stage IV
• and
• the GAP-DRG database that stores medical billing data in Austria

• The S4MDB database was available in Excel format, whereas GAP-DRG is a relational database

• Since they were designed for different purposes, the sources are differing in their conceptual schema possibly
leading to integration challenges
Cont.
• An important task of the data collection phase is the description of the data sources
• The following Figure illustrates the description of the EBMC data sources in a schematic
2

way
• As the analysis goals primarily focus on patient treatment , the chosen target format
should enable process-oriented analysis

• Additional questions, such as survival analysis, can be tackled by data mining

techniques demanding for associated target formats

• After selecting and / or collecting data sources, data has to be extracted

Data Extraction

• After selecting the relevant data sources and describing them , the next step is to
extract relevant data from their sources
• Typically, data extraction is part of the so-called extraction-transformation-load (ETL)
process
3.3.1 Extraction-Transformation-Load (ETL) Process

• Data are transformed and integrated from heterogeneous data sources for
analysis purposes

• The overall procedure is referred to as the ETL process

• At first, data stored within different source systems, such as databases, legacy systems, or XML
documents, is extracted into the so-called staging area that provides different services for
transforming and cleaning data

• From the staging area, the cleaned and integrated data can be loaded into a presentation area
where users can perform analyses

• In data extraction, one should first of all think about the question when data is extracted?

• In some settings, it might be sufficient to have a partial or complete snapshot of the

source data

• But if data is extracted from operational systems, this is not sufficient as the data might
quickly become out-dated

• In order to stay informed about updates within the sources, typically, the sources are monitored for
updates
• [Note: A database snapshot is a read-only, static, transitionally consistent view of
the source database

• A delta load implies that the entire data of a relational database table is not
repeatedly extracted, but only the new data that has been added to a table since
the last load. With delta load, you can process only data that needs to be processed,
either new data or changed data ]
• Connected with the question when to extract ?is the question what to extract?

• It can be overly complex to extract all data within a snapshot each time the data source undergoes
an update

• Instead, it is often desired to only extract the delta compared to the last data snapshot within the
staging area
A common example

• A common example would be the list of participants from a conference registration

system that was extracted at time T1 as complete snapshot

• T1 = < Smith, Brown, Mayer, Jaeger>

• Assume that there is an update at time T2 and the new snapshot would be

• T2 = < Smith, Brown, Jaeger, Jones >

• Instead of the complete participant list at T2., only the new participants i.e., Jones, are
transferred, and Mayer becomes the one not attending the conference
• Connected with the question of what, i.e., snapshot versus delta, is the question of how
to extract?

• If the data source is a database system, for example, it typically offers many ways for
extracting both, snapshots and deltas

• Snapshots can be extracted by SQL queries

• Deltas can be extracted by utilizing database logs

• However, not all sources offer such convenient support.

Cont.

• As legacy systems are still present in enterprises, and often do not offer any support for
data extraction, it is, in some cases, only possible to take snapshots of the data followed
by calculating differential snapshots between the last and the current version of the
data

• The efficient calculation of such snapshots has been tackled by different approaches,
e.g., the window algorithm

• After extracting data updates from the sources, the data is to be transferred to the
staging area for data cleaning and integration purposes
Cont.

• For transferring massive data sets, load techniques such as bulk loader are
offered
• for example, the Oracle SQL *Loader
• loading a large set of data should be only done for cleaned and trusted data

• Thus, loading is typically applied after cleaning and integrating the

extracted data within the staging area
3.3.2 Big Data

• The extensive use of social media (e.g., half a billion tweets a day on Twitter , more than
one billion active users on Facebook ), sensor applications (e.g., for measuring health
parameters or environmental conditions), as well as the immediate provision of possibly
large result data by modern search engines has led to a massive increase in produced
and potentially interesting-to-analyze data

• key challenges in the context of handling big data are data volume, data velocity, data
variety, and data veracity
Cont.

• In short, data volume refers to processing huge amounts of data

• data velocity to the frequency with which new data enters the integration and analysis
process
• data variety to the diversity of data,
• data veracity to the trustworthiness of the data

• We will have to analyze the challenges posed on data extraction and integration by
these four “Vs.”
Challenge 1 : Data volume

• Currently, NoSQL databases are on the rise for tackling the challenge of data volume
• Such databases can be categorized into
• key value stores
• graph databases
• XML databases

• These NoSQL databases are expected to dissolve the potential restrictions imposed by
relational databases such as the demand for a schema, ACID transactions, and
consistency

• In turn, NoSQL databases are schema-free and eventually consistent

• Key-value storage systems are adopted by various enterprises
• The basic data model consists of key-value pairs

• The following statement

• store[:Key1] = “Some string”

• would create an object with key Key1 and value "Some string“

• In graph databases, the data are represented as graph structure

• The basic structure is a graph G := (V , E)

• where V is a set of vertices and E is a set of edges
• Queries are defined on the basis of graphical query language (GQL), for example:
Challenge 2 : Data velocity

• Data velocity refers to the frequency of updates from the data sources

• Whereas for data warehouses some years ago, updates in the data sources could be
treated in a periodic way
• nowadays, continuous data updates have become a frequent scenario, e.g., in the case
of sensor data (also referred to as streaming data)
Challenge 3: Data variety

• Data variety reflects the increasing number of different data formats that might have to
be integrated or, put in a more specific way, structured, semi-structured, and
unstructured data

• The integration becomes particularly difficult if schema information is missing

• Approximately half of the XML documents available on the Web do not refer to a
schema

• But different techniques were introduced that derive schema information from a set of
XML documents in a tree-based or text-based manner
Cont.

• Tree-based approaches, for example, take the structure tree of each XML
document and aggregate them according to certain rules
• The underlying schema can be derived based on the aggregated tree
• Several XML tools and editors offer to automatically derive the underlying XML
schema from a set of XML documents
• e.g., Liquid Studio

• Another possible solution to address a mix of structured and unstructured data is

to use an XML database such as BaseX
• BaseX is able to store structured, semistructured, and unstructured data, and
hence it actually addresses the two challenges of data volume and variety
Challenge 4 : Data veracity

• Data veracity connects big data with the question where the data comes from, e.g., data
that is stored in a cloud

• In such settings, we have to think about how we can ensure trust in the data we collect
and want to analyze

• In this context, techniques for auditing data by, for example, a third-party auditor
Summary on Data Extraction

• Data extraction is more than just grabbing data from a source

• Data availability, ownership:
• In order to obtain data from a certain domain, it is often indispensable to develop some
understanding for the domain
• Soft skills help to communicate with the domain experts in order to overcome possible
resistances
• Further, legal knowledge can be of advantage when it comes to data privacy questions.
Cont.

• Heterogeneous data sources: Many tools exist that offer a bunch of adapters and
extractors to facilitate data extraction
• However, the basic design of the data extraction process remains a manual task

• Big data: Volume, variety, velocity and veracity of the data are challenges
• However, it is most crucial to define what to analyze in a huge bulk of data, i.e., asking
the right questions
From Transactional Data Towards Analytical Data

• The extracted data is to be cleaned and integrated

• However, before data integration can take place, we have to decide in which format the
data should be integrated (integration format)
• Sometimes, it becomes necessary to provide additional analytical formats that are
based on the integration format
• Different analysis require different analysis formats
• In other words, the choice of the analytical format depends on the analysis questions
and the key performance indicators
• The choice of the integration format depends on the results of the data extraction step
• The following figure depicts a selection of basic integration and analysis formats that can be
transformed into different analytical models
• The columns are labeled with formats including a top-level distinction
between structured and unstructured formats
• Structured formats can be further distinguished into
–flat formats, such as relational tables
–hierarchical formats, such as XML
–hybrid formats, such as XES

• The rows are labeled with formats including a distinction into table and log
formats
• Table formats can be further distinguished into flat and multidimensional
formats
• Structured data formats can be divided into flat, hierarchical, and hybrid
structures
• Typical flat formats comprise
–Relational tables
–Comma-separated values (CSV)
–Excel files
• A prominent hierarchically structured format is XML
• Since the structure for XML documents can be mapped onto a tree structure
• (eXtensible Event Stream) is also an XML-based structure but forms an extra
column for log data
• The different analysis (and possibly integration) formats denoted by the row
labels range from flat table structures over multidimensional structures to log
structures
• Given two data sources A and B, the following transformations are possible:

• A contains B: The format of A equals the format of B, and B is a subset of A

• A generates B By aggregation:
• A flat or multidimensional,
• B multidimensional
• aggregation function, e.g., SUM, AVG;
• aggregation refers to aggregating the data
• By mapping:
• dimensionality might be changed;
• describes a set of attribute correspondences between two schemas based on which one schema
can be mapped onto the other

• By transformation:
• A and B of any format;
• Transformation changes a schema (format) in order to obtain a desired target schema (format)
3.4.1 Table Formats and Online Analytical Processing (OLAP)

• Consider the prominent example

– time describing the turnover or
– time describing the temperature measured at a certain day

Example (Health Care)

The following figure is an excerpt of medical data structured in a multidimensional way

Interesting measurements corresponding to key performance indicators, such as cost-effectiveness

of treatment or survival time are reflected by the facts billing sum or number of patients

As for many applications, a time-related view on the facts is interesting for this example as well
• This results in the dimension time, which enables the aggregation along the dimension attributes
day, month, quarter, and year

• Figure 3.10 shows an example report based on the application of aggregation functions to the
multidimensional structure presented in Fig. 3.9

• In this report, the billing sums per patient cohorts, therapy, and year have been aggregated

• Aggregation is one ingredient of online analytical processing (OLAP) operations that are typically
used to analyze multidimensional data

• [note: In medicine, a cohort is a group that is part of a clinical trial or study and is observed over a
period of time.]
• Other OLAP operations include :
• Roll up
• Drill
• Slicing
• Dicing

• a set of OLAP techniques exists that change the granularity and/or dimensionality of the data
cube in order to gain additional insights

• Roll up : we can use the operation Roll up that generates new information by the aggregation of
data along the dimension where the number of dimensions is not changed

• Drill Down : The navigation from aggregated data to detailed data is achieved by the operation
Drill down

• One example would be to roll up the dimension Time from Months to Quarters or vice versa.
• Operations that do not change the granularity of the data, but the dimensionality or the size of
the cube, respectively, are Slice (OLAP) and Dice

• Slice generates individual views by cutting “slices” from the data cube

• In general, this operation reduces the number of dimensions

• Example:
• one could be interested in the average duration of all process instances for a certain patient X in
the current year

• To answer this question, a slice of the data would be generated by the following query:

• SELECT ...WHERE year= ’2006’ AND patient = ’X’

• Dicing means that the dimensionality of the cube is not reduced, but the cube itself is
reduced by cutting out a partial cube

• An example question answered by dicing could regard the number of patient treatments
within a certain time frame, and the solution can be accomplished by range queries
3.4.2 Log Formats

• Consider the following question

• Was the diagnosis always followed by an excision?
• [Note :

• Excision :
• The removal of tissue from the body using a scalpel (a sharp knife), laser, or other
cutting tool
• A surgical excision is usually done to remove a lump or other suspicious growth.]
• As a consequence, data formats are required to store temporal
information on process executions in an explicit way

• This is achieved through log formats

• For process-oriented analysis, it is crucial that the log contains information on the
event order

• Either it is based on time stamps connected with the events or the assumption holds
that the order of the events within the log reflects the order in which they occurred
during process execution

• Further, the events of different process executions must be distinguishable, e.g., an

event for executing activity PerformSurgery for patient X (executed within process
instance X) must be distinguishable from the event for executing activity
PerformSurgery for patient Y (executed within process instance Y)

• This requires some sort of instance ID within each event

3.4.2 Log Formats

• In general, a log can be defined as a collection of events recorded during runtime of an

information system

• Currently, there are two process-oriented log formats that are predominantly used for
process analysis, i.e., Mining XML (MXML) and eXtensible Event Stream (XES)

• For illustration, see the example process depicted in the following Fig. 3.14,
• It expresses a parallel execution of process activities PerformSurgery and
ExaminePatient
• Consequently, two possible execution logs can be produced by this process schema
Import of Process-Oriented Data into Log Formats

• Process-oriented data might be logged by different information systems,

• e.g., ERP systems or workflow systems

• Several tools have been developed that support the import of this source data into
target formats such as MXML and XES

• There are also tools (Nitro and Disco ) that enable the import of CSV data

• Nitro and Disco provide support to define mappings between the columns of the
source Excel or CSV file and the target log format such as MXML and XES
Summary: From Transactional Towards Analytical Data

• An important decision within the data provisioning process is to choose

adequate integration and analysis formats

• In addition to more traditional integration and analysis formats, such as flat

or multidimensional tables, process-oriented formats, i.e., log formats, are
becoming increasingly important
Schema and Data Integration

• Once the appropriate integration format is chosen, the possibly heterogeneous data
sources must be integrated within this target format

• Hence, in addition to the mappings, aggregations, and transformations between

different formats, we have to address the challenges of schema integration and data
integration
• Schema integration means to unite participating schemata S1….Sn into one
integrated schema Sint
• where Sint should meet the following criteria

– Completeness
– Validity
– No contradictions
– Minimality
– Understandability
• Completeness: No information loss with respect to the entities contained within schemata S1….S
n
• Si , i = 1 ….n

• Validity: Sint should reflect a real-world scenario that can be seen as a union of the real-world
scenarios reflected by Si , i = 1 ….n

• No contradictions within Sint

• Minimality:
• no redundancies, every entity contained in Si , i = 1 ….n should occur just once in Sint

• Understandability: the transformation and integration steps should be documented in order to

enable the traceability and reproducibility of the result
Why is building Sint often difficult?

• The schemata S1….S n might stem from possibly heterogeneous data sources
• This often results in a bunch of conflicts among the participating schemata.

• Conflicts
– Semantic and Descriptive conflicts
– Heterogeneity conflicts
– Structural conflicts
Semantic conflicts

• Semantic and descriptive conflicts refer to the way people perceive the set of realworld
objects to model
• More precisely, a semantic conflict arises if modelers choose different entities to
describe the same real-world scenario
• Example
• Assume two schemata A and B describing patient administration
• In schema A, the entity Patient describes patients in a hospital scenario
• whereas in schema B, the entity StatPatient describes patients in a hospital scenario
Descriptive conflicts

• Descriptive conflicts happen if modelers use the same entities but different attribute
sets to describe these entities
• For example,
• in schema A, a patient is described by Name and Age, whereas
• in schema B, the patient is described by Social Insurance Number and BirthDate

• Note: The main difference between the Entity and an attribute is that an entity is a real-
world object, and attributes describe the properties of an Entity
Heterogeneity conflicts

• Heterogeneity conflicts occur if the schemata are defined using different formats, e.g.,
Excel (structured) versus XML(semi-structured)
Structural conflicts

• Structural conflicts arise if different constructs are used despite choosing a common
format

• An example would be if in XML schema A , patient age is modeled as an attribute

• in XML schema B, patient age is modeled as an element
• In particular, semantic and descriptive conflicts are hard to solve without any further
knowledge

• For semantic and descriptive conflicts, ontologies can be used to resolve conflicts
such as homonyms and synonyms (e.g., by using vehicle instead of car)

• [homonyms :each of two or more words having the same spelling

or pronunciation but different meanings and origins]
schema integration process

• The schema integration process comprises the following phases:

• (a) pre-integration
• (b) schema comparison
• (c) schema conforming
• (d) schema merging and restructuring
(a) pre-integration

• Within the pre-integration phase, the participating schemata S1….Sn

are analyzed for their format and structure as well as their metadata
• Another part of the pre-integration phase is to determine the
integration strategy, particularly if more than two schemata are to
be integrated
• different strategies are available
• One example is the “one-shot” strategy that aims at integrating
S1….Sn at once
• Binary strategies would integrate S1…Sn in a pairwise manner
schema comparison and schema conforming

• Schema mapping and matching are techniques that are applied during schema
comparison and schema conforming

• a matching takes “two schemas as input and produces a mapping between elements of
the two schemas that correspond semantically to each other

• A m is a mapping between schema S with attribute set AS and schema T with attribute
set AT
• The idea is to build the cross product AS x AT between all attributes in AS and AT
• Open-source or research tools for schema mappings are COMA 3.0 and Protégé
Heterogeneity conflicts

• Assume that the participating schemata S1, S2….Sn are of different formats
• In order to integrate two heterogeneous schemata, a common data model has to be
chosen, mostly one of the two existing ones

• Example:
• Bringing together and relational and XML-based schemata
challenge

• Integrating a hierarchical data model (XML) with a flat one (relational)

• The database table (b) is to be integrated with the XML document (a) reflected by its underlying
structure tree
• Note that the structures of both the XML document and the relational table are already aligned

• The challenge is to construct the hierarchical structure of the XML document based on the flat
relational data
• This task is referred to as structuring i.e., it has to be decided which of the database attributes has
to be converted in elements (or attributes) and in which hierarchical order

• The database (here AllParticipants) is mapped onto the root element;

• then the table constitutes the child element of the root, and the attributes are mapped to leaf
elements

• the SQL/XML standard offers functions that enable the extraction of relational data as XML
documents.
3.5.2 Data Integration and Data Quality

• After schema integration, there might still be inconsistencies at the data level,
necessitating data integration actions
• For example
• Take the two XML documents
• Apparently, they adhere to the same XML schema but show conflicts at the data level
• for example,
• encoding names in a different manner or displaying fees in a different currency
• In order to integrate both XML files into one, the data conflicts have to be detected and
resolved accordingly
• This process is also referred to as data fusion,
• i.e., data from different sources being instance to the same real-world object is
integrated to represent the real-world object in a consistent and clean way
Cont.

• the following problems at data level might occur:

• data errors (e.g., typos), different formats, inconsistencies (e.g., the zip code does not
match the city), and duplicates
• One measure to improve data quality is data cleaning, which aims at “detecting and
removing errors and inconsistencies from data”
3.5.3 Linked Data and Data Mashups

• So far, we have discussed a process that

–extracts data from sources
–cleans them
–integrates them outside the sources
• A totally different idea is to use data sources that are
integrated already, making extraction, cleaning, and integration
tasks obsolete
• This is the idea of linked data
Cont.

• The principle here is to link data sources that are available on the Web and by doing so,
provide an integrated data source for later queries or analysis

• The difference between the classical ETL process and the idea of linked data is illustrated
in the following figure
Data Mashups
• Data mashup refers to combining data from different data sources into a single
application
• The integration of heterogeneous business data from different sources into one place
gives a unified overview of the business processes
• Data mashing is also referred to as enterprise or business mashing.
• Typical sources for data mashups are Web data and Web services that act as the
components of the mashup
[Note :
• Power users are popularly known for owning and using high-end computers with
sophisticated applications and service suites. For example, software developers,
graphic designers, animators and audio mixers require advanced computer hardware
and software applications for routine processes]
Summary

• Data provision constitutes an important prerequisite for any BI project

• Hence, it is recommended to plan for sufficient time and manpower

Week 3 - Data Engineering Lifecycle
100% (1)
Week 3 - Data Engineering Lifecycle
6 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
IT Governance
No ratings yet
IT Governance
31 pages
1-Simple Stress
No ratings yet
1-Simple Stress
33 pages
Easa TCDS A.084 - Atr - 42 - Atr - 72 03 17102012 PDF
No ratings yet
Easa TCDS A.084 - Atr - 42 - Atr - 72 03 17102012 PDF
35 pages
Petition - Notarial Commission - Template
No ratings yet
Petition - Notarial Commission - Template
5 pages
Termination of Contract
No ratings yet
Termination of Contract
12 pages
Boycott List of Israel Items
No ratings yet
Boycott List of Israel Items
3 pages
DWV Notes Units 1 To 5
No ratings yet
DWV Notes Units 1 To 5
158 pages
Installing Guide For Citroen C4
No ratings yet
Installing Guide For Citroen C4
23 pages
B/U Dorsuma Ganderbal
100% (1)
B/U Dorsuma Ganderbal
2 pages
Data Processing
No ratings yet
Data Processing
4 pages
Kabel - PC SPSC2000 FW2 PDF
No ratings yet
Kabel - PC SPSC2000 FW2 PDF
1 page
Beat1550 PDF
No ratings yet
Beat1550 PDF
56 pages
Data Preparation
No ratings yet
Data Preparation
19 pages
Yarber File 1
No ratings yet
Yarber File 1
29 pages
Acceptance Testing and ETL Process j8Mus6Ctvj
No ratings yet
Acceptance Testing and ETL Process j8Mus6Ctvj
19 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Section N Notes With Answers
No ratings yet
Section N Notes With Answers
4 pages
Aws Data Analytics Fundamentals
100% (1)
Aws Data Analytics Fundamentals
15 pages
IBM - Introduccion Analisis de Datos
No ratings yet
IBM - Introduccion Analisis de Datos
148 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Patent Cooperation Treaty
No ratings yet
Patent Cooperation Treaty
11 pages
System Modelling Ch5
No ratings yet
System Modelling Ch5
53 pages
Bsd1313 Chapter 3
No ratings yet
Bsd1313 Chapter 3
74 pages
Unit 1 - PPT
No ratings yet
Unit 1 - PPT
67 pages
Basics of Data Integration
No ratings yet
Basics of Data Integration
67 pages
BI All in One
No ratings yet
BI All in One
52 pages
Unit 1 Data Acquisition
No ratings yet
Unit 1 Data Acquisition
62 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
Acct Statement XX0471 11012025
No ratings yet
Acct Statement XX0471 11012025
5 pages
Big Data Categories-Life Cycle
No ratings yet
Big Data Categories-Life Cycle
15 pages
Zertifikat-IEC62109-fuer-Huawei-SUN2000-215KTL-H0-Wechselrichter
No ratings yet
Zertifikat-IEC62109-fuer-Huawei-SUN2000-215KTL-H0-Wechselrichter
3 pages
INF30036 Lecture3
No ratings yet
INF30036 Lecture3
36 pages
Project Report Erp
No ratings yet
Project Report Erp
29 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
No ratings yet
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
36 pages
Event Handling - V
No ratings yet
Event Handling - V
49 pages
Unit 1
No ratings yet
Unit 1
61 pages
Sample SOP For Visitor Visa Australia
No ratings yet
Sample SOP For Visitor Visa Australia
6 pages
UNIVR BA2425 - L10 - DATA INTEGRATION p2
No ratings yet
UNIVR BA2425 - L10 - DATA INTEGRATION p2
32 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Database Configuration
No ratings yet
Database Configuration
5 pages
Big Data - Unit - 3
No ratings yet
Big Data - Unit - 3
20 pages
BM Costs New
No ratings yet
BM Costs New
35 pages
Microooooooooooooo
No ratings yet
Microooooooooooooo
33 pages
Introduction To Data Analytics: Roberta Turra
No ratings yet
Introduction To Data Analytics: Roberta Turra
23 pages
Ba CH02
No ratings yet
Ba CH02
23 pages
Friction Stir Welding FSW Process
No ratings yet
Friction Stir Welding FSW Process
6 pages
Paper 2 Datawarehouse Notes
No ratings yet
Paper 2 Datawarehouse Notes
20 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
DAUnit 1
No ratings yet
DAUnit 1
20 pages
Institutionalization Stage Revalida
No ratings yet
Institutionalization Stage Revalida
59 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Unit-1 DM
No ratings yet
Unit-1 DM
16 pages
Slide For Chapter 2
No ratings yet
Slide For Chapter 2
16 pages
Realtime Mohan Belandur IQ
No ratings yet
Realtime Mohan Belandur IQ
38 pages
Domain 2
No ratings yet
Domain 2
3 pages
Faceplate WinCC Motor en
No ratings yet
Faceplate WinCC Motor en
36 pages
Data Warehousing - Lecture - 5
No ratings yet
Data Warehousing - Lecture - 5
11 pages
Graduation Date On Resume
100% (1)
Graduation Date On Resume
7 pages
Online Analytical Processing: OLAP (Or Online Analytical Processing) Has Been Growing in Popularity Due To The
No ratings yet
Online Analytical Processing: OLAP (Or Online Analytical Processing) Has Been Growing in Popularity Due To The
12 pages
My Mind Reader's
No ratings yet
My Mind Reader's
19 pages
Session 1
No ratings yet
Session 1
13 pages
Unit - 1 Learning Notes
No ratings yet
Unit - 1 Learning Notes
11 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Data Extraction Part1
No ratings yet
Data Extraction Part1
15 pages
Microsoft PowerPoint - DAA - Chapter 02
No ratings yet
Microsoft PowerPoint - DAA - Chapter 02
8 pages
BDA-24 - Lect (3-4) - (Fundamentals of Data Analysis)
No ratings yet
BDA-24 - Lect (3-4) - (Fundamentals of Data Analysis)
15 pages
Data Integration Best Practices
No ratings yet
Data Integration Best Practices
17 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Unit 1 - DSA
No ratings yet
Unit 1 - DSA
12 pages
Data Processing - ST
No ratings yet
Data Processing - ST
12 pages
Passport Stats 15-04-2023 0939 GMT Softdrinks
No ratings yet
Passport Stats 15-04-2023 0939 GMT Softdrinks
1 page
STREAM PROCESSING 2 Marks Question and Answers
No ratings yet
STREAM PROCESSING 2 Marks Question and Answers
8 pages
Research Paper Templates For Elementary Students
No ratings yet
Research Paper Templates For Elementary Students
8 pages
Angeilyn Roda Activity 2
No ratings yet
Angeilyn Roda Activity 2
6 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Recent Developments of Solar Energy in India: Perspectives, Strategies and Future Goals
No ratings yet
Recent Developments of Solar Energy in India: Perspectives, Strategies and Future Goals
22 pages
Data Analysis
No ratings yet
Data Analysis
6 pages
Silt Measuring Instrument
No ratings yet
Silt Measuring Instrument
6 pages
Download
No ratings yet
Download
4 pages
Peter Velikov Petrov: Personal Details
No ratings yet
Peter Velikov Petrov: Personal Details
2 pages
Modria Brochure
No ratings yet
Modria Brochure
6 pages
Flow Diagram Description of Recruitment Operating Procedure
No ratings yet
Flow Diagram Description of Recruitment Operating Procedure
3 pages
IGNOU BCA Operating System Concepts and Networking Management Previous Year Solved Papers MCS 022
From Everand
IGNOU BCA Operating System Concepts and Networking Management Previous Year Solved Papers MCS 022
Manish Soni
No ratings yet
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet

BI Unit 3

Uploaded by

BI Unit 3

Uploaded by

Business Intelligence

• Data Provisioning is the process of creation, preparation and enablement of a network to

• [Note : In general, Provisioning means making something available, or “providing”]

• Data volume, variety, and velocity are referred to as the three

• Specifically, it has to be acknowledged that real world data is

• the invoice amount

• What is often underestimated is the effort of collecting the data for

• It also involves the clarification of issues such as data access

• EBMC2 project co-funded by University of Vienna and Medical University of Vienna

• Available data sources:

• Additional questions, such as survival analysis, can be tackled by data mining

• After selecting and / or collecting data sources, data has to be extracted

• The overall procedure is referred to as the ETL process

• In some settings, it might be sufficient to have a partial or complete snapshot of the

• A common example would be the list of participants from a conference registration

• T1 = < Smith, Brown, Mayer, Jaeger>

• T2 = < Smith, Brown, Jaeger, Jones >

• Snapshots can be extracted by SQL queries

• However, not all sources offer such convenient support.

• Thus, loading is typically applied after cleaning and integrating the

• In short, data volume refers to processing huge amounts of data

• In turn, NoSQL databases are schema-free and eventually consistent

• The following statement

• In graph databases, the data are represented as graph structure

• The basic structure is a graph G := (V , E)

• The integration becomes particularly difficult if schema information is missing

• Another possible solution to address a mix of structured and unstructured data is

• Data extraction is more than just grabbing data from a source

• The extracted data is to be cleaned and integrated

• A contains B: The format of A equals the format of B, and B is a subset of A

• Consider the prominent example

Example (Health Care)

Interesting measurements corresponding to key performance indicators, such as cost-effectiveness

• In general, this operation reduces the number of dimensions

• SELECT ...WHERE year= ’2006’ AND patient = ’X’

• Consider the following question

• This is achieved through log formats

• Further, the events of different process executions must be distinguishable, e.g., an

• This requires some sort of instance ID within each event

• In general, a log can be defined as a collection of events recorded during runtime of an

• Process-oriented data might be logged by different information systems,

• An important decision within the data provisioning process is to choose

• In addition to more traditional integration and analysis formats, such as flat

• Hence, in addition to the mappings, aggregations, and transformations between

• No contradictions within Sint

• Understandability: the transformation and integration steps should be documented in order to

• An example would be if in XML schema A , patient age is modeled as an attribute

• [homonyms :each of two or more words having the same spelling

• The schema integration process comprises the following phases:

• Within the pre-integration phase, the participating schemata S1….Sn

• Integrating a hierarchical data model (XML) with a flat one (relational)

• The database (here AllParticipants) is mapped onto the root element;

• the following problems at data level might occur:

• So far, we have discussed a process that

• Data provision constitutes an important prerequisite for any BI project

You might also like