0% found this document useful (0 votes)
50 views96 pages

BI Unit 3

1) Data provisioning involves extracting data from various sources and preparing it for analysis by cleaning, integrating, and loading it into a target system. 2) Extracting data involves determining what data to extract (full snapshots or just changes) and how to extract it (using database tools, calculating differentials, etc.). 3) The extracted data is then typically loaded into a staging area where data transformation and cleaning occurs before the data is loaded into a data warehouse or other target system for analysis.

Uploaded by

Michael Jone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views96 pages

BI Unit 3

1) Data provisioning involves extracting data from various sources and preparing it for analysis by cleaning, integrating, and loading it into a target system. 2) Extracting data involves determining what data to extract (full snapshots or just changes) and how to extract it (using database tools, calculating differentials, etc.). 3) The extracted data is then typically loaded into a staging area where data transformation and cleaning occurs before the data is loaded into a data warehouse or other target system for analysis.

Uploaded by

Michael Jone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Business Intelligence

Unit – III

Data Provisioning
Data Provisioning

• Data Provisioning is the process of creation, preparation and enablement of a network to


provide data from source to destination system of the user

• Or

• It is the process of making data available in an orderly and secure way to users, application
developers, and applications that need it

• Or
• It involves the procedures of making data and resources available to the system and users

• [Note : In general, Provisioning means making something available, or “providing”]


Data Provisioning

• Data provisioning constitutes the prerequisite for any Business Intelligence (BI) project

• Clearly, without any data basis, there will be no analysis at all and without a database of
good quality, the expected quality of the analysis can be expected to be low as well

• However, data collection, extraction, and integration are often the most complex and
expensive tasks in a BI project
Cont.

• Due to big data, more and more data is available holding potential for valuable analysis

• Possible sources for large data volumes are e-business and social network data

• On top of data volume, data variety and data velocity pose additional challenges

• For example,
• Data velocity might demand for data extraction in very short time frames or even in a
continuous way
• Data variety addresses the fact that data from different sources might be structured,
semi-structured, or even unstructured while being available in different formats
Cont.

• Data volume, variety, and velocity are referred to as the three


Vs in big data where additional Vs such as data veracity, i.e., the
trustworthiness of the data, might also be an important issue

• Specifically, it has to be acknowledged that real world data is


dirty
• Therefore, data quality constitutes a crucial challenge as well
Note : The Exclusive Or-Join (XOR) differs in that it has multiple predecessor worksteps and a single successor
workstep.
Exclusive split gateways are used when process splits into several paths and only one of them can be active.
Exclusive gateways

• Exclusive gateways (often called XOR) are used when process splits
into several paths and only one of them can be active

• i.e For a given instance of the Process, only one of the paths can be
taken
Simple example 1

• Simple example can be e.g. sales process where customer either accepts the offer as it
is, rejects it fully or asks for negotiations

• In this case name of the gateway could be “Customer decision?” and conditions for the
process paths outgoing from this gateway: “offer accepted”, “offer rejected”,
“negotiations requested”.
Below you can see example of a process with XOR gateway in two options – with
closing gateway and without
Example 2

• the invoice amount


• Only two flows have conditions on them going to CFO Approval and Finance Director
Approval

• The last sequence flow has no condition and will be selected by default if the other
conditional flows evaluate to false
• In this unit , an understanding on how to
– collect and describe data
– extract data from various sources
– find the adequate target format for subsequent analysis as well as to clean and integrate data
• in such a way that the analysis goals can be achieved
• The corresponding data provisioning process has already been depicted
• Note that data cleaning and integration might be done in an iterative way depending on
whether the data quality has reached a sufficient level
Data Collection and Description

• What is often underestimated is the effort of collecting the data for


the BI project including the identification and selection of relevant
data sources

• It also involves the clarification of issues such as data access


(particularly, if external data sources are to be accessed)
• As shown in the following Fig. , in some projects, the data sources might even become the driver
for later analysis, i.e., the analysis goals might partly depend on which data sources are available
Use Case 1: Patient treatment processes − 2
EBMC project

• EBMC2 project co-funded by University of Vienna and Medical University of Vienna


• Formalizing medical guidelines for skin cancer treatment
• Mining and analysis of real-world treatment processes
• Selected Key Performance Indicators:

– Survival time
– Health status of a specific group of persons
– Cost effectiveness of certain health policies
• Balance between:
• What data sources do we need(to fulfil a certain analysis goal)
• Which data sources are actually available and accessible(privacy, data ownership, data access costs, etc.)

• Available data sources:

• the Stage IV Medical Database (S4MDB) that contains the treatment data of skin cancer patients of stage IV
• and
• the GAP-DRG database that stores medical billing data in Austria

• The S4MDB database was available in Excel format, whereas GAP-DRG is a relational database

• Since they were designed for different purposes, the sources are differing in their conceptual schema possibly
leading to integration challenges
Cont.
• An important task of the data collection phase is the description of the data sources
• The following Figure illustrates the description of the EBMC data sources in a schematic
2

way
• As the analysis goals primarily focus on patient treatment , the chosen target format
should enable process-oriented analysis

• Additional questions, such as survival analysis, can be tackled by data mining


techniques demanding for associated target formats

• After selecting and / or collecting data sources, data has to be extracted


Data Extraction

• After selecting the relevant data sources and describing them , the next step is to
extract relevant data from their sources
• Typically, data extraction is part of the so-called extraction-transformation-load (ETL)
process
3.3.1 Extraction-Transformation-Load (ETL) Process

• Data are transformed and integrated from heterogeneous data sources for
analysis purposes

• The overall procedure is referred to as the ETL process


• At first, data stored within different source systems, such as databases, legacy systems, or XML
documents, is extracted into the so-called staging area that provides different services for
transforming and cleaning data

• From the staging area, the cleaned and integrated data can be loaded into a presentation area
where users can perform analyses

• In data extraction, one should first of all think about the question when data is extracted?

• In some settings, it might be sufficient to have a partial or complete snapshot of the


source data

• But if data is extracted from operational systems, this is not sufficient as the data might
quickly become out-dated

• In order to stay informed about updates within the sources, typically, the sources are monitored for
updates
• [Note: A database snapshot is a read-only, static, transitionally consistent view of
the source database

• A delta load implies that the entire data of a relational database table is not
repeatedly extracted, but only the new data that has been added to a table since
the last load. With delta load, you can process only data that needs to be processed,
either new data or changed data ]
• Connected with the question when to extract ?is the question what to extract?

• It can be overly complex to extract all data within a snapshot each time the data source undergoes
an update

• Instead, it is often desired to only extract the delta compared to the last data snapshot within the
staging area
A common example

• A common example would be the list of participants from a conference registration


system that was extracted at time T1 as complete snapshot

• T1 = < Smith, Brown, Mayer, Jaeger>

• Assume that there is an update at time T2 and the new snapshot would be

• T2 = < Smith, Brown, Jaeger, Jones >

• Instead of the complete participant list at T2., only the new participants i.e., Jones, are
transferred, and Mayer becomes the one not attending the conference
• Connected with the question of what, i.e., snapshot versus delta, is the question of how
to extract?

• If the data source is a database system, for example, it typically offers many ways for
extracting both, snapshots and deltas

• Snapshots can be extracted by SQL queries


• Deltas can be extracted by utilizing database logs

• However, not all sources offer such convenient support.


Cont.

• As legacy systems are still present in enterprises, and often do not offer any support for
data extraction, it is, in some cases, only possible to take snapshots of the data followed
by calculating differential snapshots between the last and the current version of the
data

• The efficient calculation of such snapshots has been tackled by different approaches,
e.g., the window algorithm

• After extracting data updates from the sources, the data is to be transferred to the
staging area for data cleaning and integration purposes
Cont.

• For transferring massive data sets, load techniques such as bulk loader are
offered
• for example, the Oracle SQL *Loader
• loading a large set of data should be only done for cleaned and trusted data

• Thus, loading is typically applied after cleaning and integrating the


extracted data within the staging area
3.3.2 Big Data

• The extensive use of social media (e.g., half a billion tweets a day on Twitter , more than
one billion active users on Facebook ), sensor applications (e.g., for measuring health
parameters or environmental conditions), as well as the immediate provision of possibly
large result data by modern search engines has led to a massive increase in produced
and potentially interesting-to-analyze data

• key challenges in the context of handling big data are data volume, data velocity, data
variety, and data veracity
Cont.

• In short, data volume refers to processing huge amounts of data


• data velocity to the frequency with which new data enters the integration and analysis
process
• data variety to the diversity of data,
• data veracity to the trustworthiness of the data

• We will have to analyze the challenges posed on data extraction and integration by
these four “Vs.”
Challenge 1 : Data volume

• Currently, NoSQL databases are on the rise for tackling the challenge of data volume
• Such databases can be categorized into
• key value stores
• graph databases
• XML databases

• These NoSQL databases are expected to dissolve the potential restrictions imposed by
relational databases such as the demand for a schema, ACID transactions, and
consistency

• In turn, NoSQL databases are schema-free and eventually consistent


• Key-value storage systems are adopted by various enterprises
• The basic data model consists of key-value pairs

• The following statement


• store[:Key1] = “Some string”

• would create an object with key Key1 and value "Some string“

• In graph databases, the data are represented as graph structure

• The basic structure is a graph G := (V , E)


• where V is a set of vertices and E is a set of edges
• Queries are defined on the basis of graphical query language (GQL), for example:
Challenge 2 : Data velocity

• Data velocity refers to the frequency of updates from the data sources

• Whereas for data warehouses some years ago, updates in the data sources could be
treated in a periodic way
• nowadays, continuous data updates have become a frequent scenario, e.g., in the case
of sensor data (also referred to as streaming data)
Challenge 3: Data variety

• Data variety reflects the increasing number of different data formats that might have to
be integrated or, put in a more specific way, structured, semi-structured, and
unstructured data

• The integration becomes particularly difficult if schema information is missing

• Approximately half of the XML documents available on the Web do not refer to a
schema

• But different techniques were introduced that derive schema information from a set of
XML documents in a tree-based or text-based manner
Cont.

• Tree-based approaches, for example, take the structure tree of each XML
document and aggregate them according to certain rules
• The underlying schema can be derived based on the aggregated tree
• Several XML tools and editors offer to automatically derive the underlying XML
schema from a set of XML documents
• e.g., Liquid Studio

• Another possible solution to address a mix of structured and unstructured data is


to use an XML database such as BaseX
• BaseX is able to store structured, semistructured, and unstructured data, and
hence it actually addresses the two challenges of data volume and variety
Challenge 4 : Data veracity

• Data veracity connects big data with the question where the data comes from, e.g., data
that is stored in a cloud

• In such settings, we have to think about how we can ensure trust in the data we collect
and want to analyze

• In this context, techniques for auditing data by, for example, a third-party auditor
Summary on Data Extraction

• Data extraction is more than just grabbing data from a source


• Data availability, ownership:
• In order to obtain data from a certain domain, it is often indispensable to develop some
understanding for the domain
• Soft skills help to communicate with the domain experts in order to overcome possible
resistances
• Further, legal knowledge can be of advantage when it comes to data privacy questions.
Cont.

• Heterogeneous data sources: Many tools exist that offer a bunch of adapters and
extractors to facilitate data extraction
• However, the basic design of the data extraction process remains a manual task

• Big data: Volume, variety, velocity and veracity of the data are challenges
• However, it is most crucial to define what to analyze in a huge bulk of data, i.e., asking
the right questions
From Transactional Data Towards Analytical Data

• The extracted data is to be cleaned and integrated


• However, before data integration can take place, we have to decide in which format the
data should be integrated (integration format)
• Sometimes, it becomes necessary to provide additional analytical formats that are
based on the integration format
• Different analysis require different analysis formats
• In other words, the choice of the analytical format depends on the analysis questions
and the key performance indicators
• The choice of the integration format depends on the results of the data extraction step
• The following figure depicts a selection of basic integration and analysis formats that can be
transformed into different analytical models
• The columns are labeled with formats including a top-level distinction
between structured and unstructured formats
• Structured formats can be further distinguished into
–flat formats, such as relational tables
–hierarchical formats, such as XML
–hybrid formats, such as XES

• The rows are labeled with formats including a distinction into table and log
formats
• Table formats can be further distinguished into flat and multidimensional
formats
• Structured data formats can be divided into flat, hierarchical, and hybrid
structures
• Typical flat formats comprise
–Relational tables
–Comma-separated values (CSV)
–Excel files
• A prominent hierarchically structured format is XML
• Since the structure for XML documents can be mapped onto a tree structure
• (eXtensible Event Stream) is also an XML-based structure but forms an extra
column for log data
• The different analysis (and possibly integration) formats denoted by the row
labels range from flat table structures over multidimensional structures to log
structures
• Given two data sources A and B, the following transformations are possible:

• A contains B: The format of A equals the format of B, and B is a subset of A

• A generates B By aggregation:
• A flat or multidimensional,
• B multidimensional
• aggregation function, e.g., SUM, AVG;
• aggregation refers to aggregating the data
• By mapping:
• dimensionality might be changed;
• describes a set of attribute correspondences between two schemas based on which one schema
can be mapped onto the other

• By transformation:
• A and B of any format;
• Transformation changes a schema (format) in order to obtain a desired target schema (format)
3.4.1 Table Formats and Online Analytical Processing (OLAP)

• Consider the prominent example


– time describing the turnover or
– time describing the temperature measured at a certain day

Example (Health Care)


The following figure is an excerpt of medical data structured in a multidimensional way

Interesting measurements corresponding to key performance indicators, such as cost-effectiveness


of treatment or survival time are reflected by the facts billing sum or number of patients

As for many applications, a time-related view on the facts is interesting for this example as well
• This results in the dimension time, which enables the aggregation along the dimension attributes
day, month, quarter, and year

• Figure 3.10 shows an example report based on the application of aggregation functions to the
multidimensional structure presented in Fig. 3.9

• In this report, the billing sums per patient cohorts, therapy, and year have been aggregated

• Aggregation is one ingredient of online analytical processing (OLAP) operations that are typically
used to analyze multidimensional data

• [note: In medicine, a cohort is a group that is part of a clinical trial or study and is observed over a
period of time.]
• Other OLAP operations include :
• Roll up
• Drill
• Slicing
• Dicing

• a set of OLAP techniques exists that change the granularity and/or dimensionality of the data
cube in order to gain additional insights

• Roll up : we can use the operation Roll up that generates new information by the aggregation of
data along the dimension where the number of dimensions is not changed

• Drill Down : The navigation from aggregated data to detailed data is achieved by the operation
Drill down

• One example would be to roll up the dimension Time from Months to Quarters or vice versa.
• Operations that do not change the granularity of the data, but the dimensionality or the size of
the cube, respectively, are Slice (OLAP) and Dice

• Slice generates individual views by cutting “slices” from the data cube

• In general, this operation reduces the number of dimensions

• Example:
• one could be interested in the average duration of all process instances for a certain patient X in
the current year

• To answer this question, a slice of the data would be generated by the following query:

• SELECT ...WHERE year= ’2006’ AND patient = ’X’


• Dicing means that the dimensionality of the cube is not reduced, but the cube itself is
reduced by cutting out a partial cube

• An example question answered by dicing could regard the number of patient treatments
within a certain time frame, and the solution can be accomplished by range queries
3.4.2 Log Formats

• Consider the following question


• Was the diagnosis always followed by an excision?
• [Note :

• Excision :
• The removal of tissue from the body using a scalpel (a sharp knife), laser, or other
cutting tool
• A surgical excision is usually done to remove a lump or other suspicious growth.]
• As a consequence, data formats are required to store temporal
information on process executions in an explicit way

• This is achieved through log formats


• For process-oriented analysis, it is crucial that the log contains information on the
event order

• Either it is based on time stamps connected with the events or the assumption holds
that the order of the events within the log reflects the order in which they occurred
during process execution

• Further, the events of different process executions must be distinguishable, e.g., an


event for executing activity PerformSurgery for patient X (executed within process
instance X) must be distinguishable from the event for executing activity
PerformSurgery for patient Y (executed within process instance Y)

• This requires some sort of instance ID within each event


3.4.2 Log Formats

• In general, a log can be defined as a collection of events recorded during runtime of an


information system

• Currently, there are two process-oriented log formats that are predominantly used for
process analysis, i.e., Mining XML (MXML) and eXtensible Event Stream (XES)

• For illustration, see the example process depicted in the following Fig. 3.14,
• It expresses a parallel execution of process activities PerformSurgery and
ExaminePatient
• Consequently, two possible execution logs can be produced by this process schema
Import of Process-Oriented Data into Log Formats

• Process-oriented data might be logged by different information systems,


• e.g., ERP systems or workflow systems

• Several tools have been developed that support the import of this source data into
target formats such as MXML and XES

• There are also tools (Nitro and Disco ) that enable the import of CSV data

• Nitro and Disco provide support to define mappings between the columns of the
source Excel or CSV file and the target log format such as MXML and XES
Summary: From Transactional Towards Analytical Data

• An important decision within the data provisioning process is to choose


adequate integration and analysis formats

• In addition to more traditional integration and analysis formats, such as flat


or multidimensional tables, process-oriented formats, i.e., log formats, are
becoming increasingly important
Schema and Data Integration

• Once the appropriate integration format is chosen, the possibly heterogeneous data
sources must be integrated within this target format

• Hence, in addition to the mappings, aggregations, and transformations between


different formats, we have to address the challenges of schema integration and data
integration
• Schema integration means to unite participating schemata S1….Sn into one
integrated schema Sint
• where Sint should meet the following criteria

– Completeness
– Validity
– No contradictions
– Minimality
– Understandability
• Completeness: No information loss with respect to the entities contained within schemata S1….S
n
• Si , i = 1 ….n

• Validity: Sint should reflect a real-world scenario that can be seen as a union of the real-world
scenarios reflected by Si , i = 1 ….n

• No contradictions within Sint

• Minimality:
• no redundancies, every entity contained in Si , i = 1 ….n should occur just once in Sint

• Understandability: the transformation and integration steps should be documented in order to


enable the traceability and reproducibility of the result
Why is building Sint often difficult?

• The schemata S1….S n might stem from possibly heterogeneous data sources
• This often results in a bunch of conflicts among the participating schemata.

• Conflicts
– Semantic and Descriptive conflicts
– Heterogeneity conflicts
– Structural conflicts
Semantic conflicts

• Semantic and descriptive conflicts refer to the way people perceive the set of realworld
objects to model
• More precisely, a semantic conflict arises if modelers choose different entities to
describe the same real-world scenario
• Example
• Assume two schemata A and B describing patient administration
• In schema A, the entity Patient describes patients in a hospital scenario
• whereas in schema B, the entity StatPatient describes patients in a hospital scenario
Descriptive conflicts

• Descriptive conflicts happen if modelers use the same entities but different attribute
sets to describe these entities
• For example,
• in schema A, a patient is described by Name and Age, whereas
• in schema B, the patient is described by Social Insurance Number and BirthDate

• Note: The main difference between the Entity and an attribute is that an entity is a real-
world object, and attributes describe the properties of an Entity
Heterogeneity conflicts

• Heterogeneity conflicts occur if the schemata are defined using different formats, e.g.,
Excel (structured) versus XML(semi-structured)
Structural conflicts

• Structural conflicts arise if different constructs are used despite choosing a common
format

• An example would be if in XML schema A , patient age is modeled as an attribute


• in XML schema B, patient age is modeled as an element
• In particular, semantic and descriptive conflicts are hard to solve without any further
knowledge

• For semantic and descriptive conflicts, ontologies can be used to resolve conflicts
such as homonyms and synonyms (e.g., by using vehicle instead of car)

• [homonyms :each of two or more words having the same spelling


or pronunciation but different meanings and origins]
schema integration process

• The schema integration process comprises the following phases:

• (a) pre-integration
• (b) schema comparison
• (c) schema conforming
• (d) schema merging and restructuring
(a) pre-integration

• Within the pre-integration phase, the participating schemata S1….Sn


are analyzed for their format and structure as well as their metadata
• Another part of the pre-integration phase is to determine the
integration strategy, particularly if more than two schemata are to
be integrated
• different strategies are available
• One example is the “one-shot” strategy that aims at integrating
S1….Sn at once
• Binary strategies would integrate S1…Sn in a pairwise manner
schema comparison and schema conforming

• Schema mapping and matching are techniques that are applied during schema
comparison and schema conforming

• a matching takes “two schemas as input and produces a mapping between elements of
the two schemas that correspond semantically to each other

• A m is a mapping between schema S with attribute set AS and schema T with attribute
set AT
• The idea is to build the cross product AS x AT between all attributes in AS and AT
• Open-source or research tools for schema mappings are COMA 3.0 and Protégé
Heterogeneity conflicts

• Assume that the participating schemata S1, S2….Sn are of different formats
• In order to integrate two heterogeneous schemata, a common data model has to be
chosen, mostly one of the two existing ones

• Example:
• Bringing together and relational and XML-based schemata
challenge

• Integrating a hierarchical data model (XML) with a flat one (relational)


• The database table (b) is to be integrated with the XML document (a) reflected by its underlying
structure tree
• Note that the structures of both the XML document and the relational table are already aligned

• The challenge is to construct the hierarchical structure of the XML document based on the flat
relational data
• This task is referred to as structuring i.e., it has to be decided which of the database attributes has
to be converted in elements (or attributes) and in which hierarchical order

• The database (here AllParticipants) is mapped onto the root element;


• then the table constitutes the child element of the root, and the attributes are mapped to leaf
elements

• the SQL/XML standard offers functions that enable the extraction of relational data as XML
documents.
3.5.2 Data Integration and Data Quality

• After schema integration, there might still be inconsistencies at the data level,
necessitating data integration actions
• For example
• Take the two XML documents
• Apparently, they adhere to the same XML schema but show conflicts at the data level
• for example,
• encoding names in a different manner or displaying fees in a different currency
• In order to integrate both XML files into one, the data conflicts have to be detected and
resolved accordingly
• This process is also referred to as data fusion,
• i.e., data from different sources being instance to the same real-world object is
integrated to represent the real-world object in a consistent and clean way
Cont.

• the following problems at data level might occur:


• data errors (e.g., typos), different formats, inconsistencies (e.g., the zip code does not
match the city), and duplicates
• One measure to improve data quality is data cleaning, which aims at “detecting and
removing errors and inconsistencies from data”
3.5.3 Linked Data and Data Mashups

• So far, we have discussed a process that


–extracts data from sources
–cleans them
–integrates them outside the sources
• A totally different idea is to use data sources that are
integrated already, making extraction, cleaning, and integration
tasks obsolete
• This is the idea of linked data
Cont.

• The principle here is to link data sources that are available on the Web and by doing so,
provide an integrated data source for later queries or analysis

• The difference between the classical ETL process and the idea of linked data is illustrated
in the following figure
Data Mashups
• Data mashup refers to combining data from different data sources into a single
application
• The integration of heterogeneous business data from different sources into one place
gives a unified overview of the business processes
• Data mashing is also referred to as enterprise or business mashing.
• Typical sources for data mashups are Web data and Web services that act as the
components of the mashup
[Note :
• Power users are popularly known for owning and using high-end computers with
sophisticated applications and service suites. For example, software developers,
graphic designers, animators and audio mixers require advanced computer hardware
and software applications for routine processes]
Summary

• Data provision constitutes an important prerequisite for any BI project


• Hence, it is recommended to plan for sufficient time and manpower

You might also like