BI Unit 3
BI Unit 3
Unit – III
Data Provisioning
Data Provisioning
• Or
• It is the process of making data available in an orderly and secure way to users, application
developers, and applications that need it
• Or
• It involves the procedures of making data and resources available to the system and users
• Data provisioning constitutes the prerequisite for any Business Intelligence (BI) project
• Clearly, without any data basis, there will be no analysis at all and without a database of
good quality, the expected quality of the analysis can be expected to be low as well
• However, data collection, extraction, and integration are often the most complex and
expensive tasks in a BI project
Cont.
• Due to big data, more and more data is available holding potential for valuable analysis
• Possible sources for large data volumes are e-business and social network data
• On top of data volume, data variety and data velocity pose additional challenges
• For example,
• Data velocity might demand for data extraction in very short time frames or even in a
continuous way
• Data variety addresses the fact that data from different sources might be structured,
semi-structured, or even unstructured while being available in different formats
Cont.
• Exclusive gateways (often called XOR) are used when process splits
into several paths and only one of them can be active
• i.e For a given instance of the Process, only one of the paths can be
taken
Simple example 1
• Simple example can be e.g. sales process where customer either accepts the offer as it
is, rejects it fully or asks for negotiations
• In this case name of the gateway could be “Customer decision?” and conditions for the
process paths outgoing from this gateway: “offer accepted”, “offer rejected”,
“negotiations requested”.
Below you can see example of a process with XOR gateway in two options – with
closing gateway and without
Example 2
• The last sequence flow has no condition and will be selected by default if the other
conditional flows evaluate to false
• In this unit , an understanding on how to
– collect and describe data
– extract data from various sources
– find the adequate target format for subsequent analysis as well as to clean and integrate data
• in such a way that the analysis goals can be achieved
• The corresponding data provisioning process has already been depicted
• Note that data cleaning and integration might be done in an iterative way depending on
whether the data quality has reached a sufficient level
Data Collection and Description
– Survival time
– Health status of a specific group of persons
– Cost effectiveness of certain health policies
• Balance between:
• What data sources do we need(to fulfil a certain analysis goal)
• Which data sources are actually available and accessible(privacy, data ownership, data access costs, etc.)
• the Stage IV Medical Database (S4MDB) that contains the treatment data of skin cancer patients of stage IV
• and
• the GAP-DRG database that stores medical billing data in Austria
• The S4MDB database was available in Excel format, whereas GAP-DRG is a relational database
• Since they were designed for different purposes, the sources are differing in their conceptual schema possibly
leading to integration challenges
Cont.
• An important task of the data collection phase is the description of the data sources
• The following Figure illustrates the description of the EBMC data sources in a schematic
2
way
• As the analysis goals primarily focus on patient treatment , the chosen target format
should enable process-oriented analysis
• After selecting the relevant data sources and describing them , the next step is to
extract relevant data from their sources
• Typically, data extraction is part of the so-called extraction-transformation-load (ETL)
process
3.3.1 Extraction-Transformation-Load (ETL) Process
• Data are transformed and integrated from heterogeneous data sources for
analysis purposes
• From the staging area, the cleaned and integrated data can be loaded into a presentation area
where users can perform analyses
• In data extraction, one should first of all think about the question when data is extracted?
• But if data is extracted from operational systems, this is not sufficient as the data might
quickly become out-dated
• In order to stay informed about updates within the sources, typically, the sources are monitored for
updates
• [Note: A database snapshot is a read-only, static, transitionally consistent view of
the source database
• A delta load implies that the entire data of a relational database table is not
repeatedly extracted, but only the new data that has been added to a table since
the last load. With delta load, you can process only data that needs to be processed,
either new data or changed data ]
• Connected with the question when to extract ?is the question what to extract?
• It can be overly complex to extract all data within a snapshot each time the data source undergoes
an update
• Instead, it is often desired to only extract the delta compared to the last data snapshot within the
staging area
A common example
• Assume that there is an update at time T2 and the new snapshot would be
• Instead of the complete participant list at T2., only the new participants i.e., Jones, are
transferred, and Mayer becomes the one not attending the conference
• Connected with the question of what, i.e., snapshot versus delta, is the question of how
to extract?
• If the data source is a database system, for example, it typically offers many ways for
extracting both, snapshots and deltas
• As legacy systems are still present in enterprises, and often do not offer any support for
data extraction, it is, in some cases, only possible to take snapshots of the data followed
by calculating differential snapshots between the last and the current version of the
data
• The efficient calculation of such snapshots has been tackled by different approaches,
e.g., the window algorithm
• After extracting data updates from the sources, the data is to be transferred to the
staging area for data cleaning and integration purposes
Cont.
• For transferring massive data sets, load techniques such as bulk loader are
offered
• for example, the Oracle SQL *Loader
• loading a large set of data should be only done for cleaned and trusted data
• The extensive use of social media (e.g., half a billion tweets a day on Twitter , more than
one billion active users on Facebook ), sensor applications (e.g., for measuring health
parameters or environmental conditions), as well as the immediate provision of possibly
large result data by modern search engines has led to a massive increase in produced
and potentially interesting-to-analyze data
• key challenges in the context of handling big data are data volume, data velocity, data
variety, and data veracity
Cont.
• We will have to analyze the challenges posed on data extraction and integration by
these four “Vs.”
Challenge 1 : Data volume
• Currently, NoSQL databases are on the rise for tackling the challenge of data volume
• Such databases can be categorized into
• key value stores
• graph databases
• XML databases
• These NoSQL databases are expected to dissolve the potential restrictions imposed by
relational databases such as the demand for a schema, ACID transactions, and
consistency
• would create an object with key Key1 and value "Some string“
• Data velocity refers to the frequency of updates from the data sources
• Whereas for data warehouses some years ago, updates in the data sources could be
treated in a periodic way
• nowadays, continuous data updates have become a frequent scenario, e.g., in the case
of sensor data (also referred to as streaming data)
Challenge 3: Data variety
• Data variety reflects the increasing number of different data formats that might have to
be integrated or, put in a more specific way, structured, semi-structured, and
unstructured data
• Approximately half of the XML documents available on the Web do not refer to a
schema
• But different techniques were introduced that derive schema information from a set of
XML documents in a tree-based or text-based manner
Cont.
• Tree-based approaches, for example, take the structure tree of each XML
document and aggregate them according to certain rules
• The underlying schema can be derived based on the aggregated tree
• Several XML tools and editors offer to automatically derive the underlying XML
schema from a set of XML documents
• e.g., Liquid Studio
• Data veracity connects big data with the question where the data comes from, e.g., data
that is stored in a cloud
• In such settings, we have to think about how we can ensure trust in the data we collect
and want to analyze
• In this context, techniques for auditing data by, for example, a third-party auditor
Summary on Data Extraction
• Heterogeneous data sources: Many tools exist that offer a bunch of adapters and
extractors to facilitate data extraction
• However, the basic design of the data extraction process remains a manual task
• Big data: Volume, variety, velocity and veracity of the data are challenges
• However, it is most crucial to define what to analyze in a huge bulk of data, i.e., asking
the right questions
From Transactional Data Towards Analytical Data
• The rows are labeled with formats including a distinction into table and log
formats
• Table formats can be further distinguished into flat and multidimensional
formats
• Structured data formats can be divided into flat, hierarchical, and hybrid
structures
• Typical flat formats comprise
–Relational tables
–Comma-separated values (CSV)
–Excel files
• A prominent hierarchically structured format is XML
• Since the structure for XML documents can be mapped onto a tree structure
• (eXtensible Event Stream) is also an XML-based structure but forms an extra
column for log data
• The different analysis (and possibly integration) formats denoted by the row
labels range from flat table structures over multidimensional structures to log
structures
• Given two data sources A and B, the following transformations are possible:
• A generates B By aggregation:
• A flat or multidimensional,
• B multidimensional
• aggregation function, e.g., SUM, AVG;
• aggregation refers to aggregating the data
• By mapping:
• dimensionality might be changed;
• describes a set of attribute correspondences between two schemas based on which one schema
can be mapped onto the other
• By transformation:
• A and B of any format;
• Transformation changes a schema (format) in order to obtain a desired target schema (format)
3.4.1 Table Formats and Online Analytical Processing (OLAP)
As for many applications, a time-related view on the facts is interesting for this example as well
• This results in the dimension time, which enables the aggregation along the dimension attributes
day, month, quarter, and year
• Figure 3.10 shows an example report based on the application of aggregation functions to the
multidimensional structure presented in Fig. 3.9
• In this report, the billing sums per patient cohorts, therapy, and year have been aggregated
• Aggregation is one ingredient of online analytical processing (OLAP) operations that are typically
used to analyze multidimensional data
• [note: In medicine, a cohort is a group that is part of a clinical trial or study and is observed over a
period of time.]
• Other OLAP operations include :
• Roll up
• Drill
• Slicing
• Dicing
• a set of OLAP techniques exists that change the granularity and/or dimensionality of the data
cube in order to gain additional insights
• Roll up : we can use the operation Roll up that generates new information by the aggregation of
data along the dimension where the number of dimensions is not changed
• Drill Down : The navigation from aggregated data to detailed data is achieved by the operation
Drill down
• One example would be to roll up the dimension Time from Months to Quarters or vice versa.
• Operations that do not change the granularity of the data, but the dimensionality or the size of
the cube, respectively, are Slice (OLAP) and Dice
• Slice generates individual views by cutting “slices” from the data cube
• Example:
• one could be interested in the average duration of all process instances for a certain patient X in
the current year
• To answer this question, a slice of the data would be generated by the following query:
• An example question answered by dicing could regard the number of patient treatments
within a certain time frame, and the solution can be accomplished by range queries
3.4.2 Log Formats
• Excision :
• The removal of tissue from the body using a scalpel (a sharp knife), laser, or other
cutting tool
• A surgical excision is usually done to remove a lump or other suspicious growth.]
• As a consequence, data formats are required to store temporal
information on process executions in an explicit way
• Either it is based on time stamps connected with the events or the assumption holds
that the order of the events within the log reflects the order in which they occurred
during process execution
• Currently, there are two process-oriented log formats that are predominantly used for
process analysis, i.e., Mining XML (MXML) and eXtensible Event Stream (XES)
• For illustration, see the example process depicted in the following Fig. 3.14,
• It expresses a parallel execution of process activities PerformSurgery and
ExaminePatient
• Consequently, two possible execution logs can be produced by this process schema
Import of Process-Oriented Data into Log Formats
• Several tools have been developed that support the import of this source data into
target formats such as MXML and XES
• There are also tools (Nitro and Disco ) that enable the import of CSV data
• Nitro and Disco provide support to define mappings between the columns of the
source Excel or CSV file and the target log format such as MXML and XES
Summary: From Transactional Towards Analytical Data
• Once the appropriate integration format is chosen, the possibly heterogeneous data
sources must be integrated within this target format
– Completeness
– Validity
– No contradictions
– Minimality
– Understandability
• Completeness: No information loss with respect to the entities contained within schemata S1….S
n
• Si , i = 1 ….n
• Validity: Sint should reflect a real-world scenario that can be seen as a union of the real-world
scenarios reflected by Si , i = 1 ….n
• Minimality:
• no redundancies, every entity contained in Si , i = 1 ….n should occur just once in Sint
• The schemata S1….S n might stem from possibly heterogeneous data sources
• This often results in a bunch of conflicts among the participating schemata.
• Conflicts
– Semantic and Descriptive conflicts
– Heterogeneity conflicts
– Structural conflicts
Semantic conflicts
• Semantic and descriptive conflicts refer to the way people perceive the set of realworld
objects to model
• More precisely, a semantic conflict arises if modelers choose different entities to
describe the same real-world scenario
• Example
• Assume two schemata A and B describing patient administration
• In schema A, the entity Patient describes patients in a hospital scenario
• whereas in schema B, the entity StatPatient describes patients in a hospital scenario
Descriptive conflicts
• Descriptive conflicts happen if modelers use the same entities but different attribute
sets to describe these entities
• For example,
• in schema A, a patient is described by Name and Age, whereas
• in schema B, the patient is described by Social Insurance Number and BirthDate
• Note: The main difference between the Entity and an attribute is that an entity is a real-
world object, and attributes describe the properties of an Entity
Heterogeneity conflicts
• Heterogeneity conflicts occur if the schemata are defined using different formats, e.g.,
Excel (structured) versus XML(semi-structured)
Structural conflicts
• Structural conflicts arise if different constructs are used despite choosing a common
format
• For semantic and descriptive conflicts, ontologies can be used to resolve conflicts
such as homonyms and synonyms (e.g., by using vehicle instead of car)
• (a) pre-integration
• (b) schema comparison
• (c) schema conforming
• (d) schema merging and restructuring
(a) pre-integration
• Schema mapping and matching are techniques that are applied during schema
comparison and schema conforming
• a matching takes “two schemas as input and produces a mapping between elements of
the two schemas that correspond semantically to each other
• A m is a mapping between schema S with attribute set AS and schema T with attribute
set AT
• The idea is to build the cross product AS x AT between all attributes in AS and AT
• Open-source or research tools for schema mappings are COMA 3.0 and Protégé
Heterogeneity conflicts
• Assume that the participating schemata S1, S2….Sn are of different formats
• In order to integrate two heterogeneous schemata, a common data model has to be
chosen, mostly one of the two existing ones
• Example:
• Bringing together and relational and XML-based schemata
challenge
• The challenge is to construct the hierarchical structure of the XML document based on the flat
relational data
• This task is referred to as structuring i.e., it has to be decided which of the database attributes has
to be converted in elements (or attributes) and in which hierarchical order
• the SQL/XML standard offers functions that enable the extraction of relational data as XML
documents.
3.5.2 Data Integration and Data Quality
• After schema integration, there might still be inconsistencies at the data level,
necessitating data integration actions
• For example
• Take the two XML documents
• Apparently, they adhere to the same XML schema but show conflicts at the data level
• for example,
• encoding names in a different manner or displaying fees in a different currency
• In order to integrate both XML files into one, the data conflicts have to be detected and
resolved accordingly
• This process is also referred to as data fusion,
• i.e., data from different sources being instance to the same real-world object is
integrated to represent the real-world object in a consistent and clean way
Cont.
• The principle here is to link data sources that are available on the Web and by doing so,
provide an integrated data source for later queries or analysis
• The difference between the classical ETL process and the idea of linked data is illustrated
in the following figure
Data Mashups
• Data mashup refers to combining data from different data sources into a single
application
• The integration of heterogeneous business data from different sources into one place
gives a unified overview of the business processes
• Data mashing is also referred to as enterprise or business mashing.
• Typical sources for data mashups are Web data and Web services that act as the
components of the mashup
[Note :
• Power users are popularly known for owning and using high-end computers with
sophisticated applications and service suites. For example, software developers,
graphic designers, animators and audio mixers require advanced computer hardware
and software applications for routine processes]
Summary