0% found this document useful (0 votes)
29 views57 pages

IDS CH2 Bharath S

Uploaded by

furquan10010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views57 pages

IDS CH2 Bharath S

Uploaded by

furquan10010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

IDS

Unit 2

Dr Setturu Bharath
SoE, Chanakya University
Unit 2
Unit 2:

Defining Data Team: Roles in a Data Science Team: Data


Scientists, Data Engineers Managing Data Team: On
boarding and evaluating the success of team, Working
with other teams, Common difficulties, Data and Data
Models, Types of Data and Datasets, Data Quality,
Epicycles of Data Analysis, Data Models, Model as
expectation, comparing models to reality, Reactions to
Data, Refining our expectations, Six Types of the
Questions, Characteristics of Good Question Formal
modelling, General Framework, Associational Analyses,
Prediction Analyses Introduction to OLTP, OLAP.
Defining Data Team
Building a common-sense baseline will force the team to get the end-to-end data and
evaluation pipeline working and uncover any issues, such as with data access,
cleanliness, and timeliness.

It will also surface any tactical obstacles with actually calculating the evaluation metric.

Knowing how well the baseline does on the evaluation metric will give a quick ballpark
idea of how much benefit to expect from the project.

Experienced practitioners know all too well that common-sense baselines are often
hard to beat. And even when data science models beat these baselines, they may do so
by slim margins.

3
Defining Data Team
Business leaders should take extraordinary care in choosing the team and defining the
problem they want their data science teams to solve.

Data scientists, especially new ones, often want to get going with preparing data and
building models.

At least initially, they may not have the confidence to question a senior business
executive, especially if that individual is the project sponsor.

It is up to leaders to make sure the team focuses on the right problem.

Pay less attention to how they are solving it, as there are usually many different ways to
solve any data science problem, and more attention to what they are solving.
Roles in a Data Science Team
• The genesis of a project need IT & Data Data Science
Business
not only come from the Management Teams
business, but it should be
tailored to a specific business Drives data
Understand
problem. Initiate Projects access and
Busineess needs
discovery
• Data is rarely collected with
future modeling projects in
Responsible for
mind. Actively engages
Provides key
tools and
deliverables
execution (Results, reports)
• Understanding what data is infrastructure
and verification
available, where it’s located,
and the trade-offs between Operationalize Communicates
Invests in
ease of availability and cost of applying results
and maintains effectively with
models all
acquisition, can impact the
overall design of solutions.
Builds and
supports end-
use applications 5
Roles in a Data Science Team
• Teams often need to loop back if they discover a
new line in data availability.
• The team should be multidisciplinary consists, Business
IT & Data Data Science
Management Teams
• Lead data scientist
• Validator Drives data
Understand
Initiate Projects access and
• Data engineer discovery
Busineess needs

• Product manager Responsible for


• Application developers Actively engages
execution
Provides key tools
and infrastructure
deliverables
(Results, reports)
and verification
• Internal end users
• External end users (customers) Invests in applying
Operationalize
and maintains
Communicates
results effectively with all
• DevOps engineer models

• Legal / compliance Builds and

• Build simple models first – communicate results supports end-use


applications

and collect feedback.


• Establish standard hardware & software configurations, but balance the flexibility to
experiment.
• Require monitoring plans for proactive alerting, acceptable, and notification threshold.
6
Epicycle of Data Analysis
• Data analysis does not follow a linear process. It is a
very iterative process that is more accurately reflected
by cycles rather than a sequence.

• The five core pieces of data analysis include:


➢ State and refine the question
➢ Explore the data
➢ Build statistical models
➢ Interpret the results
➢ Communicate the results
• At each stage of the process, keep in mind that it is
useful to set expectations, collect data that align with
your expectations and then revise the expectations or
data so that the two coincide.
• Data may need to be cleaned, revised, augmented or
otherwise changed in order to meet your goals.
7
Epicycle of Data Analysis
Revise
Set Expectations Collect Information Expectations
Question The question is of Literature review/ Refine question
interest to the Ask questions
client
Exploratory Data are Make exploratory Refine question/
Analysis appropriate plots collect more data
Formal Model Primary model Secondary models/ Revise model
answers question sensitivity analysis
Interpretation Interpretation of Interpret the Revise data
analysis provide entirety of analysis analysis and/ or
specific and with focus on effect provide specific
meaningful sizes and and interpretable
answers to client uncertainty answer
Communicate Process and results Elicit feedback Revise analysis/
of analysis are approach/
understood by presentation
client

Leek, J. (2015). The Elements of Data Analytic Style. A guide for people who want to analyze data. Leanpub. 8
Data and modeling
• Data is a crucial part of any data science
project/analysis.
• A large volume of data poses new challenges, such as
overloaded memory and algorithms that never stop
running.
• It forces you to adapt and expand your repertoire of
techniques. But even when you can perform your
analysis, you should take care of issues such as I/O
(input/output) and CPU starvation, because these can
cause speed issues.
• Never-ending algorithms, out-of-memory errors, and
speed issues are the most common challenges you
face when working with large data.
• The solutions can be divided into three categories:
using the correct algorithms, choosing the right data
structure, and using the right tools.
Data and modeling Exploratory Data Analysis (EDA)
• Exploratory data analysis is the process of exploring your data, and it typically includes
examining the structure and components of your dataset, the distributions of individual
variables, and the relationships between two or more variables.

• Data visualization is arguably the most important tool for exploratory data analysis because the
information conveyed by graphical display can be very quickly absorbed and because it is
generally easy to recognize patterns in a graphical display.

• Goals of EDA are:

1. To determine if there are any problems with your dataset.

2. To determine whether the question you are asking can be answered by the data that you
have.

3. To develop a sketch of the answer to your question.


Data and modeling Exploratory Data Analysis (EDA)

• Exploratory Data Analysis Checklist: 5. Check your “n”s


1. Formulate your question • Normalization/scaling
• Feature extraction
2. Read in your data
• Encoding categorical data
• Handling missing data • Identifying key features for analysis or modeling

• Correcting data types 6. Validate with at least one external data source
3. Check the packaging 7. Make a plot
• Removing duplicates • Histogram, box plots, scatter plots, bar charts, etc.
• Identifying outliers and trends
4. Look at the top and the bottom of your data
• Descriptive statistics (mean, median, mode, 8. Try the easy solution first
etc.) • Applying statistical tests to validate assumptions

• Summarizing distributions 9. Follow up


• Summarizing key findings
• Preparing data for predictive modeling
Data & Modeling

MODEL:
• Basic tools in science, economics, engineering,
business, planning, etc.
• Representation of real life scenarios (problems/
situations).
• Copies attributes from real world to a prototype.
• Applied to situations or systems (existing or non
existing)
• Visualization of various scenarios, derive optimum
solutions (varying input parameters)
• Can be logical, mathematical, spatial, using
assumptions, theories or similar
situations/operations.
Data and modeling

• Model provides a description of how the world works and how the data were
generated.

• The model is essentially an expectation of the relationships between various


factors in the real world and in your dataset.

• Finding the type of data distribution forms a crucial step. Need to have a clear
picture on linear vs non-linear
Data and modeling
• Perhaps the most popular statistical model in the
world is the Normal model. This model says that
the randomness in a set of data can be explained
by the Normal distribution, or a bell-shaped curve.

• The Normal distribution is fully specified by two


parameters - the mean and the standard deviation.

• It’s common to look at data and try to understand


linear relationships between variables of interest.

• The most common statistical technique to help


with this task is linear regression. 14
Data and modeling
• linear regression models assume a linear relationship between
the variables, meaning that the dependent variable can be
expressed as a linear combination of the independent variables.
• They can help you understand and predict the behavior of
complex systems or analyze experimental, financial, and
biological data in various fields such as statistics, economics,
social sciences, and machine learning.
• A linear relationship is a statistical term used to describe a
straight-line relationship between two variables.
• Linear relationships can be expressed either in a graphical format
or as a mathematical equation form of
y = mx + b
15
Data and modeling
• Non-linear models are also statistical models that do NOT assume a linear
relationship between the dependent variable and the independent variables,
meaning that analyzing more complex relationships from more complex and
relatively large datasets is available in non-linear models with making curves,
exponentials, logarithms, or interactions.
• Non-linear models are particularly useful in fields such as physics, biology, finance,
image processing, and machine learning, where the relationships between variables
are often more complex and not easily captured by linear models.
• Examples of non-linear models
• Logistic Regression: Logistic regression is used when the dependent variable is
binary or categorical. It models the relationship between the independent variables
and the probability of an event occurring using a logistic function, which produces an
S-shaped curve.
• Neural Networks: Neural networks are a class of non-linear models inspired by the
structure and function of biological neurons. They consist of interconnected layers of
nodes or artificial neurons and can model complex relationships by learning from
large datasets. 16
Data and modeling Modeling Phases: General Framework

1. Setting expectations: Setting expectations comes in the form of developing a


primary model that represents your best sense of what provides the answer to your
question. This model is chosen based on whatever information you have currently
available.

2. Collecting Information: Once the primary model is set, we want to create a set of
secondary models that challenge the primary model in some way.

3. Revising expectations: If our secondary models are successful in challenging our


primary model and put the primary model’s conclusions in some doubt, then we may
need to adjust or modify the primary model to better reflect what we have learned
from the secondary models.
Data and modeling Modeling Phases: General Framework
• After setting the primary model and secondary models, the Associational analyses are carried
out to test and verify the best outcomes.
Associational analyses are ones where we are looking at an association between two or more
features in the presence of other potentially confounding factors. There are three classes of
variables that are important to think about in an associational analysis.
1. Outcome: The outcome is the feature of your dataset that is thought to change along with
your key predictor.
2. Key predictor: Often, for associational analyses, there is one key predictor of interest (there
may be a few of them). We want to know how the outcome changes with this key predictor.
3. Potential confounders: This is a large class of predictors that are both related to the key
predictor and the outcome. If a key confounder is not available in the dataset, sometimes
there will be a proxy that is related to that key confounder that can be substituted instead.
Data and modeling Modeling Phases: General Framework
• The basic form of a model in an associational analysis will be

y = α + βx + γz + €
Where
• y is the outcome • A confounder is an external
• x is the key predictor variable that is associated with
both the cause and the effect.
• z is a potential confounder • It creates a spurious association
between the independent and
• € is independent random error dependent variables, making it
• α is the intercept, i.e. the value y when x = 0 and z = 0 difficult to identify true causal
effects.
• β is the change in y associated with a 1-unit increase
• x, adjusting for z
• γ is the change in y associated with a 1-unit increase in z, adjusting for x

This is a linear model, and our primary interest is in estimating the coefficient β, which quantifies the relationship between
the key predictor x and the outcome y.
Even though we will have to estimate α and γ as part of the process of estimating β, we do not really care about the values
of those α and γ. 19
Data and modeling Modeling Phases: Case Study

• Online advertising campaign


• Suppose we are selling a new product on the
web and we are interested in whether buying
advertisements on Facebook helps to increase
the sales of that product.

• To start, we might initiate a 1-week pilot


advertising campaign on Facebook and gauge
the success of that campaign.
Hypothetical Advertising Campaign
• If it were successful, we might continue to buy
ads for the product.
20
Data and modeling Modeling Phases: Case Study
The hypothetical data for the plot above might look as
• Online advertising campaign follows

• Your primary model is


y = α + βx + €
where y is total daily sales and x is and
indicator of whether a given day fell during the
ad campaign or not.
Unfortunately, we rarely see data like the plot
above.
In reality, the effect sizes tend to be smaller,
the noise tends to be higher, and there tend to
be other factors at play. 21
Data and modeling Modeling Phases: Case Study
• Online advertising campaign Typically, the data will look something like this

• Instead of our primary model, we fit the following.


y = α + βx + γ1t + γ2t2 + €
• where t now indicates the day number (i.e. 1; 2; : : :
; 21).
• What we have done is add a quadratic function of t
to the model to allow for some curvature in the
trend (as opposed to a linear function that would
only allow for a strictly increasing or decreasing
pattern).
• Using this model we estimate β to be $39.86,
which is somewhat less than what the primary
model estimated for β.
Data and modeling Modeling Phases: Case Study

• Online advertising campaign


• We can fit one final model, which allows for an
even moren flexible background trend–we use
a 4th order polynomial to represent that trend.

• Although we might find our quadratic model to


be sufficiently complex, the purpose of this
last model is to just push the envelope a little
bit to see how things change in more extreme
circumstances.
• Formal modeling is typically the most technical aspect of data analysis, and its purpose is to precisely lay out what is
the goal of the analysis and to provide a rigorous framework for challenging your findings and for testing assumptions.23
Data and modeling Associational Analyses

• Air Pollution and Mortality in New York City


• Air pollution and mortality data for New York City. The data were originally used as
part of the National Morbidity, Mortality, and Air Pollution Study (NMMAPS).
Data and modeling Associational Analyses

• Inferring Association
• The first approach we will take will be to ask “Is
there an association between daily 24-hour
average PM10 levels and daily mortality?”

• This is an inferential question and we are


attempting to estimate an association.

• In addition, we know there are a number of


potential confounders that we will have to deal
with regarding this question.

25
Data and modeling Associational Analyses

• Inferring Association
• The overall relationship between P10 and mortality
is null, but when we account for the seasonal
variation in both mortality and PM10, the
association is positive.

• The surprising result comes from the opposite


ways in which season is related to mortality and
PM10.
Model Types
Symbolic /
Iconic Models Analogue Models Mathematical
Models
• Similar attribute of • Analogous to prototype, • Diagrams, equations,
prototype differ physically calculations, computer
programmes

27
Using relationships, rules, variables, equations,
Mathematical conditions.
• Examples: Cash flow, Population, Pollution, etc.

Models
Input Variables : A, MODEL:
B Limits: A< 100, B < 150
C = 4.A + 3.B
SIMPLE MODEL D = X +C *0.5
E = Y * logeC Output Variables : D, E

28
Model
Simple to Complex
Can be constructed using
indicators
• Use of proxies – Cell phones linking with
Covid, Traffic

Built for specific purpose


• General or Specific application

Can be Computer based


• Excel, C, Python, R, GIS, Power BI, Tableau

If not calibrated
RUBBISH IN = RUBBISH OUT
Model Accuracy
Data used (type, quality, quantity, interlinkages) to
build and operate

Experience of Analyst

Quality and Type of Model


• Local/global interpolations
• Splines
• Co-krigging
• Agent based models
• Non agent based models
• Linear Function, Non-linear, Polynomials, etc

30
Forecasting Models
Forecasting models are limited in scope.

Can be invalidated with abrupt changes (natural disasters,


rampant developments)

Data needed for forecasting:


• Information on situation being forecasted
• Current and past records
• Model (Simple Systematic, Cause Effect Model)

Use of mathematical models using linear or non linear


relations relating past records.
Expert judgement

31
Growth Models
Exponential S Curve Logistic

• Increase in some • Rate of change • quantify maximum


quantity is directly proportional to sustainable yield
proportional to current availability of resources • max rate that an individual
can be extracted/harvested
without reducing population
• Ex: Population growth • Ex: Microbial culture size
growth

Growth rate decreases due to


limited resource

Population
𝐾
𝑁=
Population increases
exponentially as population
2
grown and resource are
utilised

Time

𝑑𝑁 𝑁
=𝑟∗𝑁 ∗ 1−
𝑑𝑡 𝐾
32
Resistance by resource
Sensitivity Analysis
What if ??

Relative response of outputs to every change in input


• No relative changes in output even in large changes in
input – insensitive.

Test for sensitivity – percentage change

• Determines the effects individual factors and their


variations in overall results

Example : Flooding and Landslides as a function of Land use


and Rainfall intensity (Waynad)

33
Data includes both patterns (stable,
The Problem of underlying relationships) and noise
(transient, random effects).

Over-fitting Noise has no predictive value; so a model is


over-fit when it incorporates noise.
The figure below right shows results from
two predictive models—polynomial and
• The polynomial model predicts linear—applied to the same data set.
sales almost 50 times the
actual value, where the linear
model is far more realistic (and
accurate).
• Be skeptical of data and
skeptical of results.

34
Classification
• Given data set with a set of independent (input) and dependent variables
(outcome).
• Partition into training and evaluation data set
• Choose classification technique to build a model
• Test model on evaluation data set to test predictive accuracy

This is a Sparrow

There are sparrows & a bear


Classification
MACHINE LEARNING
• Classification is the task of assigning labels to objects
Supervised Unsupervised
• Many evaluation criteria
• Confusion matrix commonly used
Classification Regression Clustering

• Lots of classification algorithms


• Rule based, instance based, ensembles, regressions, … SVM Linear K-means

• Different algorithms may be best in different situations Naive Bayes Ensembles DBSCAN

Nearest
Decision trees ……………
Neighbour

Neural
Networks
Classification
Object space
• 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1 , 𝑜𝑏𝑗𝑒𝑐𝑡2 , …
• Often infinite
Representations of the objects in a feature space
• ℱ = {𝜙 𝑜 , 𝑜 ∈ 𝑂}
Set of classes
• 𝐶 = {𝑐𝑙𝑎𝑠𝑠1 , … , 𝑐𝑙𝑎𝑠𝑠𝑛 } A hypothesis maps features to classes
A target concept that maps objects to classes ℎ:ℱ→𝐶
ℎ:𝜙(𝑜)→𝐶
• ℎ∗ : 𝑂 → 𝐶
Classification Approximation of the target concept ℎ^∗
• Finding an approximation of the target concept ℎ^∗ (𝑜)≈ℎ(𝜙(𝑜))
Hypothesis = Classifier = Classification M
Example of Clustering

Clustering
The General Problem

Object 1 Object 1
Object 2 Object 3

Object 3 Object 4
Class 1
Supervised/
Object 4
Unsupervised
(Clustering) Object 2

Object n
Object n
Class 2
Object lables..
(taining)
The Formal Problem
Object space
• 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1 , 𝑜𝑏𝑗𝑒𝑐𝑡2 , …
• Often infinite How do you
measure
similarity?
Representations of the objects in a (numeric) feature space
• ℱ = {𝜙 𝑜 , 𝑜 ∈ 𝑂}

Clustering
• Grouping of the objects
• Objects in the same group 𝑔 ∈ 𝐺 should be similar
• 𝑐: ℱ → 𝐺
Measuring Similarity Distances y
Small distance = similar
Euclidean Distance
• Based on the Eucledian norm 𝑥 x
2
y
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 2
= 𝑦1 − 𝑥1 2 + ⋯ + 𝑦𝑛 − 𝑥𝑛 2

Manhattan Distance
• Based on the Manhattan norm 𝑥 1 x
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 = 𝑦1 − 𝑥1 + ⋯ + 𝑦𝑛 − 𝑥𝑛
1 2 2 2 2 2
Chebyshev Distance 2 1 1 1 2
2 1 0 1 2
• Based on the maximum norm 𝑥 ∞ 2 1 1 1 2
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 ∞
= max |𝑦𝑖 − 𝑥𝑖 | 2 2 2 2 2
𝑖=1..𝑛
Idea Behind 𝑘–means Clustering
Clusters are described by their center
• The centers are called centroid
How do you
• Centroid-based clustering get the
centroids?

Objects are assigned to the closests centroid


Simple Algorithm
Select initial centroids 𝐶1 , … , 𝐶𝑘
• Randomized

Assign each object to closest centroid


• 𝑐 𝑥 = argmin𝑖=1..𝑘 𝑑(𝑥, 𝐶𝑖 )

Update centroid
• Arithmetic mean of assigned objects
1
• 𝐶𝑖 = σ𝑥:𝑐 𝑥 =𝑖 𝑥𝑖
𝑥:𝑐 𝑥 =𝑖

Repeat update and assignment


• Until convergence, or
• Until maximum number of iterations
Visualization of the 𝑘–means Algorithm

Selecting 𝑘:
Intuition and knowledge about data
• Based on looking at plots
• Based on domain knowledge

Due to goal
• Fixed number of groups desired

Based on best fit


• Within-sum-of-squares
2
• 𝑊𝑆𝑆 = σ𝑘𝑖=1 σ𝑥: 𝑐 𝑥 =𝑖 𝑑 𝑥, 𝐶𝑖
Results for 𝑘 = 2, … , 5

2, 3, and 4 all okay


→ use domain knowledge to decide

Big changes in slope (elbows) indicate


potentially good values for k

Splits like these indicate too many clusters


Problems of 𝑘-Means
Depends on initial clusters
• Results may be unstable
Wrong 𝑘 can lead to bad results
All features must have a similar scale
• Differences in scale introduce artificial weights between features
• Large scales dominate small scales
Only works well for “round“ clusters
Regression
Statistical Approach
• independent variables: problem characteristics
• dependent variables: decision
the general form of the relationship has to be known in advance (e.g., linear, quadratic, etc.)
Basic Idea:
• Regression model of the probability that an object belongs to a class
• Combines the 𝑙𝑜𝑔𝑖𝑡 function with linear regression

Linear Regression
• 𝑦 as linear combination of 𝑥1 , … , 𝑥𝑛
• 𝑦 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛
The 𝑙𝑜𝑔𝑖𝑡 function
𝑃(𝑦=𝑐)
• 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = ln
1−𝑃(𝑦=𝑐)

Logistic Regression
• 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛
Regression: Decision Trees
Basic Idea
• Make decisions based on logical rules about features
• Organize rules as a tree

© Dataaspirant
Basic Decision Tree Algorithm
Recursive algorithm
• Stop if
Data is “pure”, i.e. mostly from class
Amount of data is too small, i.e., only few instances in partition
• Otherwise
Determine „most informative feature“ 𝑋
Partition training data using 𝑋
Recursively create subtree for each partition

Details may vary depending on the specific algorithm


• For example, CART, ID3, C4.5
General concept always the same
Information theory based approach

Entropy of the class label


• 𝐻 𝐶 = − σ𝑐∈𝐶 𝑝 𝑐 log 𝑝(𝑐)
Can be used as measure for purity

Conditional entropy of the class label based on feature 𝑋


• 𝐻 𝐶 𝑋 = − σ𝑥∈𝑋 𝑝 𝑥 σ𝑐∈𝐶 𝑝 𝑐 𝑥 log 𝑝(𝑐|𝑥)

Mutual Information
• 𝐼 𝐶, 𝑋 = 𝐻 𝐶 − 𝐻 𝐶 𝑋 Interpret each dimension as
random variable

→ Feature with highest mutual information is most informative


Data Warehousing OLAP and OLTP Technology
• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process.”—W. H.
Inmon

• Data warehousing is the process of constructing and using data with update-driven.

• Constructed by integrating multiple, heterogeneous data sources, relational


databases, flat files, on-line transaction records

• Data cleaning and data integration techniques are applied, ensuring consistency in
naming conventions, encoding structures, attribute measures, etc. among different
data sources.
Data Warehousing OLAP and OLTP Technology
• A data warehouse is based on a multidimensional data model which views data in the form of a data cube
• Four views regarding the design of a data warehouse : Top-down view, Data source view, Data warehouse view and
Business query view.
Sales volume as a function of product, month, and region Industry Region Year
Dimensions: Product, Location, Time
Category Country Quarter
Hierarchical summarization paths
Product City Month Week

Office Day
Product

Month
Multidimensional Data 53
Data Warehousing OLAP and OLTP Technology
OLTP (on-line transaction processing)
• Online transactional processing (OLTP) enables the real-time execution of large numbers of database
transactions by large numbers of people, typically over the Internet.

• OLTP systems are behind many of our everyday transactions, from ATMs to in-store purchases to
hotel reservations. Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
registration, accounting, etc.

• OLTP can also drive non-financial transactions, including password changes and text messages.

• Major task of traditional relational DBMS

• OLTP, on the other hand, is optimized for processing a massive number of transactions.

• OLTP systems are designed for use by frontline workers (e.g., cashiers, bank tellers, hotel desk
clerks) or for customer self-service applications (e.g., online banking, e-commerce, travel
reservations).
Data Warehousing OLAP and OLTP Technology
OLAP (on-line analytical processing)
• Online analytical processing (OLAP) is a system for performing multi-dimensional analysis at high
speeds on large volumes of data.
• Major task of data warehouse system, data mart or some other centralized data store.
• Data analysis and decision making
• OLAP is ideal for data mining, business intelligence and complex analytical calculations, as well as
business reporting functions like financial analysis, budgeting and sales forecasting.
Data Warehousing OLAP and OLTP Technology
OLAP is optimized for conducting complex data analysis for smarter decision-making.
OLAP systems are designed for use by data scientists, business analysts and knowledge workers, and they support business intelligence (BI), data
mining and other decision support applications.
• Multidimensional OLAP (MOLAP)
• Array-based storage structures
• Direct access to array data structures
• Example: Essbase (Arbor)
• Relational OLAP (ROLAP)
• Relational and Specialized Relational DBMS to store and manage warehouse data
• OLAP middleware to support missing pieces
• Optimize for each DBMS backend
• Aggregation Navigation Logic
• Additional tools and services
• Example: Microstrategy, MetaCube (Informix)

Distinct features (OLTP vs. OLAP):


• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• View: current, local vs. evolutionary, integrated
• Access patterns: update vs. read-only but complex queries
Data Warehousing OLAP and OLTP Technology
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Online Analytical Mining (OLAM)

Why online analytical mining?


High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools
OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
integration and swapping of multiple mining functions, algorithms, and
tasks.

You might also like