0% found this document useful (0 votes)

29 views57 pages

IDS CH2 Bharath S

Uploaded by

furquan10010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views57 pages

IDS CH2 Bharath S

Uploaded by

furquan10010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

IDS

Unit 2

Dr Setturu Bharath
SoE, Chanakya University
Unit 2
Unit 2:

Defining Data Team: Roles in a Data Science Team: Data

Scientists, Data Engineers Managing Data Team: On
boarding and evaluating the success of team, Working
with other teams, Common difficulties, Data and Data
Models, Types of Data and Datasets, Data Quality,
Epicycles of Data Analysis, Data Models, Model as
expectation, comparing models to reality, Reactions to
Data, Refining our expectations, Six Types of the
Questions, Characteristics of Good Question Formal
modelling, General Framework, Associational Analyses,
Prediction Analyses Introduction to OLTP, OLAP.
Defining Data Team
Building a common-sense baseline will force the team to get the end-to-end data and
evaluation pipeline working and uncover any issues, such as with data access,
cleanliness, and timeliness.

It will also surface any tactical obstacles with actually calculating the evaluation metric.

Knowing how well the baseline does on the evaluation metric will give a quick ballpark
idea of how much benefit to expect from the project.

Experienced practitioners know all too well that common-sense baselines are often
hard to beat. And even when data science models beat these baselines, they may do so
by slim margins.

3
Defining Data Team
Business leaders should take extraordinary care in choosing the team and defining the
problem they want their data science teams to solve.

Data scientists, especially new ones, often want to get going with preparing data and
building models.

At least initially, they may not have the confidence to question a senior business
executive, especially if that individual is the project sponsor.

It is up to leaders to make sure the team focuses on the right problem.

Pay less attention to how they are solving it, as there are usually many different ways to
solve any data science problem, and more attention to what they are solving.
Roles in a Data Science Team
• The genesis of a project need IT & Data Data Science
Business
not only come from the Management Teams
business, but it should be
tailored to a specific business Drives data
Understand
problem. Initiate Projects access and
Busineess needs
discovery
• Data is rarely collected with
future modeling projects in
Responsible for
mind. Actively engages
Provides key
tools and
deliverables
execution (Results, reports)
• Understanding what data is infrastructure
and verification
available, where it’s located,
and the trade-offs between Operationalize Communicates
Invests in
ease of availability and cost of applying results
and maintains effectively with
models all
acquisition, can impact the
overall design of solutions.
Builds and
supports end-
use applications 5
Roles in a Data Science Team
• Teams often need to loop back if they discover a
new line in data availability.
• The team should be multidisciplinary consists, Business
IT & Data Data Science
Management Teams
• Lead data scientist
• Validator Drives data
Understand
Initiate Projects access and
• Data engineer discovery
Busineess needs

• Product manager Responsible for

• Application developers Actively engages
execution
Provides key tools
and infrastructure
deliverables
(Results, reports)
and verification
• Internal end users
• External end users (customers) Invests in applying
Operationalize
and maintains
Communicates
results effectively with all
• DevOps engineer models

• Legal / compliance Builds and

• Build simple models first – communicate results supports end-use

applications

and collect feedback.

• Establish standard hardware & software configurations, but balance the flexibility to
experiment.
• Require monitoring plans for proactive alerting, acceptable, and notification threshold.
6
Epicycle of Data Analysis
• Data analysis does not follow a linear process. It is a
very iterative process that is more accurately reflected
by cycles rather than a sequence.

• The five core pieces of data analysis include:

➢ State and refine the question
➢ Explore the data
➢ Build statistical models
➢ Interpret the results
➢ Communicate the results
• At each stage of the process, keep in mind that it is
useful to set expectations, collect data that align with
your expectations and then revise the expectations or
data so that the two coincide.
• Data may need to be cleaned, revised, augmented or
otherwise changed in order to meet your goals.
7
Epicycle of Data Analysis
Revise
Set Expectations Collect Information Expectations
Question The question is of Literature review/ Refine question
interest to the Ask questions
client
Exploratory Data are Make exploratory Refine question/
Analysis appropriate plots collect more data
Formal Model Primary model Secondary models/ Revise model
answers question sensitivity analysis
Interpretation Interpretation of Interpret the Revise data
analysis provide entirety of analysis analysis and/ or
specific and with focus on effect provide specific
meaningful sizes and and interpretable
answers to client uncertainty answer
Communicate Process and results Elicit feedback Revise analysis/
of analysis are approach/
understood by presentation
client

Leek, J. (2015). The Elements of Data Analytic Style. A guide for people who want to analyze data. Leanpub. 8
Data and modeling
• Data is a crucial part of any data science
project/analysis.
• A large volume of data poses new challenges, such as
overloaded memory and algorithms that never stop
running.
• It forces you to adapt and expand your repertoire of
techniques. But even when you can perform your
analysis, you should take care of issues such as I/O
(input/output) and CPU starvation, because these can
cause speed issues.
• Never-ending algorithms, out-of-memory errors, and
speed issues are the most common challenges you
face when working with large data.
• The solutions can be divided into three categories:
using the correct algorithms, choosing the right data
structure, and using the right tools.
Data and modeling Exploratory Data Analysis (EDA)
• Exploratory data analysis is the process of exploring your data, and it typically includes
examining the structure and components of your dataset, the distributions of individual
variables, and the relationships between two or more variables.

• Data visualization is arguably the most important tool for exploratory data analysis because the
information conveyed by graphical display can be very quickly absorbed and because it is
generally easy to recognize patterns in a graphical display.

• Goals of EDA are:

1. To determine if there are any problems with your dataset.

2. To determine whether the question you are asking can be answered by the data that you
have.

3. To develop a sketch of the answer to your question.

Data and modeling Exploratory Data Analysis (EDA)

• Exploratory Data Analysis Checklist: 5. Check your “n”s

1. Formulate your question • Normalization/scaling
• Feature extraction
2. Read in your data
• Encoding categorical data
• Handling missing data • Identifying key features for analysis or modeling

• Correcting data types 6. Validate with at least one external data source
3. Check the packaging 7. Make a plot
• Removing duplicates • Histogram, box plots, scatter plots, bar charts, etc.
• Identifying outliers and trends
4. Look at the top and the bottom of your data
• Descriptive statistics (mean, median, mode, 8. Try the easy solution first
etc.) • Applying statistical tests to validate assumptions

• Summarizing distributions 9. Follow up

• Summarizing key findings
• Preparing data for predictive modeling
Data & Modeling

MODEL:
• Basic tools in science, economics, engineering,
business, planning, etc.
• Representation of real life scenarios (problems/
situations).
• Copies attributes from real world to a prototype.
• Applied to situations or systems (existing or non
existing)
• Visualization of various scenarios, derive optimum
solutions (varying input parameters)
• Can be logical, mathematical, spatial, using
assumptions, theories or similar
situations/operations.
Data and modeling

• Model provides a description of how the world works and how the data were
generated.

• The model is essentially an expectation of the relationships between various

factors in the real world and in your dataset.

• Finding the type of data distribution forms a crucial step. Need to have a clear
picture on linear vs non-linear
Data and modeling
• Perhaps the most popular statistical model in the
world is the Normal model. This model says that
the randomness in a set of data can be explained
by the Normal distribution, or a bell-shaped curve.

• The Normal distribution is fully specified by two

parameters - the mean and the standard deviation.

• It’s common to look at data and try to understand

linear relationships between variables of interest.

• The most common statistical technique to help

with this task is linear regression. 14
Data and modeling
• linear regression models assume a linear relationship between
the variables, meaning that the dependent variable can be
expressed as a linear combination of the independent variables.
• They can help you understand and predict the behavior of
complex systems or analyze experimental, financial, and
biological data in various fields such as statistics, economics,
social sciences, and machine learning.
• A linear relationship is a statistical term used to describe a
straight-line relationship between two variables.
• Linear relationships can be expressed either in a graphical format
or as a mathematical equation form of
y = mx + b
15
Data and modeling
• Non-linear models are also statistical models that do NOT assume a linear
relationship between the dependent variable and the independent variables,
meaning that analyzing more complex relationships from more complex and
relatively large datasets is available in non-linear models with making curves,
exponentials, logarithms, or interactions.
• Non-linear models are particularly useful in fields such as physics, biology, finance,
image processing, and machine learning, where the relationships between variables
are often more complex and not easily captured by linear models.
• Examples of non-linear models
• Logistic Regression: Logistic regression is used when the dependent variable is
binary or categorical. It models the relationship between the independent variables
and the probability of an event occurring using a logistic function, which produces an
S-shaped curve.
• Neural Networks: Neural networks are a class of non-linear models inspired by the
structure and function of biological neurons. They consist of interconnected layers of
nodes or artificial neurons and can model complex relationships by learning from
large datasets. 16
Data and modeling Modeling Phases: General Framework

1. Setting expectations: Setting expectations comes in the form of developing a

primary model that represents your best sense of what provides the answer to your
question. This model is chosen based on whatever information you have currently
available.

2. Collecting Information: Once the primary model is set, we want to create a set of
secondary models that challenge the primary model in some way.

3. Revising expectations: If our secondary models are successful in challenging our

primary model and put the primary model’s conclusions in some doubt, then we may
need to adjust or modify the primary model to better reflect what we have learned
from the secondary models.
Data and modeling Modeling Phases: General Framework
• After setting the primary model and secondary models, the Associational analyses are carried
out to test and verify the best outcomes.
Associational analyses are ones where we are looking at an association between two or more
features in the presence of other potentially confounding factors. There are three classes of
variables that are important to think about in an associational analysis.
1. Outcome: The outcome is the feature of your dataset that is thought to change along with
your key predictor.
2. Key predictor: Often, for associational analyses, there is one key predictor of interest (there
may be a few of them). We want to know how the outcome changes with this key predictor.
3. Potential confounders: This is a large class of predictors that are both related to the key
predictor and the outcome. If a key confounder is not available in the dataset, sometimes
there will be a proxy that is related to that key confounder that can be substituted instead.
Data and modeling Modeling Phases: General Framework
• The basic form of a model in an associational analysis will be

y = α + βx + γz + €
Where
• y is the outcome • A confounder is an external
• x is the key predictor variable that is associated with
both the cause and the effect.
• z is a potential confounder • It creates a spurious association
between the independent and
• € is independent random error dependent variables, making it
• α is the intercept, i.e. the value y when x = 0 and z = 0 difficult to identify true causal
effects.
• β is the change in y associated with a 1-unit increase
• x, adjusting for z
• γ is the change in y associated with a 1-unit increase in z, adjusting for x

This is a linear model, and our primary interest is in estimating the coefficient β, which quantifies the relationship between
the key predictor x and the outcome y.
Even though we will have to estimate α and γ as part of the process of estimating β, we do not really care about the values
of those α and γ. 19
Data and modeling Modeling Phases: Case Study

• Online advertising campaign

• Suppose we are selling a new product on the
web and we are interested in whether buying
advertisements on Facebook helps to increase
the sales of that product.

• To start, we might initiate a 1-week pilot

advertising campaign on Facebook and gauge
the success of that campaign.
Hypothetical Advertising Campaign
• If it were successful, we might continue to buy
ads for the product.
20
Data and modeling Modeling Phases: Case Study
The hypothetical data for the plot above might look as
• Online advertising campaign follows

• Your primary model is

y = α + βx + €
where y is total daily sales and x is and
indicator of whether a given day fell during the
ad campaign or not.
Unfortunately, we rarely see data like the plot
above.
In reality, the effect sizes tend to be smaller,
the noise tends to be higher, and there tend to
be other factors at play. 21
Data and modeling Modeling Phases: Case Study
• Online advertising campaign Typically, the data will look something like this

• Instead of our primary model, we fit the following.

y = α + βx + γ1t + γ2t2 + €
• where t now indicates the day number (i.e. 1; 2; : : :
; 21).
• What we have done is add a quadratic function of t
to the model to allow for some curvature in the
trend (as opposed to a linear function that would
only allow for a strictly increasing or decreasing
pattern).
• Using this model we estimate β to be $39.86,
which is somewhat less than what the primary
model estimated for β.
Data and modeling Modeling Phases: Case Study

• Online advertising campaign

• We can fit one final model, which allows for an
even moren flexible background trend–we use
a 4th order polynomial to represent that trend.

• Although we might find our quadratic model to

be sufficiently complex, the purpose of this
last model is to just push the envelope a little
bit to see how things change in more extreme
circumstances.
• Formal modeling is typically the most technical aspect of data analysis, and its purpose is to precisely lay out what is
the goal of the analysis and to provide a rigorous framework for challenging your findings and for testing assumptions.23
Data and modeling Associational Analyses

• Air Pollution and Mortality in New York City

• Air pollution and mortality data for New York City. The data were originally used as
part of the National Morbidity, Mortality, and Air Pollution Study (NMMAPS).
Data and modeling Associational Analyses

• Inferring Association
• The first approach we will take will be to ask “Is
there an association between daily 24-hour
average PM10 levels and daily mortality?”

• This is an inferential question and we are

attempting to estimate an association.

• In addition, we know there are a number of

potential confounders that we will have to deal
with regarding this question.

25
Data and modeling Associational Analyses

• Inferring Association
• The overall relationship between P10 and mortality
is null, but when we account for the seasonal
variation in both mortality and PM10, the
association is positive.

• The surprising result comes from the opposite

ways in which season is related to mortality and
PM10.
Model Types
Symbolic /
Iconic Models Analogue Models Mathematical
Models
• Similar attribute of • Analogous to prototype, • Diagrams, equations,
prototype differ physically calculations, computer
programmes

27
Using relationships, rules, variables, equations,
Mathematical conditions.
• Examples: Cash flow, Population, Pollution, etc.

Models
Input Variables : A, MODEL:
B Limits: A< 100, B < 150
C = 4.A + 3.B
SIMPLE MODEL D = X +C *0.5
E = Y * logeC Output Variables : D, E

28
Model
Simple to Complex
Can be constructed using
indicators
• Use of proxies – Cell phones linking with
Covid, Traffic

Built for specific purpose

• General or Specific application

Can be Computer based

• Excel, C, Python, R, GIS, Power BI, Tableau

If not calibrated
RUBBISH IN = RUBBISH OUT
Model Accuracy
Data used (type, quality, quantity, interlinkages) to
build and operate

Experience of Analyst

Quality and Type of Model

• Local/global interpolations
• Splines
• Co-krigging
• Agent based models
• Non agent based models
• Linear Function, Non-linear, Polynomials, etc

30
Forecasting Models
Forecasting models are limited in scope.

Can be invalidated with abrupt changes (natural disasters,

rampant developments)

Data needed for forecasting:

• Information on situation being forecasted
• Current and past records
• Model (Simple Systematic, Cause Effect Model)

Use of mathematical models using linear or non linear

relations relating past records.
Expert judgement

31
Growth Models
Exponential S Curve Logistic

• Increase in some • Rate of change • quantify maximum

quantity is directly proportional to sustainable yield
proportional to current availability of resources • max rate that an individual
can be extracted/harvested
without reducing population
• Ex: Population growth • Ex: Microbial culture size
growth

Growth rate decreases due to

limited resource

Population
𝐾
𝑁=
Population increases
exponentially as population
2
grown and resource are
utilised

Time

𝑑𝑁 𝑁
=𝑟∗𝑁 ∗ 1−
𝑑𝑡 𝐾
32
Resistance by resource
Sensitivity Analysis
What if ??

Relative response of outputs to every change in input

• No relative changes in output even in large changes in
input – insensitive.

Test for sensitivity – percentage change

• Determines the effects individual factors and their

variations in overall results

Example : Flooding and Landslides as a function of Land use

and Rainfall intensity (Waynad)

33
Data includes both patterns (stable,
The Problem of underlying relationships) and noise
(transient, random effects).

Over-fitting Noise has no predictive value; so a model is

over-fit when it incorporates noise.
The figure below right shows results from
two predictive models—polynomial and
• The polynomial model predicts linear—applied to the same data set.
sales almost 50 times the
actual value, where the linear
model is far more realistic (and
accurate).
• Be skeptical of data and
skeptical of results.

34
Classification
• Given data set with a set of independent (input) and dependent variables
(outcome).
• Partition into training and evaluation data set
• Choose classification technique to build a model
• Test model on evaluation data set to test predictive accuracy

This is a Sparrow

There are sparrows & a bear

Classification
MACHINE LEARNING
• Classification is the task of assigning labels to objects
Supervised Unsupervised
• Many evaluation criteria
• Confusion matrix commonly used
Classification Regression Clustering

• Lots of classification algorithms

• Rule based, instance based, ensembles, regressions, … SVM Linear K-means

• Different algorithms may be best in different situations Naive Bayes Ensembles DBSCAN

Nearest
Decision trees ……………
Neighbour

Neural
Networks
Classification
Object space
• 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1 , 𝑜𝑏𝑗𝑒𝑐𝑡2 , …
• Often infinite
Representations of the objects in a feature space
• ℱ = {𝜙 𝑜 , 𝑜 ∈ 𝑂}
Set of classes
• 𝐶 = {𝑐𝑙𝑎𝑠𝑠1 , … , 𝑐𝑙𝑎𝑠𝑠𝑛 } A hypothesis maps features to classes
A target concept that maps objects to classes ℎ:ℱ→𝐶
ℎ:𝜙(𝑜)→𝐶
• ℎ∗ : 𝑂 → 𝐶
Classification Approximation of the target concept ℎ^∗
• Finding an approximation of the target concept ℎ^∗ (𝑜)≈ℎ(𝜙(𝑜))
Hypothesis = Classifier = Classification M
Example of Clustering

Clustering
The General Problem

Object 1 Object 1
Object 2 Object 3

Object 3 Object 4
Class 1
Supervised/
Object 4
Unsupervised
(Clustering) Object 2
…

Object n
Object n
Class 2
Object lables..
(taining)
The Formal Problem
Object space
• 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1 , 𝑜𝑏𝑗𝑒𝑐𝑡2 , …
• Often infinite How do you
measure
similarity?
Representations of the objects in a (numeric) feature space
• ℱ = {𝜙 𝑜 , 𝑜 ∈ 𝑂}

Clustering
• Grouping of the objects
• Objects in the same group 𝑔 ∈ 𝐺 should be similar
• 𝑐: ℱ → 𝐺
Measuring Similarity Distances y
Small distance = similar
Euclidean Distance
• Based on the Eucledian norm 𝑥 x
2
y
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 2
= 𝑦1 − 𝑥1 2 + ⋯ + 𝑦𝑛 − 𝑥𝑛 2

Manhattan Distance
• Based on the Manhattan norm 𝑥 1 x
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 = 𝑦1 − 𝑥1 + ⋯ + 𝑦𝑛 − 𝑥𝑛
1 2 2 2 2 2
Chebyshev Distance 2 1 1 1 2
2 1 0 1 2
• Based on the maximum norm 𝑥 ∞ 2 1 1 1 2
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 ∞
= max |𝑦𝑖 − 𝑥𝑖 | 2 2 2 2 2
𝑖=1..𝑛
Idea Behind 𝑘–means Clustering
Clusters are described by their center
• The centers are called centroid
How do you
• Centroid-based clustering get the
centroids?

Objects are assigned to the closests centroid

Simple Algorithm
Select initial centroids 𝐶1 , … , 𝐶𝑘
• Randomized

Assign each object to closest centroid

• 𝑐 𝑥 = argmin𝑖=1..𝑘 𝑑(𝑥, 𝐶𝑖 )

Update centroid
• Arithmetic mean of assigned objects
1
• 𝐶𝑖 = σ𝑥:𝑐 𝑥 =𝑖 𝑥𝑖
𝑥:𝑐 𝑥 =𝑖

Repeat update and assignment

• Until convergence, or
• Until maximum number of iterations
Visualization of the 𝑘–means Algorithm

Selecting 𝑘:
Intuition and knowledge about data
• Based on looking at plots
• Based on domain knowledge

Due to goal
• Fixed number of groups desired

Based on best fit

• Within-sum-of-squares
2
• 𝑊𝑆𝑆 = σ𝑘𝑖=1 σ𝑥: 𝑐 𝑥 =𝑖 𝑑 𝑥, 𝐶𝑖
Results for 𝑘 = 2, … , 5

2, 3, and 4 all okay

→ use domain knowledge to decide

Big changes in slope (elbows) indicate

potentially good values for k

Splits like these indicate too many clusters

Problems of 𝑘-Means
Depends on initial clusters
• Results may be unstable
Wrong 𝑘 can lead to bad results
All features must have a similar scale
• Differences in scale introduce artificial weights between features
• Large scales dominate small scales
Only works well for “round“ clusters
Regression
Statistical Approach
• independent variables: problem characteristics
• dependent variables: decision
the general form of the relationship has to be known in advance (e.g., linear, quadratic, etc.)
Basic Idea:
• Regression model of the probability that an object belongs to a class
• Combines the 𝑙𝑜𝑔𝑖𝑡 function with linear regression

Linear Regression
• 𝑦 as linear combination of 𝑥1 , … , 𝑥𝑛
• 𝑦 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛
The 𝑙𝑜𝑔𝑖𝑡 function
𝑃(𝑦=𝑐)
• 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = ln
1−𝑃(𝑦=𝑐)

Logistic Regression
• 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛
Regression: Decision Trees
Basic Idea
• Make decisions based on logical rules about features
• Organize rules as a tree

© Dataaspirant
Basic Decision Tree Algorithm
Recursive algorithm
• Stop if
Data is “pure”, i.e. mostly from class
Amount of data is too small, i.e., only few instances in partition
• Otherwise
Determine „most informative feature“ 𝑋
Partition training data using 𝑋
Recursively create subtree for each partition

Details may vary depending on the specific algorithm

• For example, CART, ID3, C4.5
General concept always the same
Information theory based approach

Entropy of the class label

• 𝐻 𝐶 = − σ𝑐∈𝐶 𝑝 𝑐 log 𝑝(𝑐)
Can be used as measure for purity

Conditional entropy of the class label based on feature 𝑋

• 𝐻 𝐶 𝑋 = − σ𝑥∈𝑋 𝑝 𝑥 σ𝑐∈𝐶 𝑝 𝑐 𝑥 log 𝑝(𝑐|𝑥)

Mutual Information
• 𝐼 𝐶, 𝑋 = 𝐻 𝐶 − 𝐻 𝐶 𝑋 Interpret each dimension as
random variable

→ Feature with highest mutual information is most informative

Data Warehousing OLAP and OLTP Technology
• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process.”—W. H.
Inmon

• Data warehousing is the process of constructing and using data with update-driven.

• Constructed by integrating multiple, heterogeneous data sources, relational

databases, flat files, on-line transaction records

• Data cleaning and data integration techniques are applied, ensuring consistency in
naming conventions, encoding structures, attribute measures, etc. among different
data sources.
Data Warehousing OLAP and OLTP Technology
• A data warehouse is based on a multidimensional data model which views data in the form of a data cube
• Four views regarding the design of a data warehouse : Top-down view, Data source view, Data warehouse view and
Business query view.
Sales volume as a function of product, month, and region Industry Region Year
Dimensions: Product, Location, Time
Category Country Quarter
Hierarchical summarization paths
Product City Month Week

Office Day
Product

Month
Multidimensional Data 53
Data Warehousing OLAP and OLTP Technology
OLTP (on-line transaction processing)
• Online transactional processing (OLTP) enables the real-time execution of large numbers of database
transactions by large numbers of people, typically over the Internet.

• OLTP systems are behind many of our everyday transactions, from ATMs to in-store purchases to
hotel reservations. Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
registration, accounting, etc.

• OLTP can also drive non-financial transactions, including password changes and text messages.

• Major task of traditional relational DBMS

• OLTP, on the other hand, is optimized for processing a massive number of transactions.

• OLTP systems are designed for use by frontline workers (e.g., cashiers, bank tellers, hotel desk
clerks) or for customer self-service applications (e.g., online banking, e-commerce, travel
reservations).
Data Warehousing OLAP and OLTP Technology
OLAP (on-line analytical processing)
• Online analytical processing (OLAP) is a system for performing multi-dimensional analysis at high
speeds on large volumes of data.
• Major task of data warehouse system, data mart or some other centralized data store.
• Data analysis and decision making
• OLAP is ideal for data mining, business intelligence and complex analytical calculations, as well as
business reporting functions like financial analysis, budgeting and sales forecasting.
Data Warehousing OLAP and OLTP Technology
OLAP is optimized for conducting complex data analysis for smarter decision-making.
OLAP systems are designed for use by data scientists, business analysts and knowledge workers, and they support business intelligence (BI), data
mining and other decision support applications.
• Multidimensional OLAP (MOLAP)
• Array-based storage structures
• Direct access to array data structures
• Example: Essbase (Arbor)
• Relational OLAP (ROLAP)
• Relational and Specialized Relational DBMS to store and manage warehouse data
• OLAP middleware to support missing pieces
• Optimize for each DBMS backend
• Aggregation Navigation Logic
• Additional tools and services
• Example: Microstrategy, MetaCube (Informix)

Distinct features (OLTP vs. OLAP):

• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• View: current, local vs. evolutionary, integrated
• Access patterns: update vs. read-only but complex queries
Data Warehousing OLAP and OLTP Technology
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Online Analytical Mining (OLAM)

Why online analytical mining?

High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools
OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
integration and swapping of multiple mining functions, algorithms, and
tasks.

Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
History Matching & Forecasting Parameters
100% (1)
History Matching & Forecasting Parameters
5 pages
FVSysID ShortCourse 1 Introduction1
No ratings yet
FVSysID ShortCourse 1 Introduction1
28 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Unit 1
No ratings yet
Unit 1
9 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
129 pages
Unit 1
No ratings yet
Unit 1
50 pages
Unit 1 Dev
No ratings yet
Unit 1 Dev
26 pages
Datas Unit1
No ratings yet
Datas Unit1
20 pages
UNIT 1 Exploratory Data Analysis
100% (3)
UNIT 1 Exploratory Data Analysis
21 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
DAT100 - Int - Data - Ana - Lec2 - Intro II
No ratings yet
DAT100 - Int - Data - Ana - Lec2 - Intro II
39 pages
Data Analysis and Modelling
No ratings yet
Data Analysis and Modelling
107 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
Cs3352 Fods Qb
No ratings yet
Cs3352 Fods Qb
25 pages
Data Career Skills Checklist
No ratings yet
Data Career Skills Checklist
19 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
16 pages
Data Science
No ratings yet
Data Science
11 pages
Module 1 - Introduction To Data Analytics
No ratings yet
Module 1 - Introduction To Data Analytics
21 pages
Intro Lectures To DSA
0% (1)
Intro Lectures To DSA
17 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
CS3352-QB Fds
No ratings yet
CS3352-QB Fds
12 pages
BD4151 Foundations OF DATA Science BD4151 Foundations OF DATA Science
No ratings yet
BD4151 Foundations OF DATA Science BD4151 Foundations OF DATA Science
70 pages
Data Science Tools Final
No ratings yet
Data Science Tools Final
11 pages
AD3301 Data Exploration and Visualization
No ratings yet
AD3301 Data Exploration and Visualization
278 pages
Business Analytics Unit I
No ratings yet
Business Analytics Unit I
45 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Unit 3
No ratings yet
Unit 3
83 pages
DS Lecture 15
No ratings yet
DS Lecture 15
44 pages
Course 1 Data Analyst Data Data Everywhere
No ratings yet
Course 1 Data Analyst Data Data Everywhere
83 pages
DA Unit 2 Trio 1
No ratings yet
DA Unit 2 Trio 1
26 pages
Datascience (Mod1)
No ratings yet
Datascience (Mod1)
4 pages
Unit 1
No ratings yet
Unit 1
11 pages
3 Data Science Intro
No ratings yet
3 Data Science Intro
76 pages
Bd4151 Foundations of Data Science
No ratings yet
Bd4151 Foundations of Data Science
70 pages
Unit 2
No ratings yet
Unit 2
48 pages
Unit 1
No ratings yet
Unit 1
29 pages
Datascience and Visualization
No ratings yet
Datascience and Visualization
8 pages
Linear Regression Merged
No ratings yet
Linear Regression Merged
38 pages
Decoding The Data Scientist
No ratings yet
Decoding The Data Scientist
6 pages
CEC 218_042006
No ratings yet
CEC 218_042006
83 pages
Introduction To Emerging Technology - Chapter Two
No ratings yet
Introduction To Emerging Technology - Chapter Two
41 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Project Presentation2
No ratings yet
Project Presentation2
22 pages
Prefect Workflow Orchestration Essentials: Definitive Reference for Developers and Engineers
From Everand
Prefect Workflow Orchestration Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Oracle Quick Guides: Part 2 - Oracle Database Design
From Everand
Oracle Quick Guides: Part 2 - Oracle Database Design
Malcolm Coxall
No ratings yet
Adobe Scan 12-Apr-2024
No ratings yet
Adobe Scan 12-Apr-2024
1 page
Upside Down Reliance Insurance
No ratings yet
Upside Down Reliance Insurance
1 page
Acko Two Wheeler Insurance
No ratings yet
Acko Two Wheeler Insurance
6 pages
Reliance Miscellaneous and Special Type of Vehicle
No ratings yet
Reliance Miscellaneous and Special Type of Vehicle
1 page
Engr. Aure Flo Oraya, MSCE
No ratings yet
Engr. Aure Flo Oraya, MSCE
24 pages
Textbook Fem
0% (1)
Textbook Fem
151 pages
Mathematical Model of QBD
No ratings yet
Mathematical Model of QBD
8 pages
Schelling (1971) - Dynamic Models of Segregation
No ratings yet
Schelling (1971) - Dynamic Models of Segregation
44 pages
Ee 312 Mod 1
No ratings yet
Ee 312 Mod 1
14 pages
1st Quiz Manscie
No ratings yet
1st Quiz Manscie
4 pages
Avr Ieee Dc1
No ratings yet
Avr Ieee Dc1
6 pages
14impc c1012 Faria
No ratings yet
14impc c1012 Faria
11 pages
Gravity Circuit Optimisation
No ratings yet
Gravity Circuit Optimisation
10 pages
(Vladimir Dorodnitsyn) Applications of Lie Groups (BookFi) PDF
No ratings yet
(Vladimir Dorodnitsyn) Applications of Lie Groups (BookFi) PDF
346 pages
Modelling and Business Decisions
No ratings yet
Modelling and Business Decisions
41 pages
Audine - Laurence Rue - MME - 321 - Practical - Exercises No - 1
No ratings yet
Audine - Laurence Rue - MME - 321 - Practical - Exercises No - 1
12 pages
CFD Model of A Hydrocyclone
No ratings yet
CFD Model of A Hydrocyclone
14 pages
Spring Wind Up
No ratings yet
Spring Wind Up
8 pages
MBD StudyGuide Book Januar2015
No ratings yet
MBD StudyGuide Book Januar2015
311 pages
Gathering Information and Measuring Market Demand
No ratings yet
Gathering Information and Measuring Market Demand
8 pages
Comparison of Fuzzy Logic and Artificial Neural Networks Approaches in Vehicle Delay Modeling
No ratings yet
Comparison of Fuzzy Logic and Artificial Neural Networks Approaches in Vehicle Delay Modeling
19 pages
Mastering Python For Finance - Sample Chapter
50% (6)
Mastering Python For Finance - Sample Chapter
24 pages
Supply Chain Management Homework 1 To
100% (1)
Supply Chain Management Homework 1 To
51 pages
Virtual Manipulatives in Mathematics Education
No ratings yet
Virtual Manipulatives in Mathematics Education
7 pages
Matlab S Function Ref
100% (1)
Matlab S Function Ref
470 pages
System Thinking Simplified Notes
No ratings yet
System Thinking Simplified Notes
25 pages
Cross Wind Oscillation PDF
No ratings yet
Cross Wind Oscillation PDF
20 pages
Data Science Specializations
No ratings yet
Data Science Specializations
164 pages
Photovoltaic Model Validation Guideline
No ratings yet
Photovoltaic Model Validation Guideline
37 pages
Ex 01 Introduction To Management Science - Question N Answers
No ratings yet
Ex 01 Introduction To Management Science - Question N Answers
2 pages
Chapter 1 Introduction Data Analytics
No ratings yet
Chapter 1 Introduction Data Analytics
64 pages
Neural Computations Underlying Arbitration Between Model-Based and Model-Free Learning
No ratings yet
Neural Computations Underlying Arbitration Between Model-Based and Model-Free Learning
13 pages

IDS CH2 Bharath S

Uploaded by

IDS CH2 Bharath S

Uploaded by

IDS

Defining Data Team: Roles in a Data Science Team: Data

It is up to leaders to make sure the team focuses on the right problem.

• Product manager Responsible for

• Legal / compliance Builds and

• Build simple models first – communicate results supports end-use

and collect feedback.

• The five core pieces of data analysis include:

• Goals of EDA are:

1. To determine if there are any problems with your dataset.

3. To develop a sketch of the answer to your question.

• Exploratory Data Analysis Checklist: 5. Check your “n”s

• Summarizing distributions 9. Follow up

• The model is essentially an expectation of the relationships between various

• The Normal distribution is fully specified by two

• It’s common to look at data and try to understand

• The most common statistical technique to help

1. Setting expectations: Setting expectations comes in the form of developing a

3. Revising expectations: If our secondary models are successful in challenging our

• Online advertising campaign

• To start, we might initiate a 1-week pilot

• Your primary model is

• Instead of our primary model, we fit the following.

• Online advertising campaign

• Although we might find our quadratic model to

• Air Pollution and Mortality in New York City

• This is an inferential question and we are

• In addition, we know there are a number of

• The surprising result comes from the opposite

Built for specific purpose

Can be Computer based

Quality and Type of Model

Can be invalidated with abrupt changes (natural disasters,

Data needed for forecasting:

Use of mathematical models using linear or non linear

• Increase in some • Rate of change • quantify maximum

Growth rate decreases due to

Relative response of outputs to every change in input

Test for sensitivity – percentage change

• Determines the effects individual factors and their

Example : Flooding and Landslides as a function of Land use

Over-fitting Noise has no predictive value; so a model is

There are sparrows & a bear

• Lots of classification algorithms

Objects are assigned to the closests centroid

Assign each object to closest centroid

Repeat update and assignment

Based on best fit

2, 3, and 4 all okay

Big changes in slope (elbows) indicate

Splits like these indicate too many clusters

Details may vary depending on the specific algorithm

Entropy of the class label

Conditional entropy of the class label based on feature 𝑋

→ Feature with highest mutual information is most informative

• Constructed by integrating multiple, heterogeneous data sources, relational

• Major task of traditional relational DBMS

Distinct features (OLTP vs. OLAP):

Why online analytical mining?

You might also like