IDS CH2 Bharath S
IDS CH2 Bharath S
Unit 2
Dr Setturu Bharath
SoE, Chanakya University
Unit 2
Unit 2:
It will also surface any tactical obstacles with actually calculating the evaluation metric.
Knowing how well the baseline does on the evaluation metric will give a quick ballpark
idea of how much benefit to expect from the project.
Experienced practitioners know all too well that common-sense baselines are often
hard to beat. And even when data science models beat these baselines, they may do so
by slim margins.
3
Defining Data Team
Business leaders should take extraordinary care in choosing the team and defining the
problem they want their data science teams to solve.
Data scientists, especially new ones, often want to get going with preparing data and
building models.
At least initially, they may not have the confidence to question a senior business
executive, especially if that individual is the project sponsor.
Pay less attention to how they are solving it, as there are usually many different ways to
solve any data science problem, and more attention to what they are solving.
Roles in a Data Science Team
• The genesis of a project need IT & Data Data Science
Business
not only come from the Management Teams
business, but it should be
tailored to a specific business Drives data
Understand
problem. Initiate Projects access and
Busineess needs
discovery
• Data is rarely collected with
future modeling projects in
Responsible for
mind. Actively engages
Provides key
tools and
deliverables
execution (Results, reports)
• Understanding what data is infrastructure
and verification
available, where it’s located,
and the trade-offs between Operationalize Communicates
Invests in
ease of availability and cost of applying results
and maintains effectively with
models all
acquisition, can impact the
overall design of solutions.
Builds and
supports end-
use applications 5
Roles in a Data Science Team
• Teams often need to loop back if they discover a
new line in data availability.
• The team should be multidisciplinary consists, Business
IT & Data Data Science
Management Teams
• Lead data scientist
• Validator Drives data
Understand
Initiate Projects access and
• Data engineer discovery
Busineess needs
Leek, J. (2015). The Elements of Data Analytic Style. A guide for people who want to analyze data. Leanpub. 8
Data and modeling
• Data is a crucial part of any data science
project/analysis.
• A large volume of data poses new challenges, such as
overloaded memory and algorithms that never stop
running.
• It forces you to adapt and expand your repertoire of
techniques. But even when you can perform your
analysis, you should take care of issues such as I/O
(input/output) and CPU starvation, because these can
cause speed issues.
• Never-ending algorithms, out-of-memory errors, and
speed issues are the most common challenges you
face when working with large data.
• The solutions can be divided into three categories:
using the correct algorithms, choosing the right data
structure, and using the right tools.
Data and modeling Exploratory Data Analysis (EDA)
• Exploratory data analysis is the process of exploring your data, and it typically includes
examining the structure and components of your dataset, the distributions of individual
variables, and the relationships between two or more variables.
• Data visualization is arguably the most important tool for exploratory data analysis because the
information conveyed by graphical display can be very quickly absorbed and because it is
generally easy to recognize patterns in a graphical display.
2. To determine whether the question you are asking can be answered by the data that you
have.
• Correcting data types 6. Validate with at least one external data source
3. Check the packaging 7. Make a plot
• Removing duplicates • Histogram, box plots, scatter plots, bar charts, etc.
• Identifying outliers and trends
4. Look at the top and the bottom of your data
• Descriptive statistics (mean, median, mode, 8. Try the easy solution first
etc.) • Applying statistical tests to validate assumptions
MODEL:
• Basic tools in science, economics, engineering,
business, planning, etc.
• Representation of real life scenarios (problems/
situations).
• Copies attributes from real world to a prototype.
• Applied to situations or systems (existing or non
existing)
• Visualization of various scenarios, derive optimum
solutions (varying input parameters)
• Can be logical, mathematical, spatial, using
assumptions, theories or similar
situations/operations.
Data and modeling
• Model provides a description of how the world works and how the data were
generated.
• Finding the type of data distribution forms a crucial step. Need to have a clear
picture on linear vs non-linear
Data and modeling
• Perhaps the most popular statistical model in the
world is the Normal model. This model says that
the randomness in a set of data can be explained
by the Normal distribution, or a bell-shaped curve.
2. Collecting Information: Once the primary model is set, we want to create a set of
secondary models that challenge the primary model in some way.
y = α + βx + γz + €
Where
• y is the outcome • A confounder is an external
• x is the key predictor variable that is associated with
both the cause and the effect.
• z is a potential confounder • It creates a spurious association
between the independent and
• € is independent random error dependent variables, making it
• α is the intercept, i.e. the value y when x = 0 and z = 0 difficult to identify true causal
effects.
• β is the change in y associated with a 1-unit increase
• x, adjusting for z
• γ is the change in y associated with a 1-unit increase in z, adjusting for x
This is a linear model, and our primary interest is in estimating the coefficient β, which quantifies the relationship between
the key predictor x and the outcome y.
Even though we will have to estimate α and γ as part of the process of estimating β, we do not really care about the values
of those α and γ. 19
Data and modeling Modeling Phases: Case Study
• Inferring Association
• The first approach we will take will be to ask “Is
there an association between daily 24-hour
average PM10 levels and daily mortality?”
25
Data and modeling Associational Analyses
• Inferring Association
• The overall relationship between P10 and mortality
is null, but when we account for the seasonal
variation in both mortality and PM10, the
association is positive.
27
Using relationships, rules, variables, equations,
Mathematical conditions.
• Examples: Cash flow, Population, Pollution, etc.
Models
Input Variables : A, MODEL:
B Limits: A< 100, B < 150
C = 4.A + 3.B
SIMPLE MODEL D = X +C *0.5
E = Y * logeC Output Variables : D, E
28
Model
Simple to Complex
Can be constructed using
indicators
• Use of proxies – Cell phones linking with
Covid, Traffic
If not calibrated
RUBBISH IN = RUBBISH OUT
Model Accuracy
Data used (type, quality, quantity, interlinkages) to
build and operate
Experience of Analyst
30
Forecasting Models
Forecasting models are limited in scope.
31
Growth Models
Exponential S Curve Logistic
Population
𝐾
𝑁=
Population increases
exponentially as population
2
grown and resource are
utilised
Time
𝑑𝑁 𝑁
=𝑟∗𝑁 ∗ 1−
𝑑𝑡 𝐾
32
Resistance by resource
Sensitivity Analysis
What if ??
33
Data includes both patterns (stable,
The Problem of underlying relationships) and noise
(transient, random effects).
34
Classification
• Given data set with a set of independent (input) and dependent variables
(outcome).
• Partition into training and evaluation data set
• Choose classification technique to build a model
• Test model on evaluation data set to test predictive accuracy
This is a Sparrow
• Different algorithms may be best in different situations Naive Bayes Ensembles DBSCAN
Nearest
Decision trees ……………
Neighbour
Neural
Networks
Classification
Object space
• 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1 , 𝑜𝑏𝑗𝑒𝑐𝑡2 , …
• Often infinite
Representations of the objects in a feature space
• ℱ = {𝜙 𝑜 , 𝑜 ∈ 𝑂}
Set of classes
• 𝐶 = {𝑐𝑙𝑎𝑠𝑠1 , … , 𝑐𝑙𝑎𝑠𝑠𝑛 } A hypothesis maps features to classes
A target concept that maps objects to classes ℎ:ℱ→𝐶
ℎ:𝜙(𝑜)→𝐶
• ℎ∗ : 𝑂 → 𝐶
Classification Approximation of the target concept ℎ^∗
• Finding an approximation of the target concept ℎ^∗ (𝑜)≈ℎ(𝜙(𝑜))
Hypothesis = Classifier = Classification M
Example of Clustering
Clustering
The General Problem
Object 1 Object 1
Object 2 Object 3
Object 3 Object 4
Class 1
Supervised/
Object 4
Unsupervised
(Clustering) Object 2
…
Object n
Object n
Class 2
Object lables..
(taining)
The Formal Problem
Object space
• 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1 , 𝑜𝑏𝑗𝑒𝑐𝑡2 , …
• Often infinite How do you
measure
similarity?
Representations of the objects in a (numeric) feature space
• ℱ = {𝜙 𝑜 , 𝑜 ∈ 𝑂}
Clustering
• Grouping of the objects
• Objects in the same group 𝑔 ∈ 𝐺 should be similar
• 𝑐: ℱ → 𝐺
Measuring Similarity Distances y
Small distance = similar
Euclidean Distance
• Based on the Eucledian norm 𝑥 x
2
y
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 2
= 𝑦1 − 𝑥1 2 + ⋯ + 𝑦𝑛 − 𝑥𝑛 2
Manhattan Distance
• Based on the Manhattan norm 𝑥 1 x
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 = 𝑦1 − 𝑥1 + ⋯ + 𝑦𝑛 − 𝑥𝑛
1 2 2 2 2 2
Chebyshev Distance 2 1 1 1 2
2 1 0 1 2
• Based on the maximum norm 𝑥 ∞ 2 1 1 1 2
• 𝑑 𝑥, 𝑦 = 𝑦 − 𝑥 ∞
= max |𝑦𝑖 − 𝑥𝑖 | 2 2 2 2 2
𝑖=1..𝑛
Idea Behind 𝑘–means Clustering
Clusters are described by their center
• The centers are called centroid
How do you
• Centroid-based clustering get the
centroids?
Update centroid
• Arithmetic mean of assigned objects
1
• 𝐶𝑖 = σ𝑥:𝑐 𝑥 =𝑖 𝑥𝑖
𝑥:𝑐 𝑥 =𝑖
Selecting 𝑘:
Intuition and knowledge about data
• Based on looking at plots
• Based on domain knowledge
Due to goal
• Fixed number of groups desired
Linear Regression
• 𝑦 as linear combination of 𝑥1 , … , 𝑥𝑛
• 𝑦 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛
The 𝑙𝑜𝑔𝑖𝑡 function
𝑃(𝑦=𝑐)
• 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = ln
1−𝑃(𝑦=𝑐)
Logistic Regression
• 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛
Regression: Decision Trees
Basic Idea
• Make decisions based on logical rules about features
• Organize rules as a tree
© Dataaspirant
Basic Decision Tree Algorithm
Recursive algorithm
• Stop if
Data is “pure”, i.e. mostly from class
Amount of data is too small, i.e., only few instances in partition
• Otherwise
Determine „most informative feature“ 𝑋
Partition training data using 𝑋
Recursively create subtree for each partition
Mutual Information
• 𝐼 𝐶, 𝑋 = 𝐻 𝐶 − 𝐻 𝐶 𝑋 Interpret each dimension as
random variable
• Data warehousing is the process of constructing and using data with update-driven.
• Data cleaning and data integration techniques are applied, ensuring consistency in
naming conventions, encoding structures, attribute measures, etc. among different
data sources.
Data Warehousing OLAP and OLTP Technology
• A data warehouse is based on a multidimensional data model which views data in the form of a data cube
• Four views regarding the design of a data warehouse : Top-down view, Data source view, Data warehouse view and
Business query view.
Sales volume as a function of product, month, and region Industry Region Year
Dimensions: Product, Location, Time
Category Country Quarter
Hierarchical summarization paths
Product City Month Week
Office Day
Product
Month
Multidimensional Data 53
Data Warehousing OLAP and OLTP Technology
OLTP (on-line transaction processing)
• Online transactional processing (OLTP) enables the real-time execution of large numbers of database
transactions by large numbers of people, typically over the Internet.
• OLTP systems are behind many of our everyday transactions, from ATMs to in-store purchases to
hotel reservations. Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
registration, accounting, etc.
• OLTP can also drive non-financial transactions, including password changes and text messages.
• OLTP, on the other hand, is optimized for processing a massive number of transactions.
• OLTP systems are designed for use by frontline workers (e.g., cashiers, bank tellers, hotel desk
clerks) or for customer self-service applications (e.g., online banking, e-commerce, travel
reservations).
Data Warehousing OLAP and OLTP Technology
OLAP (on-line analytical processing)
• Online analytical processing (OLAP) is a system for performing multi-dimensional analysis at high
speeds on large volumes of data.
• Major task of data warehouse system, data mart or some other centralized data store.
• Data analysis and decision making
• OLAP is ideal for data mining, business intelligence and complex analytical calculations, as well as
business reporting functions like financial analysis, budgeting and sales forecasting.
Data Warehousing OLAP and OLTP Technology
OLAP is optimized for conducting complex data analysis for smarter decision-making.
OLAP systems are designed for use by data scientists, business analysts and knowledge workers, and they support business intelligence (BI), data
mining and other decision support applications.
• Multidimensional OLAP (MOLAP)
• Array-based storage structures
• Direct access to array data structures
• Example: Essbase (Arbor)
• Relational OLAP (ROLAP)
• Relational and Specialized Relational DBMS to store and manage warehouse data
• OLAP middleware to support missing pieces
• Optimize for each DBMS backend
• Aggregation Navigation Logic
• Additional tools and services
• Example: Microstrategy, MetaCube (Informix)