0% found this document useful (0 votes)

7 views

Data Analytics_Module-1.2

Uploaded by

niladrix719

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Data Analytics_Module-1.2

Uploaded by

niladrix719

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

Data Analytics:

Module-1
Dr. Ramen Pal
Associate Professor
Department of CSE (AI & ML), UEMK
Contact: [email protected]
WhatsApp: 7501038078

01/24/2025 1
Course Details
• Subject Name: Professional Elective - III : Data Analytics
• Credit: 3
• Subject Code: PECCSE602A
• Lecture Hours: 36

01/24/2025 2
Course Outcome
• On completion of the course students will be able to:
CO-1: Discuss with illustration the techniques and methods related to the
area of data collection, pre-processing, and exploratory data analytics.
CO-2: Discuss important terms and techniques on statistics to enable student
to understand the background of different tools or methods used in data
analytics.
CO-3: Use at beginning level of proficiency on the tools of machine learning
to ask questions of and explore patterns in data.
CO-4: Demonstrate intermediate proficiency in the visualization of data to
communicate information and patterns that exist in the data.

01/24/2025 3
Syllabus: Module-1
(Introduction to Data Analytics)
Data science workflow, Automated methods for data collection, Data
and Visualization Models, Data wrangling and cleaning, Exploratory
data analysis, Dimensionality Reduction. Building and evaluation of
models for: Association Analysis, Recommendation Systems, Time-
series data, Text Analysis, Data Mining.

01/24/2025 4
Data Science – One Definition

01/24/2025 5
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn

SELECT Market_Cap
From Companies
Where Company_Name = “Apple”

Number of Rows: 0

Problem:
Missing Data

01/24/2025 6
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn

SELECT Market_Cap
From Companies
Where Company_Name = “IBM”

Number of Rows: 0

Problem:
Entity Resolution

01/24/2025 7
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View $210
Intl. Business Machines Armonk, NY $200
Microsoft Redmond, WA $250
Sally’s Lemonade Stand Alameda,CA $260

SELECT MAX(Market_Cap)
From Companies

Number of Rows: 0

Problem:
Unit Mismatch

01/24/2025 8
Who’s Calling Who’S Data
Dirty?

01/24/2025 9
Dirty Data
Dirty data, or unclean data or rogue data, is data that is in some way
faulty: it might contain duplicates, or be outdated, insecure,
incomplete, inaccurate, or inconsistent.

The Statistics View:

There is a process that produces data
Any dataset is a sample of the output of that process
Results are probabilistic
You can correct bias in your sample
01/24/2025 10
Dirty Data
The Database View:
I got my hands on this data set
Some of the values are missing, corrupted, wrong, duplicated
You get a better answer by improving the quality of the values in your dataset

01/24/2025 11
Dirty Data
The Domain Expert’s View:
This Data Doesn’t look right
This Answer Doesn’t look right
What happened?

01/24/2025 12
Dirty Data
The Data Scientist’s View:
Some Combination of all of the above

01/24/2025 13
Data Quality Problems
Data is dirty on its own

Datasets are clean but suffer “bit rot”

Old data loses its value over time

Datasets are clean but integration (i.e., combining

them) screws them up

Any combination of the above

01/24/2025 14
Big Picture: Where can Dirty
Data Arise?

Integrate
Clean

Extract
Transform
Load

01/24/2025 15
ETL

01/24/2025 16
Example Data Quality Problems
T.Das|97336o8327|24.95|Y|-|0.0|1000
Ted J.|973-360-8779|2000|N|M|NY|1000
• Can we interpret the data?
• What do the fields mean?
• What is the key?
• Data glitches
• Typos, multiple formats, missing / default values
• Metadata and domain expertise
• Field three is Revenue. In dollars or rupees?
• Field four is Usage. Is it censored or uncensored?
• Field 4 is a censored flag. How to handle censored data?
01/24/2025 17
Data Glitches
Systemic changes to data which are external to the recorded process.
Changes in data layout / data types
 Integer becomes string, fields swap positions, etc.
Changes in scale / format
 Dollars vs. euros
Temporary reversion to defaults
 Failure of a processing step
Missing and default values
 Application programs do not handle NULL values well …
Gaps in time series
 Especially when records represent incremental changes.

01/24/2025 18
Dirty Data Problems
From Stanford Data Integration Course:
Naming conventions: Eg: NYC vs New York
Missing required field (Eg: key field)
Different representations (Eg: 2 vs Two)
Fields too long (get truncated)
Primary key violation (from unsturctured to structured or during integration)
Redundant Records (exact match or other)
Formatting issues – Eg: dates
Licensing issues/Privacy/ keep you from using the data as you would like?

01/24/2025 19
Numeric Outliers

01/24/2025 20
Tracking Superman @ home?
Ubisense tracking data

He walks Too much

through walls; cleaning
and you
He flies across
lose detail.
the room…

01/24/2025
Data Cleaning Makes Everything
Okay?
The appearance of a hole in the earth's
ozone layer over Antarctica, first
detected in 1976, was so unexpected
that scientists didn't pay attention to
what their instruments were telling
them; they thought their instruments
were malfunctioning.
National Center for Atmospheric
Research

In fact, the data were rejected as

unreasonable by data quality control
algorithms
01/24/2025 22
How Clean is “clean-enough”?

How much cleaning is too much?

Answers are likely to be:
domain-specific
data source-specific
application-specific
user-specific
all of the above?

01/24/2025 23
from
http://www-new.insightsquared.com/wp-content/uploads/2012/01/insightsquared_dq_i
nfographic-2.png

Data Quality

01/24/2025 24
Meaning of Data Quality (1)
Generally, you have a problem if the data doesn't mean what you
think it does, or should
Data not up to spec : garbage in, glitches, etc.
You don’t understand the spec : complexity, lack of metadata.
Many sources and manifestations
As we have discussed
Data quality problems are expensive and pervasive
DQ problems cost hundreds of billion $$$ each year.
Resolving data quality problems is often the biggest effort in a data mining
study.
01/24/2025 25
Conventional Definition of Data
Quality: Metrics
Accuracy
The data was recorded correctly.
Completeness
All relevant data was recorded.
Uniqueness
Entities are recorded once.
Timeliness
The data is kept up to date.
 Special problems in federated data: time consistency.
Consistency
The data agrees with itself.
01/24/2025 26
Problems
Unmeasurable
Accuracy and completeness are extremely difficult, perhaps impossible to
measure.
Context independent
No accounting for what is important. E.g., if you are computing aggregates,
you can tolerate a lot of inaccuracy.
Incomplete
What about interpretability, accessibility, metadata, analysis, etc.
Vague
The conventional definitions provide no guidance towards practical
improvements of the data.
01/24/2025 27
Meaning of Data Quality (2)
There are many types of data, which have different
uses and typical quality problems
Federated data
High dimensional data
Descriptive data
Longitudinal data
Streaming data
Web (scraped) data

01/24/2025 Adapted from Ted Johnson’s SIGMOD 2003 Tutorial 28

Meaning of Data Quality (2)
There are many uses of data
Operations
Aggregate analysis
Customer relations …
Data Interpretation: the data is useless if we don’t know all of the
rules behind the data.
Data Suitability: Can you get the answer from the available data
Use of proxy data
Relevant data is missing

01/24/2025 29
The Data Quality Continuum

Data and information is not static, it flows in a data

collection and usage process
Data gathering
Data storage
Data integration
Data retrieval
Data delivery
Data mining/analysis

01/24/2025 30
Data Gathering
How does the data enter the system?
Sources of problems:
Manual entry
No uniform standards for content and formats
Parallel data entry (duplicates)
Approximations, surrogates – SW/HW constraints
Measurement errors.

01/24/2025 Adapted from Ted Johnson’s SIGMOD 2003 Tutorial 31

Solutions
Potential Solutions:
Preemptive:
 Process architecture (build in integrity checks)
 Process management (reward accurate data entry, data sharing, data stewards)
Retrospective:
 Cleaning focus (duplicate removal, merge/purge, name & address matching, field value
standardization)
 Diagnostic focus (automated detection of glitches).

01/24/2025 32
Internet of Things has Special
Problems
 RFID data has many dropped readings
Typically, use a smoothing filter to interpolate
SELECT distinct tag_id
FROM RFID_stream [RANGE ‘7 sec’]
GROUP BY tag_id

Smoothed
output

Smoothing Filter
Raw
readings

Time

01/24/2025 33
Tracking Superman @ home?
Ubisense tracking data

He walks Too much

through walls; cleaning
and you
He flies across
lose detail.
the room…

01/24/2025
Adding Quality Assessment

01/24/2025 35
EDA

01/24/2025 36
WHAT IS EDA (Exploratory Data
Analytics)?
The analysis of datasets based on various numerical methods and
graphical tools.
Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
It facilitates discovering unexpected as well as conforming the
expected.
Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).

01/24/2025 37
AIM OF THE EDA
Maximize insight into a dataset
Uncover underlying structure
Extract important variables
Detect outliers and anomalies
Test underlying assumptions
Develop valid models
Determine optimal factor settings

01/24/2025 38
01/24/2025 39
AIM OF THE EDA
The goal of EDA is to open-mindedly explore data.
Tukey: EDA is detective work… Unless detective finds the clues, judge
or jury has nothing to consider.
Here, judge or jury is a confirmatory data analysis
Tukey: Confirmatory data analysis goes further, assessing the
strengths of the evidence.
With EDA, we can examine data and try to understand the meaning
of variables. What are the abbreviations stand for.

01/24/2025 40
Exploratory vs Confirmatory Data
Analysis
EDA CDA
• No hypothesis at first • Start with hypothesis
• Generate hypothesis • Test the null hypothesis
• Uses graphical methods (mostly) • Uses statistical models

01/24/2025 41
STEPS OF EDA
Generate good research questions
Data restructuring: You may need to make new variables from the existing ones.
 Instead of using two variables, obtaining rates or percentages of them
 Creating dummy variables for categorical variables
Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships, anomalies,
unexpected behaviors.
Try to identify confounding variables, interaction relations and multicollinearity, if any.
Handle missing observations
Decide on the need of transformation (on response and/or explanatory variables).
Decide on the hypothesis based on your research questions
01/24/2025 42
AFTER EDA
Confirmatory Data Analysis: Verify the hypothesis by statistical
analysis
Get conclusions and present your results nicely.

01/24/2025 43
Classification of EDA*
Exploratory data analysis is generally cross-classified in two ways. First, each
method is either non-graphical or graphical. And second, each method is either
univariate or multivariate (usually just bivariate).
Non-graphical methods generally involve calculation of summary statistics, while
graphical methods obviously summarize the data in a diagrammatic or pictorial
way.
Univariate methods look at one variable (data column) at a time
Bivariate EDA look at exactly two variables.
Multivariate methods look at more than two variables at a time to explore
relationships.
It is almost always a good idea to perform univariate EDA on each of the
components before performing the multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
01/24/2025 44
Graphical Methods
Univariate: Looking at one variable/column at a time
Bar-graph
Histograms
Boxplot
Multivariate : Looking at relationship between two or more variables
Scatter plots
Pie plots
Heatmaps(seaborn)

01/24/2025 45
EXAMPLE 1
In a breast cancer research, main questions of interest might be
Does any treatment method result in a higher survival rate? Can a
particular treatment be suggested to a woman with specific
characteristic?
Is there any difference between patients in terms of survival rates
(e.g. Are white woman more likely to survive compare the black
woman if they are both at the same stage of disease?)

01/24/2025 46
EXAMPLE 2*
New cancer cases in the U.S. based on a cancer registry
• The rows in the registry are called observations they correspond to
individuals
• The columns are variables or data fields they correspond to attributes
of the individuals

https://www.biostat.wisc.edu/~lindstro/2.EDA.9.10.pdf
01/24/2025 47
EDA Part 2: Summarizing Data With
Tables and Plots
Examine the entire data set using basic techniques before starting a
formal statistical analysis.

• Familiarizing yourself with the data.

• Find possible errors and anomalies.
• Examine the distribution of values for each variable.

01/24/2025 48
Examples of Variables
• Identifier(s):
- patient number,
- visit # or measurement date (if measured more than once)
• Attributes at study start (baseline):
- enrollment date,
- demographics (age, BMI, etc.)
- prior disease history, labs, etc.
- assigned treatment or intervention group
- outcome variable
• Attributes measured at subsequent times
- any variables that may change over time
- outcome variable
01/24/2025 49
Categorical Data Summaries
Tables

Cancer site is a variable taking 5 values

• categorical or continuous?
• ordered or unordered?

01/24/2025 50
Frequency Table

• Frequency Table: Categories with counts

• Relative Frequency Table: Percentage in each category

01/24/2025 51
Graphing a Frequency Table - Bar
Chart:
Plot the number of observations in each category:

01/24/2025 52
Continuous Data - Tables
Example: Ages of 10 adult leukemia patients:
35; 40; 52; 27; 31; 42; 43; 28; 50; 35
One option is to group these ages into decades and create a categorical
age variable:

01/24/2025 53
Continuous Data - Tables
We can then create a frequency table for this new categorical age
variable.

01/24/2025 54
EDA

01/24/2025 55

How To Understand SAP ST04 Information and Use It For Performance Analysis
No ratings yet
How To Understand SAP ST04 Information and Use It For Performance Analysis
19 pages
CS194 Lec 04 Data Cleaning
No ratings yet
CS194 Lec 04 Data Cleaning
50 pages
DHV MODEL 1.2 Data Cleaning
No ratings yet
DHV MODEL 1.2 Data Cleaning
49 pages
02-DataQuality_compressed
No ratings yet
02-DataQuality_compressed
71 pages
Data Quality and Data Cleaning: An Overview
0% (1)
Data Quality and Data Cleaning: An Overview
132 pages
Data Quality and Data Cleaning: An Overview
No ratings yet
Data Quality and Data Cleaning: An Overview
27 pages
Big Data Analysis With Apache Spark: Uc#Berkeley
No ratings yet
Big Data Analysis With Apache Spark: Uc#Berkeley
80 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
lec 1 Data Acquisition and preprocessing
No ratings yet
lec 1 Data Acquisition and preprocessing
8 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Introduction To Data Cleaning
No ratings yet
Introduction To Data Cleaning
36 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Quality Concepts PDF
100% (3)
Data Quality Concepts PDF
83 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
DSF 3-4
No ratings yet
DSF 3-4
18 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Chapter 6 Part2
No ratings yet
Chapter 6 Part2
23 pages
M2.pptx
No ratings yet
M2.pptx
33 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Correlation
No ratings yet
Correlation
14 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Introduction to Data Analytics (2)
No ratings yet
Introduction to Data Analytics (2)
51 pages
Module 2 Clean Data For More Accurate Insights
No ratings yet
Module 2 Clean Data For More Accurate Insights
35 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
From Everand
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
alasdair gilchrist
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Cloud Notes
No ratings yet
Cloud Notes
60 pages
Data Analytics_Module-1.1
No ratings yet
Data Analytics_Module-1.1
42 pages
Deep Learning M1
No ratings yet
Deep Learning M1
54 pages
Lecture 08 - FIRST and FOLLOW
No ratings yet
Lecture 08 - FIRST and FOLLOW
30 pages
What'S New in Omnis Studio 4.3.1: Tigerlogic Corporation
No ratings yet
What'S New in Omnis Studio 4.3.1: Tigerlogic Corporation
54 pages
Detailed syllabus-MCA-2019-2020
No ratings yet
Detailed syllabus-MCA-2019-2020
34 pages
A Semantic Web Primer Third Edition Grigoris Antoniou instant download
No ratings yet
A Semantic Web Primer Third Edition Grigoris Antoniou instant download
80 pages
Skill 4 Cyber Security
No ratings yet
Skill 4 Cyber Security
7 pages
Alv Reports
No ratings yet
Alv Reports
21 pages
Apb-Uart Vip
No ratings yet
Apb-Uart Vip
10 pages
Advantages of DBMS-Elmasri
No ratings yet
Advantages of DBMS-Elmasri
6 pages
Database Management Systems Notes
67% (3)
Database Management Systems Notes
2 pages
Sqoop Commands
No ratings yet
Sqoop Commands
9 pages
Amazon Web Services - How AWS Pricing Works June 2015
No ratings yet
Amazon Web Services - How AWS Pricing Works June 2015
15 pages
BI_Syllabus
No ratings yet
BI_Syllabus
3 pages
Literature Review Guide MAXQDA24
No ratings yet
Literature Review Guide MAXQDA24
27 pages
Course Details - MS POWER BI
No ratings yet
Course Details - MS POWER BI
9 pages
CP04 Business Intelligence
No ratings yet
CP04 Business Intelligence
25 pages
Dashboard SOP and Power BI User Guide - Final
No ratings yet
Dashboard SOP and Power BI User Guide - Final
81 pages
Contextualizing Aircraft Maintenance Documentation Final Author
No ratings yet
Contextualizing Aircraft Maintenance Documentation Final Author
21 pages
Jiwaji University Gwalior m
No ratings yet
Jiwaji University Gwalior m
66 pages
Applied - Statistics - Theory - CSBA 2010
No ratings yet
Applied - Statistics - Theory - CSBA 2010
3 pages
Course Materials 2 Prerequisites 3 Course Outline 4 Microsoft Certified Professional Program 8 Facilities 10
No ratings yet
Course Materials 2 Prerequisites 3 Course Outline 4 Microsoft Certified Professional Program 8 Facilities 10
14 pages
Obtaining and Setting Up The DCT in IBM Lotus Notes Tut
No ratings yet
Obtaining and Setting Up The DCT in IBM Lotus Notes Tut
2 pages
Database Notes Xi
No ratings yet
Database Notes Xi
6 pages
Information Management 1
No ratings yet
Information Management 1
22 pages
Project Report - Attendance Management Android App
No ratings yet
Project Report - Attendance Management Android App
29 pages
Preview 2
No ratings yet
Preview 2
4 pages
Azure Fundamentals AZ-900
No ratings yet
Azure Fundamentals AZ-900
42 pages
RC8_Error_Analysis
No ratings yet
RC8_Error_Analysis
8 pages
Lecture 20 - AVL Trees
No ratings yet
Lecture 20 - AVL Trees
94 pages
Tutorial 01ACL
No ratings yet
Tutorial 01ACL
2 pages
Select Joins: SQL Cheat Sheet
100% (1)
Select Joins: SQL Cheat Sheet
3 pages

Data Analytics_Module-1.2

Uploaded by

Data Analytics_Module-1.2

Uploaded by

Data Analytics:

The Statistics View:

Datasets are clean but suffer “bit rot”

Datasets are clean but integration (i.e., combining

Any combination of the above

He walks Too much

In fact, the data were rejected as

How much cleaning is too much?

01/24/2025 Adapted from Ted Johnson’s SIGMOD 2003 Tutorial 28

Data and information is not static, it flows in a data

01/24/2025 Adapted from Ted Johnson’s SIGMOD 2003 Tutorial 31

He walks Too much

• Familiarizing yourself with the data.

Cancer site is a variable taking 5 values

• Frequency Table: Categories with counts

You might also like