Data Analytics_Module-1.2
Data Analytics_Module-1.2
Module-1
Dr. Ramen Pal
Associate Professor
Department of CSE (AI & ML), UEMK
Contact: [email protected]
WhatsApp: 7501038078
01/24/2025 1
Course Details
• Subject Name: Professional Elective - III : Data Analytics
• Credit: 3
• Subject Code: PECCSE602A
• Lecture Hours: 36
01/24/2025 2
Course Outcome
• On completion of the course students will be able to:
CO-1: Discuss with illustration the techniques and methods related to the
area of data collection, pre-processing, and exploratory data analytics.
CO-2: Discuss important terms and techniques on statistics to enable student
to understand the background of different tools or methods used in data
analytics.
CO-3: Use at beginning level of proficiency on the tools of machine learning
to ask questions of and explore patterns in data.
CO-4: Demonstrate intermediate proficiency in the visualization of data to
communicate information and patterns that exist in the data.
01/24/2025 3
Syllabus: Module-1
(Introduction to Data Analytics)
Data science workflow, Automated methods for data collection, Data
and Visualization Models, Data wrangling and cleaning, Exploratory
data analysis, Dimensionality Reduction. Building and evaluation of
models for: Association Analysis, Recommendation Systems, Time-
series data, Text Analysis, Data Mining.
01/24/2025 4
Data Science – One Definition
01/24/2025 5
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn
SELECT Market_Cap
From Companies
Where Company_Name = “Apple”
Number of Rows: 0
Problem:
Missing Data
01/24/2025 6
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn
SELECT Market_Cap
From Companies
Where Company_Name = “IBM”
Number of Rows: 0
Problem:
Entity Resolution
01/24/2025 7
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View $210
Intl. Business Machines Armonk, NY $200
Microsoft Redmond, WA $250
Sally’s Lemonade Stand Alameda,CA $260
SELECT MAX(Market_Cap)
From Companies
Number of Rows: 0
Problem:
Unit Mismatch
01/24/2025 8
Who’s Calling Who’S Data
Dirty?
01/24/2025 9
Dirty Data
Dirty data, or unclean data or rogue data, is data that is in some way
faulty: it might contain duplicates, or be outdated, insecure,
incomplete, inaccurate, or inconsistent.
01/24/2025 11
Dirty Data
The Domain Expert’s View:
This Data Doesn’t look right
This Answer Doesn’t look right
What happened?
01/24/2025 12
Dirty Data
The Data Scientist’s View:
Some Combination of all of the above
01/24/2025 13
Data Quality Problems
Data is dirty on its own
Integrate
Clean
Extract
Transform
Load
01/24/2025 15
ETL
01/24/2025 16
Example Data Quality Problems
T.Das|97336o8327|24.95|Y|-|0.0|1000
Ted J.|973-360-8779|2000|N|M|NY|1000
• Can we interpret the data?
• What do the fields mean?
• What is the key?
• Data glitches
• Typos, multiple formats, missing / default values
• Metadata and domain expertise
• Field three is Revenue. In dollars or rupees?
• Field four is Usage. Is it censored or uncensored?
• Field 4 is a censored flag. How to handle censored data?
01/24/2025 17
Data Glitches
Systemic changes to data which are external to the recorded process.
Changes in data layout / data types
Integer becomes string, fields swap positions, etc.
Changes in scale / format
Dollars vs. euros
Temporary reversion to defaults
Failure of a processing step
Missing and default values
Application programs do not handle NULL values well …
Gaps in time series
Especially when records represent incremental changes.
01/24/2025 18
Dirty Data Problems
From Stanford Data Integration Course:
Naming conventions: Eg: NYC vs New York
Missing required field (Eg: key field)
Different representations (Eg: 2 vs Two)
Fields too long (get truncated)
Primary key violation (from unsturctured to structured or during integration)
Redundant Records (exact match or other)
Formatting issues – Eg: dates
Licensing issues/Privacy/ keep you from using the data as you would like?
01/24/2025 19
Numeric Outliers
01/24/2025 20
Tracking Superman @ home?
Ubisense tracking data
01/24/2025
Data Cleaning Makes Everything
Okay?
The appearance of a hole in the earth's
ozone layer over Antarctica, first
detected in 1976, was so unexpected
that scientists didn't pay attention to
what their instruments were telling
them; they thought their instruments
were malfunctioning.
National Center for Atmospheric
Research
01/24/2025 23
from
http://www-new.insightsquared.com/wp-content/uploads/2012/01/insightsquared_dq_i
nfographic-2.png
Data Quality
01/24/2025 24
Meaning of Data Quality (1)
Generally, you have a problem if the data doesn't mean what you
think it does, or should
Data not up to spec : garbage in, glitches, etc.
You don’t understand the spec : complexity, lack of metadata.
Many sources and manifestations
As we have discussed
Data quality problems are expensive and pervasive
DQ problems cost hundreds of billion $$$ each year.
Resolving data quality problems is often the biggest effort in a data mining
study.
01/24/2025 25
Conventional Definition of Data
Quality: Metrics
Accuracy
The data was recorded correctly.
Completeness
All relevant data was recorded.
Uniqueness
Entities are recorded once.
Timeliness
The data is kept up to date.
Special problems in federated data: time consistency.
Consistency
The data agrees with itself.
01/24/2025 26
Problems
Unmeasurable
Accuracy and completeness are extremely difficult, perhaps impossible to
measure.
Context independent
No accounting for what is important. E.g., if you are computing aggregates,
you can tolerate a lot of inaccuracy.
Incomplete
What about interpretability, accessibility, metadata, analysis, etc.
Vague
The conventional definitions provide no guidance towards practical
improvements of the data.
01/24/2025 27
Meaning of Data Quality (2)
There are many types of data, which have different
uses and typical quality problems
Federated data
High dimensional data
Descriptive data
Longitudinal data
Streaming data
Web (scraped) data
01/24/2025 29
The Data Quality Continuum
01/24/2025 30
Data Gathering
How does the data enter the system?
Sources of problems:
Manual entry
No uniform standards for content and formats
Parallel data entry (duplicates)
Approximations, surrogates – SW/HW constraints
Measurement errors.
01/24/2025 32
Internet of Things has Special
Problems
RFID data has many dropped readings
Typically, use a smoothing filter to interpolate
SELECT distinct tag_id
FROM RFID_stream [RANGE ‘7 sec’]
GROUP BY tag_id
Smoothed
output
Smoothing Filter
Raw
readings
Time
01/24/2025 33
Tracking Superman @ home?
Ubisense tracking data
01/24/2025
Adding Quality Assessment
01/24/2025 35
EDA
01/24/2025 36
WHAT IS EDA (Exploratory Data
Analytics)?
The analysis of datasets based on various numerical methods and
graphical tools.
Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
It facilitates discovering unexpected as well as conforming the
expected.
Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).
01/24/2025 37
AIM OF THE EDA
Maximize insight into a dataset
Uncover underlying structure
Extract important variables
Detect outliers and anomalies
Test underlying assumptions
Develop valid models
Determine optimal factor settings
01/24/2025 38
01/24/2025 39
AIM OF THE EDA
The goal of EDA is to open-mindedly explore data.
Tukey: EDA is detective work… Unless detective finds the clues, judge
or jury has nothing to consider.
Here, judge or jury is a confirmatory data analysis
Tukey: Confirmatory data analysis goes further, assessing the
strengths of the evidence.
With EDA, we can examine data and try to understand the meaning
of variables. What are the abbreviations stand for.
01/24/2025 40
Exploratory vs Confirmatory Data
Analysis
EDA CDA
• No hypothesis at first • Start with hypothesis
• Generate hypothesis • Test the null hypothesis
• Uses graphical methods (mostly) • Uses statistical models
01/24/2025 41
STEPS OF EDA
Generate good research questions
Data restructuring: You may need to make new variables from the existing ones.
Instead of using two variables, obtaining rates or percentages of them
Creating dummy variables for categorical variables
Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships, anomalies,
unexpected behaviors.
Try to identify confounding variables, interaction relations and multicollinearity, if any.
Handle missing observations
Decide on the need of transformation (on response and/or explanatory variables).
Decide on the hypothesis based on your research questions
01/24/2025 42
AFTER EDA
Confirmatory Data Analysis: Verify the hypothesis by statistical
analysis
Get conclusions and present your results nicely.
01/24/2025 43
Classification of EDA*
Exploratory data analysis is generally cross-classified in two ways. First, each
method is either non-graphical or graphical. And second, each method is either
univariate or multivariate (usually just bivariate).
Non-graphical methods generally involve calculation of summary statistics, while
graphical methods obviously summarize the data in a diagrammatic or pictorial
way.
Univariate methods look at one variable (data column) at a time
Bivariate EDA look at exactly two variables.
Multivariate methods look at more than two variables at a time to explore
relationships.
It is almost always a good idea to perform univariate EDA on each of the
components before performing the multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
01/24/2025 44
Graphical Methods
Univariate: Looking at one variable/column at a time
Bar-graph
Histograms
Boxplot
Multivariate : Looking at relationship between two or more variables
Scatter plots
Pie plots
Heatmaps(seaborn)
01/24/2025 45
EXAMPLE 1
In a breast cancer research, main questions of interest might be
Does any treatment method result in a higher survival rate? Can a
particular treatment be suggested to a woman with specific
characteristic?
Is there any difference between patients in terms of survival rates
(e.g. Are white woman more likely to survive compare the black
woman if they are both at the same stage of disease?)
01/24/2025 46
EXAMPLE 2*
New cancer cases in the U.S. based on a cancer registry
• The rows in the registry are called observations they correspond to
individuals
• The columns are variables or data fields they correspond to attributes
of the individuals
https://www.biostat.wisc.edu/~lindstro/2.EDA.9.10.pdf
01/24/2025 47
EDA Part 2: Summarizing Data With
Tables and Plots
Examine the entire data set using basic techniques before starting a
formal statistical analysis.
01/24/2025 48
Examples of Variables
• Identifier(s):
- patient number,
- visit # or measurement date (if measured more than once)
• Attributes at study start (baseline):
- enrollment date,
- demographics (age, BMI, etc.)
- prior disease history, labs, etc.
- assigned treatment or intervention group
- outcome variable
• Attributes measured at subsequent times
- any variables that may change over time
- outcome variable
01/24/2025 49
Categorical Data Summaries
Tables
01/24/2025 50
Frequency Table
01/24/2025 51
Graphing a Frequency Table - Bar
Chart:
Plot the number of observations in each category:
01/24/2025 52
Continuous Data - Tables
Example: Ages of 10 adult leukemia patients:
35; 40; 52; 27; 31; 42; 43; 28; 50; 35
One option is to group these ages into decades and create a categorical
age variable:
01/24/2025 53
Continuous Data - Tables
We can then create a frequency table for this new categorical age
variable.
01/24/2025 54
EDA
01/24/2025 55