0% found this document useful (0 votes)
7 views

Data Analytics_Module-1.2

Uploaded by

niladrix719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Analytics_Module-1.2

Uploaded by

niladrix719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Data Analytics:

Module-1
Dr. Ramen Pal
Associate Professor
Department of CSE (AI & ML), UEMK
Contact: [email protected]
WhatsApp: 7501038078

01/24/2025 1
Course Details
• Subject Name: Professional Elective - III : Data Analytics
• Credit: 3
• Subject Code: PECCSE602A
• Lecture Hours: 36

01/24/2025 2
Course Outcome
• On completion of the course students will be able to:
CO-1: Discuss with illustration the techniques and methods related to the
area of data collection, pre-processing, and exploratory data analytics.
CO-2: Discuss important terms and techniques on statistics to enable student
to understand the background of different tools or methods used in data
analytics.
CO-3: Use at beginning level of proficiency on the tools of machine learning
to ask questions of and explore patterns in data.
CO-4: Demonstrate intermediate proficiency in the visualization of data to
communicate information and patterns that exist in the data.

01/24/2025 3
Syllabus: Module-1
(Introduction to Data Analytics)
Data science workflow, Automated methods for data collection, Data
and Visualization Models, Data wrangling and cleaning, Exploratory
data analysis, Dimensionality Reduction. Building and evaluation of
models for: Association Analysis, Recommendation Systems, Time-
series data, Text Analysis, Data Mining.

01/24/2025 4
Data Science – One Definition

01/24/2025 5
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn

SELECT Market_Cap
From Companies
Where Company_Name = “Apple”

Number of Rows: 0

Problem:
Missing Data

01/24/2025 6
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn

SELECT Market_Cap
From Companies
Where Company_Name = “IBM”

Number of Rows: 0

Problem:
Entity Resolution

01/24/2025 7
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View $210
Intl. Business Machines Armonk, NY $200
Microsoft Redmond, WA $250
Sally’s Lemonade Stand Alameda,CA $260

SELECT MAX(Market_Cap)
From Companies

Number of Rows: 0

Problem:
Unit Mismatch

01/24/2025 8
Who’s Calling Who’S Data
Dirty?

01/24/2025 9
Dirty Data
Dirty data, or unclean data or rogue data, is data that is in some way
faulty: it might contain duplicates, or be outdated, insecure,
incomplete, inaccurate, or inconsistent.

The Statistics View:


There is a process that produces data
Any dataset is a sample of the output of that process
Results are probabilistic
You can correct bias in your sample
01/24/2025 10
Dirty Data
The Database View:
I got my hands on this data set
Some of the values are missing, corrupted, wrong, duplicated
You get a better answer by improving the quality of the values in your dataset

01/24/2025 11
Dirty Data
The Domain Expert’s View:
This Data Doesn’t look right
This Answer Doesn’t look right
What happened?

01/24/2025 12
Dirty Data
The Data Scientist’s View:
Some Combination of all of the above

01/24/2025 13
Data Quality Problems
Data is dirty on its own

Datasets are clean but suffer “bit rot”


Old data loses its value over time

Datasets are clean but integration (i.e., combining


them) screws them up

Any combination of the above


01/24/2025 14
Big Picture: Where can Dirty
Data Arise?

Integrate
Clean

Extract
Transform
Load

01/24/2025 15
ETL

01/24/2025 16
Example Data Quality Problems
T.Das|97336o8327|24.95|Y|-|0.0|1000
Ted J.|973-360-8779|2000|N|M|NY|1000
• Can we interpret the data?
• What do the fields mean?
• What is the key?
• Data glitches
• Typos, multiple formats, missing / default values
• Metadata and domain expertise
• Field three is Revenue. In dollars or rupees?
• Field four is Usage. Is it censored or uncensored?
• Field 4 is a censored flag. How to handle censored data?
01/24/2025 17
Data Glitches
Systemic changes to data which are external to the recorded process.
Changes in data layout / data types
 Integer becomes string, fields swap positions, etc.
Changes in scale / format
 Dollars vs. euros
Temporary reversion to defaults
 Failure of a processing step
Missing and default values
 Application programs do not handle NULL values well …
Gaps in time series
 Especially when records represent incremental changes.

01/24/2025 18
Dirty Data Problems
From Stanford Data Integration Course:
Naming conventions: Eg: NYC vs New York
Missing required field (Eg: key field)
Different representations (Eg: 2 vs Two)
Fields too long (get truncated)
Primary key violation (from unsturctured to structured or during integration)
Redundant Records (exact match or other)
Formatting issues – Eg: dates
Licensing issues/Privacy/ keep you from using the data as you would like?

01/24/2025 19
Numeric Outliers

01/24/2025 20
Tracking Superman @ home?
Ubisense tracking data

He walks Too much


through walls; cleaning
and you
He flies across
lose detail.
the room…

01/24/2025
Data Cleaning Makes Everything
Okay?
The appearance of a hole in the earth's
ozone layer over Antarctica, first
detected in 1976, was so unexpected
that scientists didn't pay attention to
what their instruments were telling
them; they thought their instruments
were malfunctioning.
National Center for Atmospheric
Research

In fact, the data were rejected as


unreasonable by data quality control
algorithms
01/24/2025 22
How Clean is “clean-enough”?

How much cleaning is too much?


Answers are likely to be:
domain-specific
data source-specific
application-specific
user-specific
all of the above?

01/24/2025 23
from
http://www-new.insightsquared.com/wp-content/uploads/2012/01/insightsquared_dq_i
nfographic-2.png

Data Quality

01/24/2025 24
Meaning of Data Quality (1)
Generally, you have a problem if the data doesn't mean what you
think it does, or should
Data not up to spec : garbage in, glitches, etc.
You don’t understand the spec : complexity, lack of metadata.
Many sources and manifestations
As we have discussed
Data quality problems are expensive and pervasive
DQ problems cost hundreds of billion $$$ each year.
Resolving data quality problems is often the biggest effort in a data mining
study.
01/24/2025 25
Conventional Definition of Data
Quality: Metrics
Accuracy
The data was recorded correctly.
Completeness
All relevant data was recorded.
Uniqueness
Entities are recorded once.
Timeliness
The data is kept up to date.
 Special problems in federated data: time consistency.
Consistency
The data agrees with itself.
01/24/2025 26
Problems
Unmeasurable
Accuracy and completeness are extremely difficult, perhaps impossible to
measure.
Context independent
No accounting for what is important. E.g., if you are computing aggregates,
you can tolerate a lot of inaccuracy.
Incomplete
What about interpretability, accessibility, metadata, analysis, etc.
Vague
The conventional definitions provide no guidance towards practical
improvements of the data.
01/24/2025 27
Meaning of Data Quality (2)
There are many types of data, which have different
uses and typical quality problems
Federated data
High dimensional data
Descriptive data
Longitudinal data
Streaming data
Web (scraped) data

01/24/2025 Adapted from Ted Johnson’s SIGMOD 2003 Tutorial 28


Meaning of Data Quality (2)
There are many uses of data
Operations
Aggregate analysis
Customer relations …
Data Interpretation: the data is useless if we don’t know all of the
rules behind the data.
Data Suitability: Can you get the answer from the available data
Use of proxy data
Relevant data is missing

01/24/2025 29
The Data Quality Continuum

Data and information is not static, it flows in a data


collection and usage process
Data gathering
Data storage
Data integration
Data retrieval
Data delivery
Data mining/analysis

01/24/2025 30
Data Gathering
How does the data enter the system?
Sources of problems:
Manual entry
No uniform standards for content and formats
Parallel data entry (duplicates)
Approximations, surrogates – SW/HW constraints
Measurement errors.

01/24/2025 Adapted from Ted Johnson’s SIGMOD 2003 Tutorial 31


Solutions
Potential Solutions:
Preemptive:
 Process architecture (build in integrity checks)
 Process management (reward accurate data entry, data sharing, data stewards)
Retrospective:
 Cleaning focus (duplicate removal, merge/purge, name & address matching, field value
standardization)
 Diagnostic focus (automated detection of glitches).

01/24/2025 32
Internet of Things has Special
Problems
 RFID data has many dropped readings
Typically, use a smoothing filter to interpolate
SELECT distinct tag_id
FROM RFID_stream [RANGE ‘7 sec’]
GROUP BY tag_id

Smoothed
output

Smoothing Filter
Raw
readings

Time

01/24/2025 33
Tracking Superman @ home?
Ubisense tracking data

He walks Too much


through walls; cleaning
and you
He flies across
lose detail.
the room…

01/24/2025
Adding Quality Assessment

01/24/2025 35
EDA

01/24/2025 36
WHAT IS EDA (Exploratory Data
Analytics)?
The analysis of datasets based on various numerical methods and
graphical tools.
Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
It facilitates discovering unexpected as well as conforming the
expected.
Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).

01/24/2025 37
AIM OF THE EDA
Maximize insight into a dataset
Uncover underlying structure
Extract important variables
Detect outliers and anomalies
Test underlying assumptions
Develop valid models
Determine optimal factor settings

01/24/2025 38
01/24/2025 39
AIM OF THE EDA
The goal of EDA is to open-mindedly explore data.
Tukey: EDA is detective work… Unless detective finds the clues, judge
or jury has nothing to consider.
Here, judge or jury is a confirmatory data analysis
Tukey: Confirmatory data analysis goes further, assessing the
strengths of the evidence.
With EDA, we can examine data and try to understand the meaning
of variables. What are the abbreviations stand for.

01/24/2025 40
Exploratory vs Confirmatory Data
Analysis
EDA CDA
• No hypothesis at first • Start with hypothesis
• Generate hypothesis • Test the null hypothesis
• Uses graphical methods (mostly) • Uses statistical models

01/24/2025 41
STEPS OF EDA
Generate good research questions
Data restructuring: You may need to make new variables from the existing ones.
 Instead of using two variables, obtaining rates or percentages of them
 Creating dummy variables for categorical variables
Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships, anomalies,
unexpected behaviors.
Try to identify confounding variables, interaction relations and multicollinearity, if any.
Handle missing observations
Decide on the need of transformation (on response and/or explanatory variables).
Decide on the hypothesis based on your research questions
01/24/2025 42
AFTER EDA
Confirmatory Data Analysis: Verify the hypothesis by statistical
analysis
Get conclusions and present your results nicely.

01/24/2025 43
Classification of EDA*
Exploratory data analysis is generally cross-classified in two ways. First, each
method is either non-graphical or graphical. And second, each method is either
univariate or multivariate (usually just bivariate).
Non-graphical methods generally involve calculation of summary statistics, while
graphical methods obviously summarize the data in a diagrammatic or pictorial
way.
Univariate methods look at one variable (data column) at a time
Bivariate EDA look at exactly two variables.
Multivariate methods look at more than two variables at a time to explore
relationships.
It is almost always a good idea to perform univariate EDA on each of the
components before performing the multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
01/24/2025 44
Graphical Methods
Univariate: Looking at one variable/column at a time
Bar-graph
Histograms
Boxplot
Multivariate : Looking at relationship between two or more variables
Scatter plots
Pie plots
Heatmaps(seaborn)

01/24/2025 45
EXAMPLE 1
In a breast cancer research, main questions of interest might be
Does any treatment method result in a higher survival rate? Can a
particular treatment be suggested to a woman with specific
characteristic?
Is there any difference between patients in terms of survival rates
(e.g. Are white woman more likely to survive compare the black
woman if they are both at the same stage of disease?)

01/24/2025 46
EXAMPLE 2*
New cancer cases in the U.S. based on a cancer registry
• The rows in the registry are called observations they correspond to
individuals
• The columns are variables or data fields they correspond to attributes
of the individuals

https://www.biostat.wisc.edu/~lindstro/2.EDA.9.10.pdf
01/24/2025 47
EDA Part 2: Summarizing Data With
Tables and Plots
Examine the entire data set using basic techniques before starting a
formal statistical analysis.

• Familiarizing yourself with the data.


• Find possible errors and anomalies.
• Examine the distribution of values for each variable.

01/24/2025 48
Examples of Variables
• Identifier(s):
- patient number,
- visit # or measurement date (if measured more than once)
• Attributes at study start (baseline):
- enrollment date,
- demographics (age, BMI, etc.)
- prior disease history, labs, etc.
- assigned treatment or intervention group
- outcome variable
• Attributes measured at subsequent times
- any variables that may change over time
- outcome variable
01/24/2025 49
Categorical Data Summaries
Tables

Cancer site is a variable taking 5 values


• categorical or continuous?
• ordered or unordered?

01/24/2025 50
Frequency Table

• Frequency Table: Categories with counts


• Relative Frequency Table: Percentage in each category

01/24/2025 51
Graphing a Frequency Table - Bar
Chart:
Plot the number of observations in each category:

01/24/2025 52
Continuous Data - Tables
Example: Ages of 10 adult leukemia patients:
35; 40; 52; 27; 31; 42; 43; 28; 50; 35
One option is to group these ages into decades and create a categorical
age variable:

01/24/2025 53
Continuous Data - Tables
We can then create a frequency table for this new categorical age
variable.

01/24/2025 54
EDA

01/24/2025 55

You might also like