0% found this document useful (0 votes)
3 views93 pages

Chapter 02

Chapter 2 discusses the nature and characteristics of data, including structured, semi-structured, and unstructured data, as well as various data sources and storage types. It also covers data analysis frameworks, types of processing, and the importance of data preprocessing and normalization. Additionally, the chapter explores data visualization techniques and essential mathematics for multivariate data analysis.

Uploaded by

hajeera.nk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views93 pages

Chapter 02

Chapter 2 discusses the nature and characteristics of data, including structured, semi-structured, and unstructured data, as well as various data sources and storage types. It also covers data analysis frameworks, types of processing, and the importance of data preprocessing and normalization. Additionally, the chapter explores data visualization techniques and essential mathematics for multivariate data analysis.

Uploaded by

hajeera.nk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 93

Chapter 2

Understanding of Data
What is Data?
• DATA ARE FACTS
• FACTS ARE IN THE FORM OF NUMBERS, AUDIO, VIDEO, AND IMAGE
• NEED TO ANALYZE DATA FOR TAKING DECISIONS
Characteristics of Big Data
Characteristics of Data
Data Sources
A DATA SOURCE CAN BE ANYTHING –

• STRUCTURED DATA
• SEMI-STRUCTURED DATA
• UNSTRUCTURED DATA
Structured Data
A STRUCTURED DATA CAN BE ANY ONE OF THE FOLLOWING –

• RECORD DATA
• GRAPHICS DATA
• DATA MATRIX
• ORDERED DATA – SEQUENCE DATA, TIME SERIES DATA, TEMPORAL DATA
Unstructured Data
AN UNSTRUCTURED DATA CAN BE ANY ONE OF THE FOLLOWING

• VIDEO, IMAGE, PROGRAMS


• BLOG DATA
• 80% OF ORGANIZATION DATA
Semi-Structured Data
A SEMI-STRUCTURED DATA CAN BE ANY ONE OF THE FOLLOWING –

• XML/JSON OBJECTS
• RSS FEEDS
• HIERARCHICAL RECORDS
Data Storage
Data Storage
• DATABASE SYSTEMS
• TYPES ARE
1. TRANSACTIONAL DATABASE
2. TIME SERIES DATABASE
3. TEMPORAL DATABASE
Data Storage
• OTHER
TYPES
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
Data Analysis Framework
• FRAMEWORK
Types of Processing
• CLOUD COMPUTING
• GRID COMPUTING
• H-COMPUTING
Good Data Characteristics
• GOOD DATA SHOULD HAVE THESE CHARACTERISTICS
Open-Source Data
1. DIGITAL LIBRARIES
2. EXPERIMENTAL DATA LIKE GENOMIC AND BIOLOGICAL DATA
3. HEALTHCARE SYSTEMS LIKE PATIENT INSURANCE DATA
Social-Media Data
1. TWITTER DATA
2. FACEBOOK DATA
3. YOUTUBE VIDEOS
4. INSTAGRAM DATA
Multimodal Data
• IMAGE ARCHIVES WITH TEXT AND NUMERIC DATA
• WWW
Data Preprocessing
DATA THAT CAN CAUSE
PROBLEMS
• INCOMPLETE DATA
• OUTLIER DATA
• INCONSISTENT DATA
• INACCURATE DATA
• MISSING VALUES
• DUPLICATE DATA
Missing Data
Noisy Data
BINNING TECHNIQUE
Data Normalization
MIN-MAX PROCEDURE
TRANSFORMS DATA TO THE RANGE 0-
1
Data Normalization
Z-SCORE
Types of Data
Nominal Data
Ordinal Data
Numerical Data
Types of Data
BASED ON VARIABLES
Data Visualization
Data Visualization
Data Visualization
Data Visualization
Data Visualization
Central Tendency
MEAN OF DATA
Central Tendency
MEDIAN OF DATA
Central Tendency
MODE OF DATA
DISPERSION
RANGE AND STANDARD DEVIATION
DISPERSION
QUARTILES AND IQR
Five-point summary
5-POINT SUMMARY
Shape of Data
SKEWNESS AND KURTOSIS
Shape of Data
KURTOSIS
Shape of Data
MEAN ABSOLUTE DEVIATION AND COEFFICIENT OF VARIATION
Stem-Leaf Plot
Q-Q Plot
QQ PLOT IS NORMALITY TEST. IF DATA CLOSER TO STRAIGHT LINE, THEN THE
DISTRIBUTION IS NORMAL.
Bivariate Data
INVOLVES TWO VARIABLES
Bivariate Data Visualization
Bivariate Data – Covariance
Bivariate Data – Correlation
Bivariate Data – Correlation
Multivariate Data Visualization
Multivariate Data Visualization
Multivariate Essential Mathematics
1. GAUSSIAN ELIMINATION
Multivariate Essential Mathematics
1. GAUSSIAN ELIMINATION
Multivariate Essential Mathematics
1. MATRIX DECOMPOSITION
Multivariate Essential Mathematics
1. MATRIX DECOMPOSITION
Multivariate Essential Mathematics
1. DISTRIBUTIONS
Multivariate Essential Mathematics
EXPONENTIAL DISTRIBUTION
Multivariate Essential Mathematics
BINOMIAL DISTRIBUTION
Multivariate Essential Mathematics
POISSON AND BERNOULLI DISTRIBUTION
Density Estimation
Hypothesis Testing
Z-
TEST
Hypothesis Testing
PAIRED T-TEST
Hypothesis Testing
PAIRED T-TEST
Hypothesis Testing
PAIRED T-TEST
Hypothesis Testing
CHI-SQUARE TEST
Hypothesis Testing
CHI-SQUARE TEST
Hypothesis Testing
CHI-SQUARE TEST
Feature Engineering
Feature Engineering
• FEATURE TRANSFORMATION
• FEATURE SELECTIONS
Characteristics of Good Features
• FEATURES ARE REMOVED USING RELEVANCY
• FEATURES ARE REMOVED BASED ON
REDUNDANCY
FEATURE SELECTION
FORWARD SELECTION
FEATURE SELECTION
BACKWARD SELECTION
Principal Component Analysis
Principal Component Analysis
Compute
Covariance
matrix as

Compute Eigen
values and
Eigen vectors
and matrix A as
a set of eigen
vectors
Principal Component Analysis
Compute PCA as

The original
Data can be
recovered as
PCA Algorithm
PCA Example
PCA Example
PCA Example
PCA Example
PCA Example
Verification
LDA Algorithm
LDA Algorithm
SVD Algorithm
SVD Algorithm
SVD Example
SVD Example
SVD Example
SVD Example

You might also like