0% found this document useful (0 votes)
9 views56 pages

Module 1 ML Chapter2

This document provides an overview of data understanding, types of data, and big data analytics frameworks. It discusses data characteristics, storage, preprocessing, and various types of analytics, including descriptive, diagnostic, predictive, and prescriptive analytics. Additionally, it covers data visualization techniques and statistical measures such as central tendency and dispersion.

Uploaded by

anudeep05062005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views56 pages

Module 1 ML Chapter2

This document provides an overview of data understanding, types of data, and big data analytics frameworks. It discusses data characteristics, storage, preprocessing, and various types of analytics, including descriptive, diagnostic, predictive, and prescriptive analytics. Additionally, it covers data visualization techniques and statistical measures such as central tendency and dispersion.

Uploaded by

anudeep05062005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

MODULE 1

CHAPTER 2
UNDERSTANDING DATA – 1
UNDERSTANDING DATA – 1
Contents
• Introduction.
• Big Data Analysis Framework.
• Descriptive Statistics.
• Univariate Data Analysis and Visualization.
What is data?
• Data are facts
• Facts are in the form of numbers, audio, video, and image
• Need to analyze data for taking decisions
• Organizations store vast amounts of data (GB, TB, PB, EB).
• Data can be human-interpretable or computer-readable.
• Operational and Non-Operational Data
• Operational Data: Encountered in daily business procedures.
• Non-Operational Data: Used for decision-making.
• Processed data is meaningful and used for analysis.
Elements of Big Data
• Big data is characterized by:
• Volume: Large amounts of data (PB, EB).
• Velocity: Fast data arrival speeds.
• Variety: Different forms, functions, and sources of data.
• Veracity: Truthfulness and accuracy of data.
• Validity: Correctness for decision-making.
• Value: Importance of extracted insights for business decisions.
Types of Data
• Structured Data
• Stored in an organized manner (e.g., databases, SQL tables).
• Types include:
• Record Data: Organized as tables with rows and columns.
• Data Matrix: Numeric attributes arranged in multidimensional space.
• Graph Data: Represents relationships between objects (e.g., web pages and hyperlinks).
• Ordered Data: organized the data in order.
• Unstructured Data
• Includes images, video, audio, blogs, and textual documents.
• Estimated that 80% of data is unstructured.
• Semi-Structured Data
• Combines elements of structured and unstructured data.
• Examples: XML, JSON, RSS feeds, hierarchical data.
Data Storage and Representation
• Data stored in structures for analysis.
• Types:
• Flat Files
• CSV(Comma-Separated Values)
• In CSV files, values are separated by commas (","), Used in
spreadsheets, databases, and data analysis tools.
• TSV(Tab-Separated Values)
• In TSV files, values are separated by tabs (\t) instead of commas, Also
used in spreadsheets, databases, and data exchange between
applications.
Data Storage and Representation
• DBMS manages data efficiently.
• Types of databases:
• Transactional Database
• Time-Series Database
• Spatial Database
• World Wide Web (WWW)
• XML (eXtensible Markup Language)
• Data Stream
• RSS (Really Simple Syndication)
• JSON (JavaScript Object Notation)
Big Data Analytics and Types of Analytics
• Big data analytics helps businesses make decisions by analyzing data.
• It generates useful information and insights.
• Data analytics covers data collection, preprocessing, and analysis.
• It deals with the complete cycle of data management.
• Types of Data Analytics
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
Types of Analytics
• Descriptive Analytics
• Describes the main features of the data.
• Deals with collected data and quantifies it.
• Focuses on descriptive statistics rather than inference.
• Diagnostic Analytics
• Answers the question: 'Why did something happen?'
• Finds cause-and-effect relationships in data.
• Example: If a product is not selling, diagnostic analytics identifies reasons.
Types of Analytics
• Predictive Analytics
• Answers the question: 'What will happen in the future?'
• Uses algorithms to predict future trends.
• Machine learning heavily relies on predictive analytics.
• Prescriptive Analytics
• Recommends the best course of action.
• Goes beyond prediction and aids decision-making.
• Helps organizations plan for the future and mitigate risks.
Big Data Analysis Framework
• Big data frameworks use a layered architecture for flexibility and scalability.
• This architecture simplifies data processing and management.
• Four primary layers make up the big data framework.
• The framework consists of four layers:
1. Data Connection Layer
2. Data Management Layer
3. Data Analytics Layer
4. Presentation Layer
Big Data Analysis Framework
• Data Connection Layer
• Ingests raw data into appropriate structures.
• Supports Extract, Transform, and Load (ETL) operations.
• Connects data from various sources for analysis.
• Data Management Layer
• Preprocesses data for analysis.
• Executes read, write, and management tasks.
• Enables parallel query execution and data warehousing.
Big Data Analysis Framework
• Data Analytics Layer
• Performs statistical tests and machine learning model construction.
• Supports various analytical functions for insights.
• Validates models to ensure data integrity.
• Presentation Layer
• Displays results through dashboards and reports.
• Provides insights using machine learning models.
• Facilitates interpretation and visualization for better decision-making.
Types of Processing
• Cloud Computing
• Cloud computing provides shared resources over the internet.
• Services include:
• SaaS (Software as a Service) – Allows users to access software
applications over the internet without needing to install them on their
devices. Example: Google Docs, Microsoft 365.
• PaaS (Platform as a Service) – Provides a platform for developers to
build, test, and deploy applications. Example: Google App Engine,
Microsoft Azure.
• IaaS (Infrastructure as a Service) – Offers virtualized computing
resources like servers, storage, and networking. Example: Amazon Web
Services (AWS), Google Cloud Platform.
Types of Processing
• Cloud Service Deployment Models
• Public Cloud – Managed by third-party providers and accessible to the
general public. Example: Google Cloud, AWS.
• Private Cloud – Used exclusively by a single organization, providing greater
security and control.
• Community Cloud – Shared infrastructure owned and used by multiple
organizations with common concerns (e.g., government institutions).
• Hybrid Cloud – A combination of two or more cloud models to balance
security, performance, and
Types of Processing
• Characteristics of Cloud Computing
• Shared Infrastructure – Computing resources are shared across multiple
users.
• Dynamic Provisioning – Resources are allocated based on demand.
• Dynamic Scaling – Services can expand or shrink according to user needs.
• Network Access – Cloud resources are accessed over the internet.
• Utility-Based Metering – Users are charged based on resource consumption.
• Multitenancy – Multiple users share cloud resources securely.
• Reliability – Ensures continuous and reliable services.
Types of Processing
• Grid Computing:
• Uses distributed networks for complex tasks.
• Connects multiple computers to act as a single supercomputer.
• Distributes tasks across nodes for parallel processing.
• Ideal for high-performance, large-scale applications.
• HPC (High-Performance Computing):
• Aggregates resources to solve complex problems quickly.
• Utilizes parallel processing across compute, network, and storage components.
• Enhances performance for scientific and engineering tasks.
Data Collection
• Good Data Characteristics
• Timeliness: Relevant and up-to-date.
• Relevancy: Ready for machine learning tasks.
• Knowledge: Understandable and interpretable.
• Data Source Types:
• 1. Open/Public Data (e.g., digital libraries, healthcare databases)
• 2. Social Media Data (e.g., Twitter, YouTube)
• 3. Multimodal Data (e.g., text, audio, video)
Data preprocessing
• In the real world, data is often 'dirty'. Dirty data includes:
• Incomplete data: Missing values in the dataset.
• Outlier data: Errors in the recorded data.
• Data with inconsistent values: Contradictory or logically incorrect data entries.
• Inaccurate data: Errors in the recorded data.
• Data with missing values: Attributes or records with missing information.
• Duplicate data: Repeated entries that can skew analysis.

• Data preprocessing improves the quality of data mining techniques. The raw data
must be preprocessed to provide accurate results. This process involves data
cleaning and wrangling to make data usable for machine learning.
Data preprocessing
• Examples of Bad Data
• Consider the following examples of bad
data:
• Missing Salary values
• Age recorded as '5' but Date of Birth
indicates otherwise
• Age of '136', likely a typographical
error
• Negative salary values, e.g., '-1500'
• Data Cleaning Process involves:
• Identifying and correcting errors
• Removing duplicate or irrelevant data
• Filling in missing values
• Correcting inconsistent data formats
Missing Data Analysis
• The primary data cleaning process is missing data analysis.
• Data cleaning routines attempt to fill up missing values, smooth noise, identify
outliers, and correct data inconsistencies.
• This helps data mining models avoid overfitting.
• Methods for Handling Missing Data
• Ignore the tuple
• Fill in values manually
• Use a global constant
• Attribute value substitution
• Class mean
• Predicted value
Missing Data Analysis
• Ignore the tuple:
• Ignore records with missing data, especially class labels.
• Effective only when missing data is minimal.

• Fill in values manually:


• Experts analyze and fill values manually.
• Time-consuming and impractical for large datasets.

• Use a global constant:


• Fill missing values with a constant (e.g., 'Unknown').
• May cause spurious results.
Missing Data Analysis
• Attribute value substitution:
• Replace missing value with an attribute's value.
• Example: Use average income for missing income.
• Class mean:
• Use mean value for each class to fill missing values.
• Predicted value:
• Predict the missing value using classification or decision trees.
Removal of Noisy or Outlier Data
• Noise is random error or variance and can be removed using binning
techniques:
• - Smoothing by means: Replace with bin mean
• - Smoothing by medians: Replace with bin median
• - Smoothing by bin boundaries: Replace with boundary values

• Binning helps in discretizing the data and smoothing noisy data.


Example of Binning
• Example dataset: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}
• Bins of size 3:
• - Bin 1: 12, 14, 19
• - Bin 2: 22, 24, 26
• - Bin 3: 28, 31, 34

• Smoothing by means: {15, 15, 15}, {24, 24, 24}, {30.3, 30.3, 30.3}
• Smoothing by boundaries: {12, 12, 19}, {22, 22, 26}, {28, 28, 34}
Data Integration and Data transformation
• Data integration merges data from multiple sources into a single source, which
may lead to redundant data.
• Detect and remove redundancies arising from data integration.
• These operations (like normalization) enhance data mining algorithm performance
by transforming data into a processable format.
•Normalization:
•A preliminary stage of data conditioning.
•Scales attribute values to a range (e.g., 0 to 1) for better algorithm performance.
•Commonly used in neural networks.
•Normalization Procedures:
• Min-Max
• z-Score
Data Normalization
• MIN-MAX normalization
• Transforms data to the range 0-1

• z-Score calculates the number of standard deviations from the mean.


Data Reduction
• Data reduction reduces dataset size while maintaining performance.
• Techniques include:
• Data Aggregation
• Feature Selection
• Dimensionality Reduction
Descriptive Statistics
• Descriptive statistics summarize and describe data.
• It helps in understanding the nature of data.
• Includes techniques like Exploratory Data Analysis (EDA) and data
visualization.
• Dataset and Data Types
• A dataset is a collection of data objects.
• Attributes define the properties of objects.
• Data types are categorized as Categorical and Numerical.
Types of Data
• Categorical Data Types
• Nominal Data: Symbols without statistical value.
• Ordinal Data: Data with a natural order (e.g., Low, Medium, High).
• Numerical Data Types
• Interval Data: Numeric data with meaningful differences.
• Ratio Data: Numeric data where zero has meaning.
• Discrete vs Continuous Data
• Discrete Data: Integer-based, like survey responses.
• Continuous Data: Can have decimal points, like height and weight.
Types of Data
• Data Classification by Variables
• Univariate Data: Single variable.
• Bivariate Data: Two variables.
• Multivariate Data: Three or more variables.
Univariate Data Analysis and Visualization
• Univariate analysis is the simplest form of statistical analysis, involving only one
variable.
• It describes data, finds patterns, and explores frequency distributions, central
tendency measures, and variation.
• Univariate data analysis provides insights into data distribution, central tendency,
and variation.
• Data visualization helps to understand and present data effectively.
• Common types include bar charts, histograms, pie charts, frequency polygons, and
dot plots.
• Data visualization techniques such as bar charts, pie charts, histograms, and dot
plots make data interpretation easier.
Bar Chart
• Bar charts display frequency distributions for variables.
• They illustrate discrete data and help compare the frequency of different groups.
Pie Chart
• Pie charts represent frequency distributions as proportional sectors.
• They help visualize the relative sizes of different groups within a dataset.
Histogram
• Histograms show frequency distributions for grouped data.
• They can illustrate data distribution, mode, and skewness.
Dot Plot
• Dot plots represent data points with dots.
• They are less cluttered than bar charts and help identify individual values.
Central Tendency
• Central tendency is a summary statistic that represents the center point
of a dataset.
• It helps to simplify data analysis by focusing on key measures like the
mean, median, and mode.
• Mean
• Geometric Mean
• Median
• Mode
Mean
• Mean (Arithmetic Average) represents the center of the dataset.
• Calculated by summing all observations and dividing by the number of
observations.
• Formula: x̄ = (Σxᵢ)/N
• Example: Mean of 10, 20, and 30 is (10+20+30)/3 = 20.
• Weighted mean applies different weights to values based on their
importance.
Geometric Mean
• Geometric mean is the nth root of the product of n numbers.
• Formula: GM = (Πxᵢ)^(1/n)
• Example: GM of 6 and 8 is √(6×8) = √48.
• It can also be computed using logarithms.
Median and Mode
• The median is the middle value in a distribution.
• For an odd number of items, it's the middle item; for an even number,
it's the average of the two middle items.
• Formula for continuous data: Median = L₁ + [(N/2 - cf)/f] × i.
• Mode is the most frequently occurring value in a dataset.
• Applicable mainly to discrete data.
• Datasets can be unimodal, bimodal, or trimodal, based on the number
of modes present.
Dispersion
• Dispersion measures the spread of data around the central tendency.
• Range: Difference between the maximum and minimum values.
• Standard Deviation: Average distance from the mean, calculated using:
• σ = sqrt( Σ(xᵢ - x̄)² / (N - 1) )
• Quartiles and Inter Quartile Range (IQR)
• Quartiles divide data into four parts:
• Q₁: 25th percentile
• Q₂: 50th percentile (Median)
• Q₃: 75th percentile
• IQR = Q₃ - Q₁
• Outliers fall 1.5 × IQR above Q₃ or below Q₁.
Example 2.4: IQR Calculation
• Given the dataset: {12, 14, 19, 22, 24, 26, 28, 31, 34}
• - Median (Q₂): 24
• - Q₁ (median of lower half): 16.5
• - Q₃ (median of upper half): 29.5
• IQR = Q₃ - Q₁ = 29.5 - 16.5 = 13
Five-point Summary and Box Plots
• The five-point summary includes:
• - Minimum
• - First Quartile (Q1)
• - Median (Q2)
• - Third Quartile (Q3)
• - Maximum

• Box plots visualize the distribution and spread of the data.


Shape

• Skewness: Measures symmetry.


• - Positive skew: Tail to the right
• - Negative skew: Tail to the left
• Formula: (1/N) * Σ((xᵢ - μ)³ / σ³)
Shape
• **Kurtosis**: Measures the 'peakedness' of the data.
• - High kurtosis: Sharp peak
• - Low kurtosis: Flat peak
• Formula: (1/N) * Σ((xᵢ - x̄)⁴ / σ⁴)
Shape
• **Mean Absolute Deviation (MAD)**:
• - Measures deviation from the mean.
• Formula: (1/N) * Σ|xᵢ - μ|

• **Coefficient of Variation (CV)**:


• - Compares datasets with different units.
• Formula: (σ / μ) * 100
Special Univariate Plots
• Stem-Leaf Plot
• A stem and leaf plot displays data distribution and shape by splitting values into a
stem and a leaf.
• The stem is the left part (leading digits), and the leaf is the right part (last digit).
• For example, marks like 45, 60, 80, and 85 can be represented in this plot.
• In the stem and leaf plot, the first column represents the stem, and the second
column represents the leaf.
• For the English marks, two students with 60 marks are shown in the plot as stem 6
with leaves 0 and 0.
• Stem and leaf plots help visualize data distribution easily.
• Q-Q Plot
• QQ plot is normality test. If data closer to straight line, then the distribution is
normal.
• A Q-Q plot is a 2D scatter plot that compares the quantiles of a dataset with the
theoretical quantiles of a normal distribution.
• If the points lie along a 45-degree line, the data follows a normal distribution.
• In the Q-Q plot, points should ideally lie on the 45-degree reference line.
• Significant deviations indicate a non-normal distribution.
• Q-Q plots assess normality by comparing sample and theoretical quantiles.
• These tools are essential for understanding univariate data distributions.
Thank you

You might also like