0% found this document useful (0 votes)
8 views

Module1 Understanding Data1

The document provides an overview of descriptive statistics, including dataset summarization and data types such as categorical, ordinal, and numerical data. It discusses various data analysis techniques, including univariate, bivariate, and multivariate analyses, as well as visualization methods like bar charts and histograms. Additionally, it covers central tendency measures, dispersion, skewness, kurtosis, and the coefficient of variation, emphasizing their importance in understanding data distributions.

Uploaded by

Lakshmi Hj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module1 Understanding Data1

The document provides an overview of descriptive statistics, including dataset summarization and data types such as categorical, ordinal, and numerical data. It discusses various data analysis techniques, including univariate, bivariate, and multivariate analyses, as well as visualization methods like bar charts and histograms. Additionally, it covers central tendency measures, dispersion, skewness, kurtosis, and the coefficient of variation, emphasizing their importance in understanding data distributions.

Uploaded by

Lakshmi Hj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Module-1

Understanding of Data
2.4 Descriptive Statistics
• Descriptive statistics is a branch of statistics that does dataset
summarization.
• It is used to summarize and describe data.
• Descriptive statistics do not bother too much about ML algorithms and its
functioning.
• Descriptive analytics and data visualization techniques helps to understand
the nature of the data, which further helps to determine the kinds of
machine learning or data mining tasks that can be applied to the data.
• This step is often known as Exploratory Data Analysis (EDA).
Dataset and Data Types
• A dataset can be assumed to be a collection of data objects.
• The data objects may be records, points, vectors,patterns, events, cases, samples
or observations.
• Example: Sample Patient Table
Patient ID Name Age Blood Test Fever Disease

1. John 21 Negative Low No

2. Andre 36 Positive High Yes

• Every attribute should be associated with a value. This process is called


measurement.
• The type of attribute determines the data types.
Types of Data
Categorical or Qualitative Data
Nominal Data: Nominal data is a type of categorical data that represents labels,
names, or categories without any inherent order or ranking.
• Mathematical operations like addition or subtraction do not apply.
Example of Nominal Data:
• Gender: Male, Female
• Eye Color: Blue, Brown, Green
• Blood Group: A, B, AB, O
• Car Brands: Toyota, Ford, BMW
• Yes/No Responses: Yes, No
• Since nominal data is purely categorical, it is typically analyzed using frequency
counts or mode (most frequently occurring category).
• Only operations like (=,≠) are meaningful for these data.
• Example: The patient ID can be checked for equality and nothing else.
Ordinal Data: Ordinal data is a type of categorical data where the categories have a
meaningful order or ranking.
• But, the differences between them are not necessarily uniform or measurable.
• Arithmetic operations like addition or subtraction are not applicable.
Example of Ordinal Data:
• Customer Satisfaction Levels: Poor, Fair, Good, Very Good, Excellent
• Education Levels: High School, Bachelor's, Master's, PhD
• Movie Ratings: 1 star, 2 stars, 3 stars, 4 stars, 5 stars
• Ranks in a Competition: 1st place, 2nd place, 3rd place
• Fever: Low, Medium, High
• Only operations like (<,>) are meaningful for these data
• Since ordinal data has a ranking, it is often analyzed using median or percentile-
based methods.
Numeric or Quantitative Data
Interval Data: Interval data is a type of numerical data where the differences
between values are meaningful and equal, but there is no true zero point.
• This means that while addition and subtraction are possible, multiplication and
division are not meaningful.
Example of Interval Data:
• Temperature (in Celsius or Fahrenheit): 10°C, 20°C, 30°C (The difference
between 10°C and 20°C is the same as between 20°C and 30°C, but 0°C does
not mean "no temperature.")
• Time of the Day (on a 12-hour clock): 3 AM, 6 AM, 9 AM (There is no absolute
zero time; the clock continues in cycles.)
• IQ Scores: 90, 100, 110 (The differences are measurable, but there is no true
zero IQ.)
• Only operations like (+,-) are meaningful for these data
Ratio data: is a type of numerical data that has all the properties of interval data,
but with a true zero point, meaning zero represents the total absence of the
measured variable.
This allows for meaningful addition, subtraction, multiplication, and division
operations.
Examples of Ratio Data:
• Height & Weight: A person who is 100 kg weighs twice as much as someone
who is 50 kg.
• Age: A 20-year-old is twice as old as a 10-year-old.
• Kelvin Temperature: 0 K (absolute zero) means a complete absence of thermal
energy.
• Distance: 0 km means no distance, and 10 km is twice as far as 5 km.
Since ratio data has a true zero, it allows for meaningful ratio comparisons, unlike
interval data (e.g., "40°C is not twice as hot as 20°C," but 40 kg is twice as heavy as
20 kg).
• Another way of classifying the data:
1. Discrete value data
2. Continuous data
Discrete data is a type of numerical data that consists of distinct, separate values.
It can only take specific, countable values and cannot be divided into smaller parts
meaningfully.
Examples of Discrete Data:
• Number of students in a class: 25, 30, 35 (cannot be 25.5 students).
• Number of goals scored in a match: 1, 2, 3 (not 1.5 goals).

Discrete data is typically represented using bar charts or count-based statistics like
mode and frequency.
Continuous data is a type of numerical data that can take any value within a given
range, including fractions and decimals.
It is measurable rather than countable.

Examples of Continuous Data:


• Height of a person: 170.2 cm, 170.25 cm, 170.257 cm.
• Weight of an object: 65.5 kg, 65.55 kg, 65.555 kg.
• Temperature: 22.3°C, 22.35°C, 22.357°C.

Continuous data is typically represented using histograms or line graphs and


analyzed using statistical measures like mean, standard deviation, and range.
Classification of data based on number of variables:
1. Univariate Data
2. Bivariate Data
3. Multivariate Data
Univariate Data: refers to a dataset that contains only one variable.
• It focuses on analyzing a single characteristic or feature without considering
relationships with other variables.
• The analysis is simple and focuses on distribution, central tendency (mean,
median, mode), and spread (range, variance, standard deviation).
Examples of Univariate Data:
• Number of cars in different households: (1, 2, 3, 2, 4, etc.)
• Test scores of students: (85, 90, 78, 92, etc.)
• Colors of cars in a parking lot: (Red, Blue, Black, White, etc.)
Bivariate data: refers to a dataset that involves two variables and examines the
relationship between them.
• Used to analyze cause-and-effect or correlations between variables.
• It helps in identifying patterns, correlations, or dependencies between the two
variables.
Examples of Bivariate Data:
• Height vs. Weight: Analyzing how a person’s weight changes with height.
• Study Hours vs. Exam Score: Examining whether more study hours lead to higher
scores.
• Temperature vs. Ice Cream Sales: Studying how ice cream sales change with
temperature.
Multivariate data: refers to a dataset that contains three or more variables and
examines relationships among them.
It is used in complex analysis to understand how multiple factors interact with each
other.
Examples of Multivariate Data:
• Student Performance Analysis: Examining how study hours, attendance, and
sleep duration affect exam scores.
• Weather Prediction: Analyzing temperature, humidity, and wind speed to
forecast weather conditions.
• Sales Performance: Studying how price, advertising budget, and customer
reviews impact product sales.
• Health Analysis: Evaluating how age, blood pressure, cholesterol levels, and
exercise habits affect heart disease risk.
2.5 Univariate Data Analysis and Visualization
• Data visualization is the graphical representation of information and data.
• It helps people understand complex data patterns, trends, and insights by using
visual elements such as charts, graphs, maps, and dashboards.

Bar Chart: A bar chart is a graphical representation of data using rectangular bars,
where the length or height of each bar represents the value of a particular
category.
Bars can be displayed vertically (column chart) or horizontally.
• A bar chart is best suited for categorical data (data divided into groups or
categories).
• It can also be used for discrete numerical data.
• Not Ideal for Continuous Data (e.g., temperature, speed) – A line chart is usually
better for that.
• Pie Chart: A pie chart is a circular graph divided into slices, where each slice
represents a proportion of the whole.
• The size of each slice corresponds to the percentage or fraction of a category
within the dataset.
• Equally helpful in illustrating the univariate data.
Histogram: A histogram is a graphical representation of the distribution of
numerical data.
• It looks like a bar chart, but instead of showing categories, it groups continuous
data into bins (intervals) and shows the frequency of data points within each bin.
When to Use a Histogram?
• When analyzing continuous numerical data
• To understand the frequency of data within specific ranges
• To observe distribution patterns (e.g., normal, skewed, uniform)
Problem1
• There are 60 students in a class. Among them, 15 students were placed in a
company offering a 3.5 lakh package, 10 students in a 6.5 lakh package, 8
students in a 10 lakh package, and 5 students in a 12 lakh package.
Generate a bar chart, pie chart.
• Solution:
3.5 lakh package: 15/60 x 100= 25 %
6.5 lakh package: 10/60 x 100= 16.66 ≈ 16.7 %
10 lakh package: 8/60 x 100 = 13.33 %
12 lakh package: 5/60 x 100 = 8.33 %
Not placed= total – placed students = 60 - 38 = 22 students
22/60 x 100 = 36.66 ≈ 36.7 %
Placement distribution of
students
Problem-2:
Total students=60
Consider the range 0-3lakh,3-6,6-9,9-12,12-15
Package (in Lakhs) No. of placed students
College A College B
3 25 6
5 12 15
7 2 6
10 1 14
11.5 0 9
15 0 10
Central Tendency
• Central tendency refers to the measure that represents the center or typical value
of a dataset.
• It helps in understanding the overall trend of the data by identifying a single
value that best describes the distribution.
The three main measures of central tendency are:
1. Mean (Arithmetic Average)
2. Median (Middle Value)
3. Mode (Most Frequent Value)
1. Mean (Arithmetic Average)
• The sum of all values divided by the number of values.
• Formula:
Mean=∑X/N OR

Example: If the numbers are 5, 10, 15,20,25


then: Mean=(5+10+15+20+25)/5=15

Best for: Numerical data without extreme values (outliers).


2. Median (Middle Value)
• The middle value when the data is arranged in ascending order.
• If the number of values is odd, the median is the middle number.
• If the number of values is even, the median is the average of the two middle
numbers.
Example:

For 3, 7, 9 → Median = 7
For 3, 7, 9, 12 → Median = (7+9)/2 = 8

Best for: Skewed distributions or data with outliers.


• Median Formula for Continuous Data
• When dealing with continuous data (grouped frequency distributions), the
median is estimated using the following formula:
Steps to Find the Median in Continuous Data

• Calculate N/2 (where N is the total frequency).


• Identify the median class, which is the class where the cumulative frequency just
exceeds N/2.
• Use the median formula to compute the median value.
3. Mode (Most Frequent Value)

• The number that appears most frequently in the dataset.


• Example: 3, 5, 7, 7, 9, 9, 9 → Mode = 9 (since it appears the most).

Best for: Categorical or discrete data.


Choosing the Best Measure

• Use Mean if data is normally distributed (no extreme values).


• Use Median if data has outliers or skewness.
• Use Mode if dealing with categorical data or discrete data.
Dispersion
• In statistics, dispersion refers to the extent to which a set of data points are
spread out or scattered around a central value (such as the mean or median).
• It helps measure the variability or consistency of data.

Ways of measuring dispersion:


1. Range
2. Variance
3. Standard deviation
Problem: Patients age list {12,14,19,22,24,26,28,31,34}, find the IQR?
Solution: To find the Interquartile Range (IQR) for the given patients' ages:
Step 1: Arrange the Data in Ascending Order and find the median
The data is already sorted:
12, 14, 19, 22, 24, 26, 28, 31, 34. In this case, 24 is the median.

Step 2: Find the Quartiles (Q1​and Q3​)


• Q1​or Q0.25 (First Quartile): The median of the lower half ({12, 14, 19, 22})
• Median of {12, 14, 19, 22} = (14 + 19) / 2 = 16.5
• Q3 or Q0.75 ​(Third Quartile): The median of the upper half ({26, 28, 31, 34})
• Median of {26, 28, 31, 34} = (28 + 31) / 2 = 29.5

Step 3: Compute the IQR


• IQR=Q3−Q1=29.5− 16.5 =13
Five-point summary
Summary of the box-plot

• Wide Box (High IQR): The data between the first quartile (Q1) and third
quartile (Q3) is more dispersed. Then box in the box plot will appear
wider.
• Narrow Box (Low IQR): The data is more concentrated around the
median.
Shape of Data
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and
peak location of the dataset.
Skewness: It is a measure of asymmetry in the distribution of data values. It tells us
whether the data is symmetrically distributed or leans more toward one side of
the mean.
Types of Skewness
• Positive Skewness (Right-Skewed Data)
• The tail on the right side (higher values) is longer.
• Most data points are concentrated towards the left.
• Mean > Median > Mode.
• Negative Skewness (Left-Skewed Data)
• The tail on the left side (lower values) is longer.
• Most data points are concentrated towards the right.
• Mean < Median < Mode.
Zero Skewness (Symmetric Data)
• The left and right sides of the distribution are roughly mirror images.
• Mean = Median = Mode.
Kurtosis is a statistical measure that describes the shape of a probability
distribution, specifically its "tailedness" or the extremity of outliers in the data.
• In simpler terms, it tells us whether the data has heavy or light tails compared
to a normal distribution.

Why do we use kurtosis?

Outlier Detection: Kurtosis can help identify whether a dataset has extreme outliers.

• High kurtosis indicates that the data may contain outliers, which can be important in
many applications like risk management, financial modeling, or quality control.
Shape of Data
MEAN ABSOLUTE DEVIATION AND COEFFICIENT OF VARIATION
• The coefficient of variation (CV) is a statistical measure that describes the
relative variability of a dataset.
• It is the ratio of the standard deviation to the mean, often expressed as a
percentage.
Formula:
Coefficient of Variation (CV)=Standard Deviation/Mean×100

•Higher CV: A higher CV indicates greater variability relative to the mean,


suggesting that the data points are more spread out compared to the average.

•Lower CV: A lower CV means less variability relative to the mean, implying that
the data points are more consistent around the average.
Special Univariate plots
• The ideal way to check the shape of the dataset is a stem and leaf plot.
• A stem-and-leaf plot is a method of organizing numerical data to show its
distribution while maintaining the original values.
• It helps in quickly identifying patterns, such as the shape of the data, clusters,
and outliers.

Structure of a Stem-and-Leaf Plot:


• The stem represents the leading digits (e.g., tens place).
• The leaves represent the last digit (e.g., ones place).
Stem-Leaf Plot
Example:
• Consider the following set of numbers: 23, 25, 31, 32, 35, 41, 42, 43, 47 . Apply
Stem-Leaf Plot
Stem | Leaf
• 2|35
• 3|125
• 4|1237

•Stem (left side): 2, 3, 4 (representing 20s, 30s, 40s).


•Leaves (right side): The last digits of each number
Q-Q Plot
• A Q-Q (Quantile-Quantile) plot is a graphical tool used to compare the
distribution of a dataset with a theoretical distribution (such as a normal
distribution).
• It helps determine whether a dataset follows a specific distribution by plotting the
quantiles of the data against the quantiles of the chosen theoretical distribution.
Key Features of a Q-Q Plot:
• If the points lie close to a straight diagonal line, the data follows the given
distribution.
• Deviations from the line indicate differences in shape, skewness, or outliers.
END
Lab program 2
• Develop a program to Compute the correlation matrix to understand the
relationships between pairs of features. Visualize the correlation matrix using a
heatmap to know which variables have strong positive/negative correlations.
Create a pair plot to visualize pairwise relationships between features. Use
California Housing dataset.
Introduction
• In data analysis and machine learning, understanding the relationships
between features is crucial for feature selection, multicollinearity
detection, and data interpretation.
• Correlation and pair plots are two essential techniques to analyze these
relationships.
1. Correlation Matrix
• A correlation matrix is a table showing correlation coefficients between
variables.
• It helps in understanding how strongly features are related to each other.
Types of Correlation
• Positive Correlation (+1 to 0): As one feature increases, the other also
increases.
• Negative Correlation (0 to -1): As one feature increases, the other decreases.
• No Correlation (0): No linear relationship between the variables.
Why Should You Use a Correlation Matrix?

• Identifies relationships between features.


• Helps in detecting multicollinearity in machine learning models.
• Highlights redundant features that may not add value to the model.
2. Heatmap for Correlation Matrix
• A heatmap is a visual representation of the correlation matrix.
• It uses color coding to indicate the strength of relationships between variables.
• Benefits of Using a Heatmap Easy to interpret relationships between features.
• Quickly identifies highly correlated variables.
• Helps in feature selection and data preprocessing
3. Pair Plot
• A pair plot (also known as a scatter plot matrix) is a collection of scatter plots for
every pair of numerical variables in the dataset.
• It helps in visualizing relationships between variables.
Why Use a Pair Plot?
• Shows the distribution of individual features along the diagonal.
• Displays relationships between features using scatter plots.
• Helps in identifying clusters, trends, and potential outliers.

You might also like