Module1 Understanding Data1
Module1 Understanding Data1
Understanding of Data
2.4 Descriptive Statistics
• Descriptive statistics is a branch of statistics that does dataset
summarization.
• It is used to summarize and describe data.
• Descriptive statistics do not bother too much about ML algorithms and its
functioning.
• Descriptive analytics and data visualization techniques helps to understand
the nature of the data, which further helps to determine the kinds of
machine learning or data mining tasks that can be applied to the data.
• This step is often known as Exploratory Data Analysis (EDA).
Dataset and Data Types
• A dataset can be assumed to be a collection of data objects.
• The data objects may be records, points, vectors,patterns, events, cases, samples
or observations.
• Example: Sample Patient Table
Patient ID Name Age Blood Test Fever Disease
Discrete data is typically represented using bar charts or count-based statistics like
mode and frequency.
Continuous data is a type of numerical data that can take any value within a given
range, including fractions and decimals.
It is measurable rather than countable.
Bar Chart: A bar chart is a graphical representation of data using rectangular bars,
where the length or height of each bar represents the value of a particular
category.
Bars can be displayed vertically (column chart) or horizontally.
• A bar chart is best suited for categorical data (data divided into groups or
categories).
• It can also be used for discrete numerical data.
• Not Ideal for Continuous Data (e.g., temperature, speed) – A line chart is usually
better for that.
• Pie Chart: A pie chart is a circular graph divided into slices, where each slice
represents a proportion of the whole.
• The size of each slice corresponds to the percentage or fraction of a category
within the dataset.
• Equally helpful in illustrating the univariate data.
Histogram: A histogram is a graphical representation of the distribution of
numerical data.
• It looks like a bar chart, but instead of showing categories, it groups continuous
data into bins (intervals) and shows the frequency of data points within each bin.
When to Use a Histogram?
• When analyzing continuous numerical data
• To understand the frequency of data within specific ranges
• To observe distribution patterns (e.g., normal, skewed, uniform)
Problem1
• There are 60 students in a class. Among them, 15 students were placed in a
company offering a 3.5 lakh package, 10 students in a 6.5 lakh package, 8
students in a 10 lakh package, and 5 students in a 12 lakh package.
Generate a bar chart, pie chart.
• Solution:
3.5 lakh package: 15/60 x 100= 25 %
6.5 lakh package: 10/60 x 100= 16.66 ≈ 16.7 %
10 lakh package: 8/60 x 100 = 13.33 %
12 lakh package: 5/60 x 100 = 8.33 %
Not placed= total – placed students = 60 - 38 = 22 students
22/60 x 100 = 36.66 ≈ 36.7 %
Placement distribution of
students
Problem-2:
Total students=60
Consider the range 0-3lakh,3-6,6-9,9-12,12-15
Package (in Lakhs) No. of placed students
College A College B
3 25 6
5 12 15
7 2 6
10 1 14
11.5 0 9
15 0 10
Central Tendency
• Central tendency refers to the measure that represents the center or typical value
of a dataset.
• It helps in understanding the overall trend of the data by identifying a single
value that best describes the distribution.
The three main measures of central tendency are:
1. Mean (Arithmetic Average)
2. Median (Middle Value)
3. Mode (Most Frequent Value)
1. Mean (Arithmetic Average)
• The sum of all values divided by the number of values.
• Formula:
Mean=∑X/N OR
For 3, 7, 9 → Median = 7
For 3, 7, 9, 12 → Median = (7+9)/2 = 8
• Wide Box (High IQR): The data between the first quartile (Q1) and third
quartile (Q3) is more dispersed. Then box in the box plot will appear
wider.
• Narrow Box (Low IQR): The data is more concentrated around the
median.
Shape of Data
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and
peak location of the dataset.
Skewness: It is a measure of asymmetry in the distribution of data values. It tells us
whether the data is symmetrically distributed or leans more toward one side of
the mean.
Types of Skewness
• Positive Skewness (Right-Skewed Data)
• The tail on the right side (higher values) is longer.
• Most data points are concentrated towards the left.
• Mean > Median > Mode.
• Negative Skewness (Left-Skewed Data)
• The tail on the left side (lower values) is longer.
• Most data points are concentrated towards the right.
• Mean < Median < Mode.
Zero Skewness (Symmetric Data)
• The left and right sides of the distribution are roughly mirror images.
• Mean = Median = Mode.
Kurtosis is a statistical measure that describes the shape of a probability
distribution, specifically its "tailedness" or the extremity of outliers in the data.
• In simpler terms, it tells us whether the data has heavy or light tails compared
to a normal distribution.
Outlier Detection: Kurtosis can help identify whether a dataset has extreme outliers.
• High kurtosis indicates that the data may contain outliers, which can be important in
many applications like risk management, financial modeling, or quality control.
Shape of Data
MEAN ABSOLUTE DEVIATION AND COEFFICIENT OF VARIATION
• The coefficient of variation (CV) is a statistical measure that describes the
relative variability of a dataset.
• It is the ratio of the standard deviation to the mean, often expressed as a
percentage.
Formula:
Coefficient of Variation (CV)=Standard Deviation/Mean×100
•Lower CV: A lower CV means less variability relative to the mean, implying that
the data points are more consistent around the average.
Special Univariate plots
• The ideal way to check the shape of the dataset is a stem and leaf plot.
• A stem-and-leaf plot is a method of organizing numerical data to show its
distribution while maintaining the original values.
• It helps in quickly identifying patterns, such as the shape of the data, clusters,
and outliers.