DS Module 01
DS Module 01
Introduction
Imagine you are a teacher and you’re collecting information about your students
in a class.
Elements:
These are the individual people or items you’re collecting data about.
In our case, each student in the class is an element.
Variables:
These are the types of information you're collecting about each element.
For example, you might record each student’s:
• Name
• Age
• Grade
• Favorite subject
Each of these is a variable.
Observations:
An observation is the complete set of information you record for one student.
So, for one student, an observation might look like:
• Name: Riya
• Age: 18
• Grade: A
• Favorite Subject: Math
This full set is one observation.
In short:
• Nominal = Name only
• Ordinal = Order, no math
• Interval = Order + equal steps, no true zero
• Ratio = Everything (order, math, true zero)
Eye color: Blue, Green, Height: 150 cm, 165 cm, 180
Example
Brown cm
Feature Categorical Data Quantitative Data
Difference Table:
Feature Cross-Sectional Data Time Series Data
Time
Data collected at one time Data collected over a period of time
element
Comparing groups or
Use case Studying trends or patterns over time
categories
Example:
Imagine these are the math test scores of 5 students:
Scores: 45, 50, 48, 92, 44
Without descriptive statistics, you’d have to look at all five numbers separately,
which can get confusing. But with it, we can summarize this data:
Apple 4
Banana 3
Orange 3
This table tells you that Apple is the most popular, mentioned 4 times, while
Banana and Orange were each mentioned 3 times.
Summary:
• Tabular summarizing gives a simple count of how many items fall into each
category (like fruits in our example).
• Graphical summarizing (like bar charts) helps you quickly compare the
categories visually.
8.] Summarizing Quantitative Data
Ans: Summarizing Quantitative Data means taking numbers and finding ways to
describe what they tell you in a simple way. Instead of looking at every
individual number, we try to get a general idea of what the data looks like.
Let’s break it down with an easy example:
Example:
Suppose you have the following test scores of 5 students:
Scores: 55, 60, 75, 85, 90
Now, let’s see how we can summarize this data:
1. Average (Mean):
The average gives us a single number that represents the "typical" score. To
find the average, you add up all the numbers and divide by how many there
are.
So, for our data:
• Add up all the scores: 55 + 60 + 75 + 85 + 90 = 365
• Divide by the number of students: 365 ÷ 5 = 73
So, the average score is 73. This tells us that the typical score is around 73.
2. Range:
The range tells you how spread out the data is by finding the difference
between the highest and lowest values.
• The highest score is 90.
• The lowest score is 55.
So, the range is:
90 - 55 = 35
This means the scores are spread out by 35 points.
3. Median:
The median is the middle value when the data is arranged in order. It’s
useful when you want to avoid extreme values (like very high or very low
scores) affecting the average.
• First, sort the scores: 55, 60, 75, 85, 90
• The middle score is 75 (since it's the third number).
So, the median score is 75.
4. Mode:
The mode is the number that appears the most. If no number repeats, we
don’t have a mode.
In our example, all scores are different, so there is no mode.
1 Male Apple
2 Female Banana
3 Male Apple
4 Female Orange
5 Male Orange
6 Female Apple
You can create a cross-tabulation (or a table) to show how many males and
females like each fruit:
Favorite Fruit Male Female
Apple 2 1
Banana 0 1
Orange 1 1
This tells us:
• 2 males like Apple
• 1 female likes Apple
• 1 male and 1 female like Orange
• 1 female likes Banana
It’s a quick way to compare two categories and see how they relate to each other.
Scatter Diagram:
A scatter diagram (or scatter plot) is a graph that shows the relationship between
two quantitative variables. You plot points on a graph where the x-axis
(horizontal) represents one variable and the y-axis (vertical) represents another.
For example, let’s say you want to see if there’s a relationship between the
number of hours studied and the test score. You can plot the data like this:
Hours Studied Test Score
1 50
2 60
3 70
4 80
5 90
Now, you would plot the points on a graph:
• On the x-axis, you plot hours studied (1, 2, 3, 4, 5).
• On the y-axis, you plot test scores (50, 60, 70, 80, 90).
If you plot these points, you might see a pattern where the points go up as the
hours studied increase. This tells you there is a positive relationship—the more
you study, the higher the score.
2. Measures of Variability:
These measures tell you how spread out or different the data is.
Examples:
• Range: The difference between the maximum and minimum values.
• Variance: The average of the squared differences from the mean (how
spread out the data is).
• Standard Deviation: The square root of the variance. It tells you how much
individual values deviate from the mean.
Example:
For the test scores: 40, 50, 60, 70, 80
• Range:
80 (highest) - 40 (lowest) = 40.
• Standard Deviation:
To calculate, you first find the differences from the mean (60), square them,
then find the average squared difference, and finally take the square root. In
this case, the standard deviation is 15.81 (calculation simplified here for
explanation).
11.] Measures of Distribution Shape
Ans: The shape of data tells you how the data looks when you graph it—
especially as a histogram (a kind of bar graph that shows how often values
appear).
It's like looking at a "mountain" made from your data. Some shapes are smooth
and balanced, others lean to one side, and some have weird bumps.
1. Symmetrical (or Normal) Shape
• The left and right sides of the graph are even.
• The data is centered around the middle value (mean = median).
• Looks like a bell curve or a hill.
2. Skewed Right (Positively Skewed)
• Most values are low, but a few are very high.
• The tail (long end) is on the right.
• Mean > Median
3. Skewed Left (Negatively Skewed)
• Most values are high, but a few are very low.
• The tail is on the left.
• Mean < Median
Kurtosis tells you how tall and sharp the peak of a graph is, and how thick or thin
the tails are (the ends of the distribution).
It shows whether the data has more or fewer extreme values (outliers) compared
to normal.
Types of Kurtosis:
1. Mesokurtic (Normal Kurtosis):
• This is the standard bell-shaped curve.
• Data has a normal number of outliers.
• Not too sharp, not too flat.
2. Leptokurtic (High Kurtosis):
• The graph has a very sharp peak.
• Heavy tails → more extreme values (outliers).
• Looks like a narrow mountain.
Example: Most students score around 80, but a few score extremely low or high
(like 0 or 100).
3. Platykurtic (Low Kurtosis):
• The graph is flat and wide.
• Light tails → fewer extreme values.
• Looks like a low hill.
Example: Students are evenly spread in scores, no one scores extremely high or
low.
2. Detecting Outliers:
Outliers are data points that are far away from most other values. They can affect
statistical results.
• Outlier rule: A common rule is if a data point is 1.5 times the interquartile
range (IQR) above Q3 or below Q1, it's an outlier.
Example:
Consider data: 1, 3, 5, 7, 9, 11, 13, 15, 100
• The IQR is Q3 - Q1 = 13 - 5 = 8.
• Any number above Q3 + 1.5 × IQR = 13 + 1.5 × 8 = 25 or below Q1 - 1.5 ×
IQR = 5 - 1.5 × 8 = -3 is an outlier.
• 100 is much higher than 25, so it’s an outlier.
3. Box Plot:
A box plot is a graphical way to show the distribution of data, including the
median, quartiles, and any outliers.
• Box: Represents the IQR (between Q1 and Q3).
• Line inside the box: Shows the median (Q2).
• Whiskers: Show the range (from Q1 to the lowest value and from Q3 to the
highest value that are not outliers).
• Outliers: Are marked separately (often as dots or stars).
3. Measures of Association Between Two Variables:
These measures tell you how two variables are related to each other.
Examples:
• Correlation: Shows how strongly two variables are related. It ranges from -
1 (perfect negative relationship) to +1 (perfect positive relationship). A 0
means no relationship.
Example:
If hours studied increase and test scores also increase, there’s a positive
correlation.
• Covariance: Measures how two variables change together. It’s similar to
correlation but without the standardization.
Summary:
• Location: Tells where the data is centered (mean, median, mode).
• Variability: Shows how spread out or varied the data is (range, variance,
standard deviation).
• Shape: Describes the pattern of data (skewness, kurtosis).
• Relative Location: Shows where individual data points lie relative to others
(percentiles, quartiles).
• Outliers: Identifies data points that are unusually far from others.
• Box Plot: A graph that summarizes data distribution and shows outliers.
• Association: Describes the relationship between two variables (correlation,
covariance).