0% found this document useful (0 votes)
10 views17 pages

DS Module 01

The document provides an introduction to data and statistics, covering key concepts such as elements, variables, observations, scales of measurement, and types of data. It explains descriptive statistics, statistical inference, and methods for summarizing both categorical and quantitative data, including cross-tabulations and scatter diagrams. The document emphasizes the importance of organizing and analyzing data to derive meaningful insights.

Uploaded by

Adiba Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

DS Module 01

The document provides an introduction to data and statistics, covering key concepts such as elements, variables, observations, scales of measurement, and types of data. It explains descriptive statistics, statistical inference, and methods for summarizing both categorical and quantitative data, including cross-tabulations and scatter diagrams. The document emphasizes the importance of organizing and analyzing data to derive meaningful insights.

Uploaded by

Adiba Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Module 01:

Introduction

1.] Data and Statistics: Elements, Variables, and Observations


Ans: Imagine you're collecting information about your friends—like their names,
ages, and favorite colors. All that collected information is called data.
Now, statistics is just a way of organizing, analyzing, and understanding that data
so you can learn something from it—like figuring out the most common favorite
color or the average age of your friends.
Let’s break it down even more:
When you're dealing with data, think of each friend as one record. That record
has different pieces of information about them. Each of those pieces is called a
variable—for example, age is one variable, and favorite color is another.
An element is just the thing you’re collecting data about. In this case, each friend
is an element. If you were collecting data about cars, then each car would be an
element instead.
An observation is the full set of answers or values you get from one element. So,
for one friend, if you know their name, age, and favorite color, all of that together
is one observation.
In short, you’re looking at a group of things (elements), collecting facts about
them (variables), and recording those facts (observations) for each one. That’s
how data and statistics help us make sense of the world around us.

Imagine you are a teacher and you’re collecting information about your students
in a class.
Elements:
These are the individual people or items you’re collecting data about.
In our case, each student in the class is an element.
Variables:
These are the types of information you're collecting about each element.
For example, you might record each student’s:
• Name
• Age
• Grade
• Favorite subject
Each of these is a variable.

Observations:
An observation is the complete set of information you record for one student.
So, for one student, an observation might look like:
• Name: Riya
• Age: 18
• Grade: A
• Favorite Subject: Math
This full set is one observation.

2.] Explain Scales of measurement


Ans: Scales of measurement are ways to describe what kind of data you're
dealing with and what you can do with it. There are four main types, and each
one gives you different ways to analyze the data.
Let’s understand them in simple terms with examples:

1. Nominal Scale – Just Names or Labels


This is the simplest type. It names or labels things, but there's no order or
meaning in the numbers (if any).
Example:
• Colors: Red, Blue, Green
• Gender: Male, Female
• City names: Mumbai, Delhi, Kolkata
You can’t say one is more than the other, it’s just categories.

2. Ordinal Scale – Order Matters, But Not the Difference


Here, the data has a meaningful order, but the difference between values is not
clear or equal.
Example:
• Movie ratings: Good, Better, Best
• Rank in class: 1st, 2nd, 3rd
• Customer satisfaction: Poor, Average, Excellent
You know the order, but you don’t know how much better 1st is compared to
2nd.

3. Interval Scale – Order + Equal Gaps, No True Zero


Here, the data has order and equal spacing between values, but there’s no true
zero point.
Example:
• Temperature in Celsius: 10°C, 20°C, 30°C
• Dates on a calendar: 2000, 2010, 2020
You can add and subtract values, but you can’t say 20°C is twice as hot as 10°C
because the zero is not meaningful.

4. Ratio Scale – Order + Equal Gaps + True Zero


This is the most detailed. It has everything: order, equal spacing, and a true zero.
You can do all math operations with it.
Example:
• Height, weight, age, money
• Distance in kilometers: 0 km, 5 km, 10 km
Here, you can say 10 km is twice as far as 5 km because zero really means
"nothing."

In short:
• Nominal = Name only
• Ordinal = Order, no math
• Interval = Order + equal steps, no true zero
• Ratio = Everything (order, math, true zero)

3.] Categorical and Quantitative Data


Ans: Qualitative data (also called categorical data) is all about descriptions or
labels. It tells you what kind of something you’re dealing with, not how much.
You can’t do math with this type of data.
Example: Colors of cars (Red, Blue, White), Names of cities, Types of animals.
Quantitative data is about numbers and amounts. It tells you how much, how
many, or how big something is. You can do math with this data like adding,
averaging, etc.
Example: Age (20 years), Weight (65 kg), Number of students (30).
Feature Categorical Data Quantitative Data

What it describes Qualities or groups Amounts or measurements

Data type Words or labels Numbers

Can you do math with


No Yes
it?

Eye color: Blue, Green, Height: 150 cm, 165 cm, 180
Example
Brown cm
Feature Categorical Data Quantitative Data

Goal To group or classify To measure or count

4.] Explain Cross-Sectional and Time Series Data


Ans: Cross-Sectional data is like taking a snapshot of many things at one point in
time. You're collecting data from different people, places, or items all at the same
time.
Example: A survey of 100 students about their favorite subject conducted on one
specific day. You're comparing many students, but only on that day.
Time Series data is like watching one thing over time. You're collecting data from
the same person or object again and again at different times.
Example: Tracking the daily temperature in Mumbai for one month. You're
watching how the temperature changes over time in one place.

Difference Table:
Feature Cross-Sectional Data Time Series Data

Many subjects at one point One subject over different points in


Focus
in time time

Time
Data collected at one time Data collected over a period of time
element

Income of 50 people in Monthly income of one person from


Example
2025 Jan to Dec 2025

Comparing groups or
Use case Studying trends or patterns over time
categories

Data type Wide snapshot of data Long view of data

5.] Explain Descriptive statistics


Ans: Descriptive statistics is a way of making raw numbers easy to understand
by organizing and summarizing them. Imagine you have test scores from a class
of students. Instead of looking at all the individual scores, you might want to
know what the average score was, how close the scores were to each other, or
what the highest and lowest marks were.
This process helps you quickly understand the overall picture of the data without
going through every single number. It’s like turning a messy pile of numbers into
a neat summary that tells you the most important things at a glance.
For example, if a class had scores like 45, 50, 48, 92, and 44, you wouldn’t need
to memorize all of them. You could describe the group by saying something like,
“Most students scored around 50, but one student did really well with a 92.”
That’s what this approach does—it helps you describe data in a clear and simple
way.

Example:
Imagine these are the math test scores of 5 students:
Scores: 45, 50, 48, 92, 44
Without descriptive statistics, you’d have to look at all five numbers separately,
which can get confusing. But with it, we can summarize this data:

1. Average score (mean):


Add all the scores and divide by how many there are:
(45 + 50 + 48 + 92 + 44) ÷ 5 = 55.8
So, the average score is about 56.

2. Lowest and highest scores:


The smallest number is 44 and the biggest is 92.
This gives us an idea of the range.
3. Spread of the scores:
Most scores are around 45–50, but 92 is much higher. This tells us that one
student performed much better than the others.

Summary using descriptive statistics:


“Most students scored around 45 to 50, the average was about 56, and one
student scored much higher than the rest with a 92.”

6.] Statistical inference


Ans: Statistical inference is like making a smart guess about a big group of
things by looking at just a small part of it.
Let’s say you want to know the average height of all the students in your college,
but it’s not possible to measure everyone. So, you measure just 100 students.
Based on those 100 measurements, you make a guess—or inference—about the
average height of all the students.
Even though you didn't measure everyone, you’re using math and logic to make a
reasonable conclusion about the whole group. That’s what statistical inference
does—it helps us learn about a large group using a smaller sample, and it also
tells us how confident we can be in our guess.
It’s used when you can’t collect data from the entire population, but still want to
understand it.

7.] Descriptive Statistics: Tabular and Graphical Summarizing Categorical


Data
Ans: Tabular Summarizing of Categorical Data:
When you collect data that can be divided into categories (like types of fruits,
colors, or favorite movies), you can organize that data into a table. This helps you
quickly see how many items fall into each category.
For example, let’s say you ask 10 people about their favorite fruit. The answers
are:
Apple, Banana, Apple, Orange, Banana, Orange, Apple, Apple, Banana, Orange.
To summarize this data in a table, you would list each fruit and how many times
it was mentioned:
Fruit Count

Apple 4

Banana 3

Orange 3
This table tells you that Apple is the most popular, mentioned 4 times, while
Banana and Orange were each mentioned 3 times.

Graphical Summarizing of Categorical Data:


You can also use graphs to make the categorical data even easier to understand. A
bar chart is perfect for this.
If we take the fruit example again, here’s how a bar chart would look:
• A bar for Apple would reach 4 on the y-axis (showing how many people
chose Apple).
• A bar for Banana would reach 3.
• A bar for Orange would also reach 3.
The height of the bars shows how popular each fruit is. Bar charts are great
because they let you compare categories quickly.

Summary:
• Tabular summarizing gives a simple count of how many items fall into each
category (like fruits in our example).
• Graphical summarizing (like bar charts) helps you quickly compare the
categories visually.
8.] Summarizing Quantitative Data
Ans: Summarizing Quantitative Data means taking numbers and finding ways to
describe what they tell you in a simple way. Instead of looking at every
individual number, we try to get a general idea of what the data looks like.
Let’s break it down with an easy example:

Example:
Suppose you have the following test scores of 5 students:
Scores: 55, 60, 75, 85, 90
Now, let’s see how we can summarize this data:

1. Average (Mean):
The average gives us a single number that represents the "typical" score. To
find the average, you add up all the numbers and divide by how many there
are.
So, for our data:
• Add up all the scores: 55 + 60 + 75 + 85 + 90 = 365
• Divide by the number of students: 365 ÷ 5 = 73
So, the average score is 73. This tells us that the typical score is around 73.

2. Range:
The range tells you how spread out the data is by finding the difference
between the highest and lowest values.
• The highest score is 90.
• The lowest score is 55.
So, the range is:
90 - 55 = 35
This means the scores are spread out by 35 points.

3. Median:
The median is the middle value when the data is arranged in order. It’s
useful when you want to avoid extreme values (like very high or very low
scores) affecting the average.
• First, sort the scores: 55, 60, 75, 85, 90
• The middle score is 75 (since it's the third number).
So, the median score is 75.

4. Mode:
The mode is the number that appears the most. If no number repeats, we
don’t have a mode.
In our example, all scores are different, so there is no mode.

Summary of Quantitative Data:


• Mean (Average): Adds up all the numbers and divides by how many there
are (typical value).
• Range: The difference between the highest and lowest values (spread of
data).
• Median: The middle value when the data is ordered (useful when you want
to avoid extremes).
• Mode: The most frequent value (if there is one).
These methods help us understand the "big picture" of what the data is showing
without looking at every individual value.
Would you like to see how these methods work with a different set of numbers or
an example with a graph?

9.] Cross Tabulations and Scatter Diagram.


Ans: Cross Tabulations:
A cross tabulation (or cross-tab) is a way to look at the relationship between two
categorical variables by creating a table. It helps you see how categories of one
variable relate to categories of another variable.
Think of it like comparing two things at once. For example, you might want to
compare gender with favorite fruit to see if boys and girls like the same fruit.
Let’s say you have the following data from a group of 6 people:
Person Gender Favorite Fruit

1 Male Apple

2 Female Banana

3 Male Apple

4 Female Orange

5 Male Orange

6 Female Apple
You can create a cross-tabulation (or a table) to show how many males and
females like each fruit:
Favorite Fruit Male Female

Apple 2 1

Banana 0 1

Orange 1 1
This tells us:
• 2 males like Apple
• 1 female likes Apple
• 1 male and 1 female like Orange
• 1 female likes Banana
It’s a quick way to compare two categories and see how they relate to each other.
Scatter Diagram:
A scatter diagram (or scatter plot) is a graph that shows the relationship between
two quantitative variables. You plot points on a graph where the x-axis
(horizontal) represents one variable and the y-axis (vertical) represents another.
For example, let’s say you want to see if there’s a relationship between the
number of hours studied and the test score. You can plot the data like this:
Hours Studied Test Score

1 50

2 60

3 70

4 80

5 90
Now, you would plot the points on a graph:
• On the x-axis, you plot hours studied (1, 2, 3, 4, 5).
• On the y-axis, you plot test scores (50, 60, 70, 80, 90).
If you plot these points, you might see a pattern where the points go up as the
hours studied increase. This tells you there is a positive relationship—the more
you study, the higher the score.

10.] Descriptive Statistics: Numerical Measures: Measures of Location,


Measures of Variability
Ans: 1. Measures of Location:
Measures of location help you find the central position of your data. These
measures give you a sense of where most of the data is located.
Examples:
• Mean (Average): The sum of all values divided by the number of values.
• Median: The middle value when the data is arranged in order.
• Mode: The value that appears most often in the data.
Example:
Test scores of 5 students: 40, 50, 60, 70, 80
• Mean:
(40 + 50 + 60 + 70 + 80) ÷ 5 = 60
• Median:
The middle value in the ordered list (40, 50, 60, 70, 80) is 60.
• Mode:
All numbers are different, so there is no mode.

2. Measures of Variability:
These measures tell you how spread out or different the data is.
Examples:
• Range: The difference between the maximum and minimum values.
• Variance: The average of the squared differences from the mean (how
spread out the data is).
• Standard Deviation: The square root of the variance. It tells you how much
individual values deviate from the mean.
Example:
For the test scores: 40, 50, 60, 70, 80
• Range:
80 (highest) - 40 (lowest) = 40.
• Standard Deviation:
To calculate, you first find the differences from the mean (60), square them,
then find the average squared difference, and finally take the square root. In
this case, the standard deviation is 15.81 (calculation simplified here for
explanation).
11.] Measures of Distribution Shape
Ans: The shape of data tells you how the data looks when you graph it—
especially as a histogram (a kind of bar graph that shows how often values
appear).
It's like looking at a "mountain" made from your data. Some shapes are smooth
and balanced, others lean to one side, and some have weird bumps.
1. Symmetrical (or Normal) Shape
• The left and right sides of the graph are even.
• The data is centered around the middle value (mean = median).
• Looks like a bell curve or a hill.
2. Skewed Right (Positively Skewed)
• Most values are low, but a few are very high.
• The tail (long end) is on the right.
• Mean > Median
3. Skewed Left (Negatively Skewed)
• Most values are high, but a few are very low.
• The tail is on the left.
• Mean < Median
Kurtosis tells you how tall and sharp the peak of a graph is, and how thick or thin
the tails are (the ends of the distribution).
It shows whether the data has more or fewer extreme values (outliers) compared
to normal.

Think of kurtosis like hills:


Some hills are sharp and pointy, others are flat and wide. The same idea applies
to graphs of data.

Types of Kurtosis:
1. Mesokurtic (Normal Kurtosis):
• This is the standard bell-shaped curve.
• Data has a normal number of outliers.
• Not too sharp, not too flat.
2. Leptokurtic (High Kurtosis):
• The graph has a very sharp peak.
• Heavy tails → more extreme values (outliers).
• Looks like a narrow mountain.
Example: Most students score around 80, but a few score extremely low or high
(like 0 or 100).
3. Platykurtic (Low Kurtosis):
• The graph is flat and wide.
• Light tails → fewer extreme values.
• Looks like a low hill.
Example: Students are evenly spread in scores, no one scores extremely high or
low.

12.] Relative Location, and Detecting Outliers, Box Plot, Measures of


Association Between Two Variables 183
Ans: 1. Relative Location:
Relative location helps you compare a data point to the rest of the data using
percentages.
Example:
• Percentiles: Divide data into 100 equal parts.
• Quartiles: Divide data into 4 equal parts.
o Q1 (25th percentile): 25% of data is below this value.
o Q2 (50th percentile/Median): 50% of data is below this value.
o Q3 (75th percentile): 75% of data is below this value.
Example:
Consider these 9 numbers: 1, 3, 5, 7, 9, 11, 13, 15, 17
• Q1 (25th percentile): The 25% point is 5.
• Median (Q2): The middle value is 9.
• Q3 (75th percentile): The 75% point is 13.

2. Detecting Outliers:
Outliers are data points that are far away from most other values. They can affect
statistical results.
• Outlier rule: A common rule is if a data point is 1.5 times the interquartile
range (IQR) above Q3 or below Q1, it's an outlier.
Example:
Consider data: 1, 3, 5, 7, 9, 11, 13, 15, 100
• The IQR is Q3 - Q1 = 13 - 5 = 8.
• Any number above Q3 + 1.5 × IQR = 13 + 1.5 × 8 = 25 or below Q1 - 1.5 ×
IQR = 5 - 1.5 × 8 = -3 is an outlier.
• 100 is much higher than 25, so it’s an outlier.

3. Box Plot:
A box plot is a graphical way to show the distribution of data, including the
median, quartiles, and any outliers.
• Box: Represents the IQR (between Q1 and Q3).
• Line inside the box: Shows the median (Q2).
• Whiskers: Show the range (from Q1 to the lowest value and from Q3 to the
highest value that are not outliers).
• Outliers: Are marked separately (often as dots or stars).
3. Measures of Association Between Two Variables:
These measures tell you how two variables are related to each other.
Examples:
• Correlation: Shows how strongly two variables are related. It ranges from -
1 (perfect negative relationship) to +1 (perfect positive relationship). A 0
means no relationship.
Example:
If hours studied increase and test scores also increase, there’s a positive
correlation.
• Covariance: Measures how two variables change together. It’s similar to
correlation but without the standardization.

Summary:
• Location: Tells where the data is centered (mean, median, mode).
• Variability: Shows how spread out or varied the data is (range, variance,
standard deviation).
• Shape: Describes the pattern of data (skewness, kurtosis).
• Relative Location: Shows where individual data points lie relative to others
(percentiles, quartiles).
• Outliers: Identifies data points that are unusually far from others.
• Box Plot: A graph that summarizes data distribution and shows outliers.
• Association: Describes the relationship between two variables (correlation,
covariance).

You might also like