Unit-2 Solution
Unit-2 Solution
1.will treating categorical variables as continuous variables result in a better predictive model?
Justify your answer.(APR-2024)
Treating categorical variables as continuous variables is generally not advisable and often leads to
misleading results
a.Nature of Categorical Variables
b.Misinterpretation of Relationships
c.Loss of Information
d.Model Performance
e.Statistical Assumptions
2.Issue:Feeding data which has variables correlated to one another is not a good statistical practice,
since we are providing multiple weightage to the same type of data.
Solution:Correlation Analysis
Show how such issues are prevented by correlation analysis technique.Justify with a small instance
dataset.(Apr-2024)
Nov-2023To prevent issues caused by multicollinearity, you might:
1. Remove Highly Correlated Variables:
• If two variables are highly correlated, you can remove one. For instance, you might
choose to keep only "Size" and exclude "Bedrooms" if they are providing redundant
information.
2. Combine Correlated Variables:
• Create a new variable that combines the information. For example, a "Size per
Bedroom" variable might provide a new perspective.
3. Principal Component Analysis (PCA):
• PCA can transform correlated variables into a set of linearly uncorrelated
components.
Example Solution
Let’s simplify and fit a linear regression model to predict Price using Size and Age, after deciding to
exclude Bedrooms due to its high correlation with Size:
Create the Regression Model:
3.Explain the types of data
a.Quantitative Data
b.Qualitative Data
c.Text Data
d.Time Series Data
e.Spatial Data
f.Binary Data
g.Structured Data
i.Unstructured Data
4.Define median with example
The median is a measure of central tendency that represents the middle value in a dataset when it is
ordered from smallest to largest. If the dataset has an odd number of observations, the median is the
middle one. If the dataset has an even number of observations, the median is the average of the two
middle values.
Quantitative Data
Nature:
• Numeric: Quantitative data is numeric and can be measured.
• Objective: It tends to be more objective and can be statistically analyzed.
• Hypothesis Testing: Often used to test hypotheses and make predictions.
Collection Methods:
• Surveys/Questionnaires:
• Experiments
• Existing Data
6.List the difference between a discrete variable and continuous variable with an
example.(nov-22)
• Discrete Variables
Definition: Discrete variables are those that can only take on a countable number of distinct values.
They often represent counts or categories.
Characteristics:
1. Countable: You can list or count all possible values.
2. No Intermediate Values: There are no possible values between two adjacent values.
3. Examples: Number of students in a class, number of cars in a parking lot.
Example:
• Number of books
Continuous Variables
Definition: Continuous variables can take on an infinite number of values within a given range.
They are usually measurements and can be divided into smaller and smaller parts.
Characteristics:
1. Uncountable: There are infinitely many possible values within a given range.
2. Intermediate Values: There are possible values between any two adjacent values.
3. Examples: Height, weight, temperature.
Example:
• Height:
• Nov-2022
• 7.classify the below list of data into their types:a)ethnic group b)age c)family size
d)academic major e)sexual preference f)IQ score g)nte worth (dollars) h)third-place finish
i)gender j)temperature
And write a brief note on them.a) Ethnic group - Categorical (nominal)
b) Age - Quantitative (continuous)
c) Family size - Quantitative (discrete)
d) Academic major - Categorical (nominal)
e) Sexual preference - Categorical (nominal)
f) IQ score - Quantitative (continuous)
g) Net worth (dollars) - Quantitative (continuous)
h) Third-place finish - Ordinal (ordinal)
i) Gender - Categorical (nominal)
j) Temperature - Quantitative (continuous)
8.What is a percentile rank?Give an example.
A percentile rank is a statistical measure used to understand and interpret a data point's position
within a dataset relative to other data points. Specifically, it tells you the percentage of scores in a
dataset that fall below a particular score.
Eg:If the test had 1000 students, and your percentile rank is the 75th percentile, then 750 students
scored below you, and 250 students scored above you.
Part-B
1.(a)i)Indicate whether each of the following distributions is positively or negatively skewed.The
distribution of
(1) Incomes of tax payers have a mean of $48,000 and a median of $43,000.To determine the
skewness of the distribution, compare the mean and median:
• Positively skewed (right-skewed): The mean is greater than the median.
• Negatively skewed (left-skewed): The mean is less than the median.
Given:
• Mean income = $48,000
• Median income = $43,000
Since the mean ($48,000) is greater than the median ($43,000), the distribution of incomes is
positively skewed (right-skewed). This means that there are some high-income outliers pulling the
mean to the right, creating a longer tail on the higher end of the distribution.
(2)GPAs for all students at some college have a mean of 3.01 and a median of 3.20
• Mean GPA = 3.01
• Median GPA = 3.20
Since the mean (3.01) is less than the median (3.20), the distribution of GPAs is positively skewed
(right-skewed). This indicates that there are some lower GPAs pulling the mean down, creating a
longer tail on the lower end of the distribution.
ii)During their swim through a water maze, 15 laboratory rats made the following number of errors
(blind alleyway entrances):2,17,5,3,28,7,5,8,5,6,2,12,10,4,3.
(1)Find the mode,median and mean for these data.
To find the mode, median, and mean for the given data set, follow these steps:
Data Set
2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3
1. Mode
The mode is the number that appears most frequently in the data set.
• Frequency of each number:
• 2: 2 times
• 3: 2 times
• 4: 1 time
• 5: 3 times
• 6: 1 time
• 7: 1 time
• 8: 1 time
• 10: 1 time
• 12: 1 time
• 17: 1 time
• 28: 1 time
The number 5 appears most frequently (3 times), so the mode is 5.
2. Median
The median is the middle value when the data is ordered from smallest to largest. If there is an even
number of observations, the median is the average of the two middle numbers.
• First, sort the data: 2, 2, 3, 3, 4, 5, 5, 5, 6, 7, 8, 10, 12, 17, 28
• There are 15 data points (an odd number), so the median is the 8th value in this sorted list.
The median is 5.
3. Mean
3. Mean: To find the mean, we need to calculate the sum of all the values and then divide by the
number of data points.
Sum of the data: 2 + 17 + 5 + 3 + 28 + 7 + 5 + 8 + 5 + 6 + 2 + 12 + 10 + 4 + 3 = 119
Number of data points = 15
Mean = Total sum / Number of data points = 119 / 15 ≈ 7.93
Summary
• Mode: 5
• Median: 5
• Mean: approximately 7.93
(2)without constructing a frequency distribution or graph, would it be possible to characterize the
shape of this distribution as balanced, positively skewed, or negatively skewed?
To characterize the shape of the distribution without constructing a frequency distribution or graph,
you can use the relationship between the mean, median, and mode to determine the skewness.
Here’s how:
1. Calculate the Mode, Median, and Mean:
• Mode: 5
• Median: 5
• Mean: 7.93
2. Determine Skewness:
• Positively Skewed (Right-Skewed): Mean > Median
• Negatively Skewed (Left-Skewed): Mean < Median
• Balanced (Symmetrical): Mean ≈ Median
In this case:
• Mean =7.93
• Median = 5
The mean (7.93) is greater than the median (5).
This indicates that the distribution is positively skewed (right-skewed). This is because the
mean is pulled in the direction of the higher values, suggesting that there are some high
outliers (such as the number 28) that are stretching the distribution to the right.
Summary
Based on the mean and median comparison, the distribution of errors is positively skewed (right-
skewed).
(B)i)Assume that SAT math scores approximate a normal curve with a mean of 500 and standard
deviation 100.
Sketch a normal curve and shade in the target area(s) described y each of the following statements:
*more than 570
*Less than 515
*between 520 and 540
*convert to z scores and find the target areas specific to the above values.
To address the problem of shading areas under the normal curve based on SAT math scores, we’ll
start by sketching the normal distribution and then use z-scores to find the specific areas.
ii)Assume that the burning times of electric light bulbs approximate a normal curve with a mean of
1200 hours and standard deviation of 120 hours. If a large number of new lights are installed at the
same time (possibly along a newly opened freeway) , at what time will.
1 percent fails?
50 percent fail?
95 percent fail?
To determine the time at which a certain percentage of light bulbs will fail, given that the burning
times follow a normal distribution with a mean (μ\muμ) of 1200 hours and a standard deviation (σ\
sigmaσ) of 120 hours, we use the properties of the normal distribution. Here’s how we find the
times at which 1 percent, 50 percent, and 95 percent of the light bulbs will fail:
1. 1 Percent Failure Time: This is the time below which 1% of the bulbs will fail. In terms of
the normal distribution, this corresponds to the 1st percentile.
2. 50 Percent Failure Time: This is the median of the distribution, which for a normal
distribution is the mean. Thus, 50% of the bulbs will fail by this time.
3. 95 Percent Failure Time: This is the time below which 95% of the bulbs will fail. In terms
of the normal distribution, this corresponds to the 95th percentile.
We’ll use the Z-scores associated with these percentiles to find the actual times.
So, approximately 1 percent of the light bulbs will fail by 920.4 hours.
So, approximately 95 percent of the light bulbs will fail by 1397.4 hours.
Summary
• 1 percent of bulbs fail by approximately 920.4 hours.
• 50 percent of bulbs fail by 1200 hours.
• 95 percent of bulbs fail by approximately 1397.4 hours.
3.a)i)Explain normal curve and z-score.
ii)Using standard normal curve table, find the proportion of the total area identified with the
following statements.
1)above a z score of 1.80
2)between the mean and a z score of 1.65
3)between z scores of 0 and -1.96
•
•
•
•
• A z-score of 0 corresponds to the mean, positive z-scores correspond to values above the
mean, and negative z-scores correspond to values below the mean.
Types of Variables
In statistics, variables are typically categorized into different types based on their nature and the
kind of data they represent. The most common types are:
1. Qualitative (Categorical) Variables:
• Nominal: These variables represent categories without a specific order. Examples
include gender, color, or type of fruit.
• Ordinal: These variables represent categories with a meaningful order or ranking,
but the distances between categories are not necessarily equal. Examples include
education level (high school, bachelor's, master's, etc.) or satisfaction ratings (poor,
fair, good, excellent).
2. Quantitative (Numerical) Variables:
• Discrete: These variables represent countable quantities and often involve integers.
Examples include the number of children in a family or the number of cars in a
parking lot.
• Continuous: These variables represent measurable quantities and can take on an
infinite number of values within a range. Examples include height, weight, and age.
ii)suppose a hospital tested the age and body fat data for randomly selected adults with the
following result:
Age 23 27 39 49 50 52 54 56 57 58 60
%Fat 9.5 17.8 31.4 27.2 31.2 34.6 42.5 33.4 30.2 34.1 41
Draw the boxplots for age.
To draw the boxplot for the age data, follow these steps:
• In this boxplot:
• The box starts at 39 and ends at 57.
• The median (52) is marked inside the box.
• The whiskers extend from 23 to 39 on the lower side and from 57 to 60 on the upper side.
This boxplot provides a visual summary of the distribution of ages in the dataset.
•
4.a)i)what is a frequency distribution?Customers who have purchased a particular product
rated the usability of the product on a 10-point scale, ranging from 1 (poor) to 10
(excellent )as follows:
3 7 2 7 8
3 1 4 10 3
2 5 3 5 8
9 7 6 3 7
8 9 7 3 6
Construct a frequency distribution for the above data.
A frequency distribution is a way to organize and summarize a set of data by showing how
often each value or range of values occurs. It provides a clear picture of the data's
distribution, making it easier to analyze and interpret.
To construct a frequency distribution for the usability ratings provided, we need to tally how many
times each rating appears in the dataset. Here’s the step-by-step process:
1. List the Ratings: The ratings are:
2. 3,7,2,7,8,3,1,4,10,3,2,5,3,5,8,9,7,6,3,7,8,9,7,3,6
3. Create a Table to Count Frequencies: We’ll count how many times each rating from 1 to
10 appears in the list.
To convert a frequency distribution into a relative frequency distribution, follow these steps:
1. Calculate the Total Number of Observations: This is the sum of all frequencies. In your
case, it's given as 200.
2. Calculate Relative Frequency for Each Interval: Divide the frequency of each interval by
the total number of observations.
3. Present the Results: Create a table where each frequency is replaced by its corresponding
relative frequency.
Here’s how you can convert the given frequency distribution into a relative frequency distribution:
{3}/{200} = 0.015
Relative Frequency Distribution Table
GRE Score Range Frequency Relative Frequency
725-749 1 0.005
700-724 3 0.015
675-699 14 0.07
650-674 30 0.15
625-649 34 0.17
600-624 42 0.21
575-599 30 0.15
550-574 27 0.135
525-549 13 0.065
500-524 4 0.02
475-499 3 0.015
By converting frequencies to relative frequencies, you get a better sense of how each category
compares proportionally to the whole dataset.
12.a)Demonstrate the different types of variables used in data analysis with an example for
each.
In data analysis, different types of variables are used to categorize and interpret data. Here’s an
overview of common variable types with examples for each:
1. Nominal Variables
Nominal variables are categorical variables with no inherent order or ranking. They simply
represent different categories.
Example:
• Variable: Favorite Color
• Categories: Red, Blue, Green, Yellow, etc.
• Usage: You might use this variable to analyze the most popular color among a group of
people.
2. Ordinal Variables
Ordinal variables are categorical variables with a meaningful order but no consistent difference
between categories.
Example:
• Variable: Customer Satisfaction Rating
• Categories: Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied
• Usage: You might use this variable to gauge overall customer satisfaction and identify
trends over time.
3. Interval Variables
Interval variables are numeric variables where the intervals between values are consistent, but there
is no true zero point.
Example:
• Variable: Temperature in Celsius
• Values: -5°C, 0°C, 25°C, 40°C, etc.
• Usage: You might use this variable to analyze temperature patterns and their effects on
various outcomes.
4. Ratio Variables
Ratio variables are numeric variables with a true zero point, which allows for meaningful
comparisons of ratios.
Example:
• Variable: Height
• Values: 150 cm, 170 cm, 180 cm, etc.
• Usage: You might use this variable to study the correlation between height and other factors,
such as weight.
5. Binary Variables
Binary variables are a special type of nominal variable with only two possible values.
Example:
• Variable: Has a Pet
• Values: Yes, No
• Usage: You might use this variable to analyze pet ownership trends or its effects on other
variables, like happiness.
6. Continuous Variables
Continuous variables can take on an infinite number of values within a given range and can be
measured with fine precision.
Example:
• Variable: Annual Income
• Values: $30,000, $45,678, $100,000, etc.
• Usage: You might use this variable to analyze income distribution and its impact on
spending behavior.
7. Discrete Variables
Discrete variables are numeric variables that can only take on specific, distinct values, often
integers.
Example:
• Variable: Number of Children
• Values: 0, 1, 2, 3, etc.
• Usage: You might use this variable to study family size and its effects on household
spending.
These different types of variables are fundamental in designing data analysis methods and
interpreting results, allowing analysts to make meaningful conclusions from data.
b)The number of friends reported by Facebook users is summarized in the following
frequency distribution.
FRIENDS f
400-above 2
350-399 5
300-349 12
250-299 17
200-249 23
150-199 49
100-149 27
60-99 29
0-49 36
Total 200