PRW Questions
PRW Questions
VII Semester
1
Answer key
Part - A
1. What are the three common methods for performing bivariate analysis?
Bivariate analysis is when we look at two things together to see how they’re related.
Types of Bivariate Analysis
The various types of bivariate analysis are:
2. Scatter Plots
3. Correlation Analysis
4. Regression Analysis
5. Chi-Square Test
6. T-tests and ANOVA
2. Name the two types of statistical testing in bivariate analysis.
In bivariate analysis, the two main types of statistical testing are:
Correlation Analysis: This tests the strength and direction of the relationship between two continuous
variables. The most common test is the Pearson correlation coefficient (for linear relationships),.
Regression Analysis: This tests how one variable (dependent) changes in response to another variable
(independent). The most common type is linear regression.
3. What is the purpose of smoothing a time series data?
Smoothing a time series data serves several important purposes:
Noise Reduction: It helps to minimize random fluctuations or noise, making underlying trends and
patterns clearer.
Trend Identification: Smoothing facilitates the identification of long-term trends by filtering out short-
term variations.
Forecasting: A smoother dataset can improve the accuracy of forecasts by providing a clearer signal of
future behavior.
Seasonal Adjustment: Smoothing can assist in identifying and adjusting for seasonal patterns, helping to
isolate other factors affecting the data.
Visualization: It enhances the visual representation of the data, making it easier to analyze and interpret.
4. Outline the difference between univariate and bivariate data.
Univariate Data:
Definition: Involves a single variable or attribute.
Analysis Focus: Examines the distribution, central tendency, and dispersion of that one variable.
Examples:
o Heights of students in a class.
o Daily temperatures recorded in a month.
Visualization: Commonly represented using histograms, box plots, or bar charts.
Bivariate Data:
Definition: Involves two variables or attributes.
2
Analysis Focus: Explores the relationship or correlation between the two variables.
Examples:
o Height and weight of individuals.
o Time spent studying and exam scores.
Visualization: Often represented using scatter plots, line graphs, or contingency tables.
5. Is bivariate qualitative or quantitative?
Bivariate data can be either qualitative or quantitative, depending on the nature of the two variables
involved:
Bivariate Qualitative Data:
o Both variables are categorical.
o Example: Relationship between gender (male/female) and preference for a type of
movie (action/comedy/drama).
Bivariate Quantitative Data:
o Both variables are numerical.
o Example: Relationship between hours studied and exam scores.
Mixed Bivariate Data:
o One variable is qualitative and the other is quantitative.
o Example: Relationship between educational level (high school, college, etc.) and income.
So, bivariate data can encompass various combinations of qualitative and quantitative variables.
6. What are Numerical Summaries of Level and Spread?
Numerical summaries of level include measures like mean and median, while those of spread
involve measures like range, variance, and standard deviation.
7. Define Scaling and Standardizing?
Scaling adjusts the numerical range of a variable, while standardizing transforms it to have a
mean of 0 and a standard deviation of 1.
8. Define Skewness in a Distribution?
Skewness measures the asymmetry of a distribution; positive skewness indicates a tail to the
right, and negative skewness indicates a tail to the left.
9. What is the Purpose of Percentage Tables in Data Analysis?
Percentage tables express values as a percentage of the total, providing a clearer understanding of the
relative contribution of each category.
10. Explain the Significance of Handling Several Batches in Experimental Design?
Handling several batches is important in experimental design to account for variations introduced by different
conditions, ensuring robust and generalizable results.
3
Part – B
Calculation Methods:
The mean is calculated using the formula mentioned earlier.
o Sum: Add up all the values in the dataset.
o Divide by the number of data points: Take the total sum and divide it by the count of values in
the dataset.
Example:
For a dataset [1, 2, 3, 4, 5]:
o Sum = 1 + 2 + 3 + 4 + 5 = 15
o Number of data points (n) = 5
o Mean = 15 / 5 = 3
Implementation of Mean in
Python:
Here's how to calculate the mean in Python using the NumPy library:
1. import numpy as
np 2.
3. data = [1, 2, 3, 4, 5]
4. mean = np.mean(data)
5. print("Mean:", mean)
Ouptut:
Mean: 3.0
2. Median
Definition:
The median is every other degree of primary tendency that represents the centre value of a dataset whilst
its miles are ordered in ascending or descending order. If the dataset has an abnormal variety of
observations, the median is the centre number. If the dataset has an excellent wide variety of
observations, the median is the common of the two middle numbers.
Calculation Methods:
To calculate the median:
o Sort the data in ascending order.
o Determine the middle value:
o If the number of observations (nnn) is odd, the median of the distribution is the middle
value in the data.
o If the number of observations is even, the median value is the average of the two
middle values in the data.
Example:
For a dataset [45, 67, 23, 89, 90]:
4
o Sorted data: [23, 45, 67, 89, 90]
o Number of data points (n) = 5 (odd)
o Median = 67 (the middle
value) For a dataset [1, 2, 3, 4, 5,
6]:
o Sorted data: [1, 2, 3, 4, 5, 6]
o Number of data points (n) = 6 (even)
o Median = (3 + 4) / 2 = 3.5
Implementation of Median in Python:
Here's how to calculate the median in Python using the NumPy library:
1. import numpy as
np 2.
3. data = [1, 2, 3, 4, 5]
4. median = np.median(data)
5. print("Median:", median)
Ouptut:
Median: 3.0
3. Mode
Definition:
The mode is a measure of relevant tendency that represents the maximum regularly taking place value in
a dataset. Unlike the mean and median, which are measures of the central area, the mode makes a
speciality of the frequency of values. A dataset will have one mode (unimodal), two modes (bimodal), or
extra (multimodal). In some cases, especially with continuous data, there might be no mode at all if no
wide variety repeats.
Calculation Methods:
To calculate the mode:
o Tally the frequencies: Count the number of occurrences of each value in the dataset.
o Identify the highest frequency: The value(s) with the highest count is the mode.
Example:
For a dataset [1, 2, 2, 3, 4]:
o Tally: 1 occurs once, 2 occurs twice, 3 occurs once, 4 occurs once.
o Highest frequency: 2 (it occurs twice).
o Mode = 2
Implementation of Mode in Python:
1. import pandas as pd
2.
3. data = [1, 2, 2, 3, 4]
4. mode = pd.Series(data).mode()
5. print("Mode:", mode[0])
Ouptut:
Mode: 2
4. Standard Deviation
Definition:
Standard deviation is a measure of the amount of variation or dispersion in a set of values. It quantifies
how much the individual data points in a dataset differ from the dataset's mean. Mathematically, the
standard deviation is the square root of the variance. For a dataset with n observations, the formula for
the standard deviation (?) is as follows:
Where:
5
o xi represents each data point and
6
o μ is the mean of the dataset. For a sample from a population, the formula adjusts to use n -1
o n -1 in the denominator instead of n to provide an unbiased estimate.
Calculation Methods:
The standard deviation is calculated through the following steps:
o Calculate the mean (μ): Sum all data points and divide by the number of points.
o Calculate each point's deviation from the mean: Subtract the mean from each data point.
o Square each deviation: This eliminates negative values and emphasises larger deviations.
o Sum all squared deviations: Add up all the squared deviations.
o Divide by the number of data points (or n - 1 for a sample): This gives the variance.
o Take the square root of the variance: This is the standard deviation.
Implementation of Standard Deviation in Python:
1. import pandas as pd
2.
3. data = [1, 2, 3, 4, 5]
4. df = pd.Series(data)
5. std_dev = df.std()
6. print("Standard Deviation:",
std_dev) Ouptut:
Standard Deviation: 1.5811388300841898
5. Variance
Definition:
Variance is a statistical measure that quantifies the dispersion of data points in a dataset relative to the
mean. It indicates how much the values in the dataset differ from the average value. Mathematically,
variance is the average of the squared differences from the mean. For a dataset with n observations, the
variance ?2 is calculated as:
Where:
o xi represents each data point, and
o μ is the mean of the dataset.
o For a sample from a population, the formula adjusts to use n -1 in the denominator instead of n
to provide an unbiased estimate.
Calculation Methods:
To calculate the variance, follow these steps:
o Calculate the mean: Sum all data points and divide by the number of points.
o Calculate each point's deviation from the mean: Subtract the mean from each data point.
o Square each deviation: This eliminates negative values and emphasises larger deviations.
o Sum all squared deviations: Add up all the squared deviations.
o Divide by the number of data points: This gives the variance.
Implementation of Variance in Python:
1. import pandas as pd
2.
3. data = [1, 2, 3, 4, 5]
4. df = pd.Series(data)
5. variance = df.var()
6. print("Variance:", variance)
Ouptut:
Variance: 2.5
6. Range
Definition:
7
The range is a measure of statistical dispersion that represents the difference between the maximum and
minimum values in a dataset. It provides a simple way to understand the spread or variability of the data.
Where:
o n is the number of observations,
o xi is each individual observation,
o x is the mean, and
o s is the standard deviation.
Calculation Methods:
To calculate skewness, follow these steps:
o Calculate the mean.
o Calculate the standard deviation.
o Calculate the skewness using the skewness formula.
Implementation of Skewness in Python:
Here's how to calculate skewness in Python using the scipy.stats library:
1. from scipy.stats import skew
2. import numpy as
np 3.
4. data = [1, 2, 2, 3, 4, 5, 6, 7, 8, 9]
5.
6. # Calculate skewness
7. skewness_value = skew(data)
8. print("Skewness:", skewness_value)
Ouptut:
Skewness: 0.531
10. Kurtosis
Definition:
1
0
Kurtosis is a statistical measure that describes the shape of a distribution's tails in relation to its
overall shape. Specifically, it quantifies whether the data are heavy-tailed or light-tailed compared to
a normal distribution. There are three types of kurtoses:
1. Mesokurtic: Distributions with kurtosis similar to a normal distribution. Kurtosis value
is approximately zero.
2. Leptokurtic: Distributions with heavier tails and a sharper peak than a normal distribution.
Kurtosis value is greater than zero.
3. Platykurtic: Distributions with lighter tails and a flatter peak than a normal distribution.
Kurtosis value is less than zero.
Mathematically, kurtosis is calculated using the formula:
Where:
o n is the number of observations,
o xi is each individual observation,
o x is the mean, and
o s is the standard deviation.
Calculation Methods:
To calculate kurtosis, follow these steps:
o Calculate the mean.
o Calculate the standard deviation.
o Calculate each observation's deviation from the mean and raise it to the fourth power.
o Sum these values and apply the kurtosis formula.
Implementation of Kurtosis in Python:
Here's how to calculate kurtosis in Python using the scipy.stats library:
1. from scipy.stats import kurtosis
2. import numpy as
np 3.
4. data = [1, 2, 2, 3, 4, 5, 6, 7, 8, 9]
5.
6. # Calculate kurtosis
7. kurtosis_value = kurtosis(data)
8. print("Kurtosis:", kurtosis_value)
Ouptut:
Kurtosis: -1.2685714285714287
10
2. How do you analyse a percentage and contingency table? Give examples
PERCENTAGE TABLES IN BIVARIATE ANALYSIS
In bivariate analysis, percentage tables (or contingency tables) are used to display the relationship between
two categorical variables. These tables provide a way to examine how the distribution of one variable
differs across the levels of another. Here's a breakdown of how they work and how to construct them:
Steps to Create Percentage Tables in Bivariate Analysis:
1. Identify Two Variables:
o Choose two categorical variables to analyze.
o For example, let's take Gender (Male/Female) and Smoking Status (Smoker/Non-Smoker).
2. Create a Contingency Table:
o A contingency table shows the frequency distribution of variables.
o For the example, the table could look like this:
Smoker Non-Smoker Total
Male 40 60 100
Female 30 70 100
Total 70 130 200
3. Convert to Percentage Table: You can convert this table into a percentage table in several ways,
depending on the type of percentage you want:
o Row percentages: Calculate the percentage within each row, showing how each row's total
is divided across categories.
o Column percentages: Calculate the percentage within each column, showing how each
column's total is divided across categories.
o Overall percentages: Show what percentage of the grand total each cell represents.
11
12
thon
import pandas as pd
14
15
16
3. How, When and Why should you Normalize \ Standardize \ Rescale your data?
Scaling and Standardizing
scaling
scaling refers to the process of transforming or normalizing data to ensure that it falls within a specific range
or has consistent units. This is particularly important when working with machine learning models, as many
algorithms are sensitive to the range and magnitude of the input data. Scaling ensures that no single feature
dominates due to differences in scale.
Here are common types of scaling in data science:
1. Min-Max Scaling (Normalization):
Formula:
Explanation: This technique scales data to a specific range, typically [0, 1]. It ensures that all features
have the same minimum and maximum values.
2. Standardization (Z-Score Normalization):
Formula:
17
18
In machine learning everything is measured in terms of numbers and when we want to identify the nearest
neighbors, similarity or dissimilarity of features then we calculate the distance between features and based
on distances we say two features are similar or not. Similar means if we consider a feature with respect to the
target variable then similarity mean how much a feature impacts the target variable.
Lets understand this distance thing with an example. So there are many method to calculate distance here we
will take Euclidean distance.
19
20
21
22
4. What is meant by time series data? Describe its four components.
Time Series
The arrangement of data in accordance with their time of occurrence is a time series. It is the chronological
arrangement of data. Here, time is just a way in which one can relate the entire phenomenon to suitable
reference points. Time can be hours, days, months or years.
A time series depicts the relationship between two variables. Time is one of those variables and the second is
any quantitative variable. It is not necessary that the relationship always shows increment in the change of
the variable with reference to time. The relation is not always decreasing too.
It may be increasing for some and decreasing for some points in time. Can you think of any such example?
The temperature of a particular city in a particular week or a month is one of those examples.
Uses of Time Series
The most important use of studying time series is that it helps us to predict the future behaviour of
the variable based on past experience
It is helpful for business planning as it helps in comparing the actual current performance with the
expected one
From time series, we get to study the past behaviour of the phenomenon or the variable under
consideration
We can compare the changes in the values of different variables at different times or places, etc.
Components for Time Series Analysis
The various reasons or the forces which affect the values of an observation in a time series are the
components of a time series. The four categories of the components of time series are
Trend
Seasonal Variations
Cyclic Variations
Random or Irregular movements
Seasonal and Cyclic Variations are the periodic changes or short-term fluctuations.
23
Trend
The trend shows the general tendency of the data to increase or decrease during a long period of time. A
trend is a smooth, general, long-term, average tendency. It is not always necessary that the increase or
decrease is in the same direction throughout the given period of time.
It is observable that the tendencies may increase, decrease or are stable in different sections of time. But the
overall trend must be upward, downward or stable. The population, agricultural production, items
manufactured, number of births and deaths, number of industry or any factory, number of schools or colleges
are some of its example showing some kind of tendencies of movement.
Linear and Non-Linear Trend
If we plot the time series values on a graph in accordance with time t. The pattern of the data clustering
shows the type of trend. If the set of data cluster more or less round a straight line, then the trend is linear
otherwise it is non-linear (Curvilinear).
Periodic Fluctuations
There are some components in a time series which tend to repeat themselves over a certain period of time.
They act in a regular spasmodic manner.
Seasonal Variations
These are the rhythmic forces which operate in a regular and periodic manner over a span of less than a year.
They have the same or almost the same pattern during a period of 12 months. This variation will be present
in a time series if the data are recorded hourly, daily, weekly, quarterly, or monthly.
These variations come into play either because of the natural forces or man-made conventions. The various
seasons or climatic conditions play an important role in seasonal variations. Such as production of crops
depends on seasons, the sale of umbrella and raincoats in the rainy season, and the sale of electric fans and
A.C. shoots up in summer seasons.
The effect of man-made conventions such as some festivals, customs, habits, fashions, and some occasions
like marriage is easily noticeable. They recur themselves year after year. An upswing in a season should not
be taken as an indicator of better business conditions.
Cyclic Variations
24
The variations in a time series which operate themselves over a span of more than one year are the cyclic
variations. This oscillatory movement has a period of oscillation of more than a year. One complete period is
a cycle. This cyclic movement is sometimes called the ‘Business Cycle’.
It is a four-phase cycle comprising of the phases of prosperity, recession, depression, and recovery. The
cyclic variation may be regular are not periodic. The upswings and the downswings in business depend upon
the joint nature of the economic forces and the interaction between them.
Random or Irregular Movements
There is another factor which causes the variation in the variable under study. They are not regular variations
and are purely random or irregular. These fluctuations are unforeseen, uncontrollable, unpredictable, and are
erratic. These forces are earthquakes, wars, flood, famines, and any other disasters.
Mathematical Model for Time Series Analysis
Mathematically, a time series is given as
yt = f (t)
Here, yt is the value of the variable under study at time t. If the population is the variable under study at the
various time period t1, t2, t3, … , tn. Then the time series is
t: t1, t2, t3, … , tn
yt: yt1, yt2, yt3, …, ytn
or, t: t1, t2, t3, … , tn
yt: y1, y2, y3, … , yn
Additive Model for Time Series Analysis
If yt is the time series value at time t. Tt, St, Ct, and Rt are the trend value, seasonal, cyclic and random
fluctuations at time t respectively. According to the Additive Model, a time series can be expressed as
yt = Tt + St + Ct + Rt.
This model assumes that all four components of the time series act independently of each other.
Multiplicative Model for Time Series Analysis
The multiplicative model assumes that the various components in a time series operate proportionately to
each other. According to this model
yt = Tt × St × Ct × Rt
Mixed models
Different assumptions lead to different combinations of additive and multiplicative models as
yt = Tt + St + Ct Rt.
The time series analysis can also be done using the model yt = Tt + St × Ct × Rt or yt = Tt × Ct + St × Rt etc.
25
5. Explain smoothing techniques for time series data with suitable example.
Smoothing is a technique used to make a time series less "noisy" and more understandable. When data is
collected over time, it often has random fluctuations (noise), which can make it hard to see the real trend.
Smoothing helps by reducing the impact of those random spikes or dips and highlights the underlying
pattern or trend.
Why Do We Smooth Time Series Data?
To identify trends: Smoothing can help show if values are generally increasing, decreasing, or
staying constant over time.
To reduce noise: It minimizes the random fluctuations in the data, making the important patterns
easier to spot.
To forecast or predict: By smoothing, it becomes easier to make predictions about future data
points.
Common Smoothing Techniques
1. Moving Average
This is the simplest smoothing method. It calculates the average of a specific number of past data points,
called a "window."
A moving average is a tool to smooth time series data by averaging data over a fixed
period. It helps in seeing the trend by reducing random fluctuations
A Moving Average (MA) is a basic technique to smooth out time series data by taking the average of
different subsets of data points. It's commonly used when dealing with data collected over time, such as
stock prices, weather records, or sales figures. The goal is to reduce short-term fluctuations (noise) and
reveal the overall trend.
How Does It Work?
o Window size: Choose a number of consecutive data points (e.g., 3, 5, 10). This is called the
"window."
o Calculate the average: Take the average of the data points in the window.
o Move the window forward: Slide the window by one data point and repeat the process
until you cover the entire data set.
Example:
Simple Example:
Let’s say you have the following daily temperature data for 7 days:
Day 1: 30°C
Day 2: 32°C
Day 3: 35°C
Day 4: 31°C
Day 5: 28°C
Day 6: 26°C
Day 7: 29°C
26
Results of the 3-Day Moving Average:
Day 3: 32.33°C
Day 4: 32.67°C
Day 5: 31.33°C
Day 6: 28.33°C
Day 7: 27.67°C
2. Weighted Moving Average
A Weighted Moving Average (WMA) is a variation of the simple moving average, but it assigns more
importance (or weight) to recent data points. The more recent the data, the higher the weight, which
means that newer data has a greater impact on the average than older data.
Why Use a Weighted Moving Average?
While a Simple Moving Average (SMA) gives equal weight to all data points in the window, it may
not always reflect the most recent changes in the data accurately. In many cases, recent data is more
relevant, especially in scenarios like stock prices, weather forecasting, or economic indicators. A
weighted moving average is useful when you want the average to respond more quickly to changes.
How Does It Work?
Window size: Just like with a simple moving average, you choose a "window" of data points, but
instead of averaging them equally, you assign different weights.
27
Weights: More recent data points get larger weights. For example, in a 3-day window, day 3 (most
recent) might have a weight of 3, day 2 a weight of 2, and day 1 a weight of 1.
Formula for Weighted Moving Average:
Example:
Suppose you have daily sales data for the last 3 days:
Day 1: 50 units, Day 2: 60 units, Day 3: 70 units.
Instead of averaging them equally (which would give you 60), we can apply a weighted moving average
where the most recent day (Day 3) gets the highest weight.
Assigning weights:
Day 1: weight = 1
Day 2: weight = 2
Day 3: weight = 3
Now calculate the weighted moving average:
In this case, the weighted moving average gives us 63.33, which is closer to the most recent value (70)
than the simple average (60).
Key Points:
1. Emphasizes Recent Data: The WMA gives more weight to the most recent data, making it more
responsive to recent changes or trends.
2. Fast-Reactive: WMA reacts more quickly than a simple moving average to recent changes, which
makes it useful for detecting trends in volatile or fast-moving data.
3. Common Uses: WMA is often used in financial analysis (such as stock prices) because it
highlights recent price movements more effectively than SMA.
3. Exponential Smoothing
This method also uses weights but applies them in a way that each data point's influence decreases
exponentially as you go back in time. The most recent data point has the highest weight, and older data
points contribute less. This is useful for data that changes rapidly.
How Does It Work?
It works by combining the most recent value with the previous smoothed value.
New data gets more importance.
Older data is still used but with less importance.
28
WMA:
You control exactly how much weight each data point receives. However, once you reach beyond
the window size (e.g., if the window is 3 days, then data older than 3 days), the older data points are
completely ignored.
In other words, the older data is completely discarded once you move past the
window. EMA:
Older data points are never fully discarded, even though their influence becomes very small over
time. The influence of older data decays exponentially.
EMA gives a smoother response to changes, as it keeps accounting for all past data to some degree,
though recent data has the most influence.
4. Loess Smoothing
Loess (Locally Estimated Scatterplot Smoothing) is a more advanced method that fits multiple small
models to different parts of the time series. It's flexible and can handle data that has complex patterns
(like curves or cycles).
30