0% found this document useful (0 votes)
9 views31 pages

PRW Questions

The document outlines a project report writing test for computer science students, covering topics such as project constraints, user guides, research questions, and data analysis methods. It includes both multiple-choice questions and elaborative questions on statistical concepts like bivariate analysis, numerical summaries, and data interpretation. Additionally, it provides examples and Python implementations for calculating statistical measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views31 pages

PRW Questions

The document outlines a project report writing test for computer science students, covering topics such as project constraints, user guides, research questions, and data analysis methods. It includes both multiple-choice questions and elaborative questions on statistical concepts like bivariate analysis, numerical summaries, and data interpretation. Additionally, it provides examples and Python implementations for calculating statistical measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

CHENDHURAN COLLEGE OF ENGINEERING AND TECHNOLOGY

Pilivalam (Po), Thirumayam (Tk), Pudukkottai (Dt.) – 622 507

Department of Computer Science and Engineering

VII Semester

OHS352-PROJECT REPORT WRITING

Coaching Class Test- Date:


PART-A
1. List the constraints in any project.
2. What is user guide?
3. List any 2 features of a good research question.
4. Define barometric method.
5. What is Delphi Technique?
6. List the types of research recommendations.
7. What is analytical bibliography?
8. Define “Z” gaze.
9. Identify the two major categories of fonts.
10. What are visual aids?
PART-B
1. Elaborate on the ways by which you can avoid typographical errors with an example.
2. Elaborate on Tables and Illustration used in project reports
3. Describe the types of data collection methods.
4. Discuss the importance of data interpretation and the steps in interpreting data correctly.
5. Briefly explain the step-wise guide to create research recommendations.

1
Answer key
Part - A
1. What are the three common methods for performing bivariate analysis?
Bivariate analysis is when we look at two things together to see how they’re related.
Types of Bivariate Analysis
The various types of bivariate analysis are:
2. Scatter Plots
3. Correlation Analysis
4. Regression Analysis
5. Chi-Square Test
6. T-tests and ANOVA
2. Name the two types of statistical testing in bivariate analysis.
In bivariate analysis, the two main types of statistical testing are:
Correlation Analysis: This tests the strength and direction of the relationship between two continuous
variables. The most common test is the Pearson correlation coefficient (for linear relationships),.
Regression Analysis: This tests how one variable (dependent) changes in response to another variable
(independent). The most common type is linear regression.
3. What is the purpose of smoothing a time series data?
Smoothing a time series data serves several important purposes:
Noise Reduction: It helps to minimize random fluctuations or noise, making underlying trends and
patterns clearer.
Trend Identification: Smoothing facilitates the identification of long-term trends by filtering out short-
term variations.
Forecasting: A smoother dataset can improve the accuracy of forecasts by providing a clearer signal of
future behavior.
Seasonal Adjustment: Smoothing can assist in identifying and adjusting for seasonal patterns, helping to
isolate other factors affecting the data.
Visualization: It enhances the visual representation of the data, making it easier to analyze and interpret.
4. Outline the difference between univariate and bivariate data.
Univariate Data:
 Definition: Involves a single variable or attribute.
 Analysis Focus: Examines the distribution, central tendency, and dispersion of that one variable.
 Examples:
o Heights of students in a class.
o Daily temperatures recorded in a month.
 Visualization: Commonly represented using histograms, box plots, or bar charts.
Bivariate Data:
 Definition: Involves two variables or attributes.
2
 Analysis Focus: Explores the relationship or correlation between the two variables.
 Examples:
o Height and weight of individuals.
o Time spent studying and exam scores.
 Visualization: Often represented using scatter plots, line graphs, or contingency tables.
5. Is bivariate qualitative or quantitative?
Bivariate data can be either qualitative or quantitative, depending on the nature of the two variables
involved:
Bivariate Qualitative Data:
o Both variables are categorical.
o Example: Relationship between gender (male/female) and preference for a type of
movie (action/comedy/drama).
Bivariate Quantitative Data:
o Both variables are numerical.
o Example: Relationship between hours studied and exam scores.
Mixed Bivariate Data:
o One variable is qualitative and the other is quantitative.
o Example: Relationship between educational level (high school, college, etc.) and income.
So, bivariate data can encompass various combinations of qualitative and quantitative variables.
6. What are Numerical Summaries of Level and Spread?
Numerical summaries of level include measures like mean and median, while those of spread
involve measures like range, variance, and standard deviation.
7. Define Scaling and Standardizing?
Scaling adjusts the numerical range of a variable, while standardizing transforms it to have a
mean of 0 and a standard deviation of 1.
8. Define Skewness in a Distribution?
Skewness measures the asymmetry of a distribution; positive skewness indicates a tail to the
right, and negative skewness indicates a tail to the left.
9. What is the Purpose of Percentage Tables in Data Analysis?
Percentage tables express values as a percentage of the total, providing a clearer understanding of the
relative contribution of each category.
10. Explain the Significance of Handling Several Batches in Experimental Design?
Handling several batches is important in experimental design to account for variations introduced by different
conditions, ensuring robust and generalizable results.

3
Part – B

1. Explain the 10 Essential Numerical Summaries in Statistics with example.

Importance of Numerical Summaries


Numerical summaries are important equipment in data science, presenting vital insights into the traits of
a dataset. These summaries help information scientists to understand the statistics at a look, facilitating
more knowledgeable decision-making.
1. Mean
Definition:
The imply, frequently known as the common of information, is a degree of central tendency that is
calculated by taking the sum of all data points divided with the aid of the variety of data points within the
statistics sample. Mathematically, it's expressed as:

Calculation Methods:
The mean is calculated using the formula mentioned earlier.
o Sum: Add up all the values in the dataset.
o Divide by the number of data points: Take the total sum and divide it by the count of values in
the dataset.
Example:
For a dataset [1, 2, 3, 4, 5]:
o Sum = 1 + 2 + 3 + 4 + 5 = 15
o Number of data points (n) = 5
o Mean = 15 / 5 = 3
Implementation of Mean in
Python:
Here's how to calculate the mean in Python using the NumPy library:
1. import numpy as
np 2.
3. data = [1, 2, 3, 4, 5]
4. mean = np.mean(data)
5. print("Mean:", mean)
Ouptut:
Mean: 3.0

2. Median
Definition:
The median is every other degree of primary tendency that represents the centre value of a dataset whilst
its miles are ordered in ascending or descending order. If the dataset has an abnormal variety of
observations, the median is the centre number. If the dataset has an excellent wide variety of
observations, the median is the common of the two middle numbers.
Calculation Methods:
To calculate the median:
o Sort the data in ascending order.
o Determine the middle value:
o If the number of observations (nnn) is odd, the median of the distribution is the middle
value in the data.
o If the number of observations is even, the median value is the average of the two
middle values in the data.
Example:
For a dataset [45, 67, 23, 89, 90]:
4
o Sorted data: [23, 45, 67, 89, 90]
o Number of data points (n) = 5 (odd)
o Median = 67 (the middle
value) For a dataset [1, 2, 3, 4, 5,
6]:
o Sorted data: [1, 2, 3, 4, 5, 6]
o Number of data points (n) = 6 (even)
o Median = (3 + 4) / 2 = 3.5
Implementation of Median in Python:
Here's how to calculate the median in Python using the NumPy library:
1. import numpy as
np 2.
3. data = [1, 2, 3, 4, 5]
4. median = np.median(data)
5. print("Median:", median)
Ouptut:
Median: 3.0
3. Mode
Definition:
The mode is a measure of relevant tendency that represents the maximum regularly taking place value in
a dataset. Unlike the mean and median, which are measures of the central area, the mode makes a
speciality of the frequency of values. A dataset will have one mode (unimodal), two modes (bimodal), or
extra (multimodal). In some cases, especially with continuous data, there might be no mode at all if no
wide variety repeats.
Calculation Methods:
To calculate the mode:
o Tally the frequencies: Count the number of occurrences of each value in the dataset.
o Identify the highest frequency: The value(s) with the highest count is the mode.
Example:
For a dataset [1, 2, 2, 3, 4]:
o Tally: 1 occurs once, 2 occurs twice, 3 occurs once, 4 occurs once.
o Highest frequency: 2 (it occurs twice).
o Mode = 2
Implementation of Mode in Python:
1. import pandas as pd
2.
3. data = [1, 2, 2, 3, 4]
4. mode = pd.Series(data).mode()
5. print("Mode:", mode[0])
Ouptut:
Mode: 2
4. Standard Deviation
Definition:
Standard deviation is a measure of the amount of variation or dispersion in a set of values. It quantifies
how much the individual data points in a dataset differ from the dataset's mean. Mathematically, the
standard deviation is the square root of the variance. For a dataset with n observations, the formula for
the standard deviation (?) is as follows:

Where:
5
o xi represents each data point and

6
o μ is the mean of the dataset. For a sample from a population, the formula adjusts to use n -1
o n -1 in the denominator instead of n to provide an unbiased estimate.
Calculation Methods:
The standard deviation is calculated through the following steps:
o Calculate the mean (μ): Sum all data points and divide by the number of points.
o Calculate each point's deviation from the mean: Subtract the mean from each data point.
o Square each deviation: This eliminates negative values and emphasises larger deviations.
o Sum all squared deviations: Add up all the squared deviations.
o Divide by the number of data points (or n - 1 for a sample): This gives the variance.
o Take the square root of the variance: This is the standard deviation.
Implementation of Standard Deviation in Python:
1. import pandas as pd
2.
3. data = [1, 2, 3, 4, 5]
4. df = pd.Series(data)
5. std_dev = df.std()
6. print("Standard Deviation:",
std_dev) Ouptut:
Standard Deviation: 1.5811388300841898
5. Variance
Definition:
Variance is a statistical measure that quantifies the dispersion of data points in a dataset relative to the
mean. It indicates how much the values in the dataset differ from the average value. Mathematically,
variance is the average of the squared differences from the mean. For a dataset with n observations, the
variance ?2 is calculated as:

Where:
o xi represents each data point, and
o μ is the mean of the dataset.
o For a sample from a population, the formula adjusts to use n -1 in the denominator instead of n
to provide an unbiased estimate.
Calculation Methods:
To calculate the variance, follow these steps:
o Calculate the mean: Sum all data points and divide by the number of points.
o Calculate each point's deviation from the mean: Subtract the mean from each data point.
o Square each deviation: This eliminates negative values and emphasises larger deviations.
o Sum all squared deviations: Add up all the squared deviations.
o Divide by the number of data points: This gives the variance.
Implementation of Variance in Python:
1. import pandas as pd
2.
3. data = [1, 2, 3, 4, 5]
4. df = pd.Series(data)
5. variance = df.var()
6. print("Variance:", variance)
Ouptut:
Variance: 2.5
6. Range
Definition:

7
The range is a measure of statistical dispersion that represents the difference between the maximum and
minimum values in a dataset. It provides a simple way to understand the spread or variability of the data.

The formula for calculating the range is:


Range = Maximum_Value - Minimum_Value
For example, in a dataset [3, 7, 8, 2, 5], the range is 8 - 2 = 6.
Calculation Methods:
To calculate the range:
o Identify the Maximum Value: Find the highest value in the dataset.
o Identify the Minimum Value: Find the lowest value in the dataset.
o Subtract the Minimum from the Maximum: The result is the range.
Example:
For the dataset [10, 15, 20, 2, 8]:
o Maximum value = 20
o Minimum value = 2
o Range = 20 - 2 = 18
Implementation of Range in Python:
Here's how to calculate the range in Python using basic Python functions and the NumPy library:
Using Basic Python Functions:
1. data = [10, 15, 20, 2, 8]
2.
3. range_value = max(data) - min(data)
4. print("Range:", range_value)
Ouptut:
Range: 18
7. Interquartile Range
Definition:
The Interquartile Range (IQR) is a measure of statistical dispersion, which indicates the spread of the
middle 50% of a dataset. It is calculated as the difference between the third quartile (Q3) and the first
quartile (Q1):
IQR = Q3 - Q10
Quartiles divide a ranked dataset into four equal parts. The first quartile (Q1) is the median of the lower
half of the data (25th percentile), and the third quartile (Q3) is the median of the upper half of the data
(75th percentile). The second quartile (Q2) is the median of the entire dataset.
Importance of Interquartile Range:
o Robust Measure of Dispersion: Unlike the range, which only considers the extreme values, the IQR
focuses on the central portion of the data, providing a more robust measure of variability that is less
sensitive to outliers.
o Identification of Outliers: The IQR is used to identify outliers. Values that fall below Q1 - 1.5 × IQR
or above Q3 + 5 × IQR are typically considered outliers.
o Comparison of Distributions: The IQR allows for comparison of the spread of different datasets. It
helps in understanding how the middle 50% of the data varies across different groups.
o Data Summarization: By summarising the spread of the middle 50% of the data, the IQR provides a
clear picture of the central tendency and dispersion without being affected by extreme values.
o Use in Box Plots: The IQR is a key component in creating box plots, which are graphical
representations of data distributions. Box plots visually show the median, quartiles, and potential
outliers.
Calculation Methods:
To calculate the IQR:
o Arrange Data: Sort the dataset in ascending order.
o Find Quartiles:
o Q1: The median of the lower half of the data.
8
o Q3: The median of the upper half of the data.
o Calculate IQR: Subtract Q1 from Q3.
Example:
For the dataset [7, 15, 36, 39, 40, 41, 42, 43, 47, 49]:
o Arrange data: [7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
o Find Q1 (25th percentile) and Q3 (75th percentile):
o Q1 = 37.5
o Q3 = 44.5
o Calculate IQR: IQR = 44.5 - 37.5 = 7
Implementation of Interquartile Range in Python:
Here's how to calculate the IQR in Python using the NumPy and SciPy libraries:
Using NumPy:
1. import numpy as np
2.
3. data = [7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
4.
5. Q1 = np.percentile(data, 25)
6. Q3 = np.percentile(data, 75)
7. IQR = Q3 - Q1
8. print("Interquartile Range (IQR):",
IQR) Ouptut:
Interquartile Range (IQR): 7.0
8. Percentiles and Quartiles
Definitions:
Percentiles:
A percentile is a measure used in statistics that indicates the value below which a given percentage of
observations in a group of observations falls. For example, the 20th percentile is the value below which
20% of the observations may be found. Percentiles divide a dataset into 100 equal parts.
Quartiles:
Quartiles are a type of quantile, which divide a dataset into four equal parts. The three quartiles are:
1. First Quartile (Q1): The 25th percentile, below which 25% of the data falls.
2. Second Quartile (Q2 or Median): The 50th percentile, below which 50% of the data falls.
3. Third Quartile (Q3): The 75th percentile, below which 75% of the data falls.
The interquartile range (IQR) is the range between the first and third quartiles and is a measure of
statistical dispersion.
Calculation Methods:
Percentiles Calculation:
o Sort the data in ascending order.
o Use the formula P = (n + 1) × p / 100, where n is the number of observations and p is the
desired percentile.
o Find the value at the P-th position in the sorted
list. Quartiles Calculation:
o Sort the data in ascending order.
o Calculate Q1, Q2 (median), and Q3 using the 25th, 50th, and 75th percentiles,
respectively. Example:
For the dataset [7, 15, 36, 39, 40, 41, 42, 43, 47, 49]:
1. Q1 (25th percentile) = 36
2. Q2 (50th percentile or median) = 40.5
3. Q3 (75th percentile) = 43
Implementation of Percentiles and Quartiles in Python:
Using NumPy:
1. import numpy as np
9
2.
3. data = [7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
4.
5. percentile_25 = np.percentile(data, 25)
6. percentile_50 = np.percentile(data, 50)
7. percentile_75 = np.percentile(data,
75) 8.
9. print("25th Percentile (Q1):", percentile_25)
10. print("50th Percentile (Q2):", percentile_50)
11. print("75th Percentile (Q3):",
percentile_75) Ouptut:
25th Percentile (Q1):
36.0 50th Percentile
(Q2): 40.5 75th
Percentile (Q3): 43.0
9. Skewness
Definition:
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable
about its mean. It quantifies how much a distribution deviates from a normal distribution, which is
symmetrical. A distribution can be:
1. Positively Skewed (Right Skewed): The right tail (higher values) is longer or fatter than the left tail
(lower values). This indicates that the bulk of the values lie to the left of the mean.
2. Negatively Skewed (Left Skewed): The left tail (lower values) is longer or fatter than the right tail
(higher values). This indicates that the bulk of the values lie to the right of the mean.
3. Symmetrical: The values are evenly distributed on both sides of the mean, indicating no
skewness. Mathematically, skewness can be calculated using the formula:

Where:
o n is the number of observations,
o xi is each individual observation,
o x is the mean, and
o s is the standard deviation.
Calculation Methods:
To calculate skewness, follow these steps:
o Calculate the mean.
o Calculate the standard deviation.
o Calculate the skewness using the skewness formula.
Implementation of Skewness in Python:
Here's how to calculate skewness in Python using the scipy.stats library:
1. from scipy.stats import skew
2. import numpy as
np 3.
4. data = [1, 2, 2, 3, 4, 5, 6, 7, 8, 9]
5.
6. # Calculate skewness
7. skewness_value = skew(data)
8. print("Skewness:", skewness_value)
Ouptut:
Skewness: 0.531
10. Kurtosis
Definition:
1
0
Kurtosis is a statistical measure that describes the shape of a distribution's tails in relation to its
overall shape. Specifically, it quantifies whether the data are heavy-tailed or light-tailed compared to
a normal distribution. There are three types of kurtoses:
1. Mesokurtic: Distributions with kurtosis similar to a normal distribution. Kurtosis value
is approximately zero.
2. Leptokurtic: Distributions with heavier tails and a sharper peak than a normal distribution.
Kurtosis value is greater than zero.
3. Platykurtic: Distributions with lighter tails and a flatter peak than a normal distribution.
Kurtosis value is less than zero.
Mathematically, kurtosis is calculated using the formula:

Where:
o n is the number of observations,
o xi is each individual observation,
o x is the mean, and
o s is the standard deviation.
Calculation Methods:
To calculate kurtosis, follow these steps:
o Calculate the mean.
o Calculate the standard deviation.
o Calculate each observation's deviation from the mean and raise it to the fourth power.
o Sum these values and apply the kurtosis formula.
Implementation of Kurtosis in Python:
Here's how to calculate kurtosis in Python using the scipy.stats library:
1. from scipy.stats import kurtosis
2. import numpy as
np 3.
4. data = [1, 2, 2, 3, 4, 5, 6, 7, 8, 9]
5.
6. # Calculate kurtosis
7. kurtosis_value = kurtosis(data)
8. print("Kurtosis:", kurtosis_value)
Ouptut:
Kurtosis: -1.2685714285714287

10
2. How do you analyse a percentage and contingency table? Give examples
PERCENTAGE TABLES IN BIVARIATE ANALYSIS
In bivariate analysis, percentage tables (or contingency tables) are used to display the relationship between
two categorical variables. These tables provide a way to examine how the distribution of one variable
differs across the levels of another. Here's a breakdown of how they work and how to construct them:
Steps to Create Percentage Tables in Bivariate Analysis:
1. Identify Two Variables:
o Choose two categorical variables to analyze.
o For example, let's take Gender (Male/Female) and Smoking Status (Smoker/Non-Smoker).
2. Create a Contingency Table:
o A contingency table shows the frequency distribution of variables.
o For the example, the table could look like this:
Smoker Non-Smoker Total
Male 40 60 100
Female 30 70 100
Total 70 130 200
3. Convert to Percentage Table: You can convert this table into a percentage table in several ways,
depending on the type of percentage you want:
o Row percentages: Calculate the percentage within each row, showing how each row's total
is divided across categories.
o Column percentages: Calculate the percentage within each column, showing how each
column's total is divided across categories.
o Overall percentages: Show what percentage of the grand total each cell represents.

11
12
thon
import pandas as pd

# Create the contingency table as a pandas DataFrame


data = {'Smoker': [40, 30],
'Non-Smoker': [60, 70]}
index = ['Male', 'Female']
df = pd.DataFrame(data, index=index)
# Row percentages
row_percentage = df.div(df.sum(axis=1), axis=0) * 100
# Column percentages
column_percentage = df.div(df.sum(axis=0), axis=1) * 100
# Overall percentages
overall_percentage = df / df.values.sum() * 100
# Display the tables
print("Row Percentage:\n", row_percentage) print("\
nColumn Percentage:\n", column_percentage)
print("\nOverall Percentage:\n", overall_percentage)
13
Analyzing Contingency Tables in Bivariate Analysis
A contingency table (also known as a cross-tabulation or crosstab) is a matrix that displays the frequency
distribution of two categorical variables. It is a core tool in bivariate analysis, which is used to study the
relationship between two variables. These tables help identify potential associations or dependencies
between variables.

14
15
16
3. How, When and Why should you Normalize \ Standardize \ Rescale your data?
Scaling and Standardizing
scaling
scaling refers to the process of transforming or normalizing data to ensure that it falls within a specific range
or has consistent units. This is particularly important when working with machine learning models, as many
algorithms are sensitive to the range and magnitude of the input data. Scaling ensures that no single feature
dominates due to differences in scale.
Here are common types of scaling in data science:
1. Min-Max Scaling (Normalization):
 Formula:

 Explanation: This technique scales data to a specific range, typically [0, 1]. It ensures that all features
have the same minimum and maximum values.
2. Standardization (Z-Score Normalization):
 Formula:

 where μ is the mean and σ is the standard deviation.


 Explanation: Standardization transforms data to have a mean of 0 and a standard deviation of 1,
effectively centering the data and scaling by variance.
Min-Max Scaling (Normalization):

17
18
In machine learning everything is measured in terms of numbers and when we want to identify the nearest
neighbors, similarity or dissimilarity of features then we calculate the distance between features and based
on distances we say two features are similar or not. Similar means if we consider a feature with respect to the
target variable then similarity mean how much a feature impacts the target variable.
Lets understand this distance thing with an example. So there are many method to calculate distance here we
will take Euclidean distance.

19
20
21
22
4. What is meant by time series data? Describe its four components.
Time Series
The arrangement of data in accordance with their time of occurrence is a time series. It is the chronological
arrangement of data. Here, time is just a way in which one can relate the entire phenomenon to suitable
reference points. Time can be hours, days, months or years.
A time series depicts the relationship between two variables. Time is one of those variables and the second is
any quantitative variable. It is not necessary that the relationship always shows increment in the change of
the variable with reference to time. The relation is not always decreasing too.
It may be increasing for some and decreasing for some points in time. Can you think of any such example?
The temperature of a particular city in a particular week or a month is one of those examples.
Uses of Time Series
 The most important use of studying time series is that it helps us to predict the future behaviour of
the variable based on past experience
 It is helpful for business planning as it helps in comparing the actual current performance with the
expected one
 From time series, we get to study the past behaviour of the phenomenon or the variable under
consideration
 We can compare the changes in the values of different variables at different times or places, etc.
Components for Time Series Analysis
The various reasons or the forces which affect the values of an observation in a time series are the
components of a time series. The four categories of the components of time series are
 Trend
 Seasonal Variations
 Cyclic Variations
 Random or Irregular movements
Seasonal and Cyclic Variations are the periodic changes or short-term fluctuations.
23
Trend
The trend shows the general tendency of the data to increase or decrease during a long period of time. A
trend is a smooth, general, long-term, average tendency. It is not always necessary that the increase or
decrease is in the same direction throughout the given period of time.
It is observable that the tendencies may increase, decrease or are stable in different sections of time. But the
overall trend must be upward, downward or stable. The population, agricultural production, items
manufactured, number of births and deaths, number of industry or any factory, number of schools or colleges
are some of its example showing some kind of tendencies of movement.
Linear and Non-Linear Trend
If we plot the time series values on a graph in accordance with time t. The pattern of the data clustering
shows the type of trend. If the set of data cluster more or less round a straight line, then the trend is linear
otherwise it is non-linear (Curvilinear).
Periodic Fluctuations
There are some components in a time series which tend to repeat themselves over a certain period of time.
They act in a regular spasmodic manner.
Seasonal Variations
These are the rhythmic forces which operate in a regular and periodic manner over a span of less than a year.
They have the same or almost the same pattern during a period of 12 months. This variation will be present
in a time series if the data are recorded hourly, daily, weekly, quarterly, or monthly.
These variations come into play either because of the natural forces or man-made conventions. The various
seasons or climatic conditions play an important role in seasonal variations. Such as production of crops
depends on seasons, the sale of umbrella and raincoats in the rainy season, and the sale of electric fans and
A.C. shoots up in summer seasons.
The effect of man-made conventions such as some festivals, customs, habits, fashions, and some occasions
like marriage is easily noticeable. They recur themselves year after year. An upswing in a season should not
be taken as an indicator of better business conditions.
Cyclic Variations

24
The variations in a time series which operate themselves over a span of more than one year are the cyclic
variations. This oscillatory movement has a period of oscillation of more than a year. One complete period is
a cycle. This cyclic movement is sometimes called the ‘Business Cycle’.
It is a four-phase cycle comprising of the phases of prosperity, recession, depression, and recovery. The
cyclic variation may be regular are not periodic. The upswings and the downswings in business depend upon
the joint nature of the economic forces and the interaction between them.
Random or Irregular Movements
There is another factor which causes the variation in the variable under study. They are not regular variations
and are purely random or irregular. These fluctuations are unforeseen, uncontrollable, unpredictable, and are
erratic. These forces are earthquakes, wars, flood, famines, and any other disasters.
Mathematical Model for Time Series Analysis
Mathematically, a time series is given as
yt = f (t)
Here, yt is the value of the variable under study at time t. If the population is the variable under study at the
various time period t1, t2, t3, … , tn. Then the time series is
t: t1, t2, t3, … , tn
yt: yt1, yt2, yt3, …, ytn
or, t: t1, t2, t3, … , tn
yt: y1, y2, y3, … , yn
Additive Model for Time Series Analysis
If yt is the time series value at time t. Tt, St, Ct, and Rt are the trend value, seasonal, cyclic and random
fluctuations at time t respectively. According to the Additive Model, a time series can be expressed as
yt = Tt + St + Ct + Rt.
This model assumes that all four components of the time series act independently of each other.
Multiplicative Model for Time Series Analysis
The multiplicative model assumes that the various components in a time series operate proportionately to
each other. According to this model
yt = Tt × St × Ct × Rt
Mixed models
Different assumptions lead to different combinations of additive and multiplicative models as
yt = Tt + St + Ct Rt.
The time series analysis can also be done using the model yt = Tt + St × Ct × Rt or yt = Tt × Ct + St × Rt etc.

25
5. Explain smoothing techniques for time series data with suitable example.
Smoothing is a technique used to make a time series less "noisy" and more understandable. When data is
collected over time, it often has random fluctuations (noise), which can make it hard to see the real trend.
Smoothing helps by reducing the impact of those random spikes or dips and highlights the underlying
pattern or trend.
Why Do We Smooth Time Series Data?
 To identify trends: Smoothing can help show if values are generally increasing, decreasing, or
staying constant over time.
 To reduce noise: It minimizes the random fluctuations in the data, making the important patterns
easier to spot.
 To forecast or predict: By smoothing, it becomes easier to make predictions about future data
points.
Common Smoothing Techniques
1. Moving Average
This is the simplest smoothing method. It calculates the average of a specific number of past data points,
called a "window."
A moving average is a tool to smooth time series data by averaging data over a fixed
period. It helps in seeing the trend by reducing random fluctuations
A Moving Average (MA) is a basic technique to smooth out time series data by taking the average of
different subsets of data points. It's commonly used when dealing with data collected over time, such as
stock prices, weather records, or sales figures. The goal is to reduce short-term fluctuations (noise) and
reveal the overall trend.
How Does It Work?
o Window size: Choose a number of consecutive data points (e.g., 3, 5, 10). This is called the
"window."
o Calculate the average: Take the average of the data points in the window.
o Move the window forward: Slide the window by one data point and repeat the process
until you cover the entire data set.
Example:
Simple Example:
Let’s say you have the following daily temperature data for 7 days:
 Day 1: 30°C
 Day 2: 32°C
 Day 3: 35°C
 Day 4: 31°C
 Day 5: 28°C
 Day 6: 26°C
 Day 7: 29°C

26
Results of the 3-Day Moving Average:
 Day 3: 32.33°C
 Day 4: 32.67°C
 Day 5: 31.33°C
 Day 6: 28.33°C
 Day 7: 27.67°C
2. Weighted Moving Average
A Weighted Moving Average (WMA) is a variation of the simple moving average, but it assigns more
importance (or weight) to recent data points. The more recent the data, the higher the weight, which
means that newer data has a greater impact on the average than older data.
Why Use a Weighted Moving Average?
While a Simple Moving Average (SMA) gives equal weight to all data points in the window, it may
not always reflect the most recent changes in the data accurately. In many cases, recent data is more
relevant, especially in scenarios like stock prices, weather forecasting, or economic indicators. A
weighted moving average is useful when you want the average to respond more quickly to changes.
How Does It Work?
 Window size: Just like with a simple moving average, you choose a "window" of data points, but
instead of averaging them equally, you assign different weights.

27
 Weights: More recent data points get larger weights. For example, in a 3-day window, day 3 (most
recent) might have a weight of 3, day 2 a weight of 2, and day 1 a weight of 1.
Formula for Weighted Moving Average:

Example:
Suppose you have daily sales data for the last 3 days:
Day 1: 50 units, Day 2: 60 units, Day 3: 70 units.
Instead of averaging them equally (which would give you 60), we can apply a weighted moving average
where the most recent day (Day 3) gets the highest weight.
Assigning weights:
 Day 1: weight = 1
 Day 2: weight = 2
 Day 3: weight = 3
Now calculate the weighted moving average:

In this case, the weighted moving average gives us 63.33, which is closer to the most recent value (70)
than the simple average (60).
Key Points:
1. Emphasizes Recent Data: The WMA gives more weight to the most recent data, making it more
responsive to recent changes or trends.
2. Fast-Reactive: WMA reacts more quickly than a simple moving average to recent changes, which
makes it useful for detecting trends in volatile or fast-moving data.
3. Common Uses: WMA is often used in financial analysis (such as stock prices) because it
highlights recent price movements more effectively than SMA.
3. Exponential Smoothing
This method also uses weights but applies them in a way that each data point's influence decreases
exponentially as you go back in time. The most recent data point has the highest weight, and older data
points contribute less. This is useful for data that changes rapidly.
How Does It Work?
It works by combining the most recent value with the previous smoothed value.
 New data gets more importance.
 Older data is still used but with less importance.

28
WMA:
 You control exactly how much weight each data point receives. However, once you reach beyond
the window size (e.g., if the window is 3 days, then data older than 3 days), the older data points are
completely ignored.
 In other words, the older data is completely discarded once you move past the
window. EMA:
 Older data points are never fully discarded, even though their influence becomes very small over
time. The influence of older data decays exponentially.
 EMA gives a smoother response to changes, as it keeps accounting for all past data to some degree,
though recent data has the most influence.
4. Loess Smoothing
Loess (Locally Estimated Scatterplot Smoothing) is a more advanced method that fits multiple small
models to different parts of the time series. It's flexible and can handle data that has complex patterns
(like curves or cycles).

Here’s what’s happening step by step:


Original Data (Raw Temperatures):
These are the actual temperatures you recorded for each day:
29
 Day 1: 30°C
 Day 2: 32°C
 Day 3: 35°C
 Day 4: 31°C
 Day 5: 28°C
 Day 6: 26°C
 Day 7: 29°C
This data fluctuates a lot, so it’s hard to see if there's an overall warming or cooling trend.
LOESS Smoothed Data:
LOESS will look at small groups of days (e.g., 3 days at a time) and use them to compute a smoother
value for each day, based on the trend around that day. The result is a set of values that are less noisy and
show the overall trend.
For example, after applying LOESS, you might get these smoothed temperatures:
 Day 1: 30.5°C
 Day 2: 31.8°C
 Day 3: 33.0°C
 Day 4: 30.5°C
 Day 5: 28.5°C
 Day 6: 27.0°C
 Day 7: 28.0°C
What’s Happening Here?
LOESS has smoothed the original data. The new temperatures are close to the actual values, but they
don’t jump up and down as much as the raw data. For instance:
 On Day 4, the original temperature was 31°C, but LOESS has smoothed it to 30.5°C because it
looks at nearby days and sees that the general trend is slightly downward.
 Similarly, on Day 3, the actual temperature was 35°C, but LOESS has smoothed it to 33.0°C
because Days 2 and 4 have lower temperatures, so it balances the value.

When to Use Smoothing?


 When you want to focus on long-term trends and ignore short-term fluctuations.
 When you need to make predictions and want your data to be more stable.
When you need to analyze seasonal or cyclic behavior, such as sales data that varies by season

30

You might also like