0% found this document useful (0 votes)
17 views45 pages

Unit 3

The document discusses univariate analysis, which focuses on a single variable's distribution without exploring relationships between variables. It covers key concepts such as descriptive statistics, data visualization techniques like box plots and histograms, and measures of central tendency including mean, median, and mode. Additionally, it addresses the importance of scaling, normalization, and standardization in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views45 pages

Unit 3

The document discusses univariate analysis, which focuses on a single variable's distribution without exploring relationships between variables. It covers key concepts such as descriptive statistics, data visualization techniques like box plots and histograms, and measures of central tendency including mean, median, and mode. Additionally, it addresses the importance of scaling, normalization, and standardization in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

lOMoARcPSD|36553333

UNIT 3 UNIVARIATE ANALYSIS

Univariate data:
Univariate data refers to a type of data in which each observation or data
point corresponds to a single variable. In other words, it involves the
measurement or observation of a single characteristic or attribute for each
individual or item in the dataset. Analyzing univariate data is the simplest
form of analysis in statistics.

Heights (in cm) 164 167.3 170 174.2 178 180 186

Suppose that the heights of seven students in a class is recorded (above


table). There is only one variable, which is height, and it is not dealing with
any cause or relationship.
Key points in Univariate analysis:
1. No Relationships: Univariate analysis focuses solely on describing and
summarizing the distribution of the single variable. It does not explore
relationships between variables or attempt to identify causes.
2. Descriptive Statistics: Descriptive statistics, such as measures of central
tendency (mean, median, mode) and measures of dispersion (range,
standard deviation), are commonly used in the analysis of univariate data.
3. Visualization: Histograms, box plots, and other graphical representations
are often used to visually represent the distribution of the single variable
lOMoARcPSD|36553333
lOMoARcPSD|36553333

Exploring Data Distribution

Distribution means that how data can be present in different possible ways,
the percentage of specific data, identifying the outliers. So, data distribution is
the way of using graphical methods to organize and display useful
information.
Terms related to Exploration of Data Distribution
-> Boxplot
-> Frequency Table
-> Histogram
-> Density Plot
• Boxplot : It is based on the percentiles of the data as shown in the figure
below. The top and bottom of the boxplot are 75th and 25th percentile of
the data. The extended lines are known as whiskers that includes the
range of rest of the data.

Code #1 : Loading Libraries

import numpy as np

import pandas as pd

import seaborn as

sns

import matplotlib.pyplot as plt


lOMoARcPSD|36553333

Code #2: Loading Data

data = pd.read_csv("../data/state.csv")

# Adding a new column with derived data

data['PopulationInMillions'] = data['Population']/1000000

print (data.head(10))

Code #3 : BoxPlot

# BoxPlot Population In Millions

fig, ax1 = plt.subplots()

fig.set_size_inches(9, 15)

ax1 = sns.boxplot(x = data.PopulationInMillions, orient ="v")

ax1.set_ylabel("Population by State in Millions", fontsize = 15)

ax1.set_title("Population - BoxPlot", fontsize = 20)


lOMoARcPSD|36553333

Box plot

Frequency Table : It is a tool to distribute the data into equally spaced


ranges, segments and tells us how many values fall in each segment.

Code #1: Adding a column to perform crosstab and groupby functionality.

# Perform the binning action, the bins have been

# chosen to accentuate the output for the Frequency Table

data['PopulationInMillionsBins'] = pd.cut(

data.PopulationInMillions, bins =[0, 1, 2, 5, 8, 12, 15, 20, 50])

print (data.head(10))
lOMoARcPSD|36553333

Code #2: Cross Tab – a type of Frequency Table

# Cross Tab - a type of Frequency Table

pd.crosstab(data.PopulationInMillionsBins, data.Abbreviation, margins = True)

Code #3: GroupBy – a type of Frequency Table

# Groupby - a type of Frequency Table

data.groupby(data.PopulationInMillionsBins)['Abbreviation'].apply(', '.join)
lOMoARcPSD|36553333

Histogram: It is a way of visualizing data distribution through frequency table


with bins on the x-axis and data count on the y-axis.
Code – Histogram

# Histogram Population In

Millions fig, ax2 = plt.subplots()

fig.set_size_inches(9, 15)

ax2 = sns.distplot(data.PopulationInMillions, kde = False)

ax2.set_ylabel("Frequency", fontsize = 15)

ax2.set_xlabel("Population by State in Millions", fontsize =

15) ax2.set_title("Population - Histogram", fontsize = 20)


lOMoARcPSD|36553333

Density Plot:

It is related to histogram as it shows data-values being distributed as


continuous line. It is a smoothed histogram version. The output below is the
density plot superposed over histogram.
Code – Density Plot for the data

# Density Plot - Population

fig, ax3 = plt.subplots()

fig.set_size_inches(7, 9)

ax3 = sns.distplot(data.Population, kde = True)

ax3.set_ylabel("Density", fontsize = 15)

ax3.set_xlabel("Death Rate per Million", fontsize = 15)

ax3.set_title("Density Plot - Population", fontsize = 20)


lOMoARcPSD|36553333

Death rate 

Numerical Summaries in Statistics for Data Science

1. Mean

2. Median

3. Mode

4. Percentile

5. Quartiles (five-number summary)

6. Standard Deviation

7. Variance
lOMoARcPSD|36553333

8. Range

9. Proportion

10.Correlation

Mean

This is the point of balance, describing the most typical value for normally
distributed data. I say “normally distributed” data because the mean is highly
influenced by outliers.

The mean adds up all the data values and divides by the total number of
values, as follows:

The formula for the mean

The ‘x-bar’ is used to represent the sample mean (the mean of a sample of
data). ‘∑’ (sigma) implies de addition of all values up from ‘i=1’ until ‘i=n’ (’n’ is
the number of data values). The result is then divided by ‘n’.

Python: np.mean([1,2,3,4,5]) The result is 3.

R: mean(c(2,2,4,4)) The result is 3.

Effect of outliers:
lOMoARcPSD|36553333

Effect of an outlier in the mean

The first plot ranges from 1 to 10. The mean is 5.5. When we replace 10 with
20, the average increases to 6.5. In the next concept, we will go over the
‘median’, that is the perfect choice to ignore outliers.

Median

This is the “middle data point”, where half of the data is below the median and
half is above the median. It’s the 50th percentile of the data (we will
cover percentile later in this article). It’s also mostly used with skewed data
because outliers won’t have a big effect on the median.

There are two formulas to compute the median. The choice of which formula
to use depends on n (number of data points in the sample, or sample size) if
it’s even or odd.

The formula for the median when n is even.

When n is even, there is no “middle” data point, so the middle two values are
averaged.

The formula for the median when n is odd.

When n is odd, the middle data point is the median.


lOMoARcPSD|36553333

Python: np.median([1,2,3,4,5,6]) (n is even). The result is 3.5, the average


between 3 and 4 (middle points).

R: median(c(1,2,3,4,5,6,7)) (n is odd). The result is 4, the middle point.

Effect of outliers:

The effect of outliers on the median is low. None in this case.

In the graph above, we are using the same data used to calculate the mean.
Notice how the median stays the same in the second graph when we replace
10 with 20. It doesn’t mean that the median will always ignore the outliers. If
we had a larger number of numbers and/or outliers, the median could be
affected, but the influence of an outlier is low.

Mode

The mode will return you the most commonly occurring data value.

Python: statistics.mode([1,2,2,2,3,3,4,5,6]) The result is 2.

R doesn’t give you specifically the mean, but you can do the following to get the
frequency of each data
value: R: table(c('apple','banana','banana','tomato','orange','orange','banana'))
The result is apple:1, banana:3, orange:2, tomato:1. ‘Banana’ has a higher
frequency with 3 occurrences. Follows below a histogram plot of this fruit
vector.
lOMoARcPSD|36553333

Example of a mode using a histogram.

Percentile

The percent of data that is equal to or less than a given data point. It’s useful
for describing where a data point stands within the data set. If the percentile is
close to zero, then the observation is one of the smallest. If the percentile is
close to 100, then the data point is one o the largest in the data set.

Python:
from scipy import statsx = [10, 12, 15, 17, 20, 25, 30]## In what percentile lies
the
number 25?
stats.percentileofscore(x,25)

R:
library(stats)x <- c(10, 12, 15, 17, 20, 25, 30)## In what percentile lies the number
25?
ecdf(x)(25)
# resul: 85.7## In what percentile lies the number 12? ecdf(x)
(12)
# resul: 0.29
lOMoARcPSD|36553333

Quartiles (five-number summary)

Quartiles measure the center and it’s also great to describe the spread of the
data. Highly useful for skewed data. There are four quartiles, and they
compose the five-number summary (combined with the minimum). The Five-
number summary is composed of:

1. Minimum

2. 25th percentile (lower quartile)

3. 50th percentile (median)

4. 75th percentile (upper quartile)

5. 100th percentile (maximum)

Python:
import numpy as npx = [10,12,15,17,20,25,30]min = np.min(x)
q1 = np.quantile(x, .25)
median =
np.median(x) q3 =
np.quantile(x, .75)

R:
x <- c(10,12,15,17,20,25,30)min = min(x)
q1 = quantile(x,
.25) median =
median(x) q3 =
quantile(x, .75)
max = max(x)paste(min, q1, median, q3, max)## You can also use the function
favstats from the mosaic
## It will give you the five-number summary, mean, standard deviation, sample
size and number of missing values.librarylibrary(mosaic)
lOMoARcPSD|36553333

A boxplot is one good way to plot the five-number summary and explore the
data set.

A boxplot of the ‘mtcars’ data set (mpg x gear).

The bottom end of the boxplot represents the minimum; the first horizontal
line represents the lower quartile; the line inside the square is the median;
the next line is the upper quartile, and the top is the maximum.

Standard Deviation

Standard deviation is extensively used in statistics and data science. It


measures the amount of variation or dispersion of a data set, calculating how
spread out the data are from the mean. Small values mean the data is
consistent and close to the mean. Larger values indicate the data is highly
variable.

Deviation: The idea is to use the mean as a reference point from which
everything varies. A deviation is defined as the distance an observation
lies from the reference point. This distance is obtained by subtracting the
data point (xi) from the mean (x-bar).

The formula to calculate the standard deviation.


lOMoARcPSD|36553333

Calculating the standard deviation: The average of all the deviations will
always turn out to be zero, so we square each deviation and sum up the
results. Then, we divide it for ‘n-1’ (called degrees of freedom). We square root
the final result to undo de squaring of the deviations.

The standard deviation is a representation of all deviations in the data. It’s


never negative and it’s zero only if all the values are the same.

Density plot of the Sepal.Width from the Iris data set.

This graph shows the density of Sepal.Width from the Iris data set. The
standard deviation is 0.436. The blue line represents the mean, and the red
lines one and two standard deviations away from the mean. For example, a
Sepal.Width with a value of 3.5 lies 1 standard deviation from the mean.

Python: np.std(x)

R: sd(x)

Effect of outliers: The standard deviation, like the mean, is highly influenced by
outliers. The code below will use R to compare the standard deviation of two
vectors, one without outliers and a second with an outlier.
x <- c(1,2,3,4,5,6,7,8,9,10)
sd(x)
# result: 3.02765#Replacing 10 by
20: y <- c(1,2,3,4,5,6,7,8,9,20)
lOMoARcPSD|36553333

sd(y)
# result: 5.400617

Variance

Variance is almost the same calculation of the standard deviation, but it stays in
squared units. So, if you take the square root of the variance, you have the
standard deviation.

The formula for the variance.

Note that it’s represented by ‘s-squared’, while the standard deviation is


represented by ‘s’.

Python: np.var(x)

R: var(x)

Range

The difference between the maximum and minimum values. Useful for some
basic exploratory analysis, but not as powerful as the standard deviation.

The formula for the range.

Python: np.max(n) — np.min(x)

R: max(x) — min(x)
lOMoARcPSD|36553333

Proportion

It’s often referred to as “percentage”. Defines the percent of observations in


the data set that satisfy some requirements.

Correlation

Defines the strength and direction of the association between two quantitative
variables. It ranges between -1 and 1. Positive correlations mean that one
variable increases as the other variable increases. Negative correlations mean
that one variable decreases as the other increases. When the correlation is
zero, there is no correlation at all. As closest to one of the extreme the result
is, stronger is the association between the two variables.

The formula to compute the correlation.

Python: stats.pearsonr(x,y)

R: cor(x,y)

Correlation between MPG and Weight.


lOMoARcPSD|36553333

The graph is showing the correlation in the mtcars data set,


between MPG and Weight (-0.87). This is a strong negative correlation,
meaning that as the weight increases, the MPG decreases.

These basic summaries are essential as you explore and analyze your data.

Scaling, Normalization, and Standardization


• Scaling features makes ensuring that each characteristic is given the same
consideration during the learning process. Without scaling, bigger scale
features could dominate the learning, producing skewed outcomes. This
bias is removed through scaling, which also guarantees that each feature
contributes fairly to model predictions.
lOMoARcPSD|36553333
lOMoARcPSD|36553333

Absolute Maximum Scaling


This method of scaling requires two-step:
1. We should first select the maximum absolute value out of all the entries of
a particular measure.
2. Then after this, we divide each entry of the column by this maximum value.

After performing the above-mentioned two steps we will observe that each
entry of the column lies in the range of -1 to 1. But this method is not used
that often the reason behind this is that it is too sensitive to the outliers. And
while dealing with the real-world data presence of outliers is a very common
thing.
For the demonstration purpose, we will use the dataset which you can
download from here. This dataset is a simpler version of the original house
price prediction dataset having only two columns from the original dataset.
The first five rows of the original data are shown below:
Python:

import pandas as pd

df =

pd.read_csv('SampleFile.csv')

print(df.head())
Now let’s apply the first method which is of the absolute maximum scaling. For
this first, we are supposed to evaluate the absolute maximum values of the
columns.
Python:
import numpy as np
max_vals = np.max(np.abs(df))
max_vals

Now we are supposed to subtract these values from the data and then divide
the results from the maximum values as well.
Python:

print((df - max_vals) / max_vals)


lOMoARcPSD|36553333

Min-Max Scaling
This method of scaling requires below two-step:
1. First, we are supposed to find the minimum and the maximum value of
the column.
2. Then we will subtract the minimum value from the entry and divide the
result by the difference between the maximum and the minimum
value.

As we are using the maximum and the minimum value this method is also
prone to outliers but the range in which the data will range after performing
the above two steps is between 0 to 1.
Python:

from sklearn.preprocessing
import MinMaxScaler

scaler = MinMaxScaler()

scaled_data = scaler.fit_transform(df)

scaled_df =

pd.DataFrame(scaled_data,

columns=df.columns)

scaled_df.head()
Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
• Median
• Inter-Quartile Range
After calculating these two values we are supposed to subtract the median
from each entry and then divide the result by the interquartile range.
lOMoARcPSD|36553333

Python:

from sklearn.preprocessing import

RobustScaler scaler = RobustScaler()

scaled_data = scaler.fit_transform(df)

scaled_df =

pd.DataFrame(scaled_data,

columns=df.columns)

print(scaled_df.head())
lOMoARcPSD|36553333
lOMoARcPSD|36553333
lOMoARcPSD|36553333

Inequalities:

univariate analysis enables sociologists to explore social inequalities and


distributions of resources or experiences across different groups. For
instance, examining the distribution of education levels or income across a
population may reveal patterns of social stratification and inequality

In univariate analysis, inequality can be used to compare different distributions


within that variable range

Variance and standard deviation:

These measures quantify the spread or dispersion of values around the mean.

A higher variance or standard deviation suggests greater inequality.

Interquartile range (IQR):

The IQR is the range of the middle 50% of the data. It is calculated as the
difference between the third quartile Q3 and the first quartile Q1. A larger IQR
can indicate more variability in the central part of the distribution.

coefÏcient of variation :

A coefÏcient of variation, often abbreviated as CV, is a way to measure how


spread out values are in a dataset relative to the mean. It is calculated as:

CV = σ / μ

where:
lOMoARcPSD|36553333

• σ: The standard deviation of dataset


• μ: The mean of dataset

In plain English, the coefÏcient of variation is simply the ratio between the
standard deviation and the mean.

The coefÏcient of variation is often used to compare the variation between


two different datasets.

In the real world, it’s often used in finance to compare the mean expected
return of an investment relative to the expected standard deviation of the
investment. This allows investors to compare the risk-return trade-off between
investments.

For example, suppose an investor is considering investing in the following two


mutual funds:

Mutual Fund A: mean = 9%, standard deviation = 12.4%

Mutual Fund B: mean = 5%, standard deviation = 8.2%

Upon calculating the coefÏcient of variation for each fund, the investor finds:

CV for Mutual Fund A = 12.4% /9% = 1.38

CV for Mutual Fund B = 8.2% / 5% = 1.64

Since Mutual Fund A has a lower coefÏcient of variation, it offers a better mean
return relative to the standard deviation.

To calculate the coefÏcient of variation for a dataset in Python, you can use the
following syntax:

import numpy as np

cv = lambda x: np.std(x, ddof=1) / np.mean(x) * 100


Note that missing values will simply be ignored when calculating the coefÏcient
of variation.
lOMoARcPSD|36553333

Skewness :

Skewness measures the asymmetry of a distribution.

Lorenz curve:

A Lorenz curve is a graphical representation of income or wealth distribution in


a population, and is used to measure inequality. It's a fundamental tool for
analyzing inequality and income and wealth distribution
lOMoARcPSD|36553333

Gini coefÏcient :

The Gini coefÏcient measures inequality in the income or wealth distribution


on a scale from zero to one. A value of zero reflects perfect equality, where all
income or assets are equal, while a Gini coefÏcient of one (or 100%) reflects
the greatest inequality. ...

It is calculated by dividing the area between the Lorenz Curve and the line of
perfect equality (where income is equally distributed) by the total area under
the line of perfect equality.
lOMoARcPSD|36553333

Time Series

What is a Time Series?


A time series is a sequence of data points collected, recorded, or measured at
successive, evenly-spaced time intervals.
Each data point represents observations or measurements taken over time,
such as stock prices, temperature readings, or sales figures. Time series data
is commonly represented graphically with time on the horizontal axis and the
variable of interest on the vertical axis, allowing analysts to identify trends,
patterns, and changes over time.
Time series data is often represented graphically as a line plot, with time
depicted on the horizontal x-axis and the variable’s values displayed on the
vertical y-axis. This graphical representation facilitates the visualization of
trends, patterns, and fluctuations in the variable over time, aiding in the
analysis and interpretation of the data.

Components of Time Series Data


There are four main components of a time series:

1. Trend: Trend represents the long-term movement or directionality of the


data over time. It captures the overall tendency of the series to increase,
decrease, or remain stable. Trends can be linear, indicating a consistent
increase or decrease, or nonlinear, showing more complex patterns.
lOMoARcPSD|36553333

2. Seasonality: Seasonality refers to periodic fluctuations or patterns that


occur at regular intervals within the time series. These cycles often repeat
annually, quarterly, monthly, or weekly and are typically influenced by
factors such as seasons, holidays, or business cycles.
3. Cyclic variations: Cyclical variations are longer-term fluctuations in the
time series that do not have a fixed period like seasonality. These
fluctuations represent economic or business cycles, which can extend over
multiple years and are often associated with expansions and contractions
in economic activity.
4. Irregularity (or Noise): Irregularity, also known as noise or randomness,
refers to the unpredictable or random fluctuations in the data that cannot
be attributed to the trend, seasonality, or cyclical variations. These
fluctuations may result from random events, measurement errors, or
other unforeseen factors. Irregularity makes it challenging to identify and
model the underlying patterns in the time series data.

Basic Time Series Concepts:


• Moving average: The moving average method is a common technique
used in time series analysis to smooth out short-term fluctuations and
highlight longer-term trends or patterns in the data. It involves calculating
the average of a set of consecutive data points, referred to as a “window”
or “rolling window,” as it moves through the time series
• Noise: Noise, or random fluctuations, represents the irregular and
unpredictable components in a time series that do not follow a discernible
pattern. It introduces variability that is not attributable to the underlying
trend or seasonality.
• Differencing: Differencing is used to make the difference in values of a
specified interval. By default, it’s one, we can specify different values for
plots. It is the most popular method to remove trends in the data.
• Stationarity: A stationary time series is one whose statistical properties,
such as mean, variance, and autocorrelation, remain constant over time.
• Order: The order of differencing refers to the number of times the time
series data needs to be differenced to achieve stationarity.
• Autocorrelation: Autocorrelation, is a statistical method used in time
series analysis to quantify the degree of similarity between a time series
and a lagged version of itself.
• Resampling: Resampling is a technique in time series analysis that involves
changing the frequency of the data observations. It’s often used to
lOMoARcPSD|36553333

transform the data to a different frequency (e.g., from daily to monthly) to


reveal patterns or trends more clearly.

Time Series Visualization


Time series visualization is the graphical representation of data collected over
successive time intervals. It encompasses various techniques such as line
plots, seasonal subseries plots, autocorrelation plots, histograms, and
interactive visualizations. These methods help analysts identify trends,
patterns, and anomalies in time-dependent data for better understanding and
decision-making.
Different Time series visualization graphs
1. Line Plots: Line plots display data points over time, allowing easy
observation of trends, cycles, and fluctuations.
2. Seasonal Plots: These plots break down time series data into seasonal
components, helping to visualize patterns within specific time periods.
3. Histograms and Density Plots: Shows the distribution of data values over
time, providing insights into data characteristics such as skewness and
kurtosis.
4. Autocorrelation and Partial Autocorrelation Plots: These plots visualize
correlation between a time series and its lagged values, helping to identify
seasonality and lagged relationships.
5. Spectral Analysis: Spectral analysis techniques, such as periodograms and
spectrograms, visualize frequency components within time series data,
useful for identifying periodicity and cyclical patterns.
6. Decomposition Plots: Decomposition plots break down a time series into
its trend, seasonal, and residual components, aiding in understanding the
underlying patterns.
These visualization techniques allow analysts to explore, interpret, and
communicate insights from time series data effectively, supporting informed
decision-making and forecasting.

PlotÝng Line plot for Time Series data:


Since, the volume column is of continuous data type, we will use line graph to
visualize it.

# Assuming df is your DataFrame


lOMoARcPSD|36553333

sns.set(style="whitegrid") # Setting the style to whitegrid for a clean


background

plt.figure(figsize=(12, 6)) # Setting the figure size


sns.lineplot(data=df, x='Date', y='High', label='High Price', color='blue')

# Adding labels and title


plt.xlabel('Date')
plt.ylabel('High')
plt.title('Share Highest Price Over Time')
plt.show()
Output:
lOMoARcPSD|36553333

Seasonal Decomposition
A statistical technique used in time series analysis to separate the constituent
parts of a dataset is called seasonal decomposition. Three fundamental
components of the time series are identified: trend, seasonality, and
residuals. The long-term movement or direction is represented by the trend,
repeating patterns at regular intervals are captured by seasonality, and
random fluctuations are captured by residuals. By separating the effects of
seasonality from broader trends and anomalies, decomposing a time series
helps to comprehend the specific contributions of various components,
enabling more accurate analysis and predictions.

# Seasonal decomposition

ts = df.set_index('Date')['Precipitation'] + 0.01

# Add a small constant

result = seasonal_decompose(ts, model='multiplicative', period=12)

result.plot()

plt.show()

Output:
lOMoARcPSD|36553333

For Seasonal Component:


• The upper part of the graph represents the seasonal component.
• The x-axis corresponds to the time, usually in months given the specified
period=12.
• The y-axis represents the magnitude of the seasonal variations.
For Trend Component:
• The middle part of the graph represents the trend component.
• The x-axis corresponds to the time, reflecting the overall trend across the
entire time series.
• The y-axis represents the magnitude of the trend.
For Residual Component:
• The bottom part of the graph represents the residual component (also
known as the remainder).
• The x-axis corresponds to the time.
• The y-axis represents the difference between the observed values and the
sum of the seasonal and trend components.
Using a time series (‘ts’) that represents precipitation data, the function
performs seasonal decomposition. A little constant is added to account for
possible problems with zero or negative values. The code consists of three
parts: trend, seasonality, and residuals, and it uses a multiplicative
model with a 12-month seasonal period. The resulting graphic helps identify
long-term trends and recurrent patterns in the precipitation data by providing
a visual representation of these elements.
Autocorrelation and Partial Autocorrelation Plots
Autocorrelation: A time series’ association with its lag values is measured
by autocorrelation. Every lag is correlated, and peaks in an autocorrelation
diagram show high correlation at particular delays. By revealing recurring
patterns or seasonality in the time series data, this aids in understanding its
temporal structure and supports the choice of suitable model parameters for
time series analysis.
Partial Autocorrelation: When measuring a variable’s direct correlation with
its lags, partial autocorrelation eliminates the impact of intermediate delays.
Significant peaks in a Partial Autocorrelation Function (PACF) plot indicate
that a particular lag has a direct impact on the current observation. It helps to
capture the distinct contribution of each lag by assisting in the appropriate
ordering of autoregressive components in time series modeling.
lOMoARcPSD|36553333

What is Exponential Smoothing?


• Exponential smoothing is a forecasting technique that is used in applying
exponentially decreasing weights to past data. It is, therefore, quite
sensitive to the latest changes in data, so the method is functional for
purposes where changes in patterns occur over time.
• Exponential smoothing is preferred because recent data will have more
weight, which consequently aids in giving a more realistic situation at hand
and reduces lag. This lends the method an evident advantage in the
majority of forecasting exercises.

Exponential Smoothing Forecasting


Time series methods follow the assumption that a forecast is a linear sum of
all past observations or delays. Exponential smoothing gives more weight to
the most recent observations and reduces exponentially as the distance from
the observations rises, with the premise that the future will be similar to the
recent past. The word “exponential smoothing” refers to the fact that each
demand observation is assigned an exponentially diminishing weight.
• This technique captures the general pattern and can be expanded to
include trends and seasonal variations, allowing for precise time series
forecasts using past data.
• This method gives a bit of erroneous long-term forecasts.
• It works well with the technique of smoothing when the parameters of the
time series change gradually over time.

Types of Exponential Smoothing


There are three main types of Exponential Smoothing methods, each
designed to handle different data patterns:
1. Single Exponential Smoothing (SES): Suitable for data without trend or
seasonality.
2. Double Exponential Smoothing (DES): Suitable for data with a trend but
no seasonality.
3. Triple Exponential Smoothing (TES) or Holt-Winters Method: Suitable for
data with both trend and seasonality.

1. Simple or Single Exponential smoothing


Simple smoothing is a method of forecasting time series using univariate data
without a trend or seasonality. One must have a single parameter, which is
also referred to as alpha (\alpha) or smoothing factor so as to check how
lOMoARcPSD|36553333

much the impact of past observations should be minimized.the weight to be


given to the current data as well as the mean estimate of the past depends
on the smoothing parameter (\alpha). A smaller value of a implies more
weight on past prediction and vice-versa. The range of this parameter is
typically 0 to 1.
The formula for simple smoothing is as follows:
s_t = αx_t+(1 – α)s_{t-1}= s_{t-1}+ α(x_t – s_{t-1})
where,
• s_t = smoothed statistic (simple weighted average of current observation
xt)
• s_{t-1} = previous smoothed statistic
• α = smoothing factor of data; 0 < α < 1
• t = time period
2. Double Exponential Smoothing
Double exponential smoothing, also known as the Holt’s trend model, or
second-order smoothing, or Holt’s Linear Smoothing is a smoothing method
used to predict the trend of a time series when the data does not have a
linear trend but does not have a seasonal pattern. The fundamental idea
behind double exponential smoothing is to use a term that can take into
account the possibility that the series will show a trend.
Double exponential smoothing requires more than just an alpha parameter. It
also requires a beta (b) factor to control the decay of the effect of change in
the trend. The smoothing method supports both additive and multiplicative
trends.
The formulas for Double exponential smoothing are as follows:
s_t = αx_t + (1 – α)(s_{t-1} + b_{t-1})
β_t = β(s_t – s_{t-1}) + (1 – β)b_{t-1}
where,
• b_t = best estimate of the trend at time t
• β = trend smoothing factor; 0 < β <1
3. Holt-Winters’ exponential smoothing
Triple exponential smoothing (also known as Holt-Winters smoothing) is a
smoothing method used to predict time series data with both a trend and
seasonal component.
This is the most advanced variation of smoothing. It is used for forecasting
time series when the data contains linear trends and seasonality.
The technique uses exponential smoothing applied three times:
• Level smoothing
• Trend smoothing
lOMoARcPSD|36553333

• Seasonal smoothing
New smoothing parameter, gamma (γ), is used to control the effect of
seasonal component.
Exponential smoothing can be divided into two categories, depending on the
seasonality. The Holt-Winter’s Additive Method (HWIM) is used for addictive
seasonality. The Holts-Winters Multiplicative method (MWM) is used for
multiplicative seasonality.
The smoothing method uses three parameters:
• (α) the level (intercept),
• (β) the trend, and
• (γ ) the seasonal component.
The formulas for the triple exponential smoothing are as follows:
s_{0} = x_{0}
s_t = \alpha (x_t/c_{t – L}) +(1- \alpha)(s_{t-1} +b_{t-1})
b_t = \beta(s_t -s_{t – 1} )+(1- \beta)b_{t-1}
c_t = \gamma x_t/s_t + (1 – \gamma) c_{t-L}
where:
• s_t = smoothed statistic; it’s the simple weighted average of current
observation Yt
• s_{t-1} = previous smoothed statistic
• α = smoothing factor of data (0 < α < 1)
• t = time period
• b_t = best estimate of a trend at time t
• β = trend smoothing factor (0 < β <1)
• c_t = seasonal component at time t
• γ = seasonal smoothing parameter (0 < γ < 1)
The Holt-Winters method is the most precise of the three, but it is also the
most complicated. It involves more data and more calculations than the
others.
Exponential smoothing in Python
Python has several exponential smoothing libraries, such as Pandas,
Statsmodels, Prophet, etc. These libraries offer different functions and
methods to implement different types of smoothing methods.
The dataset
For the sake of completeness, we are going to use a dataset called
AirPassengers. AirPassengers is a time-series dataset that includes the
monthly passenger numbers of airlines in the years 1949 to 1960.
Dataset link: AirPassengers.csv
lOMoARcPSD|36553333

Setting up the environment


First, let’s set up our environment. We’ll be using Python 3, so make sure
you’ve got it installed.
Next, install these libraries using pip:
pip install pandas matplotlib
Once you’ve installed the necessary libraries, you can import them into your
Python script:
Python

import pandas as pd
import matplotlib.pyplot as plt

Loading the data


After setting up the environment, we can load the AirPassengers dataset into
a pandas DataFrame using the read_csv function, We can then inspect the
first few rows of the DataFrame using the head function:
Python

data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'],


index_col='Month')
print(data.head())
Output:
#Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
Visualizing the data
Before we apply simple exponential smoothing to the data, let’s visualize it to
get a better understanding of its properties. We can use the plot function of
pandas to create a line plot of the data:
Python

plt.plot(data)
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.show()
lOMoARcPSD|36553333

Output:

We can see that the number of passengers appears to be increasing over


time, with some seasonality as well.
Single Exponential smoothing
Now that we’ve loaded and visualized the data, we can perform simple
exponential smoothing using the SimpleExpSmoothing function from the
statsmodels library.
Then, we’ll create an instance of the SimpleExpSmoothing class, passing in
the data as an argument, and then fit the model to the data using the fit
method, This will calculate the smoothing parameters and fit the model to
the data.
Python

from statsmodels.tsa.api import SimpleExpSmoothing


model = SimpleExpSmoothing(data)
model_single_fit = model.fit()
Making predictions
Finally, we can use the forecast method of the model to make predictions for
future values of the time series, where the argument represents the number
of periods to forecast. This will produce a forecast for the next six months:
Python

forecast_single = model_single_fit.forecast(6)
print(forecast_single)
Output:
1961-01-01 431.791781
1961-02-01 431.791781
1961-03-01 431.791781
1961-04-01 431.791781
lOMoARcPSD|36553333

1961-05-01 431.791781
1961-06-01 431.791781
Freq: MS, dtype: float64
Visualize Single Exponential Smoothing
Let’s set the forecast to 40, and check the trend for next 40 months.
Python

forecast_single = model_single_fit.forecast(40)
Now, let’s Visualize
Python

plt.plot(data, label='Original Data')


plt.plot(model_single_fit.fittedvalues, label='Fitted Values')
plt.plot(forecast_single, label='Forecast')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.title('Single Exponential Smoothing')
plt.legend()
plt.show()
Output:
lOMoARcPSD|36553333

Double Exponential Smoothing


Double exponential smoothing, also known as Holt’s method, extends single
exponential smoothing to capture trends in the data. It involves forecasting
both the level and trend components of the time series.
Now, let’s write the code to perform double exponential smoothing (Holt’s
method) using the Holt function from the statsmodels library:
• Create an instance of Holt class
• Fit the model to the data
Python

from statsmodels.tsa.api import Holt

model_double = Holt(data)
model_double_fit = model_double.fit()
Making predictions
Python

forecast_double = model_double_fit.forecast(6)
print(forecast_double)
Output:
1961-01-01 436.196220
1961-02-01 440.578651
1961-03-01 444.961083
1961-04-01 449.343515
1961-05-01 453.725946
1961-06-01 458.108378
Freq: MS, dt ype: float64
Visualize Double Exponential Smoothing
Let’s set the forecast to 40, and check the trend for next 40 months, same as
earlier.
Python

forecast_double = model_double_fit.forecast(40)
Now, let’s visualize
Python

plt.plot(data, label='Original Data')


plt.plot(model_double_fit.fittedvalues, label='Fitted Values')
plt.plot(forecast_double, label='Forecast')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
lOMoARcPSD|36553333

plt.title('Double Exponential Smoothing')


plt.legend()
plt.show()
Output:

Visualize Triple Exponential Smoothing


Let’s set the forecast to 40, and check the trend for next 40 months, same as
earlier.
Python

forecast_triple = model_triple_fit.forecast(40)
Now, let’s visualise
Python

plt.plot(data, label='Original Data')


plt.plot(model_triple_fit.fittedvalues, label='Fitted Values')
plt.plot(forecast_triple, label='Forecast')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.title('Triple Exponential Smoothing')
plt.legend()
plt.show()
Output:
lOMoARcPSD|36553333

When to use Exponential Smoothing


The selection of an exponential smoothing method is dependent on the
properties of the time series and the forecasting needs.
1. Simple Exponential Smoothing (SES):
SES best suits time series data with no trend and no seasonality. It is basic,
which can be applied when there is no overall systematics in trends or
anomalies and straightforward forecasting based on the last observation
and the preceding forecast. Because SES is based on computing and is
simple to set up, it’s ideal for forecasting in real time or where there’s a
lack of data.
Benefits of Exponential Smoothing
Analysts can modify the rate at which older observations become less
significant in the computations by varying the values of these parameters. As
a result, analysts can adjust the weighting of recent observations in relation
to previous observations to suit the needs of their field.
On the other hand, the moving average approach assigns 0 weight to
observations outside of the moving average window and assigns equal weight
to all historical observations when they occur within its frame. Because
exponential smoothing models error, trend, and seasonality in time series
data, statisticians refer to it as an ETS model, just like they do with the Box-
Jenkins ARIMA methodology.
Limitations of Exponential Smoothening
However, there are some drawbacks to exponential smoothing.
lOMoARcPSD|36553333

For example, it may not work well for time series with complex patterns or
anomalies, such as sudden level or trend changes, outliers or sudden
seasonality.
In these cases, other sophisticated forecasting techniques may be more
suitable.
Also, the selection of smoothing parameters, such as alpha, beta and γ, can
affect the precision of the forecasts. Finding the best values for these
parameters might require some trial and error or model selection techniques.

ARIMA Model for Time Series Forecasting


ARIMA stands for autoregressive integrated moving average model and is
specified by three order parameters: (p, d, q).
• AR(p) Autoregression – a regression model that utilizes the dependent
relationship between a current observation and observations over a
previous period.An auto regressive (AR(p)) component refers to the use of
past values in the regression equation for the time series.
• I(d) Integration – uses differencing of observations (subtracting an
observation from observation at the previous time step) in order to make
the time series stationary. Differencing involves the subtraction of the
current values of a series with its previous values d number of times.
• MA(q) Moving Average – a model that uses the dependency between an
observation and a residual error from a moving average model applied to
lagged observations. A moving average component depicts the error of
the model as a combination of previous error terms. The
order q represents the number of terms to be included in the model.

You might also like