Unit 3
Unit 3
Univariate data:
Univariate data refers to a type of data in which each observation or data
point corresponds to a single variable. In other words, it involves the
measurement or observation of a single characteristic or attribute for each
individual or item in the dataset. Analyzing univariate data is the simplest
form of analysis in statistics.
Heights (in cm) 164 167.3 170 174.2 178 180 186
Distribution means that how data can be present in different possible ways,
the percentage of specific data, identifying the outliers. So, data distribution is
the way of using graphical methods to organize and display useful
information.
Terms related to Exploration of Data Distribution
-> Boxplot
-> Frequency Table
-> Histogram
-> Density Plot
• Boxplot : It is based on the percentiles of the data as shown in the figure
below. The top and bottom of the boxplot are 75th and 25th percentile of
the data. The extended lines are known as whiskers that includes the
range of rest of the data.
import numpy as np
import pandas as pd
import seaborn as
sns
data = pd.read_csv("../data/state.csv")
data['PopulationInMillions'] = data['Population']/1000000
print (data.head(10))
Code #3 : BoxPlot
fig.set_size_inches(9, 15)
Box plot
data['PopulationInMillionsBins'] = pd.cut(
print (data.head(10))
lOMoARcPSD|36553333
data.groupby(data.PopulationInMillionsBins)['Abbreviation'].apply(', '.join)
lOMoARcPSD|36553333
# Histogram Population In
fig.set_size_inches(9, 15)
Density Plot:
fig.set_size_inches(7, 9)
Death rate
1. Mean
2. Median
3. Mode
4. Percentile
6. Standard Deviation
7. Variance
lOMoARcPSD|36553333
8. Range
9. Proportion
10.Correlation
Mean
This is the point of balance, describing the most typical value for normally
distributed data. I say “normally distributed” data because the mean is highly
influenced by outliers.
The mean adds up all the data values and divides by the total number of
values, as follows:
The ‘x-bar’ is used to represent the sample mean (the mean of a sample of
data). ‘∑’ (sigma) implies de addition of all values up from ‘i=1’ until ‘i=n’ (’n’ is
the number of data values). The result is then divided by ‘n’.
Effect of outliers:
lOMoARcPSD|36553333
The first plot ranges from 1 to 10. The mean is 5.5. When we replace 10 with
20, the average increases to 6.5. In the next concept, we will go over the
‘median’, that is the perfect choice to ignore outliers.
Median
This is the “middle data point”, where half of the data is below the median and
half is above the median. It’s the 50th percentile of the data (we will
cover percentile later in this article). It’s also mostly used with skewed data
because outliers won’t have a big effect on the median.
There are two formulas to compute the median. The choice of which formula
to use depends on n (number of data points in the sample, or sample size) if
it’s even or odd.
When n is even, there is no “middle” data point, so the middle two values are
averaged.
Effect of outliers:
In the graph above, we are using the same data used to calculate the mean.
Notice how the median stays the same in the second graph when we replace
10 with 20. It doesn’t mean that the median will always ignore the outliers. If
we had a larger number of numbers and/or outliers, the median could be
affected, but the influence of an outlier is low.
Mode
The mode will return you the most commonly occurring data value.
R doesn’t give you specifically the mean, but you can do the following to get the
frequency of each data
value: R: table(c('apple','banana','banana','tomato','orange','orange','banana'))
The result is apple:1, banana:3, orange:2, tomato:1. ‘Banana’ has a higher
frequency with 3 occurrences. Follows below a histogram plot of this fruit
vector.
lOMoARcPSD|36553333
Percentile
The percent of data that is equal to or less than a given data point. It’s useful
for describing where a data point stands within the data set. If the percentile is
close to zero, then the observation is one of the smallest. If the percentile is
close to 100, then the data point is one o the largest in the data set.
Python:
from scipy import statsx = [10, 12, 15, 17, 20, 25, 30]## In what percentile lies
the
number 25?
stats.percentileofscore(x,25)
R:
library(stats)x <- c(10, 12, 15, 17, 20, 25, 30)## In what percentile lies the number
25?
ecdf(x)(25)
# resul: 85.7## In what percentile lies the number 12? ecdf(x)
(12)
# resul: 0.29
lOMoARcPSD|36553333
Quartiles measure the center and it’s also great to describe the spread of the
data. Highly useful for skewed data. There are four quartiles, and they
compose the five-number summary (combined with the minimum). The Five-
number summary is composed of:
1. Minimum
Python:
import numpy as npx = [10,12,15,17,20,25,30]min = np.min(x)
q1 = np.quantile(x, .25)
median =
np.median(x) q3 =
np.quantile(x, .75)
R:
x <- c(10,12,15,17,20,25,30)min = min(x)
q1 = quantile(x,
.25) median =
median(x) q3 =
quantile(x, .75)
max = max(x)paste(min, q1, median, q3, max)## You can also use the function
favstats from the mosaic
## It will give you the five-number summary, mean, standard deviation, sample
size and number of missing values.librarylibrary(mosaic)
lOMoARcPSD|36553333
A boxplot is one good way to plot the five-number summary and explore the
data set.
The bottom end of the boxplot represents the minimum; the first horizontal
line represents the lower quartile; the line inside the square is the median;
the next line is the upper quartile, and the top is the maximum.
Standard Deviation
Deviation: The idea is to use the mean as a reference point from which
everything varies. A deviation is defined as the distance an observation
lies from the reference point. This distance is obtained by subtracting the
data point (xi) from the mean (x-bar).
Calculating the standard deviation: The average of all the deviations will
always turn out to be zero, so we square each deviation and sum up the
results. Then, we divide it for ‘n-1’ (called degrees of freedom). We square root
the final result to undo de squaring of the deviations.
This graph shows the density of Sepal.Width from the Iris data set. The
standard deviation is 0.436. The blue line represents the mean, and the red
lines one and two standard deviations away from the mean. For example, a
Sepal.Width with a value of 3.5 lies 1 standard deviation from the mean.
Python: np.std(x)
R: sd(x)
Effect of outliers: The standard deviation, like the mean, is highly influenced by
outliers. The code below will use R to compare the standard deviation of two
vectors, one without outliers and a second with an outlier.
x <- c(1,2,3,4,5,6,7,8,9,10)
sd(x)
# result: 3.02765#Replacing 10 by
20: y <- c(1,2,3,4,5,6,7,8,9,20)
lOMoARcPSD|36553333
sd(y)
# result: 5.400617
Variance
Variance is almost the same calculation of the standard deviation, but it stays in
squared units. So, if you take the square root of the variance, you have the
standard deviation.
Python: np.var(x)
R: var(x)
Range
The difference between the maximum and minimum values. Useful for some
basic exploratory analysis, but not as powerful as the standard deviation.
R: max(x) — min(x)
lOMoARcPSD|36553333
Proportion
Correlation
Defines the strength and direction of the association between two quantitative
variables. It ranges between -1 and 1. Positive correlations mean that one
variable increases as the other variable increases. Negative correlations mean
that one variable decreases as the other increases. When the correlation is
zero, there is no correlation at all. As closest to one of the extreme the result
is, stronger is the association between the two variables.
Python: stats.pearsonr(x,y)
R: cor(x,y)
These basic summaries are essential as you explore and analyze your data.
•
• Scaling features makes ensuring that each characteristic is given the same
consideration during the learning process. Without scaling, bigger scale
features could dominate the learning, producing skewed outcomes. This
bias is removed through scaling, which also guarantees that each feature
contributes fairly to model predictions.
lOMoARcPSD|36553333
lOMoARcPSD|36553333
After performing the above-mentioned two steps we will observe that each
entry of the column lies in the range of -1 to 1. But this method is not used
that often the reason behind this is that it is too sensitive to the outliers. And
while dealing with the real-world data presence of outliers is a very common
thing.
For the demonstration purpose, we will use the dataset which you can
download from here. This dataset is a simpler version of the original house
price prediction dataset having only two columns from the original dataset.
The first five rows of the original data are shown below:
Python:
import pandas as pd
df =
pd.read_csv('SampleFile.csv')
print(df.head())
Now let’s apply the first method which is of the absolute maximum scaling. For
this first, we are supposed to evaluate the absolute maximum values of the
columns.
Python:
import numpy as np
max_vals = np.max(np.abs(df))
max_vals
Now we are supposed to subtract these values from the data and then divide
the results from the maximum values as well.
Python:
Min-Max Scaling
This method of scaling requires below two-step:
1. First, we are supposed to find the minimum and the maximum value of
the column.
2. Then we will subtract the minimum value from the entry and divide the
result by the difference between the maximum and the minimum
value.
As we are using the maximum and the minimum value this method is also
prone to outliers but the range in which the data will range after performing
the above two steps is between 0 to 1.
Python:
from sklearn.preprocessing
import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df =
pd.DataFrame(scaled_data,
columns=df.columns)
scaled_df.head()
Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
• Median
• Inter-Quartile Range
After calculating these two values we are supposed to subtract the median
from each entry and then divide the result by the interquartile range.
lOMoARcPSD|36553333
Python:
scaled_data = scaler.fit_transform(df)
scaled_df =
pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
lOMoARcPSD|36553333
lOMoARcPSD|36553333
lOMoARcPSD|36553333
Inequalities:
These measures quantify the spread or dispersion of values around the mean.
The IQR is the range of the middle 50% of the data. It is calculated as the
difference between the third quartile Q3 and the first quartile Q1. A larger IQR
can indicate more variability in the central part of the distribution.
coefÏcient of variation :
CV = σ / μ
where:
lOMoARcPSD|36553333
In plain English, the coefÏcient of variation is simply the ratio between the
standard deviation and the mean.
In the real world, it’s often used in finance to compare the mean expected
return of an investment relative to the expected standard deviation of the
investment. This allows investors to compare the risk-return trade-off between
investments.
Upon calculating the coefÏcient of variation for each fund, the investor finds:
Since Mutual Fund A has a lower coefÏcient of variation, it offers a better mean
return relative to the standard deviation.
To calculate the coefÏcient of variation for a dataset in Python, you can use the
following syntax:
import numpy as np
Skewness :
Lorenz curve:
Gini coefÏcient :
It is calculated by dividing the area between the Lorenz Curve and the line of
perfect equality (where income is equally distributed) by the total area under
the line of perfect equality.
lOMoARcPSD|36553333
Time Series
Seasonal Decomposition
A statistical technique used in time series analysis to separate the constituent
parts of a dataset is called seasonal decomposition. Three fundamental
components of the time series are identified: trend, seasonality, and
residuals. The long-term movement or direction is represented by the trend,
repeating patterns at regular intervals are captured by seasonality, and
random fluctuations are captured by residuals. By separating the effects of
seasonality from broader trends and anomalies, decomposing a time series
helps to comprehend the specific contributions of various components,
enabling more accurate analysis and predictions.
# Seasonal decomposition
ts = df.set_index('Date')['Precipitation'] + 0.01
result.plot()
plt.show()
Output:
lOMoARcPSD|36553333
• Seasonal smoothing
New smoothing parameter, gamma (γ), is used to control the effect of
seasonal component.
Exponential smoothing can be divided into two categories, depending on the
seasonality. The Holt-Winter’s Additive Method (HWIM) is used for addictive
seasonality. The Holts-Winters Multiplicative method (MWM) is used for
multiplicative seasonality.
The smoothing method uses three parameters:
• (α) the level (intercept),
• (β) the trend, and
• (γ ) the seasonal component.
The formulas for the triple exponential smoothing are as follows:
s_{0} = x_{0}
s_t = \alpha (x_t/c_{t – L}) +(1- \alpha)(s_{t-1} +b_{t-1})
b_t = \beta(s_t -s_{t – 1} )+(1- \beta)b_{t-1}
c_t = \gamma x_t/s_t + (1 – \gamma) c_{t-L}
where:
• s_t = smoothed statistic; it’s the simple weighted average of current
observation Yt
• s_{t-1} = previous smoothed statistic
• α = smoothing factor of data (0 < α < 1)
• t = time period
• b_t = best estimate of a trend at time t
• β = trend smoothing factor (0 < β <1)
• c_t = seasonal component at time t
• γ = seasonal smoothing parameter (0 < γ < 1)
The Holt-Winters method is the most precise of the three, but it is also the
most complicated. It involves more data and more calculations than the
others.
Exponential smoothing in Python
Python has several exponential smoothing libraries, such as Pandas,
Statsmodels, Prophet, etc. These libraries offer different functions and
methods to implement different types of smoothing methods.
The dataset
For the sake of completeness, we are going to use a dataset called
AirPassengers. AirPassengers is a time-series dataset that includes the
monthly passenger numbers of airlines in the years 1949 to 1960.
Dataset link: AirPassengers.csv
lOMoARcPSD|36553333
import pandas as pd
import matplotlib.pyplot as plt
plt.plot(data)
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.show()
lOMoARcPSD|36553333
Output:
forecast_single = model_single_fit.forecast(6)
print(forecast_single)
Output:
1961-01-01 431.791781
1961-02-01 431.791781
1961-03-01 431.791781
1961-04-01 431.791781
lOMoARcPSD|36553333
1961-05-01 431.791781
1961-06-01 431.791781
Freq: MS, dtype: float64
Visualize Single Exponential Smoothing
Let’s set the forecast to 40, and check the trend for next 40 months.
Python
forecast_single = model_single_fit.forecast(40)
Now, let’s Visualize
Python
model_double = Holt(data)
model_double_fit = model_double.fit()
Making predictions
Python
forecast_double = model_double_fit.forecast(6)
print(forecast_double)
Output:
1961-01-01 436.196220
1961-02-01 440.578651
1961-03-01 444.961083
1961-04-01 449.343515
1961-05-01 453.725946
1961-06-01 458.108378
Freq: MS, dt ype: float64
Visualize Double Exponential Smoothing
Let’s set the forecast to 40, and check the trend for next 40 months, same as
earlier.
Python
forecast_double = model_double_fit.forecast(40)
Now, let’s visualize
Python
forecast_triple = model_triple_fit.forecast(40)
Now, let’s visualise
Python
For example, it may not work well for time series with complex patterns or
anomalies, such as sudden level or trend changes, outliers or sudden
seasonality.
In these cases, other sophisticated forecasting techniques may be more
suitable.
Also, the selection of smoothing parameters, such as alpha, beta and γ, can
affect the precision of the forecasts. Finding the best values for these
parameters might require some trial and error or model selection techniques.