dsa unit 2
dsa unit 2
Time Series
Analysis
• Time Series Analysis comprises
methods for analyzing time-series data
to extract meaningful
statistics, rules and patterns.
• These rules and patterns might be used
to build forecasting models that are
able to predict future developments.
• Is the database play a vital role in Time Series
mining?
1 def check_stationarity(ts):
dftest = adfuller(ts)
adf = dftest[0]
pvalue = dftest[1]
critical_value = dftest[4]['5%']
if (pvalue < 0.05) and (adf <
critical_value):
print('The series is stationary')
else:
2 print('The series is NOT stationary')
3. Average
• Finally, the third component is the moving
‘average’, which also uses past information but
in a different way.
• How is this different from autoregression?
• Well, while autoregression uses past values of
the time series, moving average uses the model's
errors as information.
In summary, let’s recap with the following
visualization:
Model selection in ARIMA
The ARIMA model also has three components: p,
d, and q, which stand for "autoregressive",
"differencing", and "moving average",
respectively.
•The p-component (AR) measures the correlation
between the current value of a time series and
the values that came before it
•The d-component (I) represents the number of
times the series needs to be differenced to
make it stationary
•The q-component (MA) measures the correlation
SARIMAX forecasters
SARIMAX (Seasonal Autoregressive Integrated Moving-Average with Exogenous
Regressors) is a generalization of the ARIMA model that considers both seasonality and
exogenous variables. SARIMAX models are among the most widely used statistical models
for forecasting, with excellent forecasting performance.
In the SARIMAX model notation, the parameters p, d, and q represent the autoregressive,
differencing, and moving-average components, respectively. P, D, and Q denote the same
components for the seasonal part of the model, with m representing the number of periods in
each season.
•p is the order (number of time lags) of the autoregressive part of the model.
•d is the degree of differencing (the number of times the data have had past values
subtracted).
•q is the order of the moving-average part of the model.
•P is the order (number of time lags) of the seasonal part of the model
•D is the degree of differencing (the number of times the data have had past values
subtracted) of the seasonal part of the model.
•Q is the order of the moving-average of the seasonal part of the model.
•m refers to the number of periods in each season.
Supply Chain Management: Real world case study in logistics.
Supply chain
management (SCM)
is the optimization
of a product's
creation and flow
from raw material
sourcing to
production, logistics
and delivery to the
final customer.
Forecasting and Demand Planning:
• Data science is fundamental in forecasting and demand planning, the essential
components of effective supply chain management.
• Data scientists can develop accurate demand forecasts by analyzing historical
sales data, market trends, and external factors.
Inventory Optimization:
However, neither of the above seem like the best fit. Perhaps a line such that
the boundary between the two classes is maximal is the optimal line?
This line is such that the margin is maximized. This is the
line an SVM attempts to find - an SVM attempts to find
the maximum-margin separating hyperplane between
the two classes.
However, we need to construct a decision rule to classify examples. To do this,
consider a vector w perpendicular to the margin. Further, consider some
unknown vector u representing some example we want to classify:
R1 : If (Age=Youth) AND
(Student=Yes) THEN
Buys_computer=Yes
R2 : If (Age=Youth) AND
(Student=No) THEN
Buys_computer=No
R3 : If (Age=middle_aged) THEN
Buys_computer=Yes
R4 : If (Age=Senior) AND
(Credit_rating=Fair) THEN
Buys_computer=No
R5 : If (Age=Senior) AND
(Credit_rating =Excellent)
THEN Buys_computer=Yes
What is node impurity?
• The node impurity is a measure of the
homogeneity of the labels at the node.
• The current implementation provides two impurity
measures for classification (Gini impurity and
entropy) and one impurity measure for regression
(variance).
• Gini Impurity is a measurement of the likelihood
of an incorrect classification of a new instance of
data, if that new instance were randomly
Formula for Gini Impurity
Below is the formula for Gini Impurity, where p is the
probability of samples belonging to the class i at a specific
node. The feature with the smallest Gini Impurity is
selected for splitting the node.
•The range of value Gini Impurity
can have is between 0 to 0.5
•The lesser the Gini Impurity, the
better the split is.
•A Gini Impurity of 0 denotes a pure
node and 0.5 denotes a most
impure node
Entropy and Gini criterion
measure similar
performance metrics.
•The range of values Entropy can
have is between 0 to 1
•Entropy of 0 denotes a pure
node and 1 denotes most impure
node (where we have 50–50 split
of ‘Yes’ and ‘No’)
Example: Splitting by Gini Index
We can split the data by the Gini Index too. Let’s compute
the required probabilities:
Out of the 14 days in the above example, Sunny, Overcast, and Rain occur 5,
4, and 5 times, respectively. Then, we compute the probabilities of a Sunny
day and playing tennis or not. Out of the 5 times when Outlook=Sunny, we
played tennis on 2 and didn’t play it on 3 days:
Having calculated the required probabilities, we can
compute the Gini Index of Sunny: