0% found this document useful (0 votes)
11 views

dsa unit 2

The document discusses Time Series Analysis, detailing its methods for analyzing time-series data to extract patterns and build forecasting models like ARIMA and SARIMA. It also highlights applications in various fields such as finance, industry, and meteorology, and emphasizes the importance of data preprocessing. Additionally, it covers supply chain management case studies, including Inditex and Amazon, showcasing how data science optimizes logistics and enhances decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

dsa unit 2

The document discusses Time Series Analysis, detailing its methods for analyzing time-series data to extract patterns and build forecasting models like ARIMA and SARIMA. It also highlights applications in various fields such as finance, industry, and meteorology, and emphasizes the importance of data preprocessing. Additionally, it covers supply chain management case studies, including Inditex and Amazon, showcasing how data science optimizes logistics and enhances decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

UNIT-2

Time Series
Analysis
• Time Series Analysis comprises
methods for analyzing time-series data
to extract meaningful
statistics, rules and patterns.
• These rules and patterns might be used
to build forecasting models that are
able to predict future developments.
• Is the database play a vital role in Time Series
mining?

- The database is the collection of data


retrieved from a different source in which the data
are stored in a structural, nonstructural format on
their respective columns.

- Time Series database consists of a sequence


of values or events changing with time. Data are
recorded at regular intervals.
Application of Time Series Mining:
1. Financial:
1.1 Used for stock price evaluation
1.2 For the measurement of Inflation
2. Industry:
2.1 Determine the power consumption
3. Scientific:
3.1 Used for experiment results
4. Meteorological:
4.1 Concerned with the processes and phenomena of
the atmosphere, basically for forecasting weather
Time Series Forecasting models ARIMA,
SARIMA, and SARIMAX Explained
• ARIMA and SARIMA are both algorithms for
forecasting. ARIMA takes into account the past values
(autoregressive, moving average) and predicts future
values based on that.
• SARIMA similarly uses past values but also takes into
account any seasonality patterns.
• Since SARIMA considers seasonality a parameter, it’s
significantly more powerful than ARIMA in forecasting
complex data spaces containing cycles.
Data preprocessing for time series forecasting
• Time series data is messy. This data preprocessing step is general and intended to make
readers emphasize it as real-world projects involve a lot of cleaning and preparation. So here
are some techniques you could use before moving to forecasting.
ARIMA
ARIMA model is a class of linear models
that utilizes historical values to forecast
future values. ARIMA stands
for Autoregressive Integrated Moving
Average, each of which technique
contributes to the final forecast. Let’s
understand it one by one.
The 3 core components of ARIMA
ARIMA has three components, which I'll
briefly introduce here:
1. Autoregression
• The first is ‘autoregression’, which refers to
the model's regression on its own lagged values.
• In simple terms, this means that we use past
data to predict future outcomes.
2. Integrated
• The second component is ‘integrated’, which deals
with stationary time.
• A stationary time series has a constant mean,
variance, and covariance over time, which makes it
easier to predict patterns.
• Integrated represents any
differencing that has to be
applied in order to make
the data stationary.
• A dickey-fuller test (code
below) can be run on the
data to check for
stationarity and then
experiment with different
differencing factors.
• A differencing factor, d=1
means a lag of i.e.mt-mt-
1. Let’s look at a plot of
original vs differenced
data.
Augmented Dickey-Fuller (ADF) Test

1 def check_stationarity(ts):
dftest = adfuller(ts)
adf = dftest[0]
pvalue = dftest[1]
critical_value = dftest[4]['5%']
if (pvalue < 0.05) and (adf <
critical_value):
print('The series is stationary')
else:
2 print('The series is NOT stationary')
3. Average
• Finally, the third component is the moving
‘average’, which also uses past information but
in a different way.
• How is this different from autoregression?
• Well, while autoregression uses past values of
the time series, moving average uses the model's
errors as information.
In summary, let’s recap with the following
visualization:
Model selection in ARIMA
The ARIMA model also has three components: p,
d, and q, which stand for "autoregressive",
"differencing", and "moving average",
respectively.
•The p-component (AR) measures the correlation
between the current value of a time series and
the values that came before it
•The d-component (I) represents the number of
times the series needs to be differenced to
make it stationary
•The q-component (MA) measures the correlation
SARIMAX forecasters
SARIMAX (Seasonal Autoregressive Integrated Moving-Average with Exogenous
Regressors) is a generalization of the ARIMA model that considers both seasonality and
exogenous variables. SARIMAX models are among the most widely used statistical models
for forecasting, with excellent forecasting performance.
In the SARIMAX model notation, the parameters p, d, and q represent the autoregressive,
differencing, and moving-average components, respectively. P, D, and Q denote the same
components for the seasonal part of the model, with m representing the number of periods in
each season.
•p is the order (number of time lags) of the autoregressive part of the model.
•d is the degree of differencing (the number of times the data have had past values
subtracted).
•q is the order of the moving-average part of the model.
•P is the order (number of time lags) of the seasonal part of the model
•D is the degree of differencing (the number of times the data have had past values
subtracted) of the seasonal part of the model.
•Q is the order of the moving-average of the seasonal part of the model.
•m refers to the number of periods in each season.
Supply Chain Management: Real world case study in logistics.

Supply chain
management (SCM)
is the optimization
of a product's
creation and flow
from raw material
sourcing to
production, logistics
and delivery to the
final customer.
Forecasting and Demand Planning:
• Data science is fundamental in forecasting and demand planning, the essential
components of effective supply chain management.
• Data scientists can develop accurate demand forecasts by analyzing historical
sales data, market trends, and external factors.
Inventory Optimization:

• Data science enables organizations to optimize inventory levels by analyzing


historical demand patterns, lead times, supplier performance, and customer
behavior.
• Companies can identify optimal reorder points, safety stock levels, and
replenishment strategies by employing data-driven techniques like predictive
analytics.
Supplier Relationship Management:

• Data science enhances supplier relationship management by providing


insights into supplier performance, quality, and reliability.
• By analyzing supplier data and external information such as financial
indicators and market dynamics, organizations can identify the most
suitable suppliers for their needs.
Route Optimization and Logistics:

• Efficient transportation and logistics management are crucial for timely


order fulfillment and cost optimization.
• Data science is pivotal in route optimization, vehicle scheduling, and
fleet management.
• Data-driven logistics optimization helps minimize transportation costs,
improve customer satisfaction, and reduce the environmental impact of
supply chain operations.
Risk Management and
Resilience:
• Data science enables organizations to identify and
mitigate risks in their supply chains.
• Data scientists can develop risk models to identify
potential disruptions or vulnerabilities by analyzing
historical data, market trends, and external factors.
• Advanced analytics techniques like predictive modeling
and scenario analysis help businesses anticipate and
mitigate risks related to demand variability, supply
disruptions, natural disasters, and geopolitical events.
Key features of effective supply chain optimization include:
Supply Chain Analytics
• Supply chain analytics, in the simplest terms, is the bridge between
data and decision-making.
• It is the process of analyzing large amounts of data to identify
patterns and uncover insights for informed decision-making in
supply chain management.
Why Is Supply Chain Analytics Important for Companies?
The key reasons why companies are turning to supply chain
analytics:
• Supply chain analytics help
organizations make better, faster and
more informed decisions about their
supply chain operations.
• supply chain analytics helps
with optimizing inventory management.
• Supply chain analytics can identify
inefficiencies, waste, and areas for
improvement
• Supply chain analytics analyzes customer
data to better predict an
Overview of Different Types of Data
Used
An important aspect of data management is understanding
the types of data required for supply chain analytics.
Some of the key types of data are:
•Customer data that provides valuable information about
consumer preferences, buying behavior, and demand
patterns. Analyzing customer data can shape supply chain
strategies to better meet customer needs.
•Product data includes information about the
characteristics, specifications, and attributes of
different products. Analyzing product data is important
for optimizing inventory, demand planning, and
forecasting.
•Demand data is historical and real-time information about
the demand for products or services. Analyzing demand
data assists in predicting future demand and adjusting
supply chain operations accordingly.
Supply Chain Management in logistics.
Supply Chain Management: Real world case
study in logistics.
Case Study: Inditex (Zara's parent company) and
their Agile Supply Chain
Company: Inditex (Zara, Massimo Duti, Bershka, etc.)
Industry: Apparel Retail
Challenge: In the fast-changing world of fashion,
Inditex needed a way to quickly respond to trends and
customer demand without sacrificing cost or quality.
Traditional, long lead time supply chains were not
agile enough.
Solution: Inditex implemented an agile supply chain with several
key features:
•Vertical integration: They own a significant portion of their
manufacturing, allowing for closer control over production and faster
response times.
•Quick response (QR) manufacturing: They produce smaller
batches of clothing more frequently, allowing them to react to trends
and changes in demand quickly.
•Information technology: They use sophisticated software to track
inventory, production, and sales data in real-time, allowing for better
decision-making and faster response times.
•Flexible logistics: They have a network of strategically located
distribution centers and transportation partners to quickly move goods
around the world.
Results:
•Faster lead times: From design to store shelves takes as little as 2 weeks for some
items.
•Increased responsiveness to trends: Zara releases new collections several times
per week, keeping up with the latest fashion trends.
•Reduced inventory levels: By responding quickly to demand, Inditex can keep
lower inventory levels, reducing costs and improving cash flow.
•Strong brand reputation: Their ability to deliver fast fashion at affordable prices
has helped them build a strong brand reputation and customer loyalty.
Challenges:
•Higher cost: Their vertically integrated model and quick response manufacturing
can increase costs compared to traditional models.
•Labor concerns: There have been concerns about labor practices in some of their
manufacturing facilities.
Overall, Inditex's agile supply chain is a successful example of how companies
can use innovative logistics strategies to gain a competitive advantage in the
fast-paced world of fashion.
This case study highlights the importance of:
•Understanding customer needs and responding
quickly to changes in demand.
•Using technology to improve visibility and decision-
making.
•Building a strong network of partners and suppliers.
•Balancing cost, quality, and responsiveness.
It's important to note that this is just one example, and
the best supply chain strategy will vary depending on the
specific industry, company, and market conditions.
Case Study: Amazon and their Fulfilment
Network
Company: Amazon, a leading e-commerce
platform
Industry: E-commerce logistics
Challenge: To meet the ever-increasing
demand for speedy and efficient delivery,
especially during peak seasons, Amazon
needed a robust and scalable logistics
network. They faced issues like managing
large inventory volumes, fulfilling orders
Solution: Amazon built a complex and sophisticated
fulfilment network with several key elements:
•Massive network of fulfillment centers: Strategically
located across the globe, these centers house a vast
inventory, allowing for faster delivery to customers.
•Advanced warehouse automation: Amazon employs
state-of-the-art robotics and automation technologies for
tasks like picking and packing, increasing efficiency and
accuracy.
•Delivery options: They offer a variety of delivery options,
including one-day, two-day, and even same-day delivery in
certain locations, catering to customer needs and
expectations.
•Transportation partnerships: Amazon collaborates with
various transportation companies, including their own
Results:
•Faster delivery times: Amazon boasts some of the fastest delivery
times in the e-commerce industry, exceeding customer expectations.
•Increased customer satisfaction: Fast and reliable delivery
contributes significantly to customer satisfaction, leading to higher loyalty
and repeat business.
•Increased market share: By offering a superior delivery experience,
Amazon has captured a significant share of the e-commerce market.
Challenges:
•High capital investment: Building and maintaining a vast network of
fulfillment centers and deploying automation technologies requires
substantial financial resources.
•Labor concerns: There have been concerns regarding employee
treatment and working conditions in Amazon warehouses.
•Environmental impact: The sheer volume of transportation required for
rapid delivery raises concerns about the environmental impact.
Overall, Amazon's fulfilment network is a
remarkable example of how innovation and
technology can be harnessed to revolutionize an
industry. Their focus on efficient and fast delivery
contributes significantly to their success.
This case study highlights the importance of:
•Building a scalable and adaptable logistics
network.
•Integrating technology to enhance efficiency and
accuracy.
•Balancing fulfillment speed with ethical business
• Support Vector Machine
(SVM) is a supervised
machine learning algorithm.
• SVM’s purpose is to predict
the classification of a query
sample by relying on labeled
input data which are
separated into two group
classes by using a margin.
• Specifically, the data is
transformed into a higher
dimension, and a support
vector classifier is used as a
threshold (or hyperplane) to
separate the two classes with
SVMs: A Geometric Interpretation
Consider a set of positive and
negative samples from some
dataset as shown above. How
can we approach the problem of
classifying these - and more
importantly, unseen - samples
as either positive or negative
examples? The most intuitive
way to do this is to draw a line /
hyperplane between the
between the positive and
However, which line should we draw? We could draw this
one:

However, neither of the above seem like the best fit. Perhaps a line such that
the boundary between the two classes is maximal is the optimal line?
This line is such that the margin is maximized. This is the
line an SVM attempts to find - an SVM attempts to find
the maximum-margin separating hyperplane between
the two classes.
However, we need to construct a decision rule to classify examples. To do this,
consider a vector w perpendicular to the margin. Further, consider some
unknown vector u representing some example we want to classify:

We want to know what side of the


decision boundary u is in order to
classify it. To do this, we project it
onto w by computing w⋅u. This will give
us a value that is proportional to the
distance u is, in the direction of w. We
can then use this to determine which
side of the boundary u lies on using the
following decision rule:
Problem of Non-linear decision boundary
(hyperplane)
• It is possible that in many of the problems of
classification we may not have a linear decision
boundary or hyperplane.
• In such cases we use the support vector machines to
do the classification by producing nonlinear
boundaries which are constructed as a result of
linear boundary in a higher and transformed version
of the feature space.
Working of a Support Vector Machine
• A support vector machine (SVM) constructs a hyperplane or set
of hyperplanes in a higher dimensional space.
• A "good" separation is achieved by the hyperplane that has the
largest distance to the nearest training data point of any class,
that is, we try to find a decision boundary that maximizes the
margin.
• We do this in order to minimize the generalization error as
much as possible, since the larger the margin the lower is the
generalization error of the classifier.
• The problem of non-linear decision boundary is solved by using
the kernel tricks.
The Kernel Trick
• In most of the ideal cases the data is
linear and thus we can find a
separating hyperplane that divides
the data into two classes.
• It may happen in most of the
practical situations that the data is
not linear and hence the dataset is
inseparable, in such cases we use the
kernel trick which maps or
transforms the input data into a
higher dimensional space non-
linearly.
• The new transformation that we get
after this process can be separated
linearly. In layman terms it can be
said that the kernel trick allows the
SVM to form non-linear boundaries.
Polynomial kernel
Radial Basis Function
(RBF) Kernel
The RBF kernel SVM decision
region is actually also a linear
decision region. What RBF
kernel SVM actually does is
create non-linear combinations
of features to uplift the samples
onto a higher-dimensional
feature space where a linear
decision boundary can be used
to separate classes.
What is a Radial Function?
• Radial function is a function
whose value depends on the
distance from the point of
origin (x = 0 and y = 0).
• For example if the distance to
origin is constant, then it is a
circle if we plot it on X and Y
axes. In figure below we can
see a circle with r = 2, where
r is the distance from origin.
In this case r is the radius of
the circle and point O is the
point of origin. Figure-Radial Function
What is a Radial Basis
Function (RBF)?
• A Radial Basis Function (RBF) is a
radial function where the reference
point is not the origin. For example,
distance of 3 from point (5,5) is like
this:
• We can sum multiple RBFs to get shapes with
multiple centres like this:

Radial Basis Function


So what is SVM with RBF Kernel?
SVM with RBF Kernel is a machine learning algorithm which is capable to classify data points
separated with radial based shapes like this:
•Radial Basis Function (RBF) Kernel: - The RBF kernel is very
popular. It considers all possible polynomials of all degrees and
infinite dimensions. It is suitable for cases where the decision
boundary is not easily represented by a straight line or a plane.
Domain Specific Kernel
• Sometimes, custom kernels tailored to specific domain
knowledge or problem characteristics can outperform
standard kernels.
• For example, designing kernels based on domain-specific
similarity measures can improve performance.
• In this type of kernel, we don’t have a generic form of kernel
because this kernel is tailored according to a specific
domain.
• Nowadays, grid search and cross-validation techniques
can also help determine the best-performing kernel for
a given dataset.
DECISION TREE
CLASSIFICATION
• Decision tree builds
classification or regression
models in the form of a tree
structure.
• It breaks down a dataset into
smaller and smaller subsets
while at the same time an
associated decision tree is
incrementally developed.
• The final result is a tree
with decision nodes and leaf
nodes. It follows Iterative
Dichotomiser 3 (ID3)
algorithm structure for
determining the split.
• Entropy and information gain are used to construct a decision tree.
Entropy
Entropy is the degree or amount of uncertainty in the randomness of elements.
In other words, it is a measure of impurity.
• Intuitively, it tells us about
the predictability of a
certain event. Entropy
calculates the homogeneity
of a sample.
• If the sample is completely
homogeneous the entropy is
zero, and if the sample is
equally divided it has an
entropy of one.
Information Gain
• Information gain measures the relative change in entropy with
respect to the independent attribute. It tries to estimate the
information contained by each attribute.
• Constructing a decision tree is all about finding the attribute that
returns the highest information gain (i.e., the most homogeneous
branches).

• Where Gain(T, X) is the information gain by applying feature X. Entropy(T) is


the entropy of the entire set, while the second term calculates the entropy after
applying the feature X.
• Information gain ranks attributes for filtering at a given node in the tree. The
ranking is based on the highest information gain entropy in each split.
• To get a clear understanding
of calculating information
gain & entropy, we will try
to implement it on sample
data.
• Consider a piece of data
collected from a computer
shop where the features are
age, income, student, credit
rating and the outcome
variable is whether the
customer buys a computer or
not.
• Now, our job is to build a
predictive model which takes
in above 4 parameters and
predicts whether the
ID3 Algorithm will perform following tasks recursively
1. Create a root node for the tree
2. If all examples are positive, return leaf node ‘positive’
3. Else if all examples are negative, return leaf node ‘negative’
4. Calculate the entropy of current state H(S)
5. For each attribute, calculate the entropy with respect to the attribute ‘x’
denoted by H(S, x)
6. Select the attribute which has the maximum value of IG(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf
Step 1: The initial step is to calculate H(S), the Entropy of the current state. In
nodes.
the above example, we can see in total there are 5 No’s and 9 Yes’s.
Step 2 : The next step is to calculate H(S,x), the entropy with respect to the
attribute ‘x’ for each attribute. In the above example, The expected information
needed to classify a tuple in ‘S’ if the tuples are partitioned according to age
is,

Hence, the gain in information from such partitioning would be,


Similarly,
Step 3: Choose attribute with the largest information gain, IG(S,x) as the
decision node, divide the dataset by its branches and repeat the same process on
every branch. Age has the highest information gain among the attributes, so Age
is selected as the splitting attribute.
Step 4a: A branch with an entropy of 0 is a leaf node.
Step 4b : A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches until all
data is classified.
Decision Tree to Decision Rules
• A decision tree can easily be transformed into a set of rules by mapping
from the root node to the leaf nodes one by one.

R1 : If (Age=Youth) AND
(Student=Yes) THEN
Buys_computer=Yes
R2 : If (Age=Youth) AND
(Student=No) THEN
Buys_computer=No
R3 : If (Age=middle_aged) THEN
Buys_computer=Yes
R4 : If (Age=Senior) AND
(Credit_rating=Fair) THEN
Buys_computer=No
R5 : If (Age=Senior) AND
(Credit_rating =Excellent)
THEN Buys_computer=Yes
What is node impurity?
• The node impurity is a measure of the
homogeneity of the labels at the node.
• The current implementation provides two impurity
measures for classification (Gini impurity and
entropy) and one impurity measure for regression
(variance).
• Gini Impurity is a measurement of the likelihood
of an incorrect classification of a new instance of
data, if that new instance were randomly
Formula for Gini Impurity
Below is the formula for Gini Impurity, where p is the
probability of samples belonging to the class i at a specific
node. The feature with the smallest Gini Impurity is
selected for splitting the node.
•The range of value Gini Impurity
can have is between 0 to 0.5
•The lesser the Gini Impurity, the
better the split is.
•A Gini Impurity of 0 denotes a pure
node and 0.5 denotes a most
impure node
Entropy and Gini criterion
measure similar
performance metrics.
•The range of values Entropy can
have is between 0 to 1
•Entropy of 0 denotes a pure
node and 1 denotes most impure
node (where we have 50–50 split
of ‘Yes’ and ‘No’)
Example: Splitting by Gini Index
We can split the data by the Gini Index too. Let’s compute
the required probabilities:

Out of the 14 days in the above example, Sunny, Overcast, and Rain occur 5,
4, and 5 times, respectively. Then, we compute the probabilities of a Sunny
day and playing tennis or not. Out of the 5 times when Outlook=Sunny, we
played tennis on 2 and didn’t play it on 3 days:
Having calculated the required probabilities, we can
compute the Gini Index of Sunny:

We follow the same steps for Overcast and Rain:

You might also like