Analytics Final Exam Review
Analytics Final Exam Review
Topic #1
1. Describe the concept of a recurrent neural network and the meaning of “unfolding”
such a network.
2. Describe and characterize the neural network that results from the following Python code.
Comment/explain all aspects of the network.
m = models.Sequential()
m.add(layers.Input(batch_input_shape=(16, 5, 20,)))
m.add(layers.LSTM(units=32, stateful=False, return_sequences=True))
m.add(layers.LSTM(units=16, stateful=False, return_sequences=False))
m.add(layers.Dense(5)) m.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_6 (LSTM) (16, 5, 32) 6784
lstm_7 (LSTM) (16, 16) 3136
dense_5 (Dense) (16, 5) 85
=================================================================
Total params: 10005 (39.08 KB)
Trainable params: 10005 (39.08 KB)
Non-trainable params: 0 (0.00 Byte)
The network has an input layer with a batch size of 16, 5-time steps, and 20 input
features.
It has two LSTM layers with 32 and 16 units respectively, both not maintaining state
across batches and the first returning sequences.
The output layer is a Dense layer with 5 units.
The total trainable parameters in the network are 10,005.
In Simple Terms
o Input Layer
It takes in batches of data, where each batch has 16 samples.
Each sample has 5-time steps, representing different moments or
events.
There are 20 features at each time step, like different types of
information.
o LSTM Layers
LSTM (Long Short-Term Memory) Layer: is a type of recurrent
neural network (RNN) layer that is designed to address the vanishing
gradient problem, which is common in traditional RNNs
The network has two LSTM layers that help it understand sequences
of data.
first layer looks at the sequences and gives back sequences as well.
The second layer looks at the sequences but gives a single output.
o Output Layer
Finally, there's an output layer with 5 units, which means it produces 5
different outputs.
o Total Parameters
The network has about 10,005 settings that it can adjust to learn from
the data during training.
o In simple terms, this network is good at handling data that comes in
sequences, like a series of events over time. It can learn patterns and
relationships in this data and give back useful predictions or information
based on what it learns.
3. Consider the following time series dataset with inputs and corresponding targets:
pd.DataFrame([inputs, targets])
0123456789
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 NaN
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
Next, assume that a Keras Dataset object for training of a recurrent neural network is
created from this Pandas Data Frame in the following way:
Write the first few elements of the output of the following code:
First batch:
o Input sequence: [[0.0, 1.0], [2.0, 3.0]]
o Target sequence: [[None, None], [2.0, 3.0]
Second batch:
o Input sequence: [[4.0, 5.0], [6.0, 7.0]]
o Target sequence: [[4.0, 5.0], [6.0, 7.0]]
Third batch:
o Input sequence: [[8.0, None]]
o Target sequence: [[8.0, 9.0]]
This represents how the dataset is organized into batches and sequences based on the
specified parameters. Each batch contains a pair of input and target sequences, with
the target sequence shifted by one time step compared to the input sequence.
4. Embedding layers are often used for categorical inputs, e.g. String character values.
Describe the idea of word embeddings. Given a “vocabulary” (number of categories) of
25 and an embedding size of 5, how many trainable parameters will the embedding
layer have?
Topic #2
Simple answer:
o The multi-armed bandit problem is like choosing the best slot machine
to play. Here's the simple version:
Problem Setting: You have several slot machines (arms) to choose
from, but you don't know which one pays out the most.
What is Learned: You learn which slot machine gives the highest
rewards (payouts).
How Learning Takes Place: You try different machines to see
how much they pay out. Over time, you figure out which machine
gives the best rewards.
Use Case in Business: In business, this could be like testing
different ads to see which one gets the most clicks or trying
different prices to see which one sells the most products. It helps
businesses make better decisions based on real-time feedback.
4. What is meant by “Exploring Starts” in the context of Monte Carlo control? Why
are exploring starts necessary?
"Exploring Starts" is a strategy used in Monte Carlo control algorithms for
reinforcement learning. In this context:
o Exploring Starts: This refers to starting episodes (sequences of actions
and states) from random initial states and taking random actions.
The idea is to explore different paths and possibilities in the
environment, allowing the learning algorithm to gather diverse
experiences and learn more effectively.
o Necessity: Exploring starts are necessary to ensure that the learning
algorithm doesn't get stuck in a limited set of states or actions. By starting
from random states and taking random actions, the algorithm can discover
new states, encounter different scenarios, and learn optimal policies that
generalize well across the entire environment.
o Monte Carlo Control: a type of reinforcement learning algorithm that
learns to make decisions by simulating many episodes of interaction with
the environment. Here's a simple breakdown:
Episodes: Each interaction with the environment is called an
episode. For example, playing a game from start to finish is an
episode.
Simulation: The algorithm simulates many episodes, making
random decisions and observing the rewards obtained.
Learning from Experience: Based on the rewards received in
each episode, the algorithm learns which actions lead to better
outcomes and adjusts its decision-making strategy (policy)
accordingly.
Improvement: Over time, Monte Carlo control improves its
policy by learning from the simulated experiences, aiming to
maximize long-term rewards.
o In essence, Monte Carlo control is like learning to play a game by playing
it many times, trying different strategies, and figuring out which ones
work best based on the outcomes.
In simpler terms, exploring starts help the learning algorithm to explore a wide
range of possibilities, avoid biases, and learn robust and flexible policies that
perform well in various situations within the environment.
Simple answer:
o "Exploring Starts" means starting from random situations and trying
random actions in Monte Carlo control.
o This is important because it helps the learning process explore different
possibilities and avoid getting stuck in one path, leading to better overall
learning.
5. Examine the following Python code of a reinforcement learning agent (the function
maxQ(Sprime) returns the maximum Q-value in state Sprime over all actions). Is
this a SARSA or Q-learning agent? Why?
while terminal is False:
A = pi(S)
Sprime, R, terminal = windy.step(A)
Q[(S,A)] = Q[(S,A)] + alpha*(R + gamma * maxQ(Sprime) - Q[(S, A)])
S = Sprime step += 1
The code snippet you provided is for a Q-learning agent, not a SARSA (State-
Action-Reward-State-Action) agent. Here's why:
o Update Rule:
In Q-learning, the Q-value update rule considers the maximum Q-
value of the next state (Sprime) regardless of the action taken in
that state.
This is evident in the line `Q[(S,A)] = Q[(S,A)] + alpha*(R +
gamma * maxQ(Sprime) - Q[(S, A)])`, where `maxQ(Sprime)`
represents the maximum Q-value over all actions in the next state
Sprime.
o Policy Improvement:
Q-learning is an off-policy method, meaning it updates its Q-
values using the maximum Q-value of the next state regardless of
the action taken.
The agent follows a separate policy (e.g., epsilon-greedy) to
explore and select actions.
o Exploration-Exploitation Trade-off:
Q-learning agents often use an epsilon-greedy policy (represented
here as `pi(S)`) to balance exploration (trying new actions) and
exploitation (selecting actions with the highest Q-values) during
learning.
In summary, the code snippet follows the Q-learning algorithm's update rule and
characteristics, making it a Q-learning agent rather than a SARSA agent.
6. Describe the advantages of function approximation reinforcement learning over
tabular methods.
Function approximation in reinforcement learning refers to using parameterized
functions, such as neural networks, to approximate value functions or policies
instead of storing values in a table (tabular methods). Here are the advantages of
function approximation over tabular methods:
o Generalization: Function approximation allows the learning algorithm to
generalize from seen states to unseen states.
o This means that the agent can make reasonable predictions and decisions
even in states it hasn't encountered before, based on the patterns it has
learned.
o Efficiency: With function approximation, the agent can handle large state
or action spaces more efficiently. Tabular methods become impractical
when the state space is large or continuous, as they require storing values
for every possible state-action pair.
o Feature Representation: Function approximation allows for more
flexible feature representation. Instead of using raw state or action values,
the agent can extract meaningful features that capture important
information about the environment, leading to more effective learning.
o Memory Usage: Function approximation typically requires less memory
compared to tabular methods, especially in complex environments with
many states or actions. This makes it more scalable and applicable to real-
world problems.
o Continuous State and Action Spaces: Function approximation naturally
handles continuous state and action spaces, which are common in many
real-world applications. Tabular methods struggle with continuous spaces
due to the sheer number of possible states or actions.
o Transfer Learning: Function approximation facilitates transfer learning,
where knowledge gained from learning one task can be transferred to
related tasks. This is because the learned functions capture general patterns
and can be adapted to new situations more easily than tabular methods.
o Non-linear Relationships: Function approximation can capture non-
linear relationships between states, actions, and rewards. This allows the
agent to learn more complex strategies and policies that may not be
expressible using tabular methods.
In summary, function approximation offers advantages such as generalization,
efficiency, flexible feature representation, scalability, handling of continuous
spaces, transfer learning capabilities, capturing non-linear relationships, and
reduced memory usage compared to tabular methods in reinforcement learning.
7. Describe the concept of a target network in DQN. What is it used for?
1. Describe in general terms local and global interpretation methods. What are their
main differences?
Local and global interpretation methods are techniques used in machine
learning and data analysis to understand and interpret the behavior and decisions
of machine learning models. Here's an overview of their main differences:
Local Interpretation Methods
o Focus: These methods focus on understanding the predictions or decisions
of a machine learning model for a specific instance or data point.
o Purpose: They aim to explain why a model made a particular prediction
or decision for a single input or a small subset of inputs.
o Examples: Local interpretation methods include techniques like LIME
(Local Interpretable Model-agnostic Explanations), SHAP (SHapley
Additive exPlanations), and feature importance based on local gradients.
o Use Case: Useful for understanding model behavior at the individual
level, such as explaining why a certain image was classified as a cat or
why a specific loan application was approved.
Global Interpretation Methods
o Focus: These methods focus on understanding the overall behavior and
workings of a machine learning model across the entire dataset or a large
portion of it.
o Purpose: They aim to provide insights into how the model as a whole
makes predictions, identifies patterns, and generalizes to new data.
o Examples: Global interpretation methods include techniques like feature
importance based on overall model performance (e.g., Gini importance for
decision trees), model-agnostic feature importance across the entire
dataset, and visualizations of decision boundaries.
o Use Case: Useful for gaining a broader understanding of the model's
strengths, weaknesses, and generalization capabilities, such as identifying
which features are most influential in predicting a target variable across
the dataset.
In summary, the main differences between local and global interpretation methods
lie in their focus (specific instances vs. overall model behavior) and purpose
(explaining individual predictions vs. understanding model behavior at scale).
Both types of methods are valuable for different aspects of model interpretation
and can be used together to gain a comprehensive understanding of machine
learning models.
Simple Answer
o Local interpretation methods focus on explaining individual predictions of
a model, while global interpretation methods aim to understand the overall
behavior and patterns of a model across the entire dataset or a large
portion of it.
2. To ensure interpretability and also to avoid overfitting, decision trees can be kept
“shallow” by stopping further divisions either based on tree depth or number of
observations in each leaf node. Alternatively, a fully developed tree may be pruned
through cost-complexity-pruning. Describe in general principles what this is and
how it works.
Cost-complexity pruning is a technique used in decision tree algorithms, like
CART (Classification and Regression Trees), to prevent overfitting and improve
model interpretability by simplifying the tree structure. Here's how it works in
general principles:
o Building the Full Tree
Initially, a decision tree algorithm builds a full and complex tree by
recursively splitting nodes based on features to minimize impurity
or maximize information gain.
This process continues until each leaf node is pure or meets a
predefined stopping criterion.
o Calculate Cost-Complexity Measure
After building the full tree, a cost-complexity measure is calculated
for each node in the tree. This measure considers the node's
impurity (e.g., Gini impurity or entropy) and the number of
samples it contains.
o Cost-Complexity Pruning
Starting from the full tree, nodes with the highest cost-complexity
measure are pruned (removed) one by one.
Pruning a node involves converting it into a leaf node and
assigning it the majority class (for classification) or the average
value (for regression) of the samples in that node.
o Determine Optimal Tree Complexity
During pruning, a tuning parameter (usually denoted as alpha or
ccp_alpha) controls the level of pruning.
Smaller values of alpha result in more aggressive pruning, leading
to simpler trees with fewer nodes.
Larger values of alpha retain more nodes, resulting in more
complex trees.
Cross-validation or validation set performance is often used to
determine the optimal value of alpha that balances model
complexity and predictive accuracy.
o Final Pruned Tree
After pruning according to the optimal alpha value, the final
pruned decision tree is obtained. This tree is typically simpler than
the full tree and contains fewer nodes, making it easier to interpret
and less prone to overfitting.
In summary, cost-complexity pruning works by systematically removing nodes
from a fully developed decision tree based on a cost-complexity measure, leading
to a simpler and more interpretable tree structure while maintaining good
predictive performance.
Simple Answer
o Cost-complexity pruning simplifies decision trees by removing nodes with
high complexity measures, balancing model interpretability and predictive
accuracy.