0% found this document useful (0 votes)
14 views25 pages

MIT 6.0002 Introduction To Computational Thinking and Data Science Notes

The document covers multiple lectures on optimization problems, algorithms, data visualization, and experimental data analysis. Key topics include the knapsack problem, greedy algorithms, dynamic programming, graph algorithms, Monte Carlo simulations, and confidence intervals. The lectures emphasize the importance of understanding data variability and the use of simulations and visualizations in analyzing complex data sets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views25 pages

MIT 6.0002 Introduction To Computational Thinking and Data Science Notes

The document covers multiple lectures on optimization problems, algorithms, data visualization, and experimental data analysis. Key topics include the knapsack problem, greedy algorithms, dynamic programming, graph algorithms, Monte Carlo simulations, and confidence intervals. The lectures emphasize the importance of understanding data variability and the use of simulations and visualizations in analyzing complex data sets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Lecture 1: Optimization Problems and Algorithms

Introduction

The lecture focuses on optimization problems and algorithms, specifically the knapsack
problem.

Optimization problems are about finding the best solution from a set of solutions.

The knapsack problem is a classic example of an optimization problem.

The Knapsack Problem

Definition

You have a knapsack with a weight limit (W) and a set of items, each with a weight and a value.
The goal is to maximize the value of the items you can carry without exceeding the weight limit.

Example

You have a calorie limit of 1,500 and a list of foods with their respective calorie values.
The challenge is to pick the foods that maximize your happiness (value) without exceeding the
calorie limit.

Formalization

Each item is represented by a pair (value, weight).

The knapsack can accommodate items with a total weight of no more than W.

A vector L of length n represents the set of available items.

Another vector V indicates whether an item was taken (0 or 1).

Algorithms to Solve the Knapsack Problem

Brute Force

Enumerate all possible combinations of items.

Remove combinations that exceed the weight limit.

Choose the combination with the highest value.


Time complexity: Exponential (not practical for large n).

Greedy Algorithm

While the knapsack is not full, put the best available item into it.

"Best" can be defined in various ways: highest value, lowest weight, or highest value-to-weight
ratio.

Time complexity: O(n log n) (practical for large n).

Greedy Algorithm Implementation

The implementation uses a key function to define what "best" means.

The items are sorted based on the key function, and then the algorithm iterates through the
sorted list to fill the knapsack.

python
Copy code
def greedy(items, max_cost, key_function):
items_copy = sorted(items, key=key_function, reverse=True)
result = []
total_value, total_cost = 0.0, 0.0
for i in range(len(items_copy)):
if (total_cost + items_copy[i].get_cost()) <= max_cost:
result.append(items_copy[i])
total_cost += items_copy[i].get_cost()
total_value += items_copy[i].get_value()
return (result, total_value)

Testing the Greedy Algorithm

The algorithm was tested with different key functions: by value, by cost, and by density
(value-to-weight ratio).

The results varied depending on the key function used, demonstrating that the greedy algorithm
can get stuck in local optima.

Limitations of Greedy Algorithms

Greedy algorithms make a series of local optimizations, which may not lead to a global
optimum.
Different definitions of "best" can lead to different solutions, and there's no guarantee that any
will be optimal.

Conclusion

While greedy algorithms are efficient, they do not guarantee finding the optimal solution.
For the knapsack problem, and many other optimization problems, there is no algorithm that
provides an exact solution with a non-exponential worst-case running time.
These notes should provide a comprehensive understanding of the lecture's content, preparing
you for any tests on the topic.

Lecture 2: Dynamic Programming - MIT 6.0002 Introduction to Computational Thinking and


Data Science

Introduction

The lecture focuses on the concept of Dynamic Programming, a technique used to solve
optimization problems.

Dynamic Programming is particularly useful for problems that exhibit optimal substructure and
overlapping subproblems.

Optimal Substructure and Overlapping Subproblems

Optimal Substructure: A globally optimal solution can be found by combining optimal solutions
to local subproblems.

Overlapping Subproblems: Finding an optimal solution involves solving the same problem
multiple times.

Fibonacci Numbers: A Case Study

Fibonacci numbers are used to demonstrate the concept of Dynamic Programming.


The naive recursive implementation of Fibonacci numbers is inefficient because it recomputes
the same subproblems multiple times.

Memoization

Memoization is the technique of storing the results of expensive function calls and returning the
cached result when the same inputs occur again.

The Fibonacci function is optimized using memoization, resulting in a significant speedup.

python
Copy code
def fastFib(n, memo={}):
if n in memo:
return memo[n]
if n <= 1:
return 1
memo[n] = fastFib(n-1, memo) + fastFib(n-2, memo)
return memo[n]

Knapsack Problem Revisited

The knapsack problem also exhibits optimal substructure and overlapping subproblems.
The naive recursive solution is inefficient for the same reasons as the Fibonacci function.
Dynamic Programming Solution

A dynamic programming solution to the knapsack problem is introduced, which also uses
memoization.

The key to the memo is a tuple containing the items left to be considered and the available
weight.

python
Copy code
def fastMaxVal(toConsider, avail, memo={}):
if (len(toConsider), avail) in memo:
return memo[(len(toConsider), avail)]
# ... (rest of the code is similar to the naive solution)
memo[(len(toConsider), avail)] = result
return result

Performance Comparison

The number of recursive calls in the dynamic programming solution grows much more slowly
compared to the naive solution.

For example, for a problem size of 40, the naive solution requires over 1 trillion calls, while the
dynamic programming solution requires only 43,000 calls.

Summary

Many practical problems can be formulated as optimization problems.

Greedy algorithms often provide adequate but not optimal solutions.


Dynamic programming can yield optimal solutions and is often much faster than naive methods.

The key to dynamic programming is to break the problem down into smaller subproblems, solve
each subproblem only once, and store its answer in a memo.

Additional Notes

The lecture also touches upon the history of dynamic programming and its inventor, Richard
Bellman.

The term "dynamic programming" was chosen to conceal the mathematical nature of the work,
as it was funded by a part of the Defense Department that did not approve of mathematics.

Homework/Exercise

An optimization problem related to rolling over problem set grades into a quiz is provided in the
PowerPoint slides for further exploration.

These notes should provide a comprehensive overview of the lecture and serve as a useful
study guide for tests or further understanding of dynamic programming.

Lecture 3: Graphs and Graph Algorithms


Introduction
Graphs are a powerful way to represent networks.
Nodes represent entities, and edges represent relationships between them.
Graphs can be directed or undirected.
Graphs can be weighted or unweighted.
Graph Representation
Adjacency List: A list where each node is associated with a list of its neighbors.
Adjacency Matrix: A 2D array where the cell at (i, j) is 1 if there is an edge from node i to node j;
otherwise, 0.
Python Code for Graph Representation
python
Copy code
class Node:
def __init__(self, name):
self.name = name
self.edges = []

def addEdge(self, node):


self.edges.append(node)

class Graph:
def __init__(self):
self.nodes = []

def addNode(self, node):


self.nodes.append(node)
Depth-First Search (DFS)
Start at the source node.
Follow the first edge until you reach the destination or run out of options.
If stuck, backtrack to the previous node and try another edge.
Avoid loops by keeping track of visited nodes.
Python Code for DFS
python
Copy code
def DFS(graph, start, end, path, shortest):
path = path + [start]
if start == end:
return path
for node in start.edges:
if node not in path:
newPath = DFS(graph, node, end, path, shortest)
if newPath:
if not shortest or len(newPath) < len(shortest):
shortest = newPath
return shortest
Breadth-First Search (BFS)
Start at the source node.
Explore all nodes at the current depth before moving on to the next level.
Use a queue to keep track of nodes to be explored.
Python Code for BFS
python
Copy code
def BFS(graph, start, end):
initPath = [start]
pathQueue = [initPath]
while len(pathQueue) != 0:
tmpPath = pathQueue.pop(0)
lastNode = tmpPath[-1]
if lastNode == end:
return tmpPath
for linkNode in lastNode.edges:
if linkNode not in tmpPath:
newPath = tmpPath + [linkNode]
pathQueue.append(newPath)
Weighted Shortest Path
Depth-First Search can be easily modified to handle weighted edges.
Breadth-First Search is not easily adaptable for weighted graphs.
Summary
Graphs are versatile data structures for representing networks.
Depth-First and Breadth-First are two fundamental algorithms for graph traversal.
These algorithms can be adapted for various optimization problems.
These notes cover the key points from the lecture and include Python code snippets for
implementing graph algorithms. They should be useful for preparing for a test on this topic.

Lecture 4: Understanding Experimental Data - Simulations and Probability


Introduction
The lecture focuses on understanding experimental data through simulations and probability.
It covers the basics of probability, Monte Carlo simulations, and real-world applications like the
birthday problem.
Probability Basics
Probability is a measure of the likelihood of an event occurring.
It is expressed as a number between 0 and 1.
The probability of an event A happening is denoted as P(A).
Counting Argument
The counting argument is a way to calculate probabilities.
It involves counting the number of successful outcomes and dividing it by the total number of
possible outcomes.
Example: Rolling a die to get a 1. P(1) = 1/6
Monte Carlo Simulations
Monte Carlo simulations are used to estimate probabilities by running experiments multiple
times.
They are particularly useful when it is difficult to calculate probabilities analytically.
Example: Estimating the probability of a football team losing.
Code Structure for Simulations
Initialize some variables.
Run a certain number of trials.
Each trial involves some operation (e.g., rolling a die).
Sum up the results of the trials.
Divide the sum by the number of trials to get an estimate.
python
Copy code
# Example code for a simple simulation
goal = 'five ones'
num_trials = 1000
total = 0
for _ in range(num_trials):
result = roll_die(goal_length)
if result == goal:
total += 1
estimated_prob = total / num_trials
Birthday Problem
The birthday problem asks what the probability is of at least two people in a group having the
same birthday.
The probability can be calculated analytically, but it is easier to use a simulation.
Code for Birthday Problem
The code involves running a simulation where birthdays are randomly assigned to people in a
group.
The number of trials is specified, and the results are summed up to estimate the probability.
python
Copy code
# Example code for the birthday problem
num_people = 100
num_trials = 1000
total = 0
for _ in range(num_trials):
if has_shared_birthday(num_people):
total += 1
estimated_prob = total / num_trials
Adjusting for Real-world Data
Birthdays are not uniformly distributed in reality.
The simulation can be adjusted to account for this by changing the probability of each date
being chosen.
When to Use Simulations
When the system is mathematically intractable.
To extract intermediate results.
To play "what-if" games by refining the model.
Conclusion
Simulations are a powerful tool for understanding experimental data.
They are particularly useful for estimating probabilities in complex systems.
However, it's important to remember that all models are approximations of reality.
These notes should provide a comprehensive overview of the lecture and serve as a useful
study guide for tests.

Lecture 5: Random Walks and Data Visualization in Python


Introduction
The lecture focuses on the use of Python libraries for data visualization, particularly Matplotlib
and PyLab.
The aim is to understand how to visualize data to gain insights, especially in the context of
simulating random walks.
Matplotlib and PyLab
Matplotlib is a plotting library for Python that allows you to create a wide variety of plots and
figures.
PyLab combines the functionalities of Matplotlib, NumPy, and other libraries to provide a
MATLAB-like interface in Python.
The plot function is used to create plots. It takes two sequences as arguments: x-coordinates
and y-coordinates.
The function has numerous optional arguments to customize the plot, such as color, line style,
and labels.
python
Copy code
import pylab
xVals = [1, 2, 3]
yVals1 = [1, 2, 3]
yVals2 = [1, 4, 9]
pylab.plot(xVals, yVals1, 'b-', label='first')
pylab.plot(xVals, yVals2, 'r:', label='second')
pylab.legend()
Importance of Plotting
Plotting is a powerful tool for understanding data.
It allows you to visualize trends, compare different data sets, and gain insights that might not be
apparent from raw data alone.
Simulating Random Walks with Wormholes
The lecture introduces a new type of field called OddField, which has wormholes that teleport
the drunk to a different location.
The OddField class is a subclass of the Field class and overrides the moveDrunk method to
include wormhole functionality.
python
Copy code
class OddField(Field):
def __init__(self, numWormholes=1000, xRange=100, yRange=100):
Field.__init__(self)
self.wormholes = {}
for w in range(numWormholes):
x, y = random.randint(-xRange, xRange), random.randint(-yRange, yRange)
newX, newY = random.randint(-xRange, xRange), random.randint(-yRange, yRange)
newLoc = Location(newX, newY)
self.wormholes[(x, y)] = newLoc

def moveDrunk(self, drunk):


Field.moveDrunk(self, drunk)
x = self.drunks[drunk].getX()
y = self.drunks[drunk].getY()
if (x, y) in self.wormholes:
self.drunks[drunk] = self.wormholes[(x, y)]
Plotting in Different Styles
The lecture shows how to use different styles for plotting, such as markers, line styles, and
colors.
A styleIterator is used to cycle through a predefined set of styles for different plots.
python
Copy code
styles = ['m-', 'b--', 'g-.']
styleIterator = iter(styles)
Customizing Plots
You can customize various aspects of the plot, such as line width, title size, axis labels, and
legend position.
PyLab's rcParams can be used to set these parameters globally.
python
Copy code
pylab.rcParams['lines.linewidth'] = 4
pylab.rcParams['axes.titlesize'] = 20
Summary
The lecture emphasizes the importance of data visualization for understanding complex data
sets.
It also demonstrates how to build simulations incrementally, starting with simple models and
adding complexity step by step.
The use of sanity checks to validate the simulation is highlighted.
Finally, the lecture provides a comprehensive overview of plotting in Python, encouraging
students to explore and customize plots to suit their needs.
Additional Resources
Two recommended online sites for learning more about plotting are mentioned, but not specified
in the transcript.
A 50-minute video by Professor Grimson on plotting in PyLab is also recommended for further
learning.
These notes should provide a comprehensive overview of the lecture and serve as a useful
study guide for tests.

Lecture 6: Understanding Experimental Data - Confidence Intervals and Distributions


Introduction
The lecture focuses on understanding experimental data, specifically dealing with confidence
intervals and distributions.
The lecture is part of the MIT OpenCourseWare and is presented by Professor John Guttag.
Variability in Data
When dealing with experimental data, it's important to consider the variability in the underlying
possibilities.
Variability can be understood through two key concepts: Variance and Standard Deviation.
Variance
Variance (σ^2) is a measure of how far each number in the data set is from the mean (μ).
Formula:
mathematica
Copy code
Variance = Σ((x - μ)^2) / N
Σ = Summation
x = Each data point
μ = Mean of the data set
N = Number of data points
Variance is useful for understanding the distribution of data points around the mean.
Standard Deviation
Standard Deviation (σ) is the square root of the variance.
Formula:
scss
Copy code
Standard Deviation = sqrt(Variance)
Standard Deviation is more interpretable as it is in the same unit as the data.
Why Square the Differences?
Squaring eliminates the sign, making it irrelevant whether the data point is above or below the
mean.
Squaring emphasizes outliers.
Confidence Intervals
Instead of estimating an unknown parameter by a single value (e.g., mean), it's better to provide
a confidence interval.
A confidence interval is a range that is likely to contain the unknown value.
Confidence intervals are usually calculated using the empirical rule.
Empirical Rule
Under certain assumptions, the empirical rule states:
68% of the data will be within one standard deviation of the mean.
95% will be within 1.96 standard deviations.
99.7% will be within three standard deviations.
The empirical rule is often used to calculate 95% confidence intervals.
Assumptions for Empirical Rule
Mean estimation error is 0.
Distribution of errors is normal.
Probability Distributions
Captures the notion of the relative frequency with which a random variable takes on different
values.
Two types: Discrete and Continuous.
Discrete Distributions
Values are drawn from a finite set.
Example: Coin flip (Heads or Tails).
Continuous Distributions
Values are drawn from a set of reals between two numbers.
Probability Density Function (PDF) is used to describe the distribution.
Normal Distribution
Also known as Gaussian distribution.
Symmetric around the mean.
Described by a specific function involving the number e.
Standard normal distribution has a mean of 0 and a standard deviation of 1.
Conclusion
Understanding the variability in data is crucial for interpreting experimental results.
Confidence intervals provide a range in which an unknown parameter is likely to lie, giving us a
measure of the reliability of our estimates.
The empirical rule and probability distributions are key concepts for understanding and
interpreting data.
Questions and Discussion
The lecture also involved questions from the audience, clarifying doubts about the gambler's
fallacy, the empirical rule, and the assumptions behind it.
These notes should provide a comprehensive overview of the lecture and serve as a useful
study guide for understanding experimental data, confidence intervals, and distributions.

Lecture 7: Monte Carlo Simulation, Central Limit Theorem, and Confidence Intervals
Introduction
The lecture covers the concept of Monte Carlo Simulation, Central Limit Theorem, and
Confidence Intervals.
Monte Carlo Simulation is a statistical technique that allows you to account for risk in
quantitative analysis and decision making.
The Central Limit Theorem (CLT) states that the distribution of sample means approximates a
normal distribution as the sample size gets larger.
Confidence Intervals give an estimated range of values which is likely to include an unknown
population parameter.
Central Limit Theorem (CLT)
The CLT is a statistical theory that states that given a sufficiently large sample size from a
population with a finite level of variance, the mean of all samples from the same population will
be approximately equal to the mean of the population.
The CLT allows us to use the empirical rule when computing confidence intervals.
Three Key Points of CLT
The means of the samples in a set of samples (the sample means) will be approximately
normally distributed.
This distribution will have a mean that is close to the mean of the population.
The variance of the sample means will be close to the variance of the population divided by the
sample size.
Verifying CLT with Simulations
The lecture demonstrates the CLT through simulations involving rolling dice and spinning a
roulette wheel.
The simulations show that as the sample size increases, the distribution of sample means
becomes more normal, even if the original distribution is not normal.
Monte Carlo Simulation
Monte Carlo Simulation is a mathematical technique that allows you to account for risk in
quantitative analysis and decision making.
It is used in various fields like finance, engineering, supply chain, and science.
The lecture shows how Monte Carlo Simulation can be used to estimate the value of Pi.
Estimating Pi using Monte Carlo
A circle is inscribed in a square. Random points are generated within the square.
The ratio of the number of points that fall within the circle to the total number of points should
approximate the ratio of the two areas, which can be used to estimate Pi.
The formula used is Pi = 4 * (points in circle / total points)
Code for Monte Carlo Simulation to Estimate Pi
python
Copy code
def throwNeedles(numNeedles):
inCircle = 0
for Needles in range(1, numNeedles + 1):
x = random.random()
y = random.random()
if (x*x + y*y)**0.5 <= 1.0:
inCircle += 1
return 4 * (inCircle / float(numNeedles))
Confidence Intervals
Confidence intervals give an estimated range of values likely to include an unknown population
parameter.
The lecture shows that while the simulation can give us a confidence interval, it does not
guarantee that the simulation model is correct.
It's essential to perform sanity checks to ensure the simulation model's validity.
Common Pitfalls
The lecture warns against the mistake of assuming that a statistically valid result is the same as
a true result.
A simulation can be statistically valid but still produce incorrect results due to bugs or incorrect
assumptions in the model.
Summary
The Central Limit Theorem is a powerful tool that allows us to make statistical inferences about
the population mean.
Monte Carlo Simulation is a versatile technique for approximating solutions to problems that
may not be easy to solve analytically.
Confidence intervals provide a range within which we can expect the true population parameter
to lie, but they do not validate the model itself.
Important Takeaways
Always validate your simulation model.
Understand the limitations of statistical methods.
Use the Central Limit Theorem and Monte Carlo Simulation in conjunction to make more
accurate predictions and decisions.
These notes should provide a comprehensive overview of the lecture and serve as a useful
study guide for understanding Monte Carlo Simulation, Central Limit Theorem, and Confidence
Intervals.

Lecture 8: Understanding Standard Error and Confidence Intervals


Introduction
The lecture focuses on understanding the concept of standard error and how it is used to
generate confidence intervals.
It also delves into the importance of sample size and skew in the population for accurate
estimations.
Confidence Intervals
Confidence intervals are used to estimate the range within which a population parameter lies.
If confidence intervals don't overlap, we can conclude that the means are statistically
significantly different.
If they do overlap, further investigation is needed.
Error Bars
Error bars can be plotted using pylab.errorbar.
As the sample size increases, the error bars get smaller, providing more confidence in the
estimate.
Standard Error of the Mean
Standard error is used to estimate the standard deviation of the sample means.
Formula: Standard Error = Sigma (population standard deviation) / sqrt(n) (sample size)
Standard error provides a good approximation of the standard deviation of the sample means.
Testing Standard Error
The standard error of the mean closely tracks the standard deviation of the sample means.
This allows us to estimate the standard deviation using a single sample, thereby saving
computational resources.
Standard Deviation vs. Standard Error
Standard deviation requires many samples to compute the variation.
Standard error can be computed using a single sample and provides an approximation close to
the standard deviation.
The Catch: Population Standard Deviation
The formula for standard error includes the population standard deviation, which is usually
unknown.
The sample standard deviation can be used as an approximation for the population standard
deviation.
Skew and Sample Size
Skew is a measure of the asymmetry of a probability distribution.
The more skew you have, the more samples you'll need for a good approximation.
Sample size should be chosen based on an estimate of the skew in the population.
Does Population Size Matter?
Surprisingly, the size of the population does not significantly affect the sample size needed for a
good approximation.
Practical Application
Choose a sample size based on some estimate of skew in the population.
Choose a random sample from the population.
Compute the mean and standard deviation of that sample.
Use the standard deviation of that sample to estimate the standard error.
Use the estimated standard error to generate confidence intervals around the sample mean.
Summary
Standard error is a crucial concept for generating confidence intervals.
It allows for efficient computation and provides a good approximation for the standard deviation.
The choice of sample size is influenced by the skew in the population.
Key Takeaways
Standard error provides a way to estimate the standard deviation of a population using a single
sample.
The choice of sample size should be influenced by the skew in the population.
Standard error allows for the generation of confidence intervals, providing a range within which
the population parameter is likely to lie.
These notes should provide a comprehensive understanding of the lecture, aiding in preparation
for any tests on the topic.

Lecture 9: Understanding Experimental Data - Part 2


Topics Covered:
Linear Regression
Goodness of Fit
Coefficient of Determination (R-Squared)
Polynomial Fits
Linear Regression Continued
Spring Constant Example
The slope of the line in the spring constant example is the difference in force over the difference
in distance.
The spring constant is the reciprocal of the slope of the line.
Python code is used to fit a line and print out the value of the slope and intercept.
Goodness of Fit
How to Measure?
The first question is how to measure the goodness of fit.
Two ways to ask this question:
Relative to each other, which model is better?
In an absolute sense, how do we know where the best solution is?
Average Squared Error
One way to measure goodness of fit is by looking at the average squared error.
However, this measure is not scale-independent.
Coefficient of Determination (R-Squared)
What is R-Squared?
R-Squared is intended to capture what portion of the variability in the data is accounted for by
the model.
It is a scale-independent measure.
R-Squared is always between 0 and 1.
Interpretation
R-Squared = 1: Model explains all of the variability in the data.
R-Squared = 0: No relationship between the values predicted by the model and the actual data.
R-Squared = 0.5: Capturing about half the variability.
Python Code for R-Squared
Python code can be used to calculate R-Squared.
The code takes in a set of observed values and a set of predicted values.
Polynomial Fits
Generating Fits
A function genFits can take a set of x values, y values, and a list of degrees of models to fit.
It returns a set of models for each degree.
Testing Fits
Another function testFits takes the models from genFits and plots them.
It also computes the R-Squared value for each fit.
Example with Different Degrees
The lecturer shows an example where he fits the data with polynomials of different degrees.
R-Squared values for quadratic, quartic, and higher-degree polynomials are compared.
A 16th-degree polynomial has a very high R-Squared value, but that doesn't necessarily mean
it's the best fit.
Key Takeaways
Linear regression can be used to fit data with different types of models.
R-Squared is a useful measure for comparing the goodness of fit between different models.
Just because a model has a high R-Squared value doesn't mean it's the best fit for the data.
Note: The lecture ends with a cliffhanger, leaving the question of whether a high-degree
polynomial is always the best fit for the data to be answered in the next lecture.

Lecture 10: Cross-Validation and Model Complexity


Introduction
The lecture focuses on the concept of model complexity and the importance of cross-validation.
The main question: How do we know which model to use for a given dataset?
The lecture covers the dangers of overfitting and underfitting.
Overfitting and Underfitting
Overfitting: When the model fits the noise in the data rather than the underlying process.
Underfitting: When the model is too simple to capture the underlying process.
The goal is to find a balance between complexity and simplicity.
Dangers of Overfitting
Example: Fitting a quadratic model to a dataset generated by a quadratic function with noise.
A small variation in one data point can cause a large variation in the prediction.
The predictive ability of a first-order model is better than a second-order model in the presence
of noise.
Finding the Right Model
Start with a low-order model.
Look at the R-squared value and how well it accounts for new data.
Increase the order of the model and repeat the process.
Stop when the R-squared value starts to fall off.
Cross-Validation
Used to guide the choice of model complexity.
Two types:
Leave-One-Out Cross-Validation: For small datasets.
K-Fold Cross-Validation: For larger datasets.
Leave-One-Out Cross-Validation
For each trial, drop out one sample from the dataset.
Build the model on the remaining data.
Test the model on the dropped sample.
Average the results.
K-Fold Cross-Validation
Divide the dataset into K equal-sized chunks.
Leave one chunk out and build the model on the remaining data.
Test the model on the left-out chunk.
Average the results.
Repeated Random Sampling
Randomly select a subset of the data for the test set.
Use the remainder as the training set.
Build the model on the training set.
Apply the model to the test set.
Example: Modeling Mean Daily High Temperature
Data from 1961 to 2015.
Four different models: Linear, Quadratic, Cubic, Quartic.
Linear fit had the highest R-squared value and smallest deviation, making it the best fit.
Importance of Multiple Trials
Running multiple trials gives statistics on those trials as well as statistics within each trial.
Helps in avoiding misleading conclusions based on a single trial.
Summary
Linear regression can be used to fit a curve to 2D, 3D, or higher-dimensional data.
R-squared value is used to measure the goodness of fit.
Cross-validation helps in selecting the simplest model that accounts for the data and predicts
new data effectively.
Key Takeaways
Be cautious of overfitting and underfitting.
Use cross-validation to determine the best model.
Run multiple trials to get a more accurate measure of the model's performance.
These notes should provide a comprehensive overview of the lecture and prepare you for any
tests on the topic.

Lecture 11: Machine Learning - Feature Engineering, Metrics, and Classifiers


Introduction
The lecture focuses on feature engineering, metrics, and classifiers in machine learning.
The goal is to understand how to measure distances between examples, what features to use,
and what constraints to put on the model.
Feature Engineering
Features are the attributes that describe the examples.
The choice of features matters a lot in machine learning.
Too many features can lead to overfitting.
The scale of the dimensions is important.
Example: Classifying animals into reptiles and non-reptiles based on features like egg-laying,
cold-blooded, scales, and number of legs.
Minkowski Metric
A way to measure distance between vectors.
Formula:
(∑ _i |x_i-y_i|p^p)^1/p

p=1 : Manhattan distance

p=2 : Euclidean distance


Example: Classifying Animals
Using the Minkowski metric, the distance between a rattlesnake, boa constrictor, and dart frog
was calculated.
The two snakes were closer to each other, while the dart frog was farther away.
Adding an alligator to the mix complicated things because the number of legs became a
significant factor in the distance calculation.
Solutions
Use Manhattan distance: It treats each feature equally.
Convert the number of legs to a binary feature: Either it has legs or it doesn't.
Types of Learning
Labeled and Unlabeled
Clustering and Classifying
Clustering
Grouping examples into clusters.
Algorithm:
Decide the number of clusters.
Pick an example as the initial representation of each cluster.
For each example, assign it to the closest cluster.
Find the median of each cluster.
Repeat the process.
Classifying
Assigning labels to examples based on features.
The goal is to find the best surface that separates the examples.
Methods:
Find the simplest surface (e.g., a line) that separates the examples.
Use a more complicated surface (e.g., sequence of line segments).
K-Nearest Neighbors: For each new example, find the K closest labeled examples and take a
vote.
Metrics for Evaluation
Confusion Matrix: A table that shows the number of true positives, true negatives, false
positives, and false negatives.

Accuracy:
(TruePositives+TrueNegatives)/Total

Positive Predictive Value (PPV):


TruePositives/(TruePositives + False Positives)

Sensitivity: TruePositives/(TruePositives + FalseNegatives)

Specificity: TrueNegatives/(TrueNegatives + False Positives)


Trade-offs
Increasing specificity can decrease sensitivity and vice versa.
Conclusion
Feature engineering is crucial in machine learning.
Different metrics can be used to evaluate the performance of a model.
There's often a trade-off between different metrics like sensitivity and specificity.
Next Lecture
Professor Guttag will show examples of machine learning algorithms in action.
These notes should provide a comprehensive overview of the lecture and prepare you for any
test on this topic.

Lecture 12: K-Means Clustering and Data Scaling in Python


Introduction
The lecture focuses on K-Means Clustering, a popular machine learning algorithm used for
unsupervised learning.
It also discusses the importance of data scaling and how it impacts the clustering results.
K-Means Algorithm
K-Means is a greedy algorithm that aims to partition a set of points into K clusters, where each
point belongs to the cluster with the nearest mean.
The algorithm starts by initializing K centroids randomly.
Each point is then assigned to the nearest centroid, and the centroid is recalculated.
This process is repeated until the centroids no longer change significantly.
Steps:
Initialize K centroids randomly.
Assign each point to the nearest centroid.
Recalculate centroids.
Repeat steps 2 and 3 until convergence.
Downsides:
Choosing a bad K can lead to poor results.
The algorithm is non-deterministic; different initial centroids can lead to different results.
Choosing K
Often, domain knowledge can help in choosing an appropriate K.
Another approach is to try different values of K and evaluate the quality of the result.
One can also run hierarchical clustering on a subset of data to get a sense of the structure and
then decide on K.
Dealing with Unlucky Centroids
One approach is to select good initial centroids that are distributed over the space.
Another approach is to try multiple sets of randomly chosen centroids and select the best result.
Real-world Example: Medical Data
The lecture discusses clustering medical patients based on four features: heart rate, number of
previous heart attacks, age, and ST elevation.
The goal is to see if the clusters can predict the likelihood of a patient dying from a heart attack.
Code Overview
The code is written in Python and makes use of classes and functions to implement the
K-Means algorithm.
It also includes a function for scaling the features, which is crucial for the algorithm's
performance.
Scaling Features: Z-Scaling
Z-Scaling is used to scale the features so that they have a mean of 0 and a standard deviation
of 1.
This is important because features with larger ranges can dominate the clustering process.
Results
Without scaling, the clustering did not provide any meaningful separation of the data.
With scaling, the clusters showed a significant difference in the fraction of positive outcomes
(i.e., patients who died), indicating that scaling is crucial for the algorithm's performance.
Summary
K-Means is a powerful but simple clustering algorithm with various applications.
Choosing the right K and initial centroids is crucial for the algorithm's success.
Feature scaling is often necessary for the algorithm to perform well.
Real-world applications require careful consideration of these factors for meaningful results.
Key Takeaways for the Test
Understand the K-Means algorithm, its steps, and its limitations.
Know how to choose an appropriate K and deal with unlucky initial centroids.
Understand the importance of feature scaling, specifically Z-Scaling.
Be able to interpret the results of a K-Means clustering operation, especially in a real-world
context like medical data.
These notes should provide a comprehensive overview of the lecture and prepare you well for
any upcoming test on the topic.

Lecture 13: Machine Learning - K-Nearest Neighbors and Logistic Regression


Introduction
The lecture focuses on two machine learning algorithms: K-Nearest Neighbors (KNN) and
Logistic Regression.
These algorithms are used for classification problems, such as predicting survival on the Titanic.
K-Nearest Neighbors (KNN)
KNN is a simple algorithm that stores all available cases and classifies new cases based on a
similarity measure.
It works by finding the k nearest data points to a given data point and classifying it based on the
majority class among those k points.
How KNN Works
Training Phase: Store all the data points.
Testing Phase:
Calculate the distance from the test point to all stored data points.
Sort the distances and consider the k smallest distances.
Return the most frequent class among these k points.
KNN in Python
Python's scikit-learn library provides a simple way to implement KNN.
The KNeighborsClassifier class is used for implementation.
python
Copy code
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
Evaluating KNN
Evaluation metrics include True Positives, False Positives, True Negatives, and False
Negatives.
These metrics can be used to calculate Precision, Recall, and F1-score.
Logistic Regression
Logistic Regression is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a
set of independent variables.
It is similar to linear regression but predicts the probability of the outcome.
How Logistic Regression Works
Training Phase: The algorithm finds weights for each feature using optimization techniques.
Testing Phase: Given a feature vector, it returns the probabilities of different labels.
Logistic Regression in Python
Python's scikit-learn library provides a simple way to implement Logistic Regression.
The LogisticRegression class is used for implementation.
python
Copy code
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)
Evaluating Logistic Regression
Similar to KNN, evaluation metrics include True Positives, False Positives, True Negatives, and
False Negatives.
The algorithm returns probabilities, and a threshold (usually 0.5) is set for classification.
Feature Weights
Logistic Regression provides insights about the variables by giving them weights.
Positive weight implies that the variable is positively correlated with the outcome.
Negative weight implies that the variable is negatively correlated with the outcome.
Comparison: KNN vs Logistic Regression
KNN is computationally expensive as it has to calculate distances for each test point, whereas
Logistic Regression is faster in the testing phase.
Logistic Regression often outperforms KNN and provides insights into feature importance.
Conclusion
Both KNN and Logistic Regression have their pros and cons.
The choice of algorithm depends on the specific requirements of the problem at hand.
Note: The lecture ends with a caution that one should be wary of reading too much into the
weights in Logistic Regression, especially when features are correlated. This topic will be
covered in the next lecture.

Lecture 14: Common Statistical Sins and Wrap Up


Introduction
The lecture focuses on common statistical sins that people often commit while analyzing data.
The aim is to make students aware of these pitfalls so they can avoid them in their own work.
Anscombe's Quartet
Four different datasets with the same mean, variance, and correlation.
However, when plotted, they look entirely different.
Moral: Statistics about data is not the same as the data itself.
Always plot your data first to understand its nature.
Lying with Pictures
Graphs can be manipulated to show what you want.
Always check the axis labels and scales.
Example: Fox News graph comparing people on welfare vs. people with full-time jobs.
The graph was misleading due to the scale and the definitions used for the categories.
GIGO (Garbage In, Garbage Out)
Analysis of bad data is worse than no analysis at all.
Example from the 1840s U.S. Census used to argue that slavery was good for slaves.
The data was flawed and biased.
Always question the quality of the data before analyzing it.
Survivor Bias
Example: WWII planes that returned from missions were analyzed to see where they were most
often hit by enemy fire.
The flaw: The planes that were analyzed were the ones that survived.
Survivor bias can lead to incorrect conclusions.
Always consider what is missing from your data.
Convenience Sampling
Not all samples are created equal.
Example: Course evaluations suffer from survivor bias.
The students who hated the course most have probably already dropped it.
Opinion polls often suffer from non-response bias.
The people who respond to surveys are not necessarily representative of the entire population.
Non-Representative Sampling
When samples are not random and independent, the basic assumptions underlying statistical
conclusions break down.
Example: Political polls relying on landlines miss out on younger demographics who don't use
landlines.
Always understand how the data was collected and whether the assumptions for your analysis
are satisfied.
Conclusion
Always be critical of the data and the methods used for collecting it.
Understand the assumptions behind statistical methods and ensure they are met before drawing
conclusions.
The lecture ends with a note that the next class will wrap up the course.
Key Takeaways
Always visualize your data.
Be critical of the scales and labels in graphs.
Question the quality of your data.
Be aware of biases like survivor bias and non-response bias.
Understand the assumptions behind the statistical methods you are using.
These notes should help you understand the common pitfalls in statistical analysis and how to
avoid them. Always be critical and cautious while dealing with data and statistics.

Lecture 15: Course Summary and Future Directions


Introduction
The lecture aims to summarize the key points covered in the course and provide some insights
into what might be next for students.
The instructor emphasizes the importance of skepticism and differentiates it from denial, quoting
Ambrose Bierce's "The Devil's Dictionary."
Course Summary
Technical Topics
Optimization Problems: Objective functions, constraints, greedy algorithms, dynamic
programming, memoization, and various examples like knapsack problems, graph problems,
curve fitting, clustering, and logistic regression.
Stochastic Thinking: Importance of probabilistic thinking, randomness as a computational
technique, and random algorithms.
Modeling Aspects of the World: Deterministic models like graph theory and statistical models
like Monte Carlo simulation, sampling, and machine learning.

Subtext: Programming

The course aimed to make students better programmers.

Importance of using libraries for real-world problems.

The course introduced a few extra features of Python and emphasized the use of libraries like
plotting libraries, machine learning libraries, and numeric libraries.

Key Takeaways

Many important problems can be formulated in terms of an objective function and a set of
constraints.

Randomness is a powerful tool for building computations that model the world.

Models are always inaccurate but provide some abstraction of reality.

Confidence intervals and levels are essential for characterizing the believability of results.

Memoization is a generally useful technique that trades time for space.

Future Directions

Courses to Take: 6.009, 6.005, 6.006, 6.034, and 6.036.

UROP (Undergraduate Research Opportunities Program): Students are encouraged to look for
interesting UROPs where they can use what they've learned.

Minor/Major in Computer Science: The instructor encourages students to consider minoring or


majoring in computer science.

Famous Predictions About Computing

Thomas Watson, chairman of IBM, predicted a world market for maybe five computers.
An article in Popular Mechanics predicted computers might someday weigh no more than 1.5
tons.

Ken Olsen, founder of Digital Equipment Corporation, said there's no reason anyone would want
a computer in their home.

Conclusion

The lecture ends with some humorous and cautionary tales about famous last words,
emphasizing the unpredictability of predictions.

You might also like