MIT 6.0002 Introduction To Computational Thinking and Data Science Notes
MIT 6.0002 Introduction To Computational Thinking and Data Science Notes
Introduction
The lecture focuses on optimization problems and algorithms, specifically the knapsack
problem.
Optimization problems are about finding the best solution from a set of solutions.
Definition
You have a knapsack with a weight limit (W) and a set of items, each with a weight and a value.
The goal is to maximize the value of the items you can carry without exceeding the weight limit.
Example
You have a calorie limit of 1,500 and a list of foods with their respective calorie values.
The challenge is to pick the foods that maximize your happiness (value) without exceeding the
calorie limit.
Formalization
The knapsack can accommodate items with a total weight of no more than W.
Brute Force
Greedy Algorithm
While the knapsack is not full, put the best available item into it.
"Best" can be defined in various ways: highest value, lowest weight, or highest value-to-weight
ratio.
The items are sorted based on the key function, and then the algorithm iterates through the
sorted list to fill the knapsack.
python
Copy code
def greedy(items, max_cost, key_function):
items_copy = sorted(items, key=key_function, reverse=True)
result = []
total_value, total_cost = 0.0, 0.0
for i in range(len(items_copy)):
if (total_cost + items_copy[i].get_cost()) <= max_cost:
result.append(items_copy[i])
total_cost += items_copy[i].get_cost()
total_value += items_copy[i].get_value()
return (result, total_value)
The algorithm was tested with different key functions: by value, by cost, and by density
(value-to-weight ratio).
The results varied depending on the key function used, demonstrating that the greedy algorithm
can get stuck in local optima.
Greedy algorithms make a series of local optimizations, which may not lead to a global
optimum.
Different definitions of "best" can lead to different solutions, and there's no guarantee that any
will be optimal.
Conclusion
While greedy algorithms are efficient, they do not guarantee finding the optimal solution.
For the knapsack problem, and many other optimization problems, there is no algorithm that
provides an exact solution with a non-exponential worst-case running time.
These notes should provide a comprehensive understanding of the lecture's content, preparing
you for any tests on the topic.
Introduction
The lecture focuses on the concept of Dynamic Programming, a technique used to solve
optimization problems.
Dynamic Programming is particularly useful for problems that exhibit optimal substructure and
overlapping subproblems.
Optimal Substructure: A globally optimal solution can be found by combining optimal solutions
to local subproblems.
Overlapping Subproblems: Finding an optimal solution involves solving the same problem
multiple times.
Memoization
Memoization is the technique of storing the results of expensive function calls and returning the
cached result when the same inputs occur again.
python
Copy code
def fastFib(n, memo={}):
if n in memo:
return memo[n]
if n <= 1:
return 1
memo[n] = fastFib(n-1, memo) + fastFib(n-2, memo)
return memo[n]
The knapsack problem also exhibits optimal substructure and overlapping subproblems.
The naive recursive solution is inefficient for the same reasons as the Fibonacci function.
Dynamic Programming Solution
A dynamic programming solution to the knapsack problem is introduced, which also uses
memoization.
The key to the memo is a tuple containing the items left to be considered and the available
weight.
python
Copy code
def fastMaxVal(toConsider, avail, memo={}):
if (len(toConsider), avail) in memo:
return memo[(len(toConsider), avail)]
# ... (rest of the code is similar to the naive solution)
memo[(len(toConsider), avail)] = result
return result
Performance Comparison
The number of recursive calls in the dynamic programming solution grows much more slowly
compared to the naive solution.
For example, for a problem size of 40, the naive solution requires over 1 trillion calls, while the
dynamic programming solution requires only 43,000 calls.
Summary
The key to dynamic programming is to break the problem down into smaller subproblems, solve
each subproblem only once, and store its answer in a memo.
Additional Notes
The lecture also touches upon the history of dynamic programming and its inventor, Richard
Bellman.
The term "dynamic programming" was chosen to conceal the mathematical nature of the work,
as it was funded by a part of the Defense Department that did not approve of mathematics.
Homework/Exercise
An optimization problem related to rolling over problem set grades into a quiz is provided in the
PowerPoint slides for further exploration.
These notes should provide a comprehensive overview of the lecture and serve as a useful
study guide for tests or further understanding of dynamic programming.
class Graph:
def __init__(self):
self.nodes = []
Lecture 7: Monte Carlo Simulation, Central Limit Theorem, and Confidence Intervals
Introduction
The lecture covers the concept of Monte Carlo Simulation, Central Limit Theorem, and
Confidence Intervals.
Monte Carlo Simulation is a statistical technique that allows you to account for risk in
quantitative analysis and decision making.
The Central Limit Theorem (CLT) states that the distribution of sample means approximates a
normal distribution as the sample size gets larger.
Confidence Intervals give an estimated range of values which is likely to include an unknown
population parameter.
Central Limit Theorem (CLT)
The CLT is a statistical theory that states that given a sufficiently large sample size from a
population with a finite level of variance, the mean of all samples from the same population will
be approximately equal to the mean of the population.
The CLT allows us to use the empirical rule when computing confidence intervals.
Three Key Points of CLT
The means of the samples in a set of samples (the sample means) will be approximately
normally distributed.
This distribution will have a mean that is close to the mean of the population.
The variance of the sample means will be close to the variance of the population divided by the
sample size.
Verifying CLT with Simulations
The lecture demonstrates the CLT through simulations involving rolling dice and spinning a
roulette wheel.
The simulations show that as the sample size increases, the distribution of sample means
becomes more normal, even if the original distribution is not normal.
Monte Carlo Simulation
Monte Carlo Simulation is a mathematical technique that allows you to account for risk in
quantitative analysis and decision making.
It is used in various fields like finance, engineering, supply chain, and science.
The lecture shows how Monte Carlo Simulation can be used to estimate the value of Pi.
Estimating Pi using Monte Carlo
A circle is inscribed in a square. Random points are generated within the square.
The ratio of the number of points that fall within the circle to the total number of points should
approximate the ratio of the two areas, which can be used to estimate Pi.
The formula used is Pi = 4 * (points in circle / total points)
Code for Monte Carlo Simulation to Estimate Pi
python
Copy code
def throwNeedles(numNeedles):
inCircle = 0
for Needles in range(1, numNeedles + 1):
x = random.random()
y = random.random()
if (x*x + y*y)**0.5 <= 1.0:
inCircle += 1
return 4 * (inCircle / float(numNeedles))
Confidence Intervals
Confidence intervals give an estimated range of values likely to include an unknown population
parameter.
The lecture shows that while the simulation can give us a confidence interval, it does not
guarantee that the simulation model is correct.
It's essential to perform sanity checks to ensure the simulation model's validity.
Common Pitfalls
The lecture warns against the mistake of assuming that a statistically valid result is the same as
a true result.
A simulation can be statistically valid but still produce incorrect results due to bugs or incorrect
assumptions in the model.
Summary
The Central Limit Theorem is a powerful tool that allows us to make statistical inferences about
the population mean.
Monte Carlo Simulation is a versatile technique for approximating solutions to problems that
may not be easy to solve analytically.
Confidence intervals provide a range within which we can expect the true population parameter
to lie, but they do not validate the model itself.
Important Takeaways
Always validate your simulation model.
Understand the limitations of statistical methods.
Use the Central Limit Theorem and Monte Carlo Simulation in conjunction to make more
accurate predictions and decisions.
These notes should provide a comprehensive overview of the lecture and serve as a useful
study guide for understanding Monte Carlo Simulation, Central Limit Theorem, and Confidence
Intervals.
Accuracy:
(TruePositives+TrueNegatives)/Total
Trade-offs
Increasing specificity can decrease sensitivity and vice versa.
Conclusion
Feature engineering is crucial in machine learning.
Different metrics can be used to evaluate the performance of a model.
There's often a trade-off between different metrics like sensitivity and specificity.
Next Lecture
Professor Guttag will show examples of machine learning algorithms in action.
These notes should provide a comprehensive overview of the lecture and prepare you for any
test on this topic.
Subtext: Programming
The course introduced a few extra features of Python and emphasized the use of libraries like
plotting libraries, machine learning libraries, and numeric libraries.
Key Takeaways
Many important problems can be formulated in terms of an objective function and a set of
constraints.
Randomness is a powerful tool for building computations that model the world.
Confidence intervals and levels are essential for characterizing the believability of results.
Future Directions
UROP (Undergraduate Research Opportunities Program): Students are encouraged to look for
interesting UROPs where they can use what they've learned.
Thomas Watson, chairman of IBM, predicted a world market for maybe five computers.
An article in Popular Mechanics predicted computers might someday weigh no more than 1.5
tons.
Ken Olsen, founder of Digital Equipment Corporation, said there's no reason anyone would want
a computer in their home.
Conclusion
The lecture ends with some humorous and cautionary tales about famous last words,
emphasizing the unpredictability of predictions.