Summary AI
Summary AI
Inhoudsopgave
Lecture 6: local search.................................................................................................................................... 2
For local search you don’t know the solution. Computers have to work with very little memory.
Hill climbing:
‣ Keeps track of one current state (no backtracking)
‣ Does not look ahead beyond the immediate neighbors of the
current state (greedy)
‣ On each iteration moves to the neighboring state with highest value
(steepest ascent)
‣ Terminates when a peak is reached (no neighbor has a higher value).
Stochastic hill climbing: random selection between the uphill moves, with probability related to
steepness.
First-choice hill climbing: random testing of successors until one is found that is better than the
current state. I think the key here is that stochastic hill climbing involves probability but first-choice is
“random”
Hill climbing starts with an arbitrary solution to a problem and iteratively moves towards a better
solution in its neighborhood. It makes small changes to the current solution and moves towards
higher elevations (better solutions).
Problems: when it doesn’t find any higher utility around its own state, it can get stuck at:
- Local Maxima
- Plateaus (flat local maximum or shoulder)
- Ridges (see figure)
‣ Sequence of local maxima that are not directly connected)
‣ Each local maximum only has worse connecting states.
‣ Common in low-dimensional state spaces
Improvements:
‣ Allow for a limited number of sideways moves (if on plateau that is really a shoulder)
- Higher success rate + Higher number of moves
Stochastic hill climbing random selection between the uphill moves, with probability related to
steepness.
First-choice hill climbing random testing of successors until one is found that is better than the
current state.
- Good strategy when testing all successors is costly.
Random-restart hill climbing: do a number of hill-climbing searches from randomly selected states.
- If each hill-climbing search has probability of success p then solution will be found on
average after 1/p restarts
- Will eventually find the correct solution because goal state will be initial state.
If both states are the same utility (same level), choose random
- If elevation = objective function → find the global maximum or highest peak → hill climbing.
- If elevation = cost → find the global minimum or lowest valley → steepest descent
Simulated annealing:
- Problem with hill climbing: efficient but will get stuck in a local maximum.
- Problem with random walk: most inefficient, but will eventually find the local maximum.
- Combination of both → simulated annealing (more complete and efficient)
How does it work:
- Move to randomly chosen neighbor state
- If utility is higher, always move to that state.
- If utility is lower, move to that state with probability p < 1.
- Probability of a move to a worse state
•Becomes less likely the worse the move makes the situation
•Becomes less likely as temperature decreases
Genetic algorithms:
- Starts with k randomly selected states (population)
- Each state (or individual) is encoded as a string
- Each state is rated by the objective function (the fitness function)
Genetic algorithms are inspired by natural selection. They are used for finding appropriate
solutions to optimization and search problems by mimicking the process of evolution.
Genetic Algorithms
- Starts with k randomly selected states (population)
- Each state (or individual) is encoded as a string
- Each state is rated by the objective function (aka fitness function)
Summary:
- For many search problems, we do not need the best possible solution or the best solution is
not achievable at all
- Local search methods are a useful tool because they operate on complete search problems
without keeping track of all the states.
- Simulated annealing adds a stochastic element to hill climbing and can give
optimal solutions in some circumstances.
- Stochastic local beam search provides a first approach to the generation and
selection of states.
- Genetic algorithms maintain a large population of states and use operations
such as mutation and crossover to expand the search space.
Practical 3 – local search & genetic algorithms
Problem Representation
- Use the problem representation from Russel & Norvig: cost is number of pairs of
queens attacking each other
- Move one queen per action, queens should stay in their column and only move up or
down
1. Starting position 2. Start checking the different states 3. Find the best possible
state
4. Continue checking but if no better states than keep previous best 5. From previous best
continue
6. restart checking best states 7. Also check the previous step 8. Continue till best
found
Example: cost
When hill climbing is used in the n queens problem, the variable ‘non attacking queens’ will
be maximized.
Start with population 21325, 35415, 14255, 15233
Crossover points, mutatiions and which sample pairs to combine are given.
In the case of cost (when counting the attacking queens) we use the formula for sample probability:
1 – (cost of sample / total cost of all samples)
In the case of utility (when counting the non attacking queens) we use the formula for sample
probability: utility (cost) / total utility of all samples
Lecture 7: adversarial search
Games studied in AI:
- Deterministic
- Two- player
- Turn taking
- Perfect information (we know everything)
- Zero sum (you win or lose) win (1) + loss (-1) = 0 (zero sum)
Minmax (s):
- If we are at terminal node -> utility of terminal node
- If its max turn to move -> maximum of descendant’s utilities
- If its min turn to move -> minimum of descendant’s utilities
Description of minmax:
- Depth first exploration of the tree
- Recursively descends each branch of the tree
- Computes utility for terminal nodes
- Goes back up, assigning minimax value to each node.
Minmax with alpha beta pruning is more efficient as it has to check less nodes.
Move ordering:
- If we have information on which moves are generally better than others, we can improve
alpha-beta pruning by first evaluating the utility of nodes which are considered good moves.
- For instance, in chess: capture > threat > forward move > backward move
Transposition tables: in games like chess, the same positions can occur as a result of different moves >
called transposition.
- Exploring search tree is double work in this case
- Results of search positions can be stored in a transposition table. You took it up in the table
instead of drawing the tree
Summary
- Games can be formalized by their initial state, the legal actions, the result of each action, a
terminal test, and a utility function.
- The MINIMAX algorithm can determine the optimal-moves for two-player, discrete,
deterministic, turn-taking, zero-sum games, with perfect information.
- Alpha-beta pruning can remove subtrees that are provably irrelevant.
- Heuristic evaluation functions must be used when the entire game-tree cannot be explored (i.e.,
when the utility of the terminal nodes can’t be computed).
- Monte-Carlo tree search is an alternative which plays-out entire games repeatedly and chooses
the next move based on the proportion of winning play-outs.
Practical 4 - Adversarial Search
α-β pruning
We don’t know how many blocks are behind the tree, but we can make very good guessed, called
predictions.
Intelligent systems also need to be able to make inferences under uncertainty. This requires a
different way of thinking about problems.
Deductive reasoning: going from true statements to other true statements using rules of logic. In
certain worlds. Example: all Dutch cities have a train station; therefore, Tilburg has a train station.
Faced with uncertainty, we need to make inductive inferences. If some are better than others,
questions emerge:
- How should we choose between competing explanations?
- What’s a rational solution to the problem of inductive inference?
An observation:
- Series of k observations (examples, instances, cases)
- An observation describes a set of inputs x = (x1, x2, … xn) and an output y
- Each xi is called a feature, attribute or input variable.
- Y is typically called the output variable.
Why is predicting the future hard? “Prediction is very difficult, especially if it’s about the future.” Niels
Bohr.
- Likely to be noise in the data.
- There are variations.
- We want to capture what’s systematic, not what’s accidental.
Why:
- What is systematic is likely to be observed again. Our goal is to make accurate predictions,
not describe the data.
Training a model…. Usually means finding the parameters values that minimize the error.
- Plotting error as a function of the parameters, we get an error surface.
- We want to find the lowest poin on this surface.
- Most learning algorithms attempt to minimize the error, one step at a time.
Both models at the right minimize the MSE, but it will be lower with the Degree-25 polynomial. But is
has more variation. So the blue might catch the trend better. Blue will catch more
the systimatic weather, and the red catches the variance better (in 2000).
Fitting vs prediciton
The model fit refers to how well the trained model describes the observations.
We are intered in how well this trained model predicts new observations.
Overfitting, the model has too many parameters, that will capture the noise.
That’s not what you want for inductive inference.
Summary of the example
1. We know the daily tempature in london for the year 2000.
2. We want to predict the London’s temperature in the future, lets say
2001.
3. Simply predicting 2000 temperature for 2001 is a bad idea.
4. We need a model to capture what is systematic, and ignore what is accidental.
5. We considerd polynomial models of different degrees.
6. Models make errors, and we minimize these errors.
7. When fitting, the higher the degree of the poynomial, the lower the error.
8. When predicting, there is a U-shaped relationship, a trade off.
a. Too little complexity (underfitting), too much complexity (overfitting).
b. Or too few parameters (underfitting) or too many parameters (overfitting).
c. Need to find a sweet spot, in this example degree 4 or 5 polynomial
David Harding, 2018: “I think the public debate about AI and machine learning is nine parts
hype to one part substance.”.
Lecture 9: discussion lecture John Searle
Syntax: Syntax is one that defines the rules and regulations that helps to write any statement in a
programming language.
Semantics: Semantics is one that refers to the meaning of the associated line of code in a
programming language.
Connectionism: movement in ai that said; we should be using neural networks. Creatures can create
connections between stimuli and responses through learning.
Behaviorism: try to understand the process of behavior. We give someone a stimuli and watch what
happens. I give an input and watch what happens, without studying the inner mechanisms of the
mind.
Counterarguments:
1. The systems reply: the person in the room doesn’t understand Chinese, but the systems as a
whole does understand Chinese. Searle is playing the role of a CPU, but the system has other
components like a memory.
Searle’s response: the person in the room could internalize the whole system and would still
not understands Chinese.
2. The robot reply: the person in the room doesn’t understand Chinese, but if the system were
connected to the world like a robot, with sensors, then it would understand Chinese. This
would establish a causal connection between the world and the structures being
manipulated.
Searle’s response: all these sensors provide is information. There is no difference between this
information and information passed into the room in the form of questions.
3. The brain simulator reply: what if the program precisely simulated the brain of a Chinese
speaker, including the neural architecture and the state of every neuron. Then the system
would understand Chinese.
Searle’s response: whatever system the person in the room is simulating, it will still only be a
simulation.
4. The other minds reply: the only way we attribute understanding to other people is through
their behavior. There is no other way. Therefore, we must decide if we attribute
understanding to machines in the same way, only through their behavior.
Searle’s response: the problem in this discussion is not about how I know that other people
have cognitive states, but rather what it is that I am attributing to them when I attribute
cognitive states to them.
There is a difference: we know machines are just manipulating symbols without knowing
what they mean, but we are not sure about people.
Paul M. Churchland and Patricia Smith Churchland of the University of California at San Diego
claim that circuits modelled on the brain might well achieve intelligence. On the opposing
side, John R. Searle of the University of California at Berkeley maintains that computer
programs can never give rise to minds.
Strong AI claims that thinking is merely the manipulation of formal symbols, and that is
exactly what the computer does: manipulate formal symbols. This view is often summarized
by saying, "The mind is to the brain as the program is to the hardware.
He continues by giving an example: Now, the rule book (syntax and no semantics) is the
"computer program." The people who wrote it are "programmers," and I am the "computer."
The baskets full of symbols are the "data base," the small bunches that are handed in to me
are "questions" and the bunches I then hand out are "answers." Like a computer, I
manipulate symbols, but I attach no meaning to the symbols. But from the outside it does
look like I can speak Chinese.
You can't get semantically loaded thought contents from formal computations alone,
whether they are done in serial or in parallel; that is why the Chinese room argument refutes
strong AI in any form.
Axiom 4. Brains cause minds.
The causation is from the "bottom up" in the sense that lower level neuronal processes cause
higher-level mental phenomena. The answer is that the brain does not merely instantiate a
formal pattern or program (it does that, too), but it also causes mental events by virtue of
specific neurobiological processes. It seems obvious that a simulation of cognition will
similarly not produce the effects of the neurobiology of cognition.
Conclusion 2. Any other system capable of causing minds would have to have causal powers
(at least) equivalent to those of brains.
This is like saying that if an electrical engine is to be able to run a car as fast as a gas engine, it
must have (at least) an equivalent power output.
Conclusion 3. Any artifact that produced mental phenomena, any artificial brain, would have
to be able to duplicate the specific causal powers of brains, and it could not do that just
byrunning a formal program.
Conclusion 4. The way that human brains actually produce mental phenomena cannot be
solely by virtue of running a computer program.
a. In the Chinese room you really do understand Chinese, even though you don't know it.
It is, after all, possible to understand something without knowing that one understands
it.
b. You don't understand Chinese, but there is an (unconscious) subsystem in you that
does. It is, after all, possible to have unconscious mental states, and there is no reason
why your understanding of Chinese should not be wholly unconscious.
Searle, as described by Searle Chinese characters are just a form of symbols or syntax so to
say. I don´t understand how your unconscious would be able to understand the meaning
without prior knowledge. Since the meaning of the symbol is open for interpretation, which
in addition language also is in general.
c. You don't understand Chinese, but the whole room does. You are like a single neuron
in the brain, and just as such a single neuron by itself cannot understand but only
contributes to the understanding of the whole system, you don't understand, but the
whole system does.
Argument against by Searle is the description of the multiple men in a room
f. Computers would have semantics and not just syntax if their inputs and outputs were
put in appropriate causal relation to the rest of the world. Imagine that we put the
computer into a robot, attached television cameras to the robot's head, installed
transducers connecting the television messages to the computer and had the computer
output operate the robot's arms and legs. Then the whole system would have a
semantics.
Neither, I don’t agree with the fact that attaching arms, legs, etc. would give the computer
semantics. Nevertheless, I think semantics could be some sort of higher-level syntax if the
symbols had meaning therefore creating a causal relationship.
g. If the program simulated the operation of the brain of a Chinese speaker, then it would
understand Chinese. Suppose that we simulated the brain of a Chinese person at the
level of neurons. Then surely such a system would understand Chinese as well as any
Chinese person's brain.
Searle/my own, the argument against this one is that simulations are not the real deal
according to Searle. I would like to argue that each brain also processes and stores
information in a different manner. Therefore, the simulation of the brain would not
necessarily match the level of Chinese speakers in general. Furthermore, the
simulation might not be able to produce any further actions like speaking in a person.
Lecture 10: quantifying uncertainty
Rational agents with perfect knowledge of the environment (but rarely the entire environment):
- Can find an optimal solution by exploring the complete environment.
- Can find a good, but maybe suboptimal, solution by exploring part of the environment using
heuristics.
What should rational agents do if they don’t have perfect information? Maximize performance by
keeping track of the relative importance of different outcomes and the likelihood that these
outcomes will be achieved.
Logic is insufficient: only an exhaustive list of possibilities on the right-hand side will make the rule
true.
- Laziness: it’s too much work to make and use the rules
- Theoretical ignorance: we don’t know everything there is to know.
- Practical ignorance: we don’t have access to all the info.
So, replace certainty (logic) with degrees of belief (probability).
Possible worlds:
- The term possible worlds originate in philosophy in reference to ways in which the actual world
could have been different.
- In statistics and AI, we use it to refer to the possible states of whatever we are trying to
represent, for example the possible configurations of a chess board or the possible outcomes of
throwing a dice*.
- The term world is limited to the problem we are trying to represent.
Possible worlds:
- A possible world (ω, lowercase omega) is a state that the world could be in.
- The set of possible worlds ( Ω, capital omega ) includes all the states that the world could be in.
In other words, Ω must be exhaustive.
- Each possible world must be different from all the other possible worlds. In other words,
possible worlds must be mutually exclusive.
P ( doubles ∣ Dice1 = 5) = ?
Product Rule:
P(a ∣ b) = P(a ∧ b) / P(b)
Implies: P(a ∧ b) = P(a ∣ b)P(b)
Random variable: Function that maps from a set of possible worlds to a domain or range, always
starts with an uppercase letter.
Example: random variable total is defined as the sum of throwing two dice.
- Possible worlds: (1,1), (1,2), …. (6,6).
- Domain or range: (2, 3, 4, … 12)
K]
Example domains of a random variable A
Boolean: {True, False}
- A = true, written as a
- A = false, written as ¬a
Arbitrary: {blonde, brown, black, red}
- A = blonde, written as blonde
Infinite and discrete: A = ℤ (set of integers)
Infinite and continuous: A = ℝ (set of real numbers)
We sum up the probabilities for each possible value of the other variable,
taking them out of the equation
P (¬cavity ∧toothache )
P( ¬ cavity ∣ toothache ) =
P(toothache)
P( ¬cavity ∧ toothache) = 0.016 + 0.064
P( toothache ) = 0.108 + 0.012 + 0.016 + 0.064
Conditioning
Marginalization P(Y) = ∑z P(Y, Z = z)
Here ∑z means the sum over all the possible values of the set of variables Z
→ Via product rule P(Y) = ∑Z P(Y ∣ Z)P(Z)
Independence
‣ Assumptions about independence are usually based on domain knowledge.
‣ Independence drastically reduces the amount of information needed to specify the full joint
distribution
For instance: rolling 5 dice
‣ Full joint distribution: 65 = 7776
‣ Five single variable distributions: 6 * 5 = 30
Conditional Independence
P(X, Y | Z ) = P(X | Z ) P(Y | Z )
Example:
Catch and toothache are not independent: if the probe catches, then it is likely that the tooth has a
cavity and that this cavity causes a toothache.
However, toothache and catch are independent, given the presence or absence of a cavity.
• If a cavity is present, then whether there is a toothache is not dependent on whether the probe
catches, and vice versa.
• If a cavity is not present, then whether there is a toothache is not dependent on whether the probe
catches, and vice versa.
→ P(toothache , catch | cavity ) = P( toothache | cavity )P (catch | cavity)
Summary
‣ Logic is insufficient to act rationally under uncertainty.
‣ Decision theory states that under uncertainty, the best action is the one that maximizes the
expected utility of the outcomes.
‣ Probability theory formalizes the notions we require to infer the expected utility of actions under
uncertainty.
‣ Given a full joint probability distribution, we can formalize a general inference procedure.
‣ Bayes’ rule allows for inferences about unknown probabilities from conditional probabilities.
‣ Neither the general inference procedure, nor Bayes’ rule scale up well.
‣ Assuming conditional independence allows for the full joint probability distribution to be factored
into smaller conditional distributions → Naive Bayes
Discussion ethics of AI Bostrom and Yudkowsky
Qualia: Sometimes termed “aboutness”. The conscious experience of something, like
the taste of chocolate.
Substrate: There are different meanings. The foundation on which something is based, or
presupposes.
Sentience: The capacity for phenomenal experience or qualia, such as the capacity to feel pain and
suffer.
Sapience: A set of capacities associated with higher intelligence, such as self- awareness and being a
reason-responsive agent.
Principles:
Non-discrimination principles when attributing moral status:
- Principle of substrate non-discrimination: All else being equal, an agents sentience or sapience
should be judged independent of the physical substrate on which it is implemented.
- Principle of ontogeny non-discrimination: All else being equal, an agents sentience or sapience
should be judged independently of the process that created the agent.
Exercise 2
Exercise 3
Lecture 11: probabilistic reasoning:
Networks: mathematical structures used to model pairwise relations between objects. Dif between
connected graphs and disconnected graphs. A part( node) is called a vertex or node or point. Lines
between nodes are called connections.
Use an adjency matrix to put a graph in a computer visually. A 1 means there is a connection.
Bayesian network
Data structure that represents dependencies among variables.
Other names: belief network, decision network and casual graph.
Generalisation: If the variables are ordered, so that for each variable the parent nodes are a subset of
the earlier variables, then the chain rule can be used to construct a Bayesian network.
Deterministic nodes: value of the nodes is specified exactly by the value of its parents,with no
uncertainty. Dif between logical and numerical
Inference by enumeration
Query: P(X | e)
‣ We already know that this can be answered by summing up the
relevant probabilities from the full joint distribution.
‣ A Bayesian network gives a complete representation of the full joint
distribution.
→ The query can be answered by computings sums of products of
conditional probabilities from the network.
→ P(X ∣ e) = αP(X, e) = α ∑ P(X, e, y)
y
Summary
‣ Bayesian networks are Directed Acyclic Graphs
- Each node corresponds to a random variable
- Each node has a conditional distribution, given its parents
‣ Bayesian networks can represent full joint probability distributions, but can be exponentially
smaller by exploiting conditional independence relations.
‣ Constructing Bayesian Networks takes advantage of the chain rule, which is a generalization of the
product rule.
‣ Conditional distributions can be represented more compactly, e.g., by making use of deterministic
functions probability density functions, or discretization.
‣ Specific inference procedures allow us to use Bayesian networks to answer questions that involve
joint probabilities and conditional probabilities.
Practical 6: probabilistic reasoning
Alarm example:
Handedness example:
First test