0% found this document useful (0 votes)
12 views35 pages

Summary AI

The document discusses various local search algorithms like hill climbing, simulated annealing, and genetic algorithms. It provides details on how they work, their advantages and disadvantages, and gives an example of applying genetic algorithms to solve the n-queens problem. Local search methods operate by making small changes to candidate solutions and moving towards better ones locally without exploring the entire search space.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views35 pages

Summary AI

The document discusses various local search algorithms like hill climbing, simulated annealing, and genetic algorithms. It provides details on how they work, their advantages and disadvantages, and gives an example of applying genetic algorithms to solve the n-queens problem. Local search methods operate by making small changes to candidate solutions and moving towards better ones locally without exploring the entire search space.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Summary AI final

Inhoudsopgave
Lecture 6: local search.................................................................................................................................... 2

Practical 3 – local search & genetic algorithms................................................................................................ 5

Lecture 7: adversarial search.......................................................................................................................... 9

Practical 4 - Adversarial Search..................................................................................................................... 12


α-β pruning..................................................................................................................................................... 12

Lecture 8: problem solving under uncertainty............................................................................................... 13

Lecture 9: discussion lecture John Searle...................................................................................................... 16

Lecture 10: quantifying uncertainty.............................................................................................................. 20

Discussion ethics of AI Bostrom and Yudkowsky...........................................................................................25

Practical 5: Quantifying uncertainty............................................................................................................. 27

Lecture 11: probabilistic reasoning:.............................................................................................................. 28

Practical 6: probabilistic reasoning............................................................................................................... 33

Lecture 6: local search


Local search algorithms
- look at neighboring states to decide what to do next;
- do not keep track of the states that have been reached;
- don’t care about the fact that there may be a better solution
somewhere else.

For local search you don’t know the solution. Computers have to work with very little memory.

Hill climbing:
‣ Keeps track of one current state (no backtracking)
‣ Does not look ahead beyond the immediate neighbors of the
current state (greedy)
‣ On each iteration moves to the neighboring state with highest value
(steepest ascent)
‣ Terminates when a peak is reached (no neighbor has a higher value).

Stochastic hill climbing: random selection between the uphill moves, with probability related to
steepness.

First-choice hill climbing: random testing of successors until one is found that is better than the
current state. I think the key here is that stochastic hill climbing involves probability but first-choice is
“random”

Hill climbing starts with an arbitrary solution to a problem and iteratively moves towards a better
solution in its neighborhood. It makes small changes to the current solution and moves towards
higher elevations (better solutions).

Problems: when it doesn’t find any higher utility around its own state, it can get stuck at:
- Local Maxima
- Plateaus (flat local maximum or shoulder)
- Ridges (see figure)
‣ Sequence of local maxima that are not directly connected)
‣ Each local maximum only has worse connecting states.
‣ Common in low-dimensional state spaces

Improvements:
‣ Allow for a limited number of sideways moves (if on plateau that is really a shoulder)
- Higher success rate + Higher number of moves
Stochastic hill climbing random selection between the uphill moves, with probability related to
steepness.

First-choice hill climbing random testing of successors until one is found that is better than the
current state.
- Good strategy when testing all successors is costly.
Random-restart hill climbing: do a number of hill-climbing searches from randomly selected states.
- If each hill-climbing search has probability of success p then solution will be found on
average after 1/p restarts
- Will eventually find the correct solution because goal state will be initial state.

If both states are the same utility (same level), choose random

- If elevation = objective function → find the global maximum or highest peak → hill climbing.
- If elevation = cost → find the global minimum or lowest valley → steepest descent

Simulated annealing:
- Problem with hill climbing: efficient but will get stuck in a local maximum.
- Problem with random walk: most inefficient, but will eventually find the local maximum.
- Combination of both → simulated annealing (more complete and efficient)
How does it work:
- Move to randomly chosen neighbor state
- If utility is higher, always move to that state.
- If utility is lower, move to that state with probability p < 1.
- Probability of a move to a worse state
•Becomes less likely the worse the move makes the situation
•Becomes less likely as temperature decreases

Local beam search


- Selects the k best successors at every step
- if k=1 → hill climbing
- if k=>2 → parallel hill climbing process
Stochastic local beam search
- Selects k successors at random every step
- Probability of selection is a function of utility (aka fitness)

Genetic algorithms:
- Starts with k randomly selected states (population)
- Each state (or individual) is encoded as a string
- Each state is rated by the objective function (the fitness function)

Genetic algorithms are inspired by natural selection. They are used for finding appropriate
solutions to optimization and search problems by mimicking the process of evolution.

Fitness is based on the ability to camouflage. The best possible fitness =0


Fitness formula: use two ‘codes’, the ultimate fitness (blue) and another one. Then you substract the
different code blocks from each other.

Fitness = -(abs(000 – 000) + abs(168 – 118) + abs(156 – 186) = -80.

Genetic Algorithms
- Starts with k randomly selected states (population)
- Each state (or individual) is encoded as a string
- Each state is rated by the objective function (aka fitness function)

- Two pairs are selected at random with probability of selection


increasing with fitness.
- Crossover point for each pair is chosen at random
- Offspring are created by combining strings at crossover point
- Small probability of random mutation

Genetic algorithms combine:


- Uphill tendency
- Random exploration
- Exchange of info between search threads
Biggest advantage comes from crossover operations.

Summary:
- For many search problems, we do not need the best possible solution or the best solution is
not achievable at all
- Local search methods are a useful tool because they operate on complete search problems
without keeping track of all the states.
- Simulated annealing adds a stochastic element to hill climbing and can give
optimal solutions in some circumstances.
- Stochastic local beam search provides a first approach to the generation and
selection of states.
- Genetic algorithms maintain a large population of states and use operations
such as mutation and crossover to expand the search space.
Practical 3 – local search & genetic algorithms
Problem Representation
- Use the problem representation from Russel & Norvig: cost is number of pairs of
queens attacking each other
- Move one queen per action, queens should stay in their column and only move up or
down

- What is the cost in the initial state?


o In the initial state, 6 pairs of queens are (in)directly attacking each other, so the
cost in the initial state is 6
- From the initial state, how many moves are possible?
o From the initial state, each of the 4 queens has 3 available moves, leading to a
total of 12 available moves from the initial state
- What is the cost of each state resulting from a first move?
In practice

1. Starting position 2. Start checking the different states 3. Find the best possible
state
4. Continue checking but if no better states than keep previous best 5. From previous best
continue

6. restart checking best states 7. Also check the previous step 8. Continue till best
found

Genetic Algorithms (5-Queens Problem)


- Start with k randomly selected states (population)
- Each state is encoded as a string
- Each state is rated by an objective function
- Sample two pairs with probability proportional to fitness/cost
- Randomly determine crossover points
- Combine strings at crossover points
- With a small chance, mutate an element of the offspring
- Repeat for the next generation
State representation
- We can represent a state as the row numbers of the queens
o Previous board: ”11111
- cost(state) = pairs of attacking queens in state
cost
- Sample probability: 1−
total cost of generation

Example: cost

When hill climbing is used in the n queens problem, the variable ‘non attacking queens’ will
be maximized.
Start with population 21325, 35415, 14255, 15233

One of the population show but need to do for all


Rate each state with the objective function:
- cost(21325)=4
- cost(35415)=2
- cost(14255)=2
- cost(15233)=3
Sample two pairs with probability proportional to cost
cost
Sample probability: 1−
total cost of generation
Total cost of generation: 11
- probability 21325: 1 − 4/11 ≈ 0.636
- probability 35415: 1 − 2/11 ≈ 0.818
- probability 14255: 1 − 2/11 ≈ 0.818
- probability 15233: 1 − 3/11 ≈ 0.727
Cross over
- Sample two pairs with the cost-based weights
- For example 35415 - 15233 and 35415 - 14255
- Randomly determine crossover points, for example 2 and 3
- Combine strings at crossover points:
o 35|415 - 15|233 becomes 35 233 - 15415
o 354|15 - 142|55 becomes 354 55 - 14215
- Determine which states in the offspring get a mutation
o For example, 35455 is mutated to 35415
- New population is 35233, 15415, 14255, 35415
- Repeat
Example utility
Another example, making a change:
- Instead of attacking pairs of queens, we count non-attacking pairs of queens
o This gives utility of a state instead of cost of a state
- 32431, 42524, 53554, 14515, 44244, 45431, 14344, 51253, 55512, 23121
- U(32431)=7, relative probability: 7/54 ≈ 0.130
- U(42524)=8, relative probability: 8/54 ≈ 0.148
- U(53554)=5, relative probability: 5/54 ≈ 0.093
- U(14515)=6, relative probability: 6/54 ≈ 0.111
- U(44244)=2, relative probability: 2/54 ≈ 0.037
- U(45431)=5, relative probability: 5/54 ≈ 0.093
- U(14344)=3, relative probability: 3/54 ≈ 0.056
- U(51253)=8, relative probability: 8/54 ≈ 0.148
- U(55512)=5, relative probability: 5/54 ≈ 0.093
- U(23121)=5, relative probability: 5/54 ≈ 0.093
In the case of utility we dont substract 1 when calculating the sample probability
Our starting population was 10, so sample 5 pairs and determine a crossover point foreach:
- 324|31 - 512|53, crossover at 3 gives 32453 – 51231
- 425|24 - 425|24, crossover at 2 gives 42524 - 42524
- 14|344 - 51|253, crossover at 3 gives 14253 - 51344
- 3243|1 - 4252|4, crossover at 1 gives 32434 - 42521
- 4424|4 - 4252|4, crossover at 4 gives 42524 - 44244
Let’s mutate 51231, 14253, 42521 to 51431, 24253, and 42521
New population:
32453, 51431, 42524, 42524, 24253, 51344, 32434, 42521, 42524, 44244

Crossover points, mutatiions and which sample pairs to combine are given.

In the case of cost (when counting the attacking queens) we use the formula for sample probability:
1 – (cost of sample / total cost of all samples)

In the case of utility (when counting the non attacking queens) we use the formula for sample
probability: utility (cost) / total utility of all samples
Lecture 7: adversarial search
Games studied in AI:
- Deterministic
- Two- player
- Turn taking
- Perfect information (we know everything)
- Zero sum (you win or lose) win (1) + loss (-1) = 0 (zero sum)

Agents together = an economy

Approaches t modelling adversarial games:


1. Consider agents together as an economy. No need to predict the action of individual agents
and can capture large characteristics of the system, such as laws of supply and demand.
2. Consider adversarial as part of the environment. Models the probabilistic behaviour of
agents as a dynamic system and does not explicitly take into account that agents may have
conflicting goals.
3. Model agents using adversarial game tree search
- explicitly models players as adversarial search
- only suitable for some games

Two player zero sum games:


Formalization:
1. So: initial stage
2. To move(s): player whose turn it is to move in state s
3. Action(s): set of legal moves in state s
4. Result(s, a): transition model, defining the state resulting from taking action a in state s
5. Is-terminal(s): true, when the game is over, false otherwise
6. Utility(s, p): numeric value to player p when the game ends in state s

State space graph:


- Initial state + actions + result = state space graph
- Vertices are states, edges are moves
- Some states may be reached by multiple paths

Game tree: full representation of state space graph


- Tree that follows every sequence of moves to terminal state
- May be infinite (in case of repeatable states)
Search tree:
- Partial representation
- Used to determine what move to make
- Why: too many combinations

Ordinary search vs adversarial search:


- In a normal search such as for the 8-puzzle, we could end the game
by finding a path to a good end position.
- However, in adversarial search, the other player co-determines the
path.
Minmax search:
- Two players, min and max, take turns.
- Max must plan ahead against each of mins possible moves.
- A move is called a ply.

Triangle downwards: min’s turn, minimizing the outcome.


Triangle upwards: max’s turn, maximizing the outcome.

Minmax (s):
- If we are at terminal node -> utility of terminal node
- If its max turn to move -> maximum of descendant’s utilities
- If its min turn to move -> minimum of descendant’s utilities

Description of minmax:
- Depth first exploration of the tree
- Recursively descends each branch of the tree
- Computes utility for terminal nodes
- Goes back up, assigning minimax value to each node.

Time complexity of minmax is exponential: O(b^m)


- B = average branching factor
- M = max depth
How deeper the tree, how bigger the time complexity and you can’t use full representation. More
efficient way to search the game tree -> alpha beta pruning.

Minmax with alpha beta pruning is more efficient as it has to check less nodes.

Alpha beta pruning: (all chess algorithms do this)


- a (alpha) = the best value we have found so far for MAX, think a = “at least” (highest value)
- B (beta) = the value of the best choice we have found so far for MIN, think B = “at most”
(lowest value)

Move ordering:
- If we have information on which moves are generally better than others, we can improve
alpha-beta pruning by first evaluating the utility of nodes which are considered good moves.
- For instance, in chess: capture > threat > forward move > backward move

Transposition tables: in games like chess, the same positions can occur as a result of different moves >
called transposition.
- Exploring search tree is double work in this case
- Results of search positions can be stored in a transposition table. You took it up in the table
instead of drawing the tree

Heuristic strategies, Shannon (1950)


- Type a: (historically used for chess) consider wide but shallow part of the tree and estimate
the utility at that point.
- Type b: (historically used for Go) consider promising parts of the tree deeply and ignore
unpromising paths.

Heuristic Alpha-Beta Tree Search


Can treat non-terminal nodes as if they were terminal
Utility function, which is certain, is replaced by an evaluation function, which provides an estimate.
- e.g. queen=9, knight=3, bishop=3, rook=5, pawn=1, ...
- Typically a weighted linear function of the values
- ... but can be any function of the features
In case of the H-MINIMAX (s, d)
- If cut-off reached → compute expected utility of node (=true utility for terminal nodes)
- If it’s MAX’s turn to move → maximum of descendant’s expected utilities
- If it’s MIN’s turn to move → minimum of descendant’s expected utilities

Forward pruning: prune moves that appear to be bad (based on experience)


- Type b strategy
- PROBCUT: forward pruning version of alpha beta search that prunes nodes that are probably
outside of the a-b window
- Late move reduction: reduces depth to which ordered moves are searched. Backs up to full
search if higher α value is found.

Monte Carlo Tree Search


Complexity of games like Go is far greater than that of chess:
- Go starts with a branching factor of 361 and continues with an average branching factor of 150.
- Alpha-beta search is useless because it would not be possible to see far ahead.
- Instead, multiple simulations of complete games are played-out, starting from a given position.
- Expected utility of a move is percentage of play-outs with a win given that move.
- Usually combined with exploitation of past experiences (lookup).
- Combined with reinforcement learning → neural-network based game programs that learns by
playing against itself (e.g.,. Alpha Zero)

Summary
- Games can be formalized by their initial state, the legal actions, the result of each action, a
terminal test, and a utility function.
- The MINIMAX algorithm can determine the optimal-moves for two-player, discrete,
deterministic, turn-taking, zero-sum games, with perfect information.
- Alpha-beta pruning can remove subtrees that are provably irrelevant.
- Heuristic evaluation functions must be used when the entire game-tree cannot be explored (i.e.,
when the utility of the terminal nodes can’t be computed).
- Monte-Carlo tree search is an alternative which plays-out entire games repeatedly and chooses
the next move based on the proportion of winning play-outs.
Practical 4 - Adversarial Search
α-β pruning

- E=2, F=3, and G=3, so B=2 and A is at least 2


- H=2, so C is at most 2
- A is at least 2 already
- We don’t need to look further below C; I and J are pruned
- K=8, so D is at most 8
- A is at least 2 and D is at least 8, so D might improve utility; keep searching
- L=2, so D is at most 2
- A is at least 2 already
- We don’t need to look further below D; M is pruned
Lecture 8: problem solving under uncertainty.
How can we build machines that can handle the uncertainty of the natural world? (“nothing is certain
except death and taxes” Benjamin Franklin)
To determine a location, we can use Dead-reckoning which is the process of using information about
direction, speed, and elapsed time to calculate the new location.
Odometry refers to the use of motion sensors, like measuring wheel rotation.
Where am I, the uncertainty
- The robot starts off at a known location (x1, y1).
- It makes series of movements calculating each new position using
dead-reckoning.
- The further it moves, the greater the uncertainty.
- Wheels may slip, or skid. Motor temperature rises compromising
the measures, etc.

The uncertainty is about where we are in a space, the more we move,


the higher the uncertainty.

We don’t know how many blocks are behind the tree, but we can make very good guessed, called
predictions.

Intelligent systems also need to be able to make inferences under uncertainty. This requires a
different way of thinking about problems.

Deductive reasoning: going from true statements to other true statements using rules of logic. In
certain worlds. Example: all Dutch cities have a train station; therefore, Tilburg has a train station.

Inductive reasoning: going from specific observations to informed guesses, or conjectures. In


uncertain words. Example: you visit 10 Dutch cities which all have train stations; therefore, you
conclude all cities have train stations.

Faced with uncertainty, we need to make inductive inferences. If some are better than others,
questions emerge:
- How should we choose between competing explanations?
- What’s a rational solution to the problem of inductive inference?

An observation:
- Series of k observations (examples, instances, cases)
- An observation describes a set of inputs x = (x1, x2, … xn) and an output y
- Each xi is called a feature, attribute or input variable.
- Y is typically called the output variable.
Why is predicting the future hard? “Prediction is very difficult, especially if it’s about the future.” Niels
Bohr.
- Likely to be noise in the data.
- There are variations.
- We want to capture what’s systematic, not what’s accidental.
Why:
- What is systematic is likely to be observed again. Our goal is to make accurate predictions,
not describe the data.

Remember there is always an infinite number of potential models.

What makes a good model?


Good models, make errors. There will be noise in the data, and some point needs to be ignored, think
about least squares method for instance. The model represents a guess about what systematic
pattern governs the observations.
Measuring the error,
on day i, the error is the difference between the model and observed
value: |f(xi) – yi|
Often we square this error, penalizing larger errors:
E(x) = {f(xi) – yi}2
And we often consider mean squared error:
k
1
MSE = ∑ ¿¿ f(xi) – yi}2
k i=1

Training a model…. Usually means finding the parameters values that minimize the error.
- Plotting error as a function of the parameters, we get an error surface.
- We want to find the lowest poin on this surface.
- Most learning algorithms attempt to minimize the error, one step at a time.
Both models at the right minimize the MSE, but it will be lower with the Degree-25 polynomial. But is
has more variation. So the blue might catch the trend better. Blue will catch more
the systimatic weather, and the red catches the variance better (in 2000).

Fitting vs prediciton
The model fit refers to how well the trained model describes the observations.
We are intered in how well this trained model predicts new observations.
Overfitting, the model has too many parameters, that will capture the noise.
That’s not what you want for inductive inference.
Summary of the example
1. We know the daily tempature in london for the year 2000.
2. We want to predict the London’s temperature in the future, lets say
2001.
3. Simply predicting 2000 temperature for 2001 is a bad idea.
4. We need a model to capture what is systematic, and ignore what is accidental.
5. We considerd polynomial models of different degrees.
6. Models make errors, and we minimize these errors.
7. When fitting, the higher the degree of the poynomial, the lower the error.
8. When predicting, there is a U-shaped relationship, a trade off.
a. Too little complexity (underfitting), too much complexity (overfitting).
b. Or too few parameters (underfitting) or too many parameters (overfitting).
c. Need to find a sweet spot, in this example degree 4 or 5 polynomial

Key point: reasoning about the uncertain world is difficult.

David Harding, 2018: “I think the public debate about AI and machine learning is nine parts
hype to one part substance.”.
Lecture 9: discussion lecture John Searle
Syntax: Syntax is one that defines the rules and regulations that helps to write any statement in a
programming language.

Semantics: Semantics is one that refers to the meaning of the associated line of code in a
programming language.

Connectionism: movement in ai that said; we should be using neural networks. Creatures can create
connections between stimuli and responses through learning.

Behaviorism: try to understand the process of behavior. We give someone a stimuli and watch what
happens. I give an input and watch what happens, without studying the inner mechanisms of the
mind.

Searle’s main points


- Some ai researchers believe that by finding the right program they will create a thinking,
conscious machine.
- Searles Chinese room argument: running the right program us not sufficient for a thinking
machine.
- The Chinese room runs the right program but has no understanding of Chinese.
- Searle is not arguing against the possibility of creating a thinking machine, he is arguing against
the idea that doing this is merely a matter of coming up with the right program.
- Is we are to construct thinking machines with consciousness, we also need to consider the
nature of machinery that runs the program.

Counterarguments:
1. The systems reply: the person in the room doesn’t understand Chinese, but the systems as a
whole does understand Chinese. Searle is playing the role of a CPU, but the system has other
components like a memory.

Searle’s response: the person in the room could internalize the whole system and would still
not understands Chinese.

2. The robot reply: the person in the room doesn’t understand Chinese, but if the system were
connected to the world like a robot, with sensors, then it would understand Chinese. This
would establish a causal connection between the world and the structures being
manipulated.

Searle’s response: all these sensors provide is information. There is no difference between this
information and information passed into the room in the form of questions.

3. The brain simulator reply: what if the program precisely simulated the brain of a Chinese
speaker, including the neural architecture and the state of every neuron. Then the system
would understand Chinese.

Searle’s response: whatever system the person in the room is simulating, it will still only be a
simulation.
4. The other minds reply: the only way we attribute understanding to other people is through
their behavior. There is no other way. Therefore, we must decide if we attribute
understanding to machines in the same way, only through their behavior.

Searle’s response: the problem in this discussion is not about how I know that other people
have cognitive states, but rather what it is that I am attributing to them when I attribute
cognitive states to them.

There is a difference: we know machines are just manipulating symbols without knowing
what they mean, but we are not sure about people.

Paul M. Churchland and Patricia Smith Churchland of the University of California at San Diego
claim that circuits modelled on the brain might well achieve intelligence. On the opposing
side, John R. Searle of the University of California at Berkeley maintains that computer
programs can never give rise to minds.

Strong AI claims that thinking is merely the manipulation of formal symbols, and that is
exactly what the computer does: manipulate formal symbols. This view is often summarized
by saying, "The mind is to the brain as the program is to the hardware.

He continues by giving an example: Now, the rule book (syntax and no semantics) is the
"computer program." The people who wrote it are "programmers," and I am the "computer."
The baskets full of symbols are the "data base," the small bunches that are handed in to me
are "questions" and the bunches I then hand out are "answers." Like a computer, I
manipulate symbols, but I attach no meaning to the symbols. But from the outside it does
look like I can speak Chinese.

Axiom 1. Computer programs are formal (syntactic).


Describing the fact that computers encodes information (symbolises), this is also the power
to make it a universal machine.
First, symbols and programs are purely abstract notions: they have no essential physical
properties to define them and can be implemented in any physical medium whatsoever. The
second point is that symbols are manipulated without reference to any meanings.

Axiom 2. Human minds have mental contents (semantics).


I attach specific meanings to these words, in accordance with my knowledge of English. In
this respect they are unlike Chinese symbols for me.

Axiom 3. Syntax. by itself is neither constitutive of nor sufficient for semantics.


Merely manipulating symbols is not enough to guarantee knowledge of what they mean.

Conclusion 1. Programs are neither constitutive of nor sufficient for minds.


The point is that there is a distinction between formal elements, which have no intrinsic
meaning or content, and those phenomena that have intrinsic content. Strong AI is false.

You can't get semantically loaded thought contents from formal computations alone,
whether they are done in serial or in parallel; that is why the Chinese room argument refutes
strong AI in any form.
Axiom 4. Brains cause minds.
The causation is from the "bottom up" in the sense that lower level neuronal processes cause
higher-level mental phenomena. The answer is that the brain does not merely instantiate a
formal pattern or program (it does that, too), but it also causes mental events by virtue of
specific neurobiological processes. It seems obvious that a simulation of cognition will
similarly not produce the effects of the neurobiology of cognition.

Conclusion 2. Any other system capable of causing minds would have to have causal powers
(at least) equivalent to those of brains.
This is like saying that if an electrical engine is to be able to run a car as fast as a gas engine, it
must have (at least) an equivalent power output.

Conclusion 3. Any artifact that produced mental phenomena, any artificial brain, would have
to be able to duplicate the specific causal powers of brains, and it could not do that just
byrunning a formal program.

Conclusion 4. The way that human brains actually produce mental phenomena cannot be
solely by virtue of running a computer program.
a. In the Chinese room you really do understand Chinese, even though you don't know it.
It is, after all, possible to understand something without knowing that one understands
it.
b. You don't understand Chinese, but there is an (unconscious) subsystem in you that
does. It is, after all, possible to have unconscious mental states, and there is no reason
why your understanding of Chinese should not be wholly unconscious.
Searle, as described by Searle Chinese characters are just a form of symbols or syntax so to
say. I don´t understand how your unconscious would be able to understand the meaning
without prior knowledge. Since the meaning of the symbol is open for interpretation, which
in addition language also is in general.

c. You don't understand Chinese, but the whole room does. You are like a single neuron
in the brain, and just as such a single neuron by itself cannot understand but only
contributes to the understanding of the whole system, you don't understand, but the
whole system does.
Argument against by Searle is the description of the multiple men in a room

d. Semantics doesn't exist anyway; there is only syntax. It is a kind of prescientific


illusion to suppose that there exist in the brain some mysterious "mental contents,"
"thought processes" or "semantics." All that exists in the brain is the same sort of
syntactic symbol manipulation that goes on in computers. Nothing more.
Both, I believe that Semantics could be a more complex and advanced form of syntax and in
that ways symbol manipulation. Studies have shown that different cultures and languages
have different perceptions of the world around them. It could be that due to (too little)
complexity in language we have not yet been able to understand mental content like this.
e. You are not really running the computer program-you only think you are. Once you
have a conscious agent going through the steps of the program, it ceases to be a case
of implementing a program at all.
Searle, because if it were the case that it is a conscious agent it would have the ability and
choice to not follow any of the steps. While in a computer it always follow the steps and set
rules in a program.

f. Computers would have semantics and not just syntax if their inputs and outputs were
put in appropriate causal relation to the rest of the world. Imagine that we put the
computer into a robot, attached television cameras to the robot's head, installed
transducers connecting the television messages to the computer and had the computer
output operate the robot's arms and legs. Then the whole system would have a
semantics.
Neither, I don’t agree with the fact that attaching arms, legs, etc. would give the computer
semantics. Nevertheless, I think semantics could be some sort of higher-level syntax if the
symbols had meaning therefore creating a causal relationship.

g. If the program simulated the operation of the brain of a Chinese speaker, then it would
understand Chinese. Suppose that we simulated the brain of a Chinese person at the
level of neurons. Then surely such a system would understand Chinese as well as any
Chinese person's brain.

Searle/my own, the argument against this one is that simulations are not the real deal
according to Searle. I would like to argue that each brain also processes and stores
information in a different manner. Therefore, the simulation of the brain would not
necessarily match the level of Chinese speakers in general. Furthermore, the
simulation might not be able to produce any further actions like speaking in a person.
Lecture 10: quantifying uncertainty
Rational agents with perfect knowledge of the environment (but rarely the entire environment):
- Can find an optimal solution by exploring the complete environment.
- Can find a good, but maybe suboptimal, solution by exploring part of the environment using
heuristics.

Probability theory: possible worlds / sample space

What should rational agents do if they don’t have perfect information? Maximize performance by
keeping track of the relative importance of different outcomes and the likelihood that these
outcomes will be achieved.

Logic is insufficient: only an exhaustive list of possibilities on the right-hand side will make the rule
true.
- Laziness: it’s too much work to make and use the rules
- Theoretical ignorance: we don’t know everything there is to know.
- Practical ignorance: we don’t have access to all the info.
So, replace certainty (logic) with degrees of belief (probability).

Probability statements are usually made with regard to a knowledge state.


- Actual state: patient has a cavity or patient does not have a cavity.
- Knowledge state: probability that the patient has a cavity if we
haven’t observed her yet.

Decision theory = probability theory + utility theory

Principle of maximum expected utility (MEU)


- An agent is rational if and only if it chooses the action that yields
the highest expected utility.
- Expected = average of outcome utilities, weighted by probability
of the outcome

Possible worlds:
- The term possible worlds originate in philosophy in reference to ways in which the actual world
could have been different.
- In statistics and AI, we use it to refer to the possible states of whatever we are trying to
represent, for example the possible configurations of a chess board or the possible outcomes of
throwing a dice*.
- The term world is limited to the problem we are trying to represent.
Possible worlds:
- A possible world (ω, lowercase omega) is a state that the world could be in.
- The set of possible worlds ( Ω, capital omega ) includes all the states that the world could be in.
In other words, Ω must be exhaustive.
- Each possible world must be different from all the other possible worlds. In other words,
possible worlds must be mutually exclusive.

Set of all possible worlds = sample space = Ω


- Ω = {(1,1), (1,2), ... (6,5), (6,6)}

Possible world = element of the sample space = ω


- ω₁ = (1,1)
- ω₃₆ = (6,6)
0 ≤ P(ω) ≤ 1 for every ω and ∑ω∈Ω P(ω) = 1

Event: set of worlds in which a proposition holds


Probability of an event: sum of probabilities of the worlds in which a proposition holds.
For example two dice, rolling 11 in total. Proposition: P(total = 11)
Event : set of worlds in which the proposition holds (5. 6), (6. 5).
1/36 is the odd of (5, 6), 1/36 is the odd of (6. 5), Probability of the event is 1/18.

Conditional and Unconditional Probabilities:


Unconditional probabilities: Degree of belief in propositions in the absence of other information.
- Also known as prior probabilities or priors.
Conditional probabilities: Degree of belief given other information.
- Example: rolling a double if the first dice is 5
- P (doubles ∣ Dice 1 = 5)

Computing conditional probabilities


P (a ∧b)
P(a ∣ b) =
P( b)

P ( doubles ∣ Dice1 = 5) = ?

P (doubles ∧ Dice 1=5) 1/36


= = 1/6
P(Dice 1=5) 1/6

Product Rule:
P(a ∣ b) = P(a ∧ b) / P(b)
Implies: P(a ∧ b) = P(a ∣ b)P(b)

Random variable: Function that maps from a set of possible worlds to a domain or range, always
starts with an uppercase letter.
Example: random variable total is defined as the sum of throwing two dice.
- Possible worlds: (1,1), (1,2), …. (6,6).
- Domain or range: (2, 3, 4, … 12)
K]
Example domains of a random variable A
Boolean: {True, False}
- A = true, written as a
- A = false, written as ¬a
Arbitrary: {blonde, brown, black, red}
- A = blonde, written as blonde
Infinite and discrete: A = ℤ (set of integers)
Infinite and continuous: A = ℝ (set of real numbers)

Joint probability distribution:


P( Toothache, Cavity) = P(Toothache ∣ Cavity)P(Cavity)
- Boldface P means “for all possible values of the random variable”.
- A probability model is completely determined by the joint distribution for all of the random
variables.
E.g., P(Cavity, Toothache, Catch) = 2x2x2 table
Probability Axioms
We already saw:
‣ 0 ≤ P(ω) ≤ 1 for every ω
‣∑ P(ω) = 1
ω∈Ω
From this we can derive:
Complement of a proposition and its negation
- P( ¬a) = 1 − P(a)
Inclusion - Exclusion principle
- P(a ∨ b) = P(a) + P(b) − P(a ∧ b)

P(cavity v toothache) = all except the no cavity and no toothache which is


the 0.144 and 0.576. so 1 - 0.144 – 0.576 = 0.28

We sum up the probabilities for each possible value of the other variable,
taking them out of the equation

P(cavity|toothache) = P(cavity ^ toothache) / (toothache)


(0.108 + 0.012) / (0.108 + 0.012 + 0.016 + 0.064) = 0.6

P (¬cavity ∧toothache )
P( ¬ cavity ∣ toothache ) =
P(toothache)
P( ¬cavity ∧ toothache) = 0.016 + 0.064
P( toothache ) = 0.108 + 0.012 + 0.016 + 0.064

P (¬cavity ∧toothache ) 0.016+ 0.064 ¿ ¿


= =
P(toothache) 0.108+0.012+0.016+ 0.064 = 0.4

Conditioning
Marginalization P(Y) = ∑z P(Y, Z = z)
Here ∑z means the sum over all the possible values of the set of variables Z
→ Via product rule P(Y) = ∑Z P(Y ∣ Z)P(Z)

P (cavity ∧toothache) 0.108+ 0.012


P( cavity ∣ toothache ) = = 0.6 Constant:
P(toothache) 0.108+0.012+0.016+ 0.064
0.2

P (¬cavity ∧toothache ) 0.016+ 0.064 ¿ ¿


P( ¬ cavity ∣ toothache ) = = 0.108+0.012+0.016+ 0.064
P(toothache)
= 0.4 Constant: 0.2
Normalization constant α for the distribution = P(Cavity ∣ toothache ) 1 / 0.2 = 5
Note the uppercase variable name, which means for all values of Cavity, so both cavity and ¬ cavity.

Full joint distribution


P( Cavity ∣ toothache ) = αP(Cavity, toothache)
= α[P(Cavity, toothache, catch) + P(Cavity, toothache, ¬catch)]
= α[⟨0.108,0.016⟩ + ⟨0.012,0.064⟩] = α⟨0.12,0.08⟩ = ⟨0.6,0.4⟩
Add up the probabilities for each of the values of the variable

P( Cavity ∣ toothache ) = αP(Cavity, toothache) = α[P(Cavity, toothache, catch) + P(Cavity, toothache,


¬catch)] = α[⟨0.108,0.016⟩ + ⟨0.012,0.064⟩] = α⟨0.12,0.08⟩ = ⟨0.6,0.4⟩
α is used to scale the probabilities, but since you know the probabilities must add up to 1, you can
derive α from the relative proportions. 0.12 + 0.08 = 0.2 ⟹ α = 1 / 0.2 = 5

General inference procedure


General form of the procedure described in previous slide:
P(X ∣ e) = αP(X, e) = α ∑y P(X, e, y)
→ Can answer questions about the probability distribution of a discrete random variable X, given
evidence variables E, and unobserved variables Y.
Does not scale well. For n variables with two values each:
‣ space complexity = O(2n)
‣ time complexity = O(2n)
‣ (i.e., complexity doubles with every additional variable)

Independence
‣ Assumptions about independence are usually based on domain knowledge.
‣ Independence drastically reduces the amount of information needed to specify the full joint
distribution
For instance: rolling 5 dice
‣ Full joint distribution: 65 = 7776
‣ Five single variable distributions: 6 * 5 = 30

• If a cavity is not present, then whether there is a toothache is not dependent on


whether the probe catches, and vice versa.

Conditional Independence
P(X, Y | Z ) = P(X | Z ) P(Y | Z )
Example:
Catch and toothache are not independent: if the probe catches, then it is likely that the tooth has a
cavity and that this cavity causes a toothache.
However, toothache and catch are independent, given the presence or absence of a cavity.
• If a cavity is present, then whether there is a toothache is not dependent on whether the probe
catches, and vice versa.
• If a cavity is not present, then whether there is a toothache is not dependent on whether the probe
catches, and vice versa.
→ P(toothache , catch | cavity ) = P( toothache | cavity )P (catch | cavity)

Bayes’ Rule - Derivation


Product Rule
‣ P(a ∧ b) = P(a ∣ b)P(b)
‣ P(a ∧ b) = P(b ∣ a)P(a)
Bayes Rule
‣ Since P(b ∣ a)P(a) = P(a ∣ b)P(b)
‣ P(b ∣ a) = P(a ∣ b)P(b) / P(a)
Useful when you have estimates for three of the four terms and you need to compute the fourth.

For multivalued variables


‣ P(Y ∣ X) = P(X ∣ Y )P(Y ) / P(X)
With normalization
‣ P(Y ∣ X) = αP(X ∣ Y )P(Y )
‣ α is the normalization constant needed to make the entries in P(Y|X) sum to 1
Example : Determining the probability of a cause given a certain effect (diagnosis).
‣ Example: what is the probability that you ate a magic mushroom if you are hallucinating?
P (hallucination ∣ magic mushroom)P (magic mushroom)
‣ P( magic mushroom ∣ hallucination) =
P(hallucination)

A patient comes into the hospital with hallucinations after lunch.


‣ Magic mushrooms cause hallucinations 70% of the time.
‣ The prior probability that someone ate magic mushrooms for lunch is 1/50,000.
‣ The prior probability that someone who comes into the hospital has hallucinations is 1%.

P (hallucination ∣ magic mushroom)P (magic mushroom)


P(magic mushroom ∣ hallucination) =
P(hallucination)
0.7 ×0.00002
P(magic mushroom ∣ hallucination) = = 0.0014
0.01

Summary
‣ Logic is insufficient to act rationally under uncertainty.
‣ Decision theory states that under uncertainty, the best action is the one that maximizes the
expected utility of the outcomes.
‣ Probability theory formalizes the notions we require to infer the expected utility of actions under
uncertainty.
‣ Given a full joint probability distribution, we can formalize a general inference procedure.
‣ Bayes’ rule allows for inferences about unknown probabilities from conditional probabilities.
‣ Neither the general inference procedure, nor Bayes’ rule scale up well.
‣ Assuming conditional independence allows for the full joint probability distribution to be factored
into smaller conditional distributions → Naive Bayes
Discussion ethics of AI Bostrom and Yudkowsky
Qualia: Sometimes termed “aboutness”. The conscious experience of something, like
the taste of chocolate.

Substrate: There are different meanings. The foundation on which something is based, or
presupposes.

Sentience: The capacity for phenomenal experience or qualia, such as the capacity to feel pain and
suffer.

Sapience: A set of capacities associated with higher intelligence, such as self- awareness and being a
reason-responsive agent.

The article addresses two main concerns:


- Protecting us from AI: As AI penetrates deeper into society, what ethical and moral issues does
this pose?
- Protecting the rights of AI systems: Should AI systems have moral status, and if so, when and
why? What are the implications?

Three scenarios are considered in the article:


1. The current scenario: Approaching Artificial General Intelligence.
2. The scenario when we attribute moral status to machines.
3. The scenario when minds with “exotic properties” exist.

Scenario 1: Approaching Artificial General Intelligence


- Because AGI aims at general abilities, AI systems of the future are likely to carry out
tasks that we didn’t design them for. Will they behave ethically when carrying out these tasks?
- The moral/ethical implications of AGI systems need to be verified before they are
deployed. How can we do this? The systems must somehow think in the same way that
trustworthy designer would.
- Ethical cognitive considerations need to be made part of the engineering problem, rather than
being considered as an afterthought.

Scenario 2: machines with moral status


2 properties seem relevant when attributing moral status:
- Sentience: the ability to feel
- Sapience: the ability to think, reason, and be self-aware.
Do both need to be established for moral status? Subjective

Principles:
Non-discrimination principles when attributing moral status:
- Principle of substrate non-discrimination: All else being equal, an agents sentience or sapience
should be judged independent of the physical substrate on which it is implemented.
- Principle of ontogeny non-discrimination: All else being equal, an agents sentience or sapience
should be judged independently of the process that created the agent.

Scenario 3: minds with exotic properties


We need to be open minded about what kind of systems might possess sentience and
sapience. The notions of morality and ethics has always evolved, fitting the concerns of the
time. This is likely to continue, and AI may play a significant role in shaping future notions
of ethics and morality.

We need to think beyond human condition.

Principle of Subjective Rate of Time:


Here the writers argue that there should be a difference in moral behaviour considering the
perception of time. As computers/software/AI process information at a pace that is unmatched to
human capabilities. Therefore, the subjective perception of time needs to be considered.

Two exotic properties:


- Objective vs subjective time: should machines that think faster than us go to prison for a shorter
period of time?
- Accelerated reproduction: Should machines that reproduce faster than others be subjected to
different moral codes?
Practical 5: Quantifying uncertainty
Exercise 1

Exercise 2

Exercise 3
Lecture 11: probabilistic reasoning:
Networks: mathematical structures used to model pairwise relations between objects. Dif between
connected graphs and disconnected graphs. A part( node) is called a vertex or node or point. Lines
between nodes are called connections.

The directed graph is alco cyclic.

Acyclic – not possible to go back to a certain node.


Cyclic – able to reach to the same node again by following the cycle.
Trees
Structure or network that can be drawn as a tree.
Path – can not revisit the same node again
Trail – can not pass through the same edge again
Walk – unrestricted

Use an adjency matrix to put a graph in a computer visually. A 1 means there is a connection.

Bayesian network
Data structure that represents dependencies among variables.
Other names: belief network, decision network and casual graph.

Directed Acyclic Graph (DAG)


- Arrow from X to Y means X is a parent of Y
- Each node corresponds to a random variable (continuous or discrete)
- Each node X1 has a conditional probability distribution P (Xi| Parents (Xi)) that quantifies the
effect of the parents on the node.

Constructing Bayesian networks:


- It’s a representation of a joint probability distribution.
- How to construct a network:
- By ordering the nodes so that causes precede effects
- And using chain rule to rewrite joint probabilities as conditional probabilities.

Generalisation: If the variables are ordered, so that for each variable the parent nodes are a subset of
the earlier variables, then the chain rule can be used to construct a Bayesian network.

Constructing Bayesian Networks


1. Determine the nodes that are required to model the domain.
- Order the nodes so that causes precedes effects.
- Result: {X1, ..., Xn}
2. For each node from Xi to X i−n, choose a minimal set of parents
- For each parent, insert a link from the parent to Xi
- Write down the conditional probability table P (Xi ∣ Parents (Xi))

Constructing Bayesian Networks


A Bayesian network is a correct representation of the domain only if each
node is conditionally independent of its other predecessors, given its
parents.
According to chain rule:
Reduced by conditional independence:
P(B, E, A, J, M) = P(M | B, E, A, J) P(J | B, E, A) P(A | B, E) P(E | B) P(B)
P(B, E, A, J, M) = P(M | A) P (J | A) P(A | B, E) P(B) P(E)
Space complexity of CPT (conditional probability table) for a node with k parents is O(2^k). it’s a two
because there are 2 parents for each nodes. K means how many parents. But most relationships
between parents and their descendants are not completely arbitrary.

Deterministic nodes: value of the nodes is specified exactly by the value of its parents,with no
uncertainty. Dif between logical and numerical

Dealing with continuous variables:


‣ Discretization: split up temperature in low, medium, and high.
‣ Model as probability density function with given parameters
‣ e.g., temperature is normally distributed with mean 18 and standard deviation 3 → high probability
of temperature 17, but low probability of temperature 11 or 25
‣ You don’t need a conditional probability table because the probability of
an event can be computed given the density function (available in most
statistics software libraries)
‣ P(t=18) = 0.133
‣ P(t=11) = 0.009
‣ P(t=25) = 0.009

Exact inference in Bayesian networks


Goal: compute the posterior probability distribution for a set of query
variables given an observed event.
‣ X : query variable
‣ {E1, E2, … En} : evidence variables
‣ {Y1, Y2, … Yn} : non-evidence variables or hidden variables.
‣ Query: P (X | e)
‣ What is the posterior probability of X given the event e ?

Inference by enumeration
Query: P(X | e)
‣ We already know that this can be answered by summing up the
relevant probabilities from the full joint distribution.
‣ A Bayesian network gives a complete representation of the full joint
distribution.
→ The query can be answered by computings sums of products of
conditional probabilities from the network.
→ P(X ∣ e) = αP(X, e) = α ∑ P(X, e, y)
y
Summary
‣ Bayesian networks are Directed Acyclic Graphs
- Each node corresponds to a random variable
- Each node has a conditional distribution, given its parents
‣ Bayesian networks can represent full joint probability distributions, but can be exponentially
smaller by exploiting conditional independence relations.
‣ Constructing Bayesian Networks takes advantage of the chain rule, which is a generalization of the
product rule.
‣ Conditional distributions can be represented more compactly, e.g., by making use of deterministic
functions probability density functions, or discretization.
‣ Specific inference procedures allow us to use Bayesian networks to answer questions that involve
joint probabilities and conditional probabilities.
Practical 6: probabilistic reasoning
Alarm example:

Handedness example:
First test

Second test with updated value

You might also like