Artificial Intelligence
Artificial Intelligence
Definition of AI What is AI? Artificial Intelligence is concerned with the design of intelligence in an artificial device. The term was coined by McCarthy in 1956. There are two ideas in the definition. 1. Intelligence 2. Artificial device What is intelligence? Is it that which characterize humans? Or is there an absolute standard of judgment? Accordingly there are two possibilities: A system with intelligence is expected to behave as intelligently as a human A system with intelligence is expected to behave in the best possible manner Secondly what type of behavior are we talking about? Are we looking at the thought process or reasoning ability of the system? Or are we only interested in the final manifestations of the system in terms of its actions? Given this scenario different interpretations have been used by different researchers as defining the scope and view of Artificial Intelligence.
Page 1
1. One view is that artificial intelligence is about designing systems that are as intelligent as humans. This view involves trying to understand human thought and an effort to build machines that emulate the human thought process. This view is the cognitive science approach to AI. 2. The second approach is best embodied by the concept of the Turing Test. Turing held that in future computers can be programmed to acquire abilities rivaling human intelligence. As part of his argument Turing put forward the idea of an 'imitation game', in which a human being and a computer would be interrogated under conditions where the interrogator would not know which was which, the communication being entirely by textual messages. Turing argued that if the interrogator could not distinguish them by questioning, then it would be unreasonable not to call the computer intelligent. Turing's 'imitation game' is now usually called 'the Turing test' for intelligence. 3. Logic and laws of thought deals with studies of ideal or rational thought process and inference. The emphasis in this case is on the inference mechanism, and its properties. That is how the system arrives at a conclusion, or the reasoning behind its selection of actions is very important in this point of view. The soundness and completeness of the inference mechanisms are important here. 4. The fourth view of AI is that it is the study of rational agents. This view deals with building machines that act rationally. The focus is on how the system acts and performs, and not so much on the reasoning process. A rational agent is one that acts rationally, that is, is in the best possible manner.
Page 2
Problem Solving:
Strong AI aims to build machines that can truly reason and solve problems. These machines should be self aware and their overall intellectual ability needs to be indistinguishable from that of a human being. Excessive optimism in the 1950s and 1960s concerning strong AI has given way to an appreciation of the extreme difficulty of the problem. Strong AI maintains that suitably programmed machines are capable of cognitive mental states. Weak AI: deals with the creation of some form of computerbased artificial intelligence that cannot truly reason and solve problems, but can act as if it were intelligent. Weak AI holds that suitably programmed machines can simulate human cognition. Applied AI: Aims to produce commercially viable "smart" systems such as, for example, a security system that is able to recognize the faces of people who are permitted to enter a particular building. Applied AI has already enjoyed considerable success. Cognitive AI: computers are used to test theories about how the human mind works--for example, theories about how we recognize faces and other objects, or about how we solve abstract problems. Best First Search: Best First search, which is a way of combining the advantages of both depth-first and breadth-firstsearch into a single method. One way of combining the two is to follow a single path at a time, but switch paths whenever some competing path looks more promising than the current one does.
Page 3
At each step of the best-first-search process, we select the most promising of the nodes we have generated so far. This is done by applying an appropriate heuristic to each of them. We then expanded the chosen node by using the rules to generate its successors. If one of them is a solution, we can quit. If not, all those new nodes are added to the set of nodes generated so far. Again the most promising node is selected and the process continues. Fig shows the beginning of a best-first search procedure. Initially, there is only one node, so it will be expanded. Doing so generates three new nodes. The heuristic function, which, in this example, is an estimate of the cost of getting to a solution from a given node, is applied to each of these new nodes. Since node D is the most promising, if it is expanded next, producing two successor nodes, E and F. But then the heuristic function is applied to them. Now another path, that going through node B, looks more promising, so it is pursued, generating nodes G and H. But again when these new nodes are evaluated they look less promising than another path, so attention is returned to the pass through D to E. E is then expanded, yielding nodes I and J. At the next step, j will be expanded, since it is the most promising. This process can continue until a solution is found. Step 1 Step2 Step3
A A A
Page 4
Step4
A
Step5
A
D B
Fig: A Best-first Search The actual operation of the algorithm is very simple. It proceeds in steps, expanding one node at each step, until it
Page 5
generates a node that corresponds to a goal state. At each step, it picks the most promising of the nodes that have so far been generated but not expanded. It generates the successors of the chosen node, applies the heuristic function to them, and adds them to the list of open nodes, after checking to see if any of them have been generated before. By doing this check, we can guarantee that each node only appears once in this graph, although many nodes may point to it as a successor. Then the next step begins. The process can be summarized as follows. Algorithm: Best-First Search 1. Start with OPEN containing just the initial state. 2. Until a goal is found or there are no nodes left on OPEN do: (a) Pick the best node on OPEN. (b) Generate its successors. (c) For each successor do: i. If it has not been generated before, evaluate it, add it to OPEN, and record its parent. ii. If it has been generated before, change the parent if this new path is better than the previous one. In that case,
Page 6
update the cost of getting this node and to any successors that this node may already here.
Page 7
3. Remove from open the node n that has the smallest value of f*(n). If the node is a goal node, return success and stop. Otherwise, 4. Expand n, generating all of its successors n and place n on closed. For every successor n, if n is not already on open or closed attach a back-pointer to n, compute f*(n) and place it an open. 5. Each n that is already on open or closed should be attached to back-pointers which reflect the lowest g*(n) path. If n was on closed and its pointer was changed, remove it and place it on open. 6. Return to step2. It has been shown that the A* algorithm is both complete and admissible. Thus, A* will always find an optimal path if one exists. The efficiency of an A* algorithm depends on how closely h* approximates h and the cost of the computing f*.
AO* algorithm:
1. Place the starting node s on open
2.
Using the search tree constructed thus far, compute the most promising solution tree To. Select a node n that is both on open and a part of To. Remove n from open and place it on closed.
Page 8
3.
4. If n is a terminal goal node, label n as solved. If the solution of n results in any of ns ancestors being solved, label all the ancestors as solved. If the start node s is solved, exit with success where To is the solution tree. Remove from open all nodes with a solved ancestor. 5. If n is not a solvable node (operators cannot be applied), label n as unsolvable. If the start node is labeled as unsolvable, exist with failure. If any of ns ancestors become unsolvable because n is, label them unsolvable as well. Remove from open all nodes with unsolvable ancestors.
6.
Otherwise, expand node n generating all of its successors. For each such successor node that contains more than one sub problem, generate their successors to give individual sub problems. Attach to each newly generated node a back pointer to its predecessor. Compute the cost estimate h* for each newly generated node and place all such nodes that do not yet have descendents on open. Next, recomputed the values of h* at n and each ancestor of n.
7. Return to step2.
exploration. This method requires that some information be available with which to evaluate and order the most promising choices. Hill climbing is like depth-first searching where the most promising child is selected for expansion. When the children have been generated, alternative choices are evaluated using some type of heuristic function. The path that appears most promising is then chosen and no further reference to the parent or other children is retained. This process continues from node-to-node with previously expanded nodes being discarded. Hill climbing can produce substantial savings over blind searches when an informative, reliable function is available to guide the search to a global goal. It suffers from some serious drawbacks when this is not the case. Potential problem types named after certain terrestrial anomalies are the foothill, ridge, and plateau traps. The foothill trap results when local maxima or peaks are found. In this case the children all have less promising goal distances than the parent node. The search is essentially trapped at the local node with no indication of goal direction. The only way to remedy this problem is to try moving in some arbitrary direction a few generations in the hope that the
Page 10
real goal direction will become evident, backtracking to an ancestor node and trying a secondary path choice. A second potential problem occurs when several adjoining nodes have higher values than surrounding nodes. This is the equivalent of a ridge. The search may encounter a plateau type of structure, that is, an area in which all neighboring nodes have the same values. Once again, one of the methods noted above must be tried to escape the trap.
Page 11
3. If the first element on the queue is a goal node g, return success and stop. Otherwise, 4. Remove and expand the first element from the queue and place all the children at the end of the queue in any order. 5. Return to step2.
Start
Mini-max search: The mini-max search procedure is a depthlimited search procedure. The idea is to start at the current position and use the plausible-move generator to generate the set of possible successor positions. Now we can apply the static evaluation function to those positions and simply choose the best one.
Page 12
The starting position is exactly as good for us as the position generated by the best move we can make next. Here we assume that the static evaluation function returns large values to indicate good situations for us, so our goal is to maximize the value of the static evaluation function of the next board position. An example of this operation is shown in fig1. It assumes a static evaluation function that returns values ranging from -10 to 10, with 10 indicating a win for us, -10 a win for the opponent, and 0 an even match. Since our goal is to maximize the value of the heuristic function, we choose to move to B. backing Bs value up to A, we can conclude that As value is 8, since we know we can move to a position with a value of 8.
A A
(8) B
(8)
C (3)
D (-2) E F
(9)
(-6) (0)
(0)
JJJJ F
Fig1. One ply search and two ply search But since we know that the static evaluation function is not completely accurate, we would like to carry the search farther
Page 13
ahead than one ply. This could be very important for example, in a chess game in which we are in the middle of a piece exchange. After our move, the situation would appear to be very good, but if we look one move ahead, we will see that one of our pieces also gets captured and so the situation is not as seemed. Once the values from the second ply are backed up, it becomes clear that the correct move for us to make at the first level, given the information we have available, is C, since there is nothing the opponent can do from there to produce a value worse than -2. This process can be repeated for as many ply as time allows, and the more accurate evaluations that are produced can be used to choose the correct move at the top level. The alteration of maximizing and minimizing at alternate ply when evaluations are being pushed back up corresponds to the opposing strategies of the two players and gives this method the name minimax. Having described informally the operation of the minimax procedure, we now describe it precisely. It is a straight forward recursive procedure that relies on two auxiliary procedures that are specific to the game being played: 1. MOVGEN (position, player) - The plausible-move generator, which returns a list of nodes representing the moves that can be made by player in position. We call the two players PLAYERONE and PLAYER-TWO; in a chess program, we might use the names BLACK and WHITE instead. 2.STATIC(position, player) the static evaluation function, which returns a number representing the goodness of position from the standpoint of player.2
Page 14
Heuristic function
A heuristic function, or simply a heuristic, is a function that ranks alternatives in various search algorithms at each branching step based on the available information (heuristically) in order to make a decision about which branch to follow during a search. Shortest paths For example, for shortest path problems, a heuristic is a function, h(n) defined on the nodes of a search tree, which serves as an estimate of the cost of the cheapest path from that node to the goal node. Heuristics are used by informed search algorithms such as Greedy best-first search and A* to choose the best node to explore. Greedy best-first search will choose the node that has the lowest value for the heuristic function. A* search will expand nodes that have the lowest value for g(n) + h(n), where g(n) is the (exact) cost of the path from the initial state to the current node. If h(n) is admissiblethat is, if h(n) never overestimates the costs of reaching the goal, then A* will always find an optimal solution. The classical problem involving heuristics is the n-puzzle. Commonly used heuristics for this problem include counting the number of misplaced tiles and finding the sum of the Manhattan distances between each block and its position in the goal configuration. Note that both are admissible.
Page 15
Effect of heuristics on computational performance In any searching problem where there are b choices at each node and a depth of d at the goal node, a naive searching algorithm would have to potentially search around bd nodes before finding a solution. Heuristics improve the efficiency of search algorithms by reducing the branching factor from b to a lower constant b', using a cutoff mechanism. The branching factor can be used for defining a partial order on the heuristics, such that h1(n) < h2(n) if h1(n) has a lower branch factor than h2(n) for a given node n of the search tree. Heuristics giving lower branching factors at every node in the search tree are preferred for the resolution of a particular problem, as they are more computationally efficient. Finding heuristics The problem of finding an admissible heuristic with a low branching factor for common search tasks has been extensively researched in the artificial intelligence community. Several common techniques are used:
Solution costs of sub-problems often serve as useful estimates of the overall solution cost. These are always admissible. For example, a heuristic for a 10-puzzle might be the cost of moving tiles 1-5 into their correct places. A common idea is to use a pattern database that stores the exact solution cost of every sub problem instance.
Page 16
The solution of a relaxed problem often serves as a useful admissible estimate of the original. For example, Manhattan distance is a relaxed version of the n-puzzle problem, because we assume we can move each tile to its position independently of moving the other tiles. Given a set of admissible heuristic functions h1(n),h2(n),...,hi(n), the function h(n) = max{h1(n),h2(n),...,hi(n)} is an admissible heuristic that dominates all of them.
Using these techniques a program called ABSOLVER was written (1993) by A.E. Prieditis for automatically generating heuristics for a given problem. ABSOLVER generated a new heuristic for the 8-puzzle better than any pre-existing heuristic and found the first useful heuristic for solving the Rubik's Cube. Consistency and Admissibility If a Heuristic function never over-estimates the cost reaching to goal, then it is called an Admissible heuristic function. If H(n) is consistent then the value of H(n) for each node along a path to goal node are non decreasing.
one possibility has been found that proves the move to be worse than a previously examined move. Such moves need not be evaluated further. Alpha-beta pruning is a sound optimization in that it does not change the score of the result of the algorithm it optimize History Allen Newell and Herbert Simon who used what John McCarthy calls an "approximation"[1] in 1958 wrote that alpha-beta "appears to have been reinvented a number of times".[2] Arthur Samuel had an early version and Richards, Hart, Levine and/or Edwards found alpha-beta independently in the United States.[3] McCarthy proposed similar ideas during the Dartmouth Conference in 1956 and suggested it to a group of his students including Alan Kotok at MIT in 1961.[4] Alexander Brudno independently discovered the alpha-beta algorithm, publishing his results in 1963.[5] Donald Knuth and Ronald W. Moore refined the algorithm in 1975[6][7] and it continued to be advanced. Improvements over naive mini max An illustration of alpha-beta pruning. The grayed-out sub trees need not be explored (when moves are evaluated from left to right), since we know the group of sub trees as a whole yields the value of an equivalent sub tree or worse, and as such cannot influence the final result. The max and min levels represent the turn of the player and the adversary, respectively.
Page 18
The benefit of alpha-beta pruning lies in the fact that branches of the search tree can be eliminated. The search time can in this way be limited to the 'more promising' sub tree, and a deeper search can be performed in the same time. Like its predecessor, it belongs to the branch and bound class of algorithms. The optimization reduces the effective depth to slightly more than half that of simple mini max if the nodes are evaluated in an optimal or near optimal order (best choice for side on move ordered first at each node). With an (average or constant) branching factor of b, and a search depth of d plies, the maximum number of leaf node positions evaluated (when the move ordering is pessimal) is O(b*b*...*b) = O(bd) the same as a simple mini max search. If the move ordering for the search is optimal (meaning the best moves are always searched first), the number of leaf node positions evaluated is about O(b*1*b*1*...*b) for odd depth and
Page 19
O(b*1*b*1*...*1) for even depth, or . In the latter case, where the ply of a search is even, the effective branching factor is reduced to its square root, or, equivalently, the search can go twice as deep with the same amount of computation.The explanation of b*1*b*1*... is that all the first player's moves must be studied to find the best one, but for each, only the best second player's move is needed to refute all but the first (and best) first player move alpha-beta ensures no other second player moves need be considered. If b=40 (as in chess), and the search depth is 12 plies, the ratio between optimal and pessimal sorting is a factor of nearly 406 or about 4 billion times. An animated pedagogical example that attempts to be human-friendly by substituting initial infinite (or arbitrarily large) values for emptiness and by avoiding using the nega max coding simplifications. Normally during alpha-beta, the sub trees are temporarily dominated by either a first player advantage (when many first player moves are good, and at each search depth the first move checked by the first player is adequate, but all second player responses are required to try and find a refutation), or vice versa. This advantage can switch sides many times during the search if the move ordering is incorrect, each time leading to inefficiency. As the number of positions searched decreases exponentially each move nearer the current position, it is worth spending considerable effort on sorting early moves. An improved sort at any depth will exponentially reduce the total number of positions searched, but sorting all positions at depths near the root node is relatively cheap as there are so few of them. In
Page 20
practice, the move ordering is often determined by the results of earlier, smaller searches, such as through iterative deepening. The algorithm maintains two values, alpha and beta, which represents the minimum score that the maximizing player is assured of and the maximum score that the minimizing player, is assured of respectively. Initially alpha is negative infinity and beta is positive infinity. As the recursion progresses the "window" becomes smaller. When beta becomes less than alpha, it means that the current position cannot be the result of best play by both players and hence need not be explored further. Additionally, this algorithm can be trivially modified to return an entire principal variation in addition to the score. Some more aggressive algorithms such as MTD(f) do not easily permit such a modification. Pseudo code Function alpha beta (node, depth, , , Player) if depth = 0 or node is a terminal node return the heuristic value of node If Player = Max Player for each child of node := max(, alpha beta(child, depth-1, , , not(Player) )) if Break (* Beta cut-off *) return else for each child of node := min(, alpha beta(child, depth-1, , , not(Player) ))
Page 21
if break (* Alpha cut-off *) return (* Initial call *) Alpha beta (origin, depth, -infinity, +infinity, Max Player) Heuristic improvements Alpha-beta search can be made even faster by considering only a narrow search window (generally determined by guesswork based on experience). This is known as aspiration search. In the extreme case, the search is performed with alpha and beta equal; a technique known as zero-window search, null-window search, or scout search. This is particularly useful for win/loss searches near the end of a game where the extra depth gained from the narrow window and a simple win/loss evaluation function may lead to a conclusive result. If an aspiration search fails, it is straightforward to detect whether it failed high (high edge of window was too low) or low (lower edge of window was too high). This gives information about what window values might be useful in a re-search of the position.
constraints on a finite domain. Such problems are usually solved via search, in particular a form of backtracking or local search. Constraint propagation are other methods used on such problems; most of them are incomplete in general, that is, they may solve the problem or prove it un satisfiable, but not always. Constraint propagation methods are also used in conjunction with search to make a given problem simpler to solve. Other considered kinds of constraints are on real or rational numbers; solving problems on these constraints is done via variable elimination or the simplex algorithm. Constraint satisfaction originated in the field of artificial intelligence in the 1970s (see for example (Laurire 1978)). During the 1980s and 1990s, embedding of constraints into a programming language were developed. Languages often used for constraint programming are Prolog and C++.
Page 23
Constraint satisfaction problem Constraints enumerate the possible values a set of variables may take. Informally, a finite domain is a finite set of arbitrary elements. A constraint satisfaction problem on such domain contains a set of variables whose values can only be taken from the domain, and a set of constraints, each constraint specifying the allowed values for a group of variables. A solution to this problem is an evaluation of the variables that satisfies all constraints. In other words, a solution is a way for assigning a value to each variable in such a way that all constraints are satisfied by these values. In practice, constraints are often expressed in compact form, rather than enumerating all values of the variables that would satisfy the constraint. One of the most used constraints is the one establishing that the values of the affected variables must be all different. Problems that can be expressed as constraint satisfaction problems are the Eight queens puzzle, the Sudoku solving problem, the Boolean satisfiability problem, scheduling problems and various problems on graphs such as the graph coloring problem. While usually not included in the above definition of a constraint satisfaction problem, arithmetic equations and inequalities bound the values of the variables they contain and can therefore be considered a form of constraints. Their domain
Page 24
is the set of numbers (either integer, rational, or real), which is infinite: therefore, the relations of these constraints may be infinite as well; for example, X = Y + 1 has an infinite number of pairs of satisfying values. Arithmetic equations and inequalities are often not considered within the definition of a "constraint satisfaction problem", which is limited to finite domains. They are however used often in constraint programming. Solving Constraint satisfaction problems on finite domains are typically solved using a form of search. The most used techniques are variants of backtracking, constraint propagation, and local search. These techniques are used on problems with nonlinear constraints. Variable elimination and the simplex algorithm are used for solving linear and polynomial equations and inequalities, and problems containing variables with infinite domain. These are typically solved as optimization problems in which the optimized function is the number of violated constraints. Complexity Solving a constraint satisfaction problem on a finite domain is an NP complete problem. Research has shown a number of tractable sub cases, some limiting the allowed constraint relations, some requiring the scopes of constraints to form a tree, possibly in a reformulated version of the problem. Research has also established relationship of the constraint satisfaction
Page 25
Constraint programming Constraint programming is the use of constraints as a programming language to encode and solve problems. This is often done by embedding constraints into a programming language, which is called the host language. Constraint programming originated from a formalization of equalities of terms in Prolog II, leading to a general framework for embedding constraints into a logic programming language. The most common host languages are Prolog, C++, and Java, but other languages have been used as well. Constraint logic programming A constraint logic program is a logic program that contains constraints in the bodies of clauses. As an example, the clause A(X):-X>0,B(X) is a clause containing the constraint X>0 in the body. Constraints can also be present in the goal. The constraints in the goal and in the clauses used to prove the goal are accumulated into a set called constraint store. This set contains the constraints the interpreter has assumed satisfiable in order to proceed in the evaluation. As a result, if this set is detected un satisfiable, the interpreter backtracks. Equations of terms, as used in logic programming, are considered a particular form of
Page 26
constraints which can be simplified using unification. As a result, the constraint store can be considered an extension of the concept of substitution that is used in regular logic programming. The most common kinds of constraints used in constraint logic programming are constraints over integers/rational/real numbers and constraints over finite domains. Concurrent constraint logic programming languages have also been developed. They significantly differ from non-concurrent constraint logic programming in that they are aimed at programming concurrent processes that may not terminate. Constraint handling rules can be seen as a form of concurrent constraint logic programming, but are also sometimes used within a non-concurrent constraint logic programming language. They allow for rewriting constraints or to infer new ones based on the truth of conditions. Constraint satisfaction toolkits Constraint satisfaction toolkits are software libraries for imperative programming languages that are used to encode and solve a constraint satisfaction problem.
Cassowary constraint solver is an open source project for constraint satisfaction (accessible from C, Java, Python and other languages). Comet, a commercial programming language and toolkit
Page 27
Gecode, an open source portable toolkit written in C++ developed as a production-quality and highly efficient implementation of a complete theoretical background. JaCoP (solver) an open source Java constraint solver Koalog a commercial Java-based constraint solver. logilab-constraint an open source constraint solver written in pure Python with constraint propagation algorithms. MINION an open-source constraint solver written in C++, with a small language for the purpose of specifying models/problems. ZDC is an open source program developed in the Computer-Aided Constraint Satisfaction Project for modeling and solving constraint satisfaction problems.
Other constraint programming languages Constraint toolkits are a way for embedding constraints into an imperative programming language. However, they are only used as external libraries for encoding and solving problems. An approach in which constraints are integrated into an imperative programming language is taken in the Kaleidoscope programming language. Constraints have also been embedded into functional programming languages.
related algorithms. The evaluation function is typically designed to be prioritize speed over accuracy; the function looks only at the current position and does not explore possible move In chess One popular strategy for constructing evaluation functions is as a weighted sum of various factors that are thought to influence the value of a position. For instance, an evaluation function for chess might take the form c1 * material + c2 * mobility + c3 * king safety + c4 * center control +... Chess beginners, as well as the simplest of chess programs, evaluate the position taking only "material" into account, i.e. they assign a numerical score for each piece (with pieces of opposite color having scores of opposite sign) and sum up the score over all the pieces on the board. On the whole, computer evaluation functions of even advanced programs tend to be more materialistic than human evaluations. This is compensated for by the increased speed of evaluation, which allows more plies to be examined. As a result, some chess programs may rely too much on tactics at the expense of strategy.
Page 29
The first two ply of the game tree for tic-tac-toe. The diagram shows the first two levels, or ply, in the game tree for tic-tac-toe. We consider all the rotations and reflections of positions as being equivalent, so the first player has three choices of move: in the center, at the edge, or in the corner. The second player has two choices for the reply if the first player played in the center, otherwise five choices. And so on. The number of leaf nodes in the complete game tree is the number of possible different ways the game can be played. For example, the game tree for tic-tac-toe has 26,830 leaf nodes. Game trees are important in artificial intelligence because one way to pick the best move in a game is to search the game tree using the mini max algorithm or its variants. The game tree for tic-tac-toe is easily searchable, but the complete game trees for larger games like chess are much too large to search. Instead, a chess-playing program searches a partial game tree: typically as many ply from the current position as it can search in the time
Page 30
available. Except for the case of "pathological" game trees [1] (which seem to be quite rare in practice), increasing the search depth (i.e., the number of ply searched) generally improves the chance of picking the best move. Two-person games can also be represented as and-or trees. For the first player to win a game there must exist a winning move for all moves of the second player. This is represented in the and-or tree by using disjunction to represent the first player's alternative moves and using conjunction to represent all of the second player's moves. Solving Game Trees
An arbitrary game tree that has been fully colored With a complete game tree, it is possible to "solve" the game that is to say, find a sequence of moves that either the first or
Page 31
second player can follow that will guarantee either a win or tie. The algorithm can be described recursively as follows. 1. Color the final ply of the game tree so that all wins for player 1 are colored one way, all wins for player 2 are colored another way, and all ties are colored a third way. 2. Look at the next ply up. If there exists a node colored opposite as the current player, color this node for that player as well. If all immediately lower nodes are colored for the same player, color this node for the same player as well. Otherwise, color this node a tie. 3. Repeat for each ply, moving upwards, until all nodes are colored. The color of the root node will determine the nature of the game. The diagram shows a game tree for an arbitrary game, colored using the above algorithm. It is usually possible to solve a game (in this technical sense of "solve") using only a subset of the game tree, since in many games a move need not be analyzed if there is another move that is better for the same player (for example alpha-beta pruning can be used in many deterministic games). Any sub tree that can be used to solve the game is known as a decision tree, and the sizes of decision trees of various shapes are used as measures of game complexity.
and upon which contestants may or may not wager money or anything of monetary value. Common devices used include dice, spinning tops, playing cards, roulette wheels or numbered balls drawn from a container. Any game of chance that involves anything of monetary value is gambling. Gambling is known in nearly all human societies, even though many have passed laws restricting it. Early people used the knucklebones of sheep as dice. Some people develop a psychological addiction to gambling, and will risk even food and shelter to continue. Some games of chance may also involve a certain degree of skill. This is especially true where the player or players have decisions to make based upon previous or incomplete knowledge, such as poker and blackjack. In other games like roulette and baccarat the player may only choose the amount of bet and the thing he/she wants to bet on, the rest is up to chance, therefore these games are still considered games of chance with small amount of skills required [1]. The distinction between 'chance' and 'skill' is relevant as in some countries chance games are illegal or at least regulated, where skill games are not
Page 33
efficiency is supplied by the guidance a representation provides for organizing information so as to facilitate making the recommended inferences. It is a medium of human expression, i.e., a language in which we say things about the words Knowledge representation is needed for library classification and for processing concepts in an information system. In the field of artificial intelligence, problem solving can be simplified by an appropriate choice of knowledge representation. Representing the knowledge in one way may make the solution simple, while an unfortunate choice of representation may make the solution difficult or obscure; the analogy is to make computations in Hindu-Arabic numerals or in Roman numerals; long division is simpler in one and harder in the other. Likewise, there is no representation that can serve all purposes or make every problem equally approachable. Properties for Knowledge Representation Systems The following properties should be possessed by a knowledge representation system. Representational Adequacy the ability to represent the required knowledge; Inferential Adequacy - the ability to manipulate the knowledge represented to produce new knowledge corresponding to that inferred from the original; Inferential Efficiency - the ability to direct the inferential mechanisms into the most productive directions by storing appropriate guides;
Page 35
Acquisition Efficiency - the ability to acquire new knowledge using automatic methods wherever possible rather than reliance on human intervention.
Predicates allow us to talk Properties: is wet (today) Relations: likes (john, apples)
True or false is a predicate e.g. first order logic, higher-order logic First Order Logic propositional Used in this course (Lecture 6 on representation in FOL)
Variables represent any object: likes(X, apples) Quantifiers qualify values of variables
X. likes(X,
Exists at least one object (Existential): X. likes(X, apples Example: FOL Sentence
For all X
More expressive than first Functions and predicates are Described by predicates: binary(addition) Transformed by functions: differentiate(square) Can quantify over both
having zero at 17
unknown values. In that case the rule condition is unknown. When a rule condition is unknown, the rule is not fired and the next rule is examined. The iterative reasoning process: The process of examining one rule after the other continues until a complete pass has been made through the entire rule set. More than one pass usually is necessary in order to assign a value to the goal variable. Perhaps the information needed to evaluate one rule is produced by another rule that is examined subsequently. After the second rule is fired, the first rule can be evaluated o the next pass. The passes continue as long as it is possible to fire rules. When no more rules can be fired, the reasoning process ceases. Example of forward reasoning: Letters are used for the conditions and actions to keep the illustration simple. In rule1, for example, if condition A exists then action B is taken. Condition A might be THIS.YEAR.SALES>LAST.YEAR.SALES
chaining is implemented in logic programming by SLD resolution. Both rules are based on the modus ponens inference rule. Backward chaining starts with a list of goals (or a hypothesis) and works backwards from the consequent to the antecedent to see if there is data available that will support any of these consequents. An inference engine using backward chaining would search the inference rules until it finds one which has a consequent (Then clause) that matches a desired goal. If the antecedent (If clause) of that rule is not known to be true, then it is added to the list of goals (in order for one's goal to be confirmed one must also provide data that confirms this new rule). For example, suppose that the goal is to conclude the color of my pet Fritz, given that he croaks and eats flies, and that the rule base contains the following four rules:
1. 2. 3. 4.
If X croaks and eats flies Then X is a frog If X chirps and sings Then X is a canary If X is a frog Then X is green If X is a canary Then X is yellow
This rule base would be searched and the third and fourth rules would be selected, because their consequents (Then Fritz is green, Then Fritz is yellow) match the goal (to determine Fritz's color). It is not yet known that Fritz is a frog, so both the antecedents (If Fritz is a frog, If Fritz is a canary) are added to the goal list. The rule base is again searched and this time the
Page 40
first two rules are selected, because their consequents (Then X is a frog, Then X is a canary) match the new goals that were just added to the list. The antecedent (If Fritz croaks and eats flies) is known to be true and therefore it can be concluded that Fritz is a frog, and not a canary. The goal of determining Fritz's color is now achieved (Fritz is green if he is a frog, and yellow if he is a canary, but he is a frog since he croaks and eats flies; therefore, Fritz is green). Note that the goals always match the affirmed versions of the consequents of implications (and not the negated versions as in modus tollens) and even then, their antecedents are then considered as the new goals (and not the conclusions as in affirming the consequent) which ultimately must match known facts (usually defined as consequents whose antecedents are always true); thus, the inference rule which is used is modus ponens. Because the list of goals determines which rules are selected and used, this method is called goal-driven, in contrast to data-driven forward-chaining inference. The backward chaining approach is often employed by expert systems.
Page 41
To construct computer programs that can understand natural language. To make inferences from the statements and also to identify conditions in which two sentences can have similar meaning. To provide facilities for the system to take part in dialogues and answer questions. To provide a means of representation which are language independent. Knowledge is represented in CD by elements what are called as conceptual structures. What forms the basis of CD representation is that for two sentences which have identical meaning there must be only one representation and implicitly packed information must be explicitly stated. In order that knowledge is represented in CD form, certain primitive actions have been developed. Table provides the primitives CD actions. Apart from the primitives CD actions one has to make use of the six following categories of types of objects. 1. PPs: (picture producers) Only physical objects are physical producers. 2. ACTs: Actions are done by an actor to an object. Table gives the major ACTs. Table Primitive CD forms
Page 42
Explanation transfer of abstract relationship(e,g, give) transfer of physical location of an object application of physical force of an object movement of a body part of an animal by grasping of an object by an actor (e.g.
3.
6. INGEST Taking of an object by an animal to the inside of that a animal (e.g. Drink.eat) 7. EXPEL Expulsion of an object from inside the body by an animal to the world (e.g. spit) 8. MTRANS Transfer of mental information between animals or within an animal (e.g. tell) 9. MBUILD Construction of a new information from an old information (e.g. decide). 10. SPEAK
Page 43
3. LOCs: Locations Every action takes place at some locations and serves as source and destination. 4. Ts: Times An action can take place at a particular location at a given specified time. The time can be represented on an absolute scale or relative scale. 5. AAs: Action aiders These serve as modifiers of actions, the actor PROPEL has a speed factor associated with it which is an action aider. 6. PAs: Picture Aides Serve as aides of picture producers. Every object that serve as a PP, needs certain characteristics by which they are defined. PAs practically serve PPs by defining the characteristics. There are certain rules by which the conceptual categories of types of objects discussed can be combined. CD models provide the following advantages for representing knowledge. The ACT primitives help in representing wide knowledge in a succinct way. To
Page 44
illustrate this, consider the following verbs. These are verbs that correspond to transfer of mental information. -see -learn -hear -inform -remember In CD representation all these are represented using a single ACT primitives MTRANS. They are not represented individually as given. Similarly, different verbs that indicate various activities are clubbed under unique ACT primitives, thereby reducing the number of inference rules.
The main goal of CD representation is to make explicit of what is implicit. That is why every statement that is made has not only the actors and objects but also time and location, source and destination.
The following set conceptual tenses still make usage of CD more precise. O-Object case relationship R-recipient case relationship P-past
Page 45
F-future T-transition Ts-start transition Tf-finished transition K-continuing ?-interrogative /-negative Nil-present Delta-timeless C-conditional
CD brought forward the notion of language independence because all ACTs are language-independent primitive.
Semantic Nets: The main idea behind semantic nets is that the meaning of a concept comes from the ways in which it is connected to other concepts. In a semantic net, information is represented as a set of nodes connected to each other by a set of labeled arcs, which represent relationship among the nodes. A fragment of a typical semantic net is shown in fig.
Mammal
Isa
Page 46 Person
Has-port
Nose
Instance
UniformColor
Blue Pee-wee-Reese
team
Brooklyn- Dodgers
Fig: A semantic network This network contains an example of both the Isa and instance relations, as well as some other, more domain-specific relations like team and uniform-color. In this network, we could use inheritance to derive the additional relation Has-part (pee-wee-Reese, Nose) 1. Insertion Search. One of the early ways that semantic nets were used was to find relationships among objects by spreading activation out from each of two nodes and seeing where the activation met. This process is called insertion search. Using this process, it is possible to use the network of fig to answer questions such as what is the connection between the Brooklyn Dodgers and Blue? 2. Representing Non Binary predicates. Semantic nets are a natural way to represent relationships that would appear as ground instances of binary predicates in predicate logic. For
Page 47
example, some of the arcs from fig could be represented in logic as Isa (person, mammal) Instance (pee-Wee-Reese, Person) Team (pee-wee-Reese, Brooklyn-Dodgers) Uniform color (pee-wee- Reese, blue) But the knowledge expressed by predicates of other arties can also be expressed in semantic nets. We have already seen that many unary predicates. Such as Isa and instance. So for example, Man (Marcus) Could be rewritten as Instance (Marcus, Man) Thereby making it easy to represent in a semantic net. 3. Partitioned semantic Nets. Suppose we want to represent simple quantified expression in semantic nets. One way to do this is to partition the semantic net into a hierarchical set of spaces, each of which corresponds to the scope of one or more variables. To see how this works, consider first the simple net shown in fig. this net corresponds to the statement. The dog bit the mail carrier.
Page 48
The nodes Dogs, Bite, and Mail-Carrier represents the classes of dogs, biting, and mail carriers, respectively, while the nodes d,b, and m represent a particular dog, a particular biting, and a particular mail carrier. This fact can be easily be represented by a single net with no portioning. But now suppose that we want to represent the fact Every dog has bitten a mail carrier. Or, in logic: X: dog(x) y: Mail-carrier(y) bite(x, y)
To represent this fact, it is necessary to encode the scope of the universally quantified variable x.
Dogs Bite Mailcarrier
Isa
d b
Isa
isa
m
Assailant
victim
Fig: using partitioned semantic nets Frame: A frame is a collection of attributes (usually called slots) and associated values (and possibly constraints on values) that describes some entity in the world. Sometimes a frame
Page 49
describes an entity in some absolute sense; sometimes it represents the entity from a particular point of view. A single frame taken alone is rarely useful. Instead, we build frame systems out of collection of frames that are connected to each other. Set theory provides a good basis for understanding frame systems. Although not all frame systems are defined this way, we do so here. In this view, each frame represents either a class (a set) or an instance (an element of a class). To see how this works, consider a frame system shown in fig. In this example, the frames person, adult-male, ML-baseball-player, pitcher, and ML-baseball team are all classes. The frames pee-wee-Reese and Brooklyn-Dodgers are instances.
Person Isa: Cardinality: *handed: Adult-Male Isa: Cardinality *height: person 2,000,000,000 5-10 Mammal 6,000,000,000 Right
ML-Baseball-Player
Page 50
ML-Baseball-Team Isa:
Page 51
Team
26 24
2. Each propositional constant (i.e. specific proposition), and each propositional variable (i.e. a variable representing propositions) are wffs. 3. Each atomic formula (i.e. a specific predicate with variables) is a wff. 4. If A, B, and C are wffs, then so are A, (A B), (A B), (A B), and (A B). 5. If x is a variable (representing objects of the universe of discourse), and A is a wff, then so are x A and x A . 6. For example, "The capital of Virginia is Richmond." is a specific proposition. Hence it is a wff by Rule 2. Let B be a predicate name representing "being blue" and let x be a variable. Then B(x) is an atomic formula meaning "x is blue". Thus it is a wff by Rule 3. above. By applying Rule 5. to B(x), xB(x) is a wff and so is xB(x). Then by applying Rule 4. to them x B(x) x B(x) is seen to be a wff. Similarly if R is a predicate name representing "being round". Then R(x) is an atomic formula. Hence it is a wff. By applying Rule 4 to B(x) and R(x), a wff B(x) R(x) is obtained. In this manner, larger and more complex wffs can be constructed following the rules given above. Note, however, that strings that can not be constructed by using those rules are not wffs. For example, xB(x)R(x), and B( x ) are NOT wffs, NOR are B( R(x) ), and B( x R(x) ) . One way to check whether or not an expression is a wff is to try to state it in English. If you can translate it into a
Page 53
correct English sentence, then it is a wff. More examples: To express the fact that Tom is taller than John, we can use the atomic formula taller(Tom, John), which is a wff. This wff can also be part of some compound statement such as taller(Tom, John) taller(John, Tom), which is also a wff. If x is a variable representing people in the world, then taller(x,Tom), x taller(x,Tom), x taller(x,Tom), x y taller(x,y) are all wffs among others. However, taller( x,John) and taller(Tom Mary, Jim), for example, are NOT wffs.
7.
Unit-3 Handling Uncertainty and learning Fuzzy Logic: In the techniques we have not modified the
mathematical underpinnings provided by set theory and logic. We have instead augmented those ideas with additional constructs provided by probability theory. We take a different approach and briefly consider what happens if we make
Page 54
fundamental changes to our idea of set membership and corresponding changes to our definitions of logical operations. The motivation for fuzzy sets is provided by the need to represent such propositions as: John is very tall. Mary is slightly ill. Sue and Linda are close friends. Exceptions to the rule are nearly impossible. Most Frenchmen are not very tall. While traditional set theory defines set membership as a Boolean predicate, fuzzy set theory allows us to represent set membership as a possibility distribution. Once set membership has been redefined in this way, it is possible to define a reasoning system based on techniques for combining distributions. Such responders have been applied control systems for devices as diverse trains and washing machines.
In which the degree of belief must lie. Belief (usually denoted Bel) measures the strength of the evidence in favor of a set of propositions. It ranges from 0 (indicating no evidence) to 1(denoting certainty). A belief function, Bel, corresponding to a specific m for the set A, is defined as the sum of beliefs committed to every subset of A by m. That is, Bel (A) is a measure of the total support or belief committed to the set A and sets a minimum value for its likelihood. It is defined in terms of all belief assigned to A as well as to all proper subsets of A. Thus, Bel (A) =m (B) For example, if U contains the mutually exclusive subsets A, B, C, and D then Bel({A,C,D})= m({A,C,D}) +m({A,C})+m({A,D})+m({C,D}) +m({A})+m({c})+m({D}. In Dempster-Shafer theory, a belief interval can also be defined for a subset A. It is represented as the subinterval [Bel (A), P1 (A)] of [0, 1]. Bel (A) is also called the support of A, and P1 (A) =1-Bel (A) the plausibility of A. We define Bel (o) =0 to signify that no belief should be assigned to the empty set and Bel (U) = 1 to show that the truth is contained within U. The subsets A of U are called the focal elements of the support function Bel when m (A)>0.
Page 56
Since Bel (A) only partially describes the beliefs about proposition A, it is useful to also have a measure of the extent one believes in A that is, the doubts regarding A. For this, we define the doubt of A as D (A) = Bel (A). From this definition it will be seen that the upper bound of the belief interval noted above, P1 (A), can be expressed as P1 (A) =1-D (A) = 1- Bel (`A). P1 (A) represents an upper belief limit on the proposition A. The belief interval, [Bel (A), P (A)], is also sometimes referred to as the confidence in A, while the quantity P1 (A)-Bel (A) is referred to as the uncertainty in A. It can be shown that P1 (0) = 0, P1 (U) =1 For all A, P1 (A) Bel (A) Bel (A) +Bel (`A) 1, P1 (A) + P1 (`A) 1, and For A _B, Bel (A) Bel (B), P1 (A) P1 (B) As an example of the above concepts, recall once again the problem of identifying the terrorist organizations A, B, C and D could have been responsible for the attack. The possible subsets of U in this case form a lattice of sixteen sub sets (fig).
{A, B, C, D}
Page 57
{A, B, C,}
{A, B, D}
{A, C, D}
{B, C, D}
{A, B} {A, C,} {B, C,} {B, D} {A, C,} {C, D} {B, D} {C, D}
{A}
{B}
{C}
{D}
Assume one piece of evidence supports the belief that groups A and C were responsible to a degree of m1 ({A, C}) = 0.6 and another source of evidence disproves the belief that C was involved (and therefore supports the belief that the three organizations, A, B, and D were responsible: that is m2 ({A, B, D}) =0.7. To obtain the pooled evidence, we compute the following quantities M1 0m2({A}) = (0.6)*(0.7) =0.42 M1 0 m2 ({A, C}) = (0.6) *(0.3) = 0.18 M1 0 m2 ({A, B, D}) = (0.4)*(0.7) = 0.28
Page 58
M1 0 m2 ({U}) = (0.4)*(0.3) =0.12 M1 0m2=0 for all other subsets of U Bel1 ({A, C}) = m ({A, C}) +m ({A}) +m ({C})
Bayes Theorem: An important goal for many problemsolving systems is to collect evidence as the system goes along and to modify its behavior on the basis of evidence. To model this behavior, we need a statistical theory of evidence. Bayesian statistics is such a theory. The fundamental notion of. Bayesian statistics is that of conditional probability; P (H/ E) 231 Read this expression as the probability of hypothesis H given that we have observed evidence E. To compute this, we need to take into account the prior probability of H and the extent to which E provides evidence of H. To do this, we need to define a universe that contains an exhaustive, mutually exclusive set of His among which we are trying to discriminate then, let P (Hi/E) = the probability that hypothesis Hi is true given evidence E P (E/Hi) = the probability that we will observe evidence E given that hypothesis i is true.
Page 59
P (Hi) = the priori probability that hypothesis i is true in the absence of any specific evidence. These probabilities are called prior probabilities of priors. K=the number of positive hypotheses Bayes theorem then states that P (Hi/E) = P (E/Hi). P (Hi) P (E/Hn).P (Hn) Specifically, when we say P (A/B), we are describing the conditional probability of A given that the only evidence we have is B. If there is also other relevant evidence, then it too must be considered. Suppose, for example, that we are solving a medical diagnosis problem. Consider the following assertions: S: patient has spots M: patient has measles F: patient has high fever Without any additional evidence, the presence of spots serves as evidence in favor of measles. It also serves as evidence of fever since measles would cause fever. But, since spots and fever are not independent events, we cannot just sum their effects; instead, we need to represent explicitly the conditional probability that arises from their conjunction. In general, given a
Page 60
prior body of evidence e and some new observation E, we need to compute. P (H/E, e) = P (H/E).P (e/E, H) P (e/E) Unfortunately, in an arbitrarily complex world, the sizes of the set of join probabilities that we are require in order to compute this function grows as 2n if there are n different propositions being considered. This makes using Bayes theorem intractable for several reasons: The knowledge acquisition problem is insurmountable; too many probabilities have to be provided. The space that would be required to store all the probabilities is too large. The time required to compute the probabilities is too large. Despite these problems, through Bayesian statistics provide an attractive basis for an uncertain reasoning system. As a result, several mechanisms for exploiting its power while at the same time making it tractable have been developed.
that the ability to adapt to new surroundings and to solve new problems is an important characteristics of intelligent entities. How to interpret its inputs in such a way that its performance gradually improves. Learning denotes changes in the systems that are adaptive in the sense that they enable the system to do the same task or tasks drawn from the same population more efficiently and more effectively the next time. Learning covers a wide range of phenomena. 1. At one end of the spectrum is skill refinement. People get better at many tasks simply by practicing. The more you ride a bicycle or play tennis, the better you get. 2. At the other end of this spectrum lies knowledge acquisition. Knowledge is generally acquired through experience 3. Many AI programs are able to improve their performance substantially through rote- learning techniques. 4. Another way we learn is through taking advice from others. Advice taking is similar to rote learning, but high-level may not be in a form simple enough for a program to use directly in problem solving.
5.
People also learn through their own problem solving experience. After solving a complex problem, we
Page 62
remember the structure of the problem and the methods we used to solve it. The next time we see the problem, we can solve it more efficiently. Moreover, we can generalize from our experience to solve related problems more easily. the program remembers its experiences and generalizes from them. In large problem spaces, however, efficiency gains are critical. Learning can mean the difference between solving a problem rapidly and not solving it at all. In addition, programs that learn through problem-solving experience may be able to come up with qualitatively better solutions in the future.
6.
Another form of learning that does involve stimuli from the outside is learning from examples. Learning from examples usually involves a teacher who helps us classify things by correcting us when we are wrong. Sometimes, however, a program can discover things without the aid of a teacher. Learning is itself a problem-solving process.
some task from which meaningful feedback can be obtained, where the feedback provides some measure of the accuracy and usefulness of the newly acquired knowledge. Learning model is depicted in fig where the environment has been included as part of the overall learner system. The environment may be regarded as either a form of nature which produces random stimuli or as a more organized training source such as a teacher which provides carefully selected training examples for the learner component.
Stimuli Examples
Feedback u
Learner Component
Environment Or Teacher
Knowledge Base Critic performance
Response
Performance Component
Evaluator
Tasks
Page 64
The actual form of environment used will depend on the particular learning paradigm. In any case, some representation language must be assumed for communication between the environment and the learner. The language may be the same representation scheme as that used in the knowledge base (such as form of predicate calculus). When they are chosen to be the same we say the single representation trick is being used. This usually results in a simpler implementation since it is not necessary to transform between two or more different representations. Inputs to the learner component may be physical stimuli of some type or descriptive, symbolic training examples. The information conveyed to the learner component is used to create and modify knowledge structures in the knowledge base. When given a task, the performance component produces a response describing actions in performing the task. The critic module then evaluates this response relative to an optimal response. The cycle described above may be repeated a number of times until the performance of the system have reached some acceptable level, until a known learning goal has been reached, or until changes cease to occur in the knowledge base after some chosen numbers of training examples have been observed.
Page 65
There are several important factors which influence a systems ability to learn in addition to the form of representation used. They include the types of training provided, the form and extent of any initial background knowledge, the type of feedback provided, and the learning algorithms used (fig).
Background knowledge
resultant
Representation scheme
fig Factors affecting learning performance Finally the learning algorithms themselves determine to a large extent how successful a learning system will be. Te algorithms control the search to find and build the knowledge structures. We then expect that the algorithms that extract much of the
Page 66
useful information from training examples and take advantage of any background knowledge out perform those that do not.
2. Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements. 3. Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output. 4. Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support vector machines or decision trees. 5. Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation. 6. Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.
Page 68
Factors to consider Factors to consider when choosing and applying a learning algorithm include the following:
1.
2.
3.
Heterogeneity of the data. If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Many algorithms, including Support Vector Machines, linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees is that they easily handle heterogeneous data. Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often by solved by imposing some form of regularization. Presence of interactions and non-linearitys. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, Support Vector Machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support vector machines with Gaussian kernels)
Page 69
generally perform well. However, if there are complex interactions among features, then algorithms such as decision trees and neural networks work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them. How supervised learning algorithms work Given a set of training examples of the form , a learning algorithm a function , where X is the input space and Y is the output space. The function g is an element of some space of possible functions G, usually called the hypothesis space. It is sometimes convenient to represent g using a scoring function such that g is defined as returning the y value that gives the highest score: . Let F denote the space of scoring functions. Although G and F can be any space of functions, many learning algorithms are probabilistic models where g takes the form of a conditional probability model g(x) = P(y | x), or f takes the form of a joint probability model f(x,y) = P(x,y). For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model. There are two basic approaches to choosing f or g: empirical risk minimization and structural risk minimization[3]. Empirical risk minimization seeks the function that best fits the training data.
Page 70
Structural risk minimize includes a penalty function that controls the bias/variance tradeoff. In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs, . In order to measure how well a function fits the training data, a loss function is defined. For training example , the loss of predicting the value is . The risk R(g) of function g is defined as the expected loss of g. This can be estimated from the training data as . Generalizations of supervised learning There are several ways in which the standard supervised learning problem can be generalized: Semi-supervised learning: In this setting, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled. Active learning: Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user. Often, the queries are based on unlabeled data, which is a scenario that combines semi-supervised learning with active learning.
Page 71
Structured prediction: When the desired output value is a complex object, such as a parse tree or a labeled graph, then standard methods must be extended. .Learning to rank: When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended.
1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 0 1 1
0 0 0 0 1 1 1 0 0
0 0 1 0 1 1 1 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 1 1 1 1 1
Page 72
Alligator 0
Fig. Data for unsupervised learning This form of learning is called unsupervised learning because no teacher is required. Given a set of input data, the network is allowed to play with it to try to discover regularities and relationships between the different parts of the input. Learning is often made possible through some notion of which features in the input sets are important. But often we do not know in advance which features are important, and asking a learning system to deal with raw input data can be computationally expensive. Unsupervised learning can be used as a feature discovery module that precedes supervised learning. Consider the data in fig. the group of ten animals, each is described by its own set of features, breaks down naturally into three groups: mammals, reptiles and birds. We would like to build a network that can learn which group a particular animal belongs to, and to generalize so that it can identify animals it has not yet seen. We can easily accomplish this with a six-input, three-output back propagation network. We simply present the network with an input, observe its output, and update its weights based on the errors it makes. Without a teacher, however, the error cannot be computed, so we must seek other methods. Our first problem is to ensure that only one of the three output units become active for any given input. One solution to this problem is to let the network settle, find the output unit with the
Page 73
highest level of activation, and set that unit to 1 and all other output units to 0. In other words, the output unit with the highest activation is the only one we consider to be active. A more neural-like solution is to have the output units fight among themselves for control of an input vector.
Page 74
developing some general conclusions or theories. (Thanks to William M.K. Trochim for these definitions). To translate this into an approach to learning a skill, deductive learning is someone TELLING you what to do, while inductive learning is someone SHOWING you what to do. Remember the saying "a picture is worth a thousand words"? That means, in a given amount of time, a person can be SHOWN a thousand times more information than they could be TOLD in the same amount of time. I can access a picture or pattern much more quickly than the equivalent description of that picture or pattern in words. Athletes often practice "visualization" before they undertake an action. But in order to visualize something, you need to have a picture in your head to visualize. How do you get those pictures in your head, by WATCHING. Who do you watch? Professionals. This is the key. Pay attention here. When you want to learn a skill:
WATCH PROFESSIONALS DO IT BEFORE YOU DO IT. DO NOT DO IT YOURSELF FIRST. Going out and doing a sport without having seen AND STUDIED professionals doing that sport is THE NUMBER ONE MISTAKE people make. They force themselves to play, their brain says "what do we do now?", another part of the brain looks for examples (pictures) of what to do, and, finding none, says "just do anything". So they try to generate behavior to
Page 75
accomplish something within the rules of the sport. If they "keep score" and try to "win" and avoid "losing", the negative impact is multiplied tenfold. Yet this is EXACTLY what most people do and what most ARE TOLD to do! "Interested in tennis? Grab a racquet, join a league, get out there and have fun!" Then what happens? They have no training, they try do what it takes to "win", and to do so, they manufacture awful strokes just TO BE ABLE to play (remember, they joined a league, so they have to keep score and win!), these awful strokes get ingrained by repetition, they produce terrible results, and they are very difficult to unlearn, so progress, despite lessons (mostly in the useless form of words), is slow or non existent. Then they quit. When you finally pick up a racquet and go out to play, and your brain says "what do we do now?", your head will be filled with pictures of professionals perfectly doing what you are trying to do. You will not know how to do it incorrectly, because you have never seen it done incorrectly. You will try to do what they do, and you will almost immediately proceed to an advanced intermediate level. You will be a beginner for a short period of time, if at all, and improvement will be a matter of adding to and refining what you are doing, not stripping down and unlearning bad patterns. And since you are not keeping score, you focus purely on technique. If you hit one into the net, just pull another ball out of your pocket and do it again. No big deal, no drama, no guilt. Just hit another. When you feel you can hit all of your shots somewhat professionally, maybe you can actually play someone and keep score. You will love the
Page 76
positive feedback of beating players who have been playing much longer than you have. You will wonder how they could have played for so long and still "play like that". Don't they know it's done "this way?" What professional does it "that way?" Don't they watch tennis on TV? Who does that? I just started and I know that's wrong. All these thoughts will make you feel like a genius. So how does all of this relate to chess? Simply put, play over the games of professional players and see how they play before you play anybody. Try to imitate them instead of trying to reinvent the wheel. Play over the games of lots of different players and then decide which one or two you like. The ones you like are the ones where you say after playing over one of their games, "I would love to play a game like that!" Then just concentrate on those one or two players. Study and play the openings they play. Get books where they comment on their own games. Maybe they will say what they were thinking during the game. Try to play like them. During your games, think "What would he do in this position?" Personally, I like Murphy for his rapid development and attacks, Lachine for his creativeness in all positions, and Spas sky for his ability to play all types of positions and create attacks in calm positions.
value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making.. General Learning decision tree is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. Data comes in records of the form:
Page 78
The dependent variable, Y, is the target variable that we are trying to understand, classify or generalize. The vector x is composed of the input variables, x1, x2, x3 etc., that are used for that task. Types: Classification tree analysis is when the predicted outcome is the class to which the data belongs. Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patients length of stay in a hospital). Classification And Regression Tree (CART) analysis is used to refer to both of the above procedures, first introduced by Breiman et al. Chi-squared Automatic Interaction Detector (CHAID). Performs multi-level splits when computing classification trees.
[2]
A Random Forest classifier uses a number of decision trees, in order to improve the classification rate. Boosted Trees can be used for regression-type and classification-type problems
Page 79
Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. Ex: relation rules can be used only with nominal variables while neural networks can be used only with numerical variables. Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by Boolean logic. An example of a black box model is an artificial neural network since the explanation for the results is difficult to understand. Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. Robust. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated. Perform well with large data in a short time. Large amounts of data can be analyzed using personal computers in a time short enough to enable stakeholders to take decisions based on its analysis.
Page 80
KB (may be a premise, antecedent, inference rule etc.) Each arc of the network represents the inference steps from which the node was derived. Premise: A premise is a fundamental belief which is assumed to be always true. They do not need justifications. Considering premises are base from which justifications for all other nodes will be stated. There are two types of justification for each node. They are: 1. Support List [SL] 2. Conceptual Dependencies(CP) Many kinds of truth maintenance systems exist. Two major types are single-context and multi-context truth maintenance. In single context systems, consistency is maintained among all facts in memory (database). Multi-context systems allow consistency to be relevant to a subset of facts in memory (a context) according to the history of logical inference. This is achieved by tagging each fact or deduction with its logical history. Multi-agent truth maintenance systems perform truth maintenance across multiple memories, often located on different machines. De Klees ATMS (1986) was utilized in systems based upon KEE on the Lisp Machine. The first multiagent TMS was created by Mason and Johnson. It was a multicontext system. Bridge land and Huns created the first singlecontext multi- agent system
N
The discovery that state 3 is inconsistent causes immediate backtracking to state 5. To be able to determine which choices underlie the contradiction requires that the problem solver store dependency records with every datum that it infers. Avoiding Rediscovering contradiction. The second deficiency of chronological backtracking is illustrated by un- necessary state 13. The contradiction discovered in state 10 depends on B and E. As E is the most recent choice, chronological and dependency directed backtracking are indistinguishable, both backtracking to state 11. How ever, as B and E are known to be inconsistent with each other, there is no point in rediscovering this contradiction by working in state 13.
Page 84
Disadvantages:
1.
Dependency-directed backtracking incurs a significant time and space overhead as it requires the maintenance of dependency records and an additional no-good database. Thus the effort required to maintain the dependencies may be more than the problem-solving effort solved.
Page 85
2. If the problem solver is logically complete and finishes all work on a state before considering the next, the problem of backtracking to an inappropriate choice cannot occur.
3.
In such cases much of the advantage of Dependencydirected backtracking is irrelevant. However, most practical problem solvers are neither logically complete nor finish all possible work on a state before considering one other.
Fuzzy function:
Membership function is the one of the fuzzy function which is used to develop the fuzzy set value. The fuzzy logic is depends upon membership function
Page 86
If X croaks and eats flies Then X is a frog If X chirps and sings Then X is a canary If X is a frog Then X is green If X is a canary Then X is yellow
Page 87
This rule base would be searched and the third and fourth rules would be selected, because their consequents (Then Fritz is green, Then Fritz is yellow) match the goal (to determine Fritz's color). It is not yet known that Fritz is a frog, so both the antecedents (If Fritz is a frog, If Fritz is a canary) are added to the goal list. The rule base is again searched and this time the first two rules are selected, because their consequents (Then X is a frog, Then X is a canary) match the new goals that were just added to the list. The antecedent (If Fritz croaks and eats flies) is known to be true and therefore it can be concluded that Fritz is a frog, and not a canary. The goal of determining Fritz's color is now achieved (Fritz is green if he is a frog, and yellow if he is a canary, but he is a frog since he croaks and eats flies; therefore, Fritz is green). Note that the goals always match the affirmed versions of the consequents of implications (and not the negated versions as in modus tollens) and even then, their antecedents are then considered as the new goals (and not the conclusions as in affirming the consequent) which ultimately must match known facts (usually defined as consequents whose antecedents are always true); thus, the inference rule which is used is modus ponens. Because the list of goals determines which rules are selected and used, this method is called goal-driven, in contrast to data-driven forward-chaining inference. The backward chaining approach is often employed by systems. Programming languages such as Prolog, Knowledge Machine and Eclipse support backward chaining within their inference engines.
Page 88
T4. Induce an initial PMTG from the relative frequencies of productions in the multi tree bank. T5. Re-estimate the PMTG parameters, using a Synchronous parser with the expectation smearing. A1. Use the PMTG to infer the most probable multi tree Covering new input text. A2. Linearize the output dimensions of the multi tree. Steps T2, T4 and A2 are trivial. Steps T1, T3, T5, and A1 are instances of the generalized parsers Figure 2 is only architecture. Computational Complexity and generalization error stand in the Way of its practical implementation. Nevertheless, it is satisfying to note that all the non-trivial algorithms In Figure 2 are special cases of Translator CT. It is therefore possible to implement an MTSMT System using just one inference algorithm, parameterized By a grammar, a smearing, and a search Strategy. An advantage of building an MT system in This manner is that improvements invented for ordinary Parsing algorithms can often be applied to all The main components of the system. For example, Me lamed (2003) showed how to reduce the computational complexity of a synchronous parser by _ _3_, just by changing the logic. The same optimization can be applied to the inference algorithms... With proper software design, such optimizations Need never be implemented more than once. For simplicity, the algorithms in this are based on CKY logic. However, the architecture in Figure 2 can also be implemented using
Page 91
generalizations of more sophisticated parsing logics, such as those inherent in Early or Head-Driven parsers
Benefits of Machine Translation: There are three research benefits of using generalized Parsers to build MT systems.
Page 92
We can take advantage of past and future research on making parsers more accurate and more efficient. Therefore, 2. We can concentrate our efforts on better models, without worrying about MT-specific search algorithms. 3.More generally and most importantly, this approach encourages MT research to be less specialized and more transparently related to the rest of computational linguistics.
1.
PUTDOWN (A)- put block A down on the table. The arm must have been holding block A. Notice that in the world we have described, the robot arm can hold only one block at a time. Also, since all blocks are the same size, each block can have at most one other block directly on top of it. In order to specify both the conditions under which an operation may be performed and the results of performing it, we need to use the following predicates: ON (A, B) - block A is on block B. ONTABLE (A) - block A is on the table. CLEAR (A) - there is nothing on top of block A. HOLDING (A) - The arm is holding block A. ARMEMPTY- the arm is holding nothing.
Various logical statements are true in this blocks world. For example, [ x: HOLDING (x)] x: ONTABLE(x) ARMEMPTY y: ON(x,y)
The first of these statements says simply that if the arm is holding anything, then it is not empty. The second says that if a block is on the table, then it is not also on another block. The third says that any block with no blocks on it is clear.
Page 94
Components of planning system: The components of planning systems are as follows: Choose the best rule to apply next based on the best available information Apply the chosen rule to compute the new problem state that arises from the application Detect when a solution has been found Detect dead ends so that they can be abandoned and the systems effort directed in more fruitful directions. Detect when an almost correct solution has been found and employ special techniques to make it totally correct.
Choosing rules to apply. The most widely used technique for selecting appropriate rules to apply is first to isolate a set of differences between the desired goal state and the current state and then to identify those rules that are relevant to reducing those differences. If several rules are found, a variety of other heuristic information can be exploited to choose among them. Applying Rules. Each rule simply specified the problem state that would result from its application. Now, however, we must be able to deal with rules that specify only a small part of the complete problem state. One way is to describe, for each action, each of the changes it makes to the state description. In addition, some statement that everything else remains unchanged is also necessary. Fig shows how a state, called S0, of a simple blocks world problem could be represented.
Page 95 A
After applying the operator UNSTACK (A, B), our description of the world would be
ONTABLE (B) ^CLEAR (A) ^CLEAR (B) ^HOLDING (A)
STRIPS-style operators that correspond to the blocks world operations we have discussed are shown in fig2. Notice that for simple rules such as these the PRECONDITION lists is often identical to the DELETE list. In order to pick up a block, the robot arm must be empty; as soon as it picks up a block, it is no longer empty. But preconditions are not always detected. For example, in order for the arm to pick up a block, the block must have no other blocks on top of it: after it is picked up, it still has no blocks on top of it. This is the reason that the PRECONDITION and DELETE lists must be specified separately.
STACK(X, Y)
Page 96
P: CLEAR(Y) ^HOLDING(X) D: CLEAR(Y) ^HOLDING(X) A: ARMEMPTY^ON(X, Y) UNSTACK(X, Y) P: ON(X, Y) ^CLEAR(X) ^ARMEMPTY D: ON(X, Y) ^ARMEMPTY A: HOLDING(X) ^ON(X, Y)
PICKUP(X) P: CLEAR(X) ^ONTABLE(X) ^ARMEMPTY D: ONTABLE(X) ^ARMEMPTY A: HOLDING(X) PUTDOWN(X) P: HOLDING(X) D: HOLDING(X) A: ONTABLE(X) ^ARMEMPTY
Fig: STRIPS- style operators for the blocks world Detecting a solution. A planning system has succeeded in finding a solution to a problem when it has found a sequence of operators that transforms the initial problem state into the goal state.
Page 97
Detecting Dead ends. As a planning system is searching for a sequence of operators to solve a particular problem, it must be able to detect when it is exploring a path that can never lead to a solution. The same reasoning mechanisms that can be used to detect a solution can often be used for detecting a dead end. Goal stack planning: The technique to be developed for solving compound goals that many interact was the use of a goal stack. This was the approach used by STRIPS. In this method, the problem solver makes use of a single stack that contains both goals and operators that have been proposed to satisfy those goals. The problem solver also relies on a database that describes the current situation and a set of operators described in PRECONDITION, ADD, and DELETE lists. To see how this method works, let us carry it through for the simple example shown in fig.
B C D C B
But we want to separate this problem into four sub problems, one for each component of the original goal. Two of the sub problems, ONTABLE (A) and ONTABLE (D), are already true in the initial state. Depending on the order in which we want to tackle the sub problems, there are two goals stacks that could be created as our first step, where each line represents one goal on the stack and OTAD is an abbreviation for ONTABLE (A) ^
ONTABLE (D):
To continue with the example we started above, let us assume that we choose first to explore alternative1. Alternative2 will also lead to a solution. In fact, it finds one so trivially that it is not very interesting. Exploring alternative 1, we first check to see whether ON (C, A) is true in the current state.
only represent partial-order constraints on steps, then we have a partial-order planner, which is also called a non-linear planner. In this case, we specify a set of temporal constraints between pairs of steps of the form S1 < S2 meaning that step S1 comes before, but not necessarily immediately before, step S2. We also show this temporal constraint in graph form as S1 +++++++++> S2 STRIPS is a total-order planner, as are situation-space progression and regression planners Principle of Least Commitment The principle of least commitment is the idea of never making a choice unless required to do so. In other words, only do something if it's necessary. The advantage of using this principle is that we try to avoid doing work that might have to be undone later, hence avoiding wasted work. In planning, one application of this principle is to never order plan steps unless it's necessary for some reason. So, partial-order planners exhibit this property because constraints ordering steps will only be inserted when necessary. On the other hand, situation-space progression planners make commitments about the order of steps as they try to find a solution and therefore may make mistakes from poor guesses about the right order of steps. Representing a Partial-Order Plan A partial-order plan will be represented as a graph that describes the temporal constraints between plan steps selected so far. That is, each node will represent a single step in the plan (i.e., an
Page 100
instance of one of the operators), and an arc will designate a temporal constraint between the two steps connected by the arc. For example, S1 ++++++++> S2 ++++++++++> S5 |\ ^ | \++++++++++++++++| | | v | ++++++> S3 ++++++> S4 ++++++ graphically represents the temporal constraints S1 < S2, S1 < S3, S1 < S4, S2 < S5, S3 < S4, and S4 < S5. This partial-order plan implicitly represents the following three total-order plans, each of which is consistent with all of the given constraints: [S1,S2,S3,S4,S5], [S1,S3,S2,S4,S5], and [S1,S3,S4,S2,S5].
Partial-Order Planner (POP) Algorithm function pop(initial-state, conjunctive-goal, operators) // non-deterministic algorithm plan = make-initial-plan(initial-state, conjunctive-goal); loop: begin if solution?(plan) then return plan; (S-need, c) = select-subgoal(plan) ; // choose an unsolved goal choose-operator(plan, operators, S-need, c); // select an operator to solve that goal and revise plan resolve-threats(plan); // fix any threats created end end function solution?(plan) if causal-links-establishing-all-preconditions-of-all-steps(plan) and all-threats-resolved(plan)
Page 101
and all-temporal-ordering-constraints-consistent(plan) and all-variable-bindings-consistent(plan) then return true; else return false; end function select-subgoal(plan) pick a plan step S-need from steps(plan) with a precondition c that has not been achieved; return (S-need, c); end procedure choose-operator(plan, operators, S-need, c) // solve "open precondition" of some step choose a step S-add by either Step Addition: adding a new step from operators that has c in its Add-list or Simple Establishment: picking an existing step in Steps(plan) that has c in its Add-list; if no such step then return fail; add causal link "S-add --->c S-need" to Links(plan); add temporal ordering constraint "S-add < S-need" to Orderings(plan); if S-add is a newly added step then begin add S-add to Steps(plan); add "Start < S-add" and "S-add < Finish" to Orderings(plan); end end procedure resolve-threats(plan) foreach S-threat that threatens link "Si --->c Sj" in Links(plan) begin // "declobber" threat choose either
Page 102
Demotion: add "S-threat < Si" to Orderings(plan) or Promotion: add "Sj < S-threat" to Orderings(plan); if not(consistent(plan)) then return fail; end end
Unit-5 Expert System and AI languages Introduction: An expert system is a set of programs that
manipulate encoded knowledge to solve problems in a specialized domain that normally requires human expertise. An expert systems knowledge is obtained form expert sources and
Page 103
coded in a form suitable for the system to use in its interference or reasoning processes. The expert knowledge must be obtained from specialists or other sources of expertise, such as texts, journal articles, and data bases. This type of knowledge usually requires much training and experience in some specialized field such as medicine, geology, system configuration, or engineering design.
Expert systems use knowledge rather than data to control the solution process. Much of the knowledge used is heuristic in nature rather than algorithmic. The knowledge is encoded and maintained as an entity separate from the control program. As such, it is not compiled together with the control program itself. This permits the incremental addition and modification (refinement) of the knowledge base without recompilation of the control programs. Furthermore, it is possible in some cases to use different knowledge bases with the same control programs to produce different types of expert systems. Expert systems are capable of explaining how a particular conclusion was reached, and why requested information is
Page 104
2.
3.
needed during a consultation. This is important as it gives the user a chance to assess and understand the systems reasoning ability, thereby improving the users confidence in the system. 4. Expert systems use symbolic representations for knowledge (rules, networks, or frames) and perform their interference through symbolic computations that closely resemble manipulations of natural language. Expert systems often reason with Meta knowledge; that is, they reason with knowledge about themselves, and their own knowledge limits and capabilities. Applications: Different types of medical diagnoses(internal medicine, pulmonary diseases, infectious blood disease, and so on) Diagnosis of complex electronic and electrochemical systems. Diagnosis of diesel electric locomotion systems. Diagnosis of software development projects. Forecasting crop damage. Location of faults in computer and communication systems.
Page 105
Development engine
Knowledge Base
Inference engine
User interface
User
IF: The temperature is greater than 200 degrees, and the water level is low THEN: Open the safety valve. A&B & C&D E&F
Each rule represents a small chunk of knowledge relating to the given domain of expertise which leads from some initially known facts to some useful conclusions or action part of the rule is then accepted as known(or at least known with some degree of certainty). Inference in production systems is accomplished by a process of chaining through the rules recursively, either in a forward or backward direction, until a conclusion is reached or until failure occurs. The selection of rules used in the chaining process is determined by matching current facts against the domain knowledge or variables in rules and choosing among a candidate set of rules the ones that meet some given criteria, such as specificity. The inference process is typically carried out in an interactive mode with the user providing input parameters needed to complete the chaining process.
Page 107
EXPERT SYSTEM
USER
Explanation Module
Input
Output
Editor
Knowledge base
Working memory
Learning Module
Fig. Components of a typical expert system The Knowledge Base: The Knowledge base contains facts and rules about some specialized knowledge domain.
Page 108
The Inference Process: The inference engine accepts user input queries and responses to questions through the I/O interface and uses this dynamic information together with the static knowledge (the rules and facts) stored in the knowledge base. The knowledge in the knowledge base is used to derive conclusions about the current case or situation as presented by the users input. During the match stage, the contents of working memory are compared to facts and rules contained in the knowledge base. When consistent matches are found, the corresponding rules are placed in a conflict set. To find an appropriate and consistent match, substitutions (instantiations) may be required. Once all the matched rules have been added to the conflict set during a given cycle, one of the rules is selected for execution. When the left side of a sequence of rules is instantiated first and the rules are executed from left to right, the process is called forward chaining. This is also known as datadriven inference since input data are used to guide the direction of the inference process. For example, we can chain forward to show that when a student is encouraged, is healthy, and has goals, the student will succeed.
ENCOURAGED (student)
Page 109
MOTIVATED (student)
MOTIVATED (student) &HEALTHY (student) WORKHARD (student) &HASGOALS (student) EXCELL (student)
SUCCED (student)
On the other hand, when the right side of the rules is instantiated first, the left-hand conditions become sub goals. These sub goals may in turn cause sub- sub goals to be established, and so on until facts are found to match the lowest sub goals conditions. When this form of inference takes place, we say that backward chaining is performed. This form of inference is also known as goal-driven inference since an initial goal establishes the backward direction of the inferring. Explanation Module: The Explanation module provides the user with an explanation of the reasoning process when requested. This is done in response to a how query or a why query. To respond to a how query, the explanation module traces the chain of rules fired during a consultation with the user. The sequence of rules that led to the conclusion is then printed for the user in an easy to understand human-language style. This permits the user to actually see the reasoning process followed by the system in arriving at the conclusion. If the user does not agree with the reasoning steps presented they may be changed using the editor.
Page 110
To respond to a why query, the explanation module must be able to explain why certain information is needed by the inference engine to complete a step in the reasoning process before it can proceed. For example, in diagnosing a car that will not start, a system might be asked why it needs to know the status of the distributor spark. Building a knowledge Base: The editor is used by developers to create new rules for addition to the knowledge base, to delete outmoded rules, or to modify existing rules in some way. Consistency tests for newly created rule. Such systems also prompt the user for missing information. The I/O Interface: The input-output interface permits the user to communicate with the system in a more natural way by permitting the use of simple selection menus or the use of a restricted language which is close to a natural language. This means that the system must have special prompts or a specialized vocabulary which encompasses the terminology of the given domain of expertise.
Page 111
Associative or Semantic Network Architectures: We know that an associative network is a network made up of nodes connected by directed arcs. The nodes represent objects, attributes, concepts, or other basic entities, and the arcs, which are labeled, describe the relationship between the two nodes they connect. Special network links include the ISA and HASPART links which designate an object as being a certain type of object (belonging to a class of objects) and as being a subpart of another object, respectively. Associative network representations are especially useful in depicting hierarchical knowledge structures, where property inheritance is common. More often, these network representations are used in natural language or computer vision systems or in conjunction with some other form of representation. Frame Architectures: Frames are structured sets of closely related knowledge, such as an object or concept name, the objects main attributes and their corresponding values, and possibility some attached procedures (if-needed, if-added, ifremoved procedures). The attributes, values, and procedures are stored in specified slots facets of the frame. Individual frames are usually linked together as a much like the nodes in an associative network.
Page 112
Decision Tree Architectures: Knowledge for expert systems may be stored in the form of a decision tree when the knowledge can be structured in a top-to-bottom manner. For example, the identification of objects (equipment, faults, physical objects, diseases) can be made through a decision tree structure. Initial and intermediate nodes in the tree correspond to object attributes, and terminal nodes correspond to the identities of objects. Attribute values for an object determine a path to a leaf node in the tree which contains object identification. Each object attribute corresponds to a non terminal node in the tree and each branch of the decision tree corresponds to an attribute value or set of values. Blackboard system Architectures: Blackboard architectures refer to a special type of knowledge-based system which uses a form of opportunistic reasoning. This differs from pure forward or pure backward chaining in production systems in that either direction may be chosen dynamically at each stage in the problem solution process. Blackboard systems are composed of three functional components as depicted in fig
1.
There are a number of knowledge sources which are separate and independent sets of coded knowledge. Each knowledge source may be thought of as a specialist in some limited area needed to solve a given subset of
Page 113
problems; the sources may contain knowledge in the form of procedures, rules, or other schemes.
2.
A globally accessible data base structure, called a blackboard, contains the current problem state and information needed by the knowledge sources (input data, partial solutions, control data, alternatives, and final solutions). The knowledge sources make changes to the blackboard data that incrementally lead to a solution. Communication and interaction between the knowledge sources takes place solely through the blackboard.
3. Control information may be contained within the sources, on the blackboard, or possibly in a separate module. The control knowledge monitors the changes to the blackboard and determines what the immediate focus of attention should be in solving the problem. Analogical Reasoning Architectures: Expert systems based on analogical architectures solve new problems like humans, by finding a similar problem solution that is known and applying the known solution to the new problem, possibly with some modifications, for example, if we know a method of proving that the product of two even integer is even, we can successfully prove that the product of two odd integers is odd through much the same proof steps. Expert systems using analogical architectures will require a large knowledge base having
Page 114
numerous problem solutions and other previously encountered situations or episodes Neural Network Architectures: Neural networks are large networks of simple processing elements or nodes which process information dynamically in response to external inputs. The nodes are simplified models of neurons. The knowledge in a neural network is distributed throughout the network in the form of internodes connections and weighted links which form the inputs to the nodes. The link weights serve to enhance or inhibit the input stimuli values which are then added together at the nodes. If the sum of all the inputs to a node exceeds some threshold value T, the node executes and produces an output
Blackboard Knowledge sources
Page 115
Control information Fig. Components of blackboard systems. which is passed on to other nodes or is used to produce some output response. Neural networks were originally inspired as being models of the human nervous system. They are generally simplified models to be sure.
capture the low-level Knowledge used and the inferring processes applied. This difficulty has lead to the use of AI experts (called Knowledge engineers) who serve as intermediaries between the domain expert and the system. The knowledge engineer elicits information from the experts and codes this Knowledge into a form suitable for use in the expert system. The Knowledge elicitation process is depicted in fig. To elicit the requisite Knowledge, a Knowledge engineer conducts extensive interviews with domain experts. During the interviews, the expert is asked to solve typical problems in the domain of interest and to explain his or her solutions.
Domain D Expert
Knowledge engineer
System Editor
Knowledge Base
Fig the Knowledge acquisition process. Using the Knowledge gained from the experts and other sources, the knowledge engineer codes the knowledge in the form of rules or some other representation scheme. This
Page 117
knowledge is then used to solve sample problems for review. Errors and omissions are uncovered and corrected, and additional knowledge is added as needed. The process is repeated until a sufficient body of knowledge has been collected to solve a large class of problems in the chosen domain. The whole process may take as many as tens of person years.
Page 118
parentheses. A string is a group of characters enclosed in double quotation marks. Lisp programs run either on an interpreter or as compiled code. The interpreter examines source programs in a repeated loop, called the read-evaluate-print loop. This loop reads the program code, evaluates it, and prints the values returned by the program. The interpreter signals its read lines to accept code for execution by printing a prompt such as the -> symbol. Examples Here are examples of Common Lisp code. The basic "Hello world" program: (Print "Hello world") As the reader may have noticed from the above discussion, Lisp syntax lends itself naturally to recursion. Mathematical problems such as the enumeration of recursively defined sets are simple to express in this notation. Evaluate a number's factorial: (Defun factorial (n) (if (<= n 1) 1 (* n (factorial (- n 1)))))
Page 119
An alternative implementation, often faster than the previous version if the Lisp system has tail recursion optimization: (defun factorial (n &optional (acc 1)) (if (<= n 1) acc (factorial (- n 1) (* acc n)))) Contrast with an iterative version which uses Common Lisp's loop macro: (defun factorial (n) (loop for i from 1 to n for fac = 1 then (* fac i) finally (return fac))) The following function reverses a list. (Lisp's built-in reverse function does the same thing.) (defun -reverse (list) (let ((return-value '())) (dolist (e list) (push e return-value)) return-value))
logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations.[4] The language was first conceived by a group around Alain Colmerauer in Marseille, France, in the early 1970s and the first Prolog system was developed in 1972 by Colmerauer with Philippe Roussel.[5][6] Prolog was one of the first logic programming languages,[7] and remains among the most popular such languages today, with many free and commercial implementations available. While initially aimed at natural language processing, the language has since then stretched far into other areas like theorem proving,[8] expert systems,[9] games, automated answering systems, ontologies and sophisticated control systems. Modern Prolog environments support creating graphical user interfaces, as well as administrative and networked applications. Syntax and semantics. In Prolog, program logic is expressed in terms of relations, and a computation is initiated by running a query over these relations. Relations and queries are constructed using Prolog's single data type, the term. Relations are defined by clauses. Given a query, the Prolog engine attempts to find a resolution refutation of the negated query. If the negated query can be refuted, i.e., an instantiation for all free variables is found that makes the union of clauses and the singleton set consisting of the negated query false, it follows that the original query, with the found instantiation applied, is a logical consequence of the program. This makes Prolog (and other logic programming
Page 121
languages) particularly useful for database, symbolic mathematics, and language parsing applications. Because Prolog allows impure predicates, checking the truth value of certain special predicates may have some deliberate side effect, such as printing a value to the screen. Because of this, the programmer is permitted to use some amount of conventional imperative programming when the logical paradigm is inconvenient. It has a purely logical subset, called "pure Prolog", as well as a number of extra logical features. Data types Prolog's single data type is the term. Terms are either atoms, numbers, variables or compound terms.
An atom is a general-purpose name with no inherent meaning. Examples of atoms include x, blue, 'Taco', and 'some atom'. Numbers can be floats or integers. Variables are denoted by a string consisting of letters, numbers and underscore characters, and beginning with an upper-case letter or underscore. Variables closely resemble variables in logic in that they are placeholders for arbitrary terms. A compound term is composed of an atom called a "functor" and a number of "arguments", which are again terms. Compound terms are ordinarily written as a functor followed by a comma-separated list of argument terms, which is contained in parentheses. The number of arguments is called the term's arity. An atom can be
Page 122
regarded as a compound term with arity zero. Examples of compound terms are truck_year('Mazda', 1986) and 'Person_Friends'(zelda,[tom,jim]). Special cases of compound terms:
A List is an ordered collection of terms. It is denoted by square brackets with the terms separated by commas or in the case of the empty list, []. For example [1,2,3] or [red,green,blue]. Strings: A sequence of characters surrounded by quotes is equivalent to a list of (numeric) character codes, generally in the local character encoding, or Unicode if the system supports Unicode. For example, "to be, or not to be".
Examples Here follow some example programs written in Prolog. Hello world An example of a query: ?- write('Hello world!'), nl. Hello world! true. ?Compiler optimization Any computation can be expressed declaratively as a sequence of state transitions. As an example, an optimizing compiler with
Page 123
three optimization passes could be implemented as a relation between an initial program and its optimized form: program optimized(Prog0, Prog) :optimization_pass_1(Prog0, Prog1), optimization_pass_2(Prog1, Prog2), optimization_pass_3(Prog2, Prog). or equivalently using DCG notation: program_optimized --> optimization_pass_1, optimization_pass_2, optimization_pass_3. Quicksort The Quicksort sorting algorithm, relating a list to its sorted version: partition([], _, [], []). partition([X|Xs], Pivot, Smalls, Bigs) :( X @< Pivot -> Smalls = [X|Rest], partition(Xs, Pivot, Rest, Bigs) ; Bigs = [X|Rest], partition(Xs, Pivot, Smalls, Rest) ). quicksort([]) --> []. quicksort([X|Xs]) --> { partition(Xs, X, Smaller, Bigger) }, quicksort(Smaller), [X], quicksort(Bigger).
Page 124
Dynamic programming The following Prolog program uses dynamic programming to find the longest common subsequence of two lists in polynomial time. The clause database is used for memorization: :- dynamic(stored/1). memo (Goal) :- ( stored (Goal) -> true ; Goal, asserts(stored(Goal)) ). lcs([], _, []) :- !. lcs(_, [], []) :- !. lcs([X|Xs], [X|Ys], [X|Ls]) :- !, memo(lcs(Xs, Ys, Ls)). lcs([X|Xs], [Y|Ys], Ls) :memo(lcs([X|Xs], Ys, Ls1)), memo(lcs(Xs, [Y|Ys], Ls2)), length(Ls1, L1), length(Ls2, L2), ( L1 >= L2 -> Ls = Ls1 ; Ls = Ls2 ). Example query: ?- lcs([x,m,j,y,a,u,z], [m,z,j,a,w,x,u], Ls). Ls = [m, j, a, u] Design patterns A design pattern is a general reusable solution to a commonly occurring problem in software design. In Prolog, design patterns go under various names: skeletons and techniques, clichs, program schemata, and logic description schemata. An alternative to design patterns is higher order programming.
Page 125
Higher-order programming Main articles: Higher-order logic and Higher-order programming By definition, first-order logic does not allow quantification over predicates. A higher-order predicate is a predicate that takes one or more other predicates as arguments. Prolog already has some built-in higher-order predicates such as call/1, find all/3, setoff/3, and bag of/3.[16] Furthermore, since arbitrary Prolog goals can be constructed and evaluated at run-time, it is easy to write higher-order predicates like map list/2, which applies an arbitrary predicate to each member of a given list, and sub list/3, which filters elements that satisfy a given predicate, also allowing for currying.[15] To convert solutions from temporal representation (answer substitutions on backtracking) to spatial representation (terms), Prolog has various all-solutions predicates that collect all answer substitutions of a given query in a list. This can be used for list comprehension. For example, perfect numbers equal the sum of their proper divisors: Perfect (N):between (1, inf, N), U is N // 2, find all(D, (between(1,U,D), N mod D =:= 0), Ds), sum list(Ds, N). This can be used to enumerate perfect numbers, and also to check whether a number is perfect
Page 126
that MYCIN's performance was minimally affected by perturbations in the uncertainty metrics associated with individual rules, suggesting that the power in the system was related more to its knowledge representation and reasoning scheme than to the details of its numerical uncertainty model. Some observers felt that it should have been possible to use classical Bayesian statistics. MYCIN's developers argued that this would require either unrealistic assumptions of probabilistic independence, or require the experts to provide estimates for an unfeasibly large number of conditional probabilities.[1][2] Subsequent studies later showed that the certainty factor model could indeed be interpreted in a probabilistic sense, and highlighted problems with the implied assumptions of such a model. However the modular structure of the system would prove very successful, leading to the development of graphical models such as Bayesian networks Results Research conducted at the Stanford Medical School found MYCIN to propose an acceptable therapy in about 69% of cases, which was better than the performance of infectious disease experts who were judged using the same criteria. This study is often cited as showing the potential for disagreement about therapeutic decisions, even among experts, when there is no "gold standard" for correct treatment.
Page 128
Practical use MYCIN was never actually used in practice. This wasn't because of any weakness in its performance. As mentioned, in tests it outperformed members of the Stanford medical school faculty. Some observers raised ethical and legal issues related to the use of computers in medicine if a program gives the wrong diagnosis or recommends the wrong therapy, who should be held responsible? However, the greatest problem, and the reason that MYCIN was not used in routine practice, was the state of technologies for system integration, especially at the time it was developed. MYCIN was a stand-alone system that required a user to enter all relevant information about a patient by typing in response to questions that MYCIN would pose. The program ran on a large time-shared system, available over the early Internet (Arpanet), before personal computers were developed. In the modern era, such a system would be integrated with medical record systems, would extract answers to questions from patient databases, and would be much less dependent on physician entry of information. In the 1970s, a session with MYCIN could easily consume 30 minutes or morean unrealistic time commitment for a busy clinician. A difficulty that rose to prominence during the development of MYCIN and subsequent complex expert systems has been the extraction of the necessary knowledge for the inference engine
Page 129
to use from the human expert in the relevant fields into the rule base (the so-called knowledge engineering).
Page 130