LEARNING %---------------------------------------------------------------------------- Learning is, after all, the quintessential AI issue. I will now give a definition of AI that will disqualify most of its practitioners. AI is the science of endowing programs with the ability to change themselves for the better as a result of their experiences. -- Schank %---------------------------------------------------------------------------- Why study learning? Because people do it. Learning is an important and fundamental problem in cognition if we are to understand how people think. The central issue is that of change. People change as a result of their experiences. They adapt to new situations, and learn from their experiences. If machines are to be intelligent, they must be able to do the same. Functionally, this is a key requirement because: (a) it is impossible to build in the large amount of knowledge required for any realistic domain by hand (b) dealing with novel input inherently requires adaptation and learning (otherwise the system will only be able to deal with situations for which it was designed) (c) dealing with changing circumstances requires learning (since the knowledge base may become obsolete) (d) to make AI systems more natural to people %---------------------------------------------------------------------------- What is learning? What does it mean to learn? The Analytical Engine has no pretensions whatever to ORIGINATE anything. It can do whatever we KNOW HOW TO ORDER IT to perform. -- Ada Augusta [Lovelace 1961] However, this does not mean that computers can't learn. Learning denotes ... changes in the system that are adaptive in the sense that they enable the system to do the same task or tasks drawn from the same population more efficiently and more effectively the next time. Learning covers a wide range of phenomena, from SKILL REFINEMENT (e.g., learning to ride a bicycle) to KNOWLEDGE ACQUISITION. Knowledge acquisition, in turn, can be of several types: 1. Rote learning - storage of computed information. 2. Taking advice from others. (Advice may need to be operationalized.) 3. Learning from problem solving experiences - remembering experiences and generalizing from them. (May add efficiency but not new knowledge.) 4. Learning from examples. (May or may not involve a teacher.) 5. Learning by experimentation and discovery. (Decreasing burden on teacher, increasing burden on learner.) %---------------------------------------------------------------------------- Learning: The ability to change through experience to better one's behavior. 1. Acquisition of new knowledge that enhances reasoning capabilities by allowing it to do something it couldn't do before ("knowledge level learning"). 2. Reorganization of existing knowledge (e.g., compilation) that, although it doesn't enhance the reasoning capabilities in the above sense, makes the system able to do its task better or more efficiently ("speedup learning"). 3. Acquisition or modification of reasoning methods ("meta-learning"). %---------------------------------------------------------------------------- Learning task: A learning task can be characterized by the (input) SOURCE and the (output) TARGET structures. Possible target structures include whatever would aid a reasoner to do its job (better). For example, - a new association, a modified association - a new plan, a modified plan - a new criterion for recognizing an object - a new word definition - a new value for a slot in a schema, a new slot, a new schema - a new case, a modified case - a new index - a new model, a modified model --- --- --- %---------------------------------------------------------------------------- Basic phases in learning: 1. Tracking down what needs to be learned (credit and blame assignment). 2. Proposing the new knowledge using some learning algorithm or method. 3. Debugging the proposed knowledge and integrating it with the old knowledge. The "credit (or blame) assignment" problem: Identifying the faulty knowledge structures when an error is made. The "deciding what to learn" problem: Identifying what needs to be learned ("formulating learning goals"). The "deciding how to learn" problem: Identifying appropriate learning strategies (in a multistrategy learning system). Most current systems use only one learning strategy. The "indexing" problem: Indexing the new knowledge back into memory such that it is retrieved in appropriate situations in the figure. %---------------------------------------------------------------------------- Issues: 1. How much do you have to know already before you can learn a little more? 2. How does the system assign responsibility for successes and failures? (This is called "credit/blame assignment".) 3. Should the system learn new knowledge that it didn't know, or should it learn to use existing knowledge more efficiently and effectively? (This is called "knowledge level learning" vs. "speedup (or symbol level) learning".) Are the methods different? 4. Should the system learn from its failures, or from its successes? Are the methods different? 5. Should the system learn new knowledge from scratch, or should it learn by incrementally refining its existing knowledge? Are the methods different? 6. Which examples should we present to the learning system, and in which order? 7. What happens if the input is noisy or incorrect? 8. Should the system try to learn "correct" knowledge only? 9. Can the system get overloaded with too much knowledge? (This is called the "utility problem".) 10. Can the system decide for itself what it should learn? Can it formulate its own learning goals? (This is called the "deciding what to learn" problem. Systems that actively pursue learning goals are called "active learning" systems.) 10. How can the system figure out which features of the input are relevant? (This is called the "situation identification" problem or the "feature selection" problem.) 11. Can a system create new representational constructs on its own? (This is called the "new term" or the "constructive induction" problem.) 12. How do people learn, anyway? 13. Could a system use multiple different learning methods? (This is called "multistrategy learning".) %---------------------------------------------------------------------------- Types of learning (from Michalski & Kodratoff): LEARNING PROCESSES / \ Primary Synthetic Analytic purpose / \ / \ / \ / \ Type of L. from L. from Example- Specification- input examples observation guided guided \ / \ / \ / \ / Type of Inductive Deductive primary inference / | \ \ / / | \ / | \ \-- Analogy --/ / | \ / | \ | / | \ / | \ | / | \ Role of Empirical | Multi-strategy | Axiomatic prior knowledge . | . | . . Constructive . Constructive . . induction . induction . . . . . . . . . . . Methods Empirical Abduction Integrated Abstraction Pure generalization empirical & explanation- Constructive explanation- Deductive based L. Qualitative generali- based L. generali- discovery zation zation Automatic Multi- program Conceptual strategy synthesis clustering constructive L. Neural nets Genetic algorithms %---------------------------------------------------------------------------- Learning by being told (and debugging) Example: TEIRESIAS (Davis 1975) helps a medical expert track down a bug in the knowledge base of an expert system (MYCIN, a backward chaining rule based system). Capabilities: - Can determine what new rule needs to be added. - Can determine what existing rule needs to be modified. - Tracks down bugs in CONTEXT (the features and reasoning from a particular case are used). What TEIRESIAS asks: - What went wrong? - Why did I infer something I shouldn't have? - How could I have inferred something I should have but didn't? In a rule based system, these questions translate to: - Why did I choose a rule I shouldn't have chosen? - What other rule caused me to do that? - Should it have fired? - If so, is there something wrong with its premises? - Is there another rule that caused it to fire that shouldn't have? - Is there a rule missing that would have kept the rule form firing? - Was there a rule that fired thaat shouldn't have? - Why didn't it? What TEIRESIAS needs to answer these questions: - DATA DEPENDENCIES: - Knowledge: What made it decide to use this rule? - Used to find rules that generate bad conclusions. - INDEXING - Knowledge: Find all rules that conclude something about x. - Used to find rules that would have given the right answer. What TEIRESIAS finds, and how it fixes the problems: - INCORRECT RULE - Ask expert to use the rule editor to alter it. - IMPROPERLY EVALUATED PREMISE - Recursively find out what rule was wrong or missing. - MISSING RULE - Ask expert to use rule editor to add new rule. - Use rule assimilator to "second guess" what the rule might be. Basic learning strategy: - Find problem (credit assignment) - Wait for expert to input correction - "Learn" by incorporating what it's told into its database %---------------------------------------------------------------------------- Inductive learning: Learning from examples Similarity-based learning, Inductive learning, Induction, Concept learning, Category formation, Learning class descriptions from examples. Given: Observational statements (facts) A tentative inductive assertion Background knowledge Find: An inductive assertion (hypothesis) that satisfies the background knowledge and tautologically implies the observational statements. The set of facts generalize to the inductive hypothesis. The inductive hypothesis should be: - consistent - refutable - simple, elegant Two classes of inductive problems: Learning concepts from examples (successive refinement based on a series of examples presented one at a time) - Building characteristic descriptions (building long and complete descriptions that describe the concepts being learned) - Building discriminant descriptions (building short discriminatory descriptions that help discriminate concepts from each other) - Discovering sequence generating rules Descriptive generalization (generalization of sets of observations presented all at once rather than as single examples) - Theory formation - Pattern discrovery - Building conceptual hierarchies Learning may be viewed as a heuristic state space search: States = symbolic descriptions of concepts Operators = inference rules (rules for generalization, specialization and reformulation) Goal state = inductive assertion satisfying desired criteria Issues in choosing instances: Noise in data. Order of presentation. Represent more than one concept? %---------------------------------------------------------------------------- Basic performance task: CLASSIFICATION Given a rule: In SITUATION do ACTION Need to be able to recognize SITUATION, i.e., classify input into the categories that define SITUATION. How are classes or categories defined? Weighted sum of features (statistical) Structural composition of features (logical) How are class descriptions learned? Basic idea: Build an initial hypothesis If a rule applied but it shouldn't have: specialize it If a rule did not apply but it should have: generalize it e.g.: Variabilization: replacing a variable with a constant ISA Generalization: climbing the ISA hierarchy Disjunct Addition: adding a new disjunct (Conjunct Addition -> specialization) Univeralization: introducing universal quantifiers Roots in early empiricist philosophy (Locke, James) Issues: What to apply the specialization/generalization rules to? SITUATION IDENTIFICATION PROBLEM: figuring out which features of the situation to ignore in revising rules e.g., how does Winston's program know not to generalize the time of day? how does it know to represent the predicate "touch"? What representations to use? DESCRIPTION PROBLEM: figuring out the representations that make learning easier e.g., (samples/positive instances, negative instances, queries) magic numbers: 7, 60, 70, 546, 627 base 9: 7 66 77 666 766 not magic: 364, 9, 626, 708 base 9: 444 10 765 866 is this magic: 628? 69? 71? base 9: 767? 76? 78? can't represent the right generalization until the right representation is used Which rules to revise? CREDIT ASSIGNMENT PROBLEM: figuring out what the program needs to know or why it made a mistake, and which rules are responsible What operators to allow? DISJUNCTIVE CONCEPT PROBLEM: If you allow disjunctions, you get overfitting. If you don't, you can't learn disjunctive concepts (at best, you can learn a generalization that includes other things as well). (LEX's solution: Multiple version spaces) How to match the new example to the current hypothesis? MATCHING PROBLEM: Which parts in the input description match which parts of the hypothesis? Even worse for specialization: algorithm must determine there is no way to do the match. e.g., match: (large red cube) with: (small red sphere), (large green cube), (small blue cube) What is there are more than one ways to match subparts? e.g., posts of an arch Which examples? Noise in data. Order of presentation. What if an example represents more than one concept? Learning from examples vs. Learning from sets (clustering) Supervised learning vs. Unsupervised learning %---------------------------------------------------------------------------- Winston's learning program (Winston 1972) Task: To derive descriptions of complex objects (e.g., house, arch) based on a series of examples presented by a teacher. Main features: - Learns a "target concept" (the concept that the teacher wants it to learn) by induction. - Proposes and then refines new knowledge (a series of "models" of the target concept). - Attached to a vision program that recognizes primitive objects (e.g., blocks, wedges, spheres). - Teacher presents a series of examples that can be: - SAMPLES: An instance of the target concept. - NEAR MISSES: An example that is not an instance of the target concept, but is "nearly so"; it differs only in one "feature" or "aspect". Example: /\ Sample: (BRICK A) / \ (WEDGE B) / B \ (SUPPORTS A B) ------ ------ Model: (BRICK ?x) | A | (WEDGE ?y) | | (SUPPORTS ?x ?y) ------ /\ ------ Near miss: (BRICK AA) / \ | | (WEDGE BB) / BB \ | AA | ------ ------ Refined model: (BRICK ?x) (WEDGE ?y) (MUST (SUPPORTS ?x ?y)) ------ Sample: (BRICK AAA) | | (BRICK BBB) | BBB| (SUPPORTS AAA BBB) ------ ------ Refined model: (BRICK ?x) | AAA| (PRISM ?y) | | (MUST (SUPPORTS ?x ?y)) ------ /\ Near miss: (WEDGE AAAA) / \ (WEDGE BBBB) /BBBB\ (SUPPORTS AAAA BBBB) ------ /\ Refined model: (MUST-BE-A (BRICK ?x)) / \ (PRISM ?y) /AAAA\ (MUST (SUPPORTS ?x ?y)) ------ Algorithm: - Start with initial model. Now, for each subsequent example: - Match example to model to find differences. - If example is a near miss, then specialize the model. Specialization heuristics: - REQUIRE-LINK: If the evolving model has a link in a place where a near miss does not, convert the model's link to a MUST form. - FORBID-LINK: If the near miss has a link in a place where the evolving model does not, add a MUST-NOT link to the model. - If example is a sample, then generalize the model. Generalization heuristics: - CLIMB-TREE: If something in the evolving model corresponds to something different in the sample, replace the model's clause with a MUST-BE-A link pointing to the most specific shared class connecting the things in the model and sample. - ENLARGE-SET: If something in the evolving model corresponds to a different thing in the sample, replace the model's clause with a MUST-BE-A link pointing to a new class composed of the union of the classes of the two things. - DROP-LINK: If the things that are different in the model and the sample form an exhaustive set, or when the model has a link that is not in the sample, delete the link from the model. - CLOSE-INTERVAL: When the number or interval in the model corresponds to a number in a sample, replace the number (or interval) by an interval spanning the model's number and the sample's number (or an enlarged interval that includes the sample's number, respectively). Underlying principles: NO-GUESSING PRINCIPLE: when in doubt about what to learn, learn nothing. FELICITY CONDITIONS: assume teacher presents reasonable examples. NO-ALTERING PRINCIPLE: if an example does not match the model, create a special case exception. MARTIN'S LAW: You can't learn anything until you almost know it already. Note: - The "scene" (example) must always be less general than the model. - The scene will always have particular objects in it, whereas the model may refer to abstract classes of objects. Contributions: - Showed incremental learning (learning in small steps). - Pointed out the importance of training sequences. - Pointed out the importance of good, consistent representations, without which the difference matching could not be done. Drawbacks: - Learning method requires one-to-one correspondence. - Requires that examples be presented in the correct order. - Disregards the fact that in the real world there are usually many features of objects and scenes. A FOCUS MECHANISM is necessary to distinguish features that are SALIENT. - Negative examples must be "near misses" (differ from the current model in only one way). %---------------------------------------------------------------------------- Extension to learning from sets (INDUCE/AQ11, Michalski 1983) Basic idea: Use similar operators for generalization and specialization Do a heuristic search (beam search) to search for a description. At each step in the search, specialize the description by adding requirements that exclude non-members, and generalize the description as to include as many members as possible while not including any non-members. Algorithm: Start with a description consisting of only nodes for the parts, but no links (this is an empty description that matches everything). Repeat until all class members are described: Select a class member that is not yet described ("current class member"). Until the current description describes none of the non-members: Select a non-member. Find difference between this non-member and the current class member. Use SPECIALIZE to add a MUST or MUST-NOT to the description. For each class member not yet described: Try to generalize the current description using GENERALIZE. %---------------------------------------------------------------------------- Learning by experimentation and problem-solving Mitchell's version spaces/Candidate elimination algorithm: Generalize the ideas in Winston's arch program Least commitment - version space pruned as little as possible Exhaustive breadth-first search throgh the version space (Winston is depth-first) Example: LEX (Mitchell 1982) acquires and refines problem-solving heuristics in the domain of symbolic integration. A form of learning from examples, but less directed than Winston's program. Goal: To devise methods by which heuristic problem-solvers improve their expertise through experience, by: - generating selected problems in the domain - solving them using available heuristics - analyzing their solutions to learn better heuristics Given: - Initial state involving integral - Goal state without integrals - Operators and preconditions (about 40 operators) - Initially, LEX has no heuristics for applicability of operators, but it has a language for representing heuristics. Heuristics take the following form: IF current problem state S matches applicability condition P THEN apply operator O with variable binding B. Language for describing applicability conditions are based on a grammar for algebraic expressions. This grammar is fixed, which restricts the range of heuristics that can be learned. Heuristics for operators represent subclasses of problem states to which the operator SHOULD be applied. Compare these with preconditions which represent when the operator CAN be applied. Example: /-\ | \ x transc(x) dx --> OP2 (integration by parts) \ with u = x | and dv = transc(x) dx \-/ Modules in LEX: - PROBLEM SOLVER - Outputs a detailed trace of the search performed - CRITIC - Analyzes the trace and outputs a set of positive and negative training instances - POSITIVE INSTANCES: Desirable search steps - NEGATIVE INSTANCES: Undesirable search steps - GENERALIZER - Proposes and refines heuristics to improve performance - PROBLEM GENERATOR - Attempts to generate practice problems for the problem solver to work on that will be informative, i.e., that will lead to proposing or refining heuristics, but yet are solvable using current heuristics PROBLEM GENERATOR / ^ practice / \ partially learned problem / \ heuristics v \ PROBLEM SOLVER GENERALIZER \ ^ problem \ / training solving \ / instances trace v / CRITIC Example: Problem generator proposes int(3x cos(x) dx) as an interesting problem to solve. (This was done by hand in the actual implementation of LEX.) Problem solver solves this problem using search. Existing heuristics are used to constrain the search. int(3x cos(x) dx) / \ / \ OP2 with u=3x, dv=cos(x)dx / \ ... 3x sin(x) - int(3 sin(x) dx) / \ OP1 / \ / \ 3x sin(x) - 3 int(sin(x) dx) ... | | OP5 | 3x sin(x) - +3 cos(x) + C Problem solving trace (the above search tree) is passed on to the critic. Critic notices which decisions led to the correct result and which ones didn't pan out, and suggests possible training instances to try out. For example: IF int(3x cos(x) dx) THEN apply OP2 with u = 3x, dv = cos(x) dx Generalizer creates a proposed heuristic or refines an old one. Version spaces: Tentative heuristics are represented using VERSION SPACES (most general and most specific forms). For example: S: IF int(3x cos(x) dx) THEN apply OP2 with u = 3x G: IF int(f1(x) f2(x) dx) THEN apply OP2 with u = f1(x) These represent the most specific heuristic and the most general heuristic one might learn. The actual heuristic it should learn is probably in between somewhere (it lies in the "version space"). Refinement occurs by generalizing S or specializing G, until the heuristic converges (hopefully) to one that works well. Thus the VERSION SPACE represents all the alternative PLAUSIBLE DESCRIPTIONS of the heuristic. A plausible description is one that is applicable to all known positive examples and no known negative example. Algorithm: CANDIDATE ELIMINATION Given: - A representation language - A set of positive and negative examples expressed in that language Compute: A concept description that is consistent with all the positive examples and none of the negative examples. Method: - Initialize G to contain one element: the null description (all features are variables). - Initialize S to contain one element: the first positive example. - Accept a new training example. - If it is a positive example: - Remove from G any descriptions that do not cover the example. - Update the S set to contain the most specific set of descriptions in the version space that cover the example and the current elements of the S set (i.e., generalize the elements of S as little as possible so that they cover the new training example). - If it is a negative example: - Remove from S any descriptions that cover the example. - Update the G set to contain the most general set of descriptions in the version space that do not cover the example (i.e., specialize the elements of G as little as possible so that the negative example is no longer covered by any of the elements of G). - If S and G are both singleton sets, then: - if they are identical, output their value and halt. - if they are different, the training cases were inconsistent. Output this result and halt. - else continue accepting new training examples. Still a trial and error method. The program does not base its choice of examples, or its learned heuristics, on an ANALYSIS of what works or why it works, but rather on the simple assumption that what works will probably work again. Evaluation of LEX: 12 training problems 5 test problems tried after every second training problem 14 heuristics formed, covering 13 of its 32 operators 12 heuristics remained partially learned Performance (number of search steps) improved two orders of magnitude Exercise: Try this algorithm on Winston's house example, above. Example (from Rich & Knight): Learning the concept of "Japanese economy car" Features: Origin, Manufacturer, Color, Decade, Type POSITIVE EXAMPLE: (Japan, Honda, Blue, 1980, Economy) Initialize G to singleton set that includes everything Initialize S to singleton set that includes first positive example G = {(?, ?, ?, ?, ?)} S = {(Japan, Honda, Blue, 1980, Economy)} NEGATIVE EXAMPLE: (Japan, Toyota, Green, 1970, Sports) Specialize G to exclude negative example G = {(?, Honda, ?, ?, ?), (?, ?, Blue, ?, ?) (?, ?, ?, 1980, ?) (?, ?, ?, ?, Economy)} S = {(Japan, Honda, Blue, 1980, Economy)} POSITIVE EXAMPLE: (Japan, Toyota, Blue, 1990, Economy) Prune G to exclude descriptions inconsistent with positive example Generalize S to include positive example G = {(?, ?, Blue, ?, ?) (?, ?, ?, ?, Economy)} S = {(Japan, ?, Blue, ?, Economy)} NEGATIVE EXAMPLE: (USA, Chrysler, Red, 1980, Economy) Specialize G to exclude negative example (but staying within version space, i.e., staying consistent with S) G = {(Japan, ?, Blue, ?, ?) (Japan, ?, ?, ?, Economy)} S = {(Japan, ?, Blue, ?, Economy)} POSITIVE EXAMPLE: (Japan, Honda, White, 1980, Economy) Prune G to exclude descriptions inconsistent with positive example Generalize S to include positive example G = {(Japan, ?, ?, ?, Economy)} S = {(Japan, ?, ?, ?, Economy)} S = G, both singleton => done! Extensions: Continuously valued features Hierarchical knowledge Limitations: Exhaustive breadth-first search Inconsistent data (noise) may cause target concept to be pruned Solution: multiple version spaces (costly) Learning disjunctive concepts is hard %---------------------------------------------------------------------------- Similarity based learning -- generalization based on correlations (induction). Dietterich and Michalski 81; Winston 72; Lebowitz 80. Examples: Winston's cup example Learning "ARCH" from examples and counter-examples "Near misses" IPP's generalizations: Urban terrorists in Italy frequently use silencers. Bombings in El Salvador cause damage but do not hurt anyone. Properties: 1. Pragmatic (as opposed to "correct") 2. Incremental 3. Large numbers of examples Problems: 1. Stupid generalizations 2. No theory of explanation %---------------------------------------------------------------------------- Issues in inductive learning - How can biases be defined (independent of particular learning algorithms and implementations)? Restricted hypothesis space biases e.g. Restricted language Conjunctive descriptions One disjunct per lesson Preference biases e.g. Occam's razor Maximally specific descriptions Maximally general descriptions Least disjunction - How can biases be implemented and feasibly computed? Often, as search procedures or heuristic constraints on search procedures. - How can we decide if the output from a learning program (with a given bias) is correct? PAC analysis for worst case performance Empirical testing - How can we determine that a bias is correct? Empirical comparison of different biases - Unsupervised concept learning and discovery Conceptual clustering Discovery %---------------------------------------------------------------------------- Explanation based learning: Using a causal analysis of the example to constrain the generalization process. History First International Machine Learning Workshop, University of Illinois, 1983 DeJong Silver Mostow Mitchell & Keller Winston Similarity-based learning needs many examples Similarity-based learning can make dumb generalizations Consider a "fork" in tic-tac-toe. From one experience, you can learn that if your opponent has two ways to win and you can only block one of them, then the game is lost. What makes this powerful generalization possible from only one experience? KNOWLEDGE. Similarity-based learning is empirical and data-intensive. Explanation-based learning is analytical and knowledge-intensive. Explanation-based learning: Input: a training example -- what the learning program sees a goal concept -- a high level description of what the program is supposed to learn an operationality criterion -- a description of which concepts are usable a domain theory -- a set of rules that describe relationships between objects and actions in a domain Output: a generalization of the training example that is sufficient to describe the goal concept, and also satisfies the operationality criterion Algorithm: step 1: explain prunes away all unimportant aspects of the training example w.r.t. the goal concept, leaving an explanation of why the training example is an instance of the goal concept, expressed in terms that satisfy the operationality criterion step 2: generalize generalize the explanation as far as possible while still describing the goal concept Problems: depends strongly on a domain theory domain theory must be complete and correct why are examples needed if such a domain theory exists? (because don't want all possible characterizations, only those that actually appear in the world) provably correct deductive explanations Extensions: incomplete and inconsistent domain theories explanation-based learning with plausible explanations explanation-based learning at the knowledge level multistrategy explanation-based and similarity-based learning multistrategy explanation-based and case-based learning explanation-based index learning learning from negative examples analogical reasoning based on explanation-based learning learning form from functional descriptions %---------------------------------------------------------------------------- [DeJong 83] Two phases: EXPLANATION: Connect known concepts or operators together during understanding or problem solving using whatever method you want (e.g., LEX's search tree; PAM's goal and plan analysis; etc.) GENERALIZATION: Generalize a new concept or operator based on this explanation, based on an analysis of how and why all the pieces fit together. Example: Explanation-based generalization [Mooney and DeJong] EGGS/GENESIS's kidnap story: Fred is the father of Mary and is a millionaire. John approached Mary. She was wearing blue jeans. John pointed a gun at her and told her we wanted her to get into his car. He drove her to his hotel and locked her in his room. John called Fred and told him John was holding Mary captive. John told Fred that he would release Mary if Fred gave him $250,000 at Treno's restaurant. Fred gave him the money. John released Mary. R|- poss(F,250k) poss(J,250k) <-|r | <--- millionaire(F) ** |-- bargain <--|- goal-ord(F,free(M) > poss(F,250k)) free(M) <-|r | <-I- father(F,M) R|- captive(J,M) | <-r- capture(J,M) <------------- | <-R- free(M) | R|- belief(F,captive(J,M)) |I | <-r- phone(J,F,captive(J,M)) | | <-R- belief(J,captive(J,M)) R|- belief(J,poss(F,250k)) | <-I- belief(J,millionaire(F)) ** R|- belief(J,goal-ord(F,free(M) > poss(F,250k))) | <-I- belief(J,father(F,M)) M|- goal-ord(J,poss(J,250k) > captive(J,M)) <-I- goal(J,poss(J,250k)) Generalization: names -> variables father -> pos-ipt phone -> mtrans delete lines marked "*" Generalization methods: 1. Replace constants that nothing depends on by variables. 2. Replace constants that have dependencies with variables with attached constraints. 2. Delete "dead-end" chains that nothing depends on. 3. Generalize predicates and relationships. Features: 1. Generalization from one instance 2. Using causality to constrain feature selection and generalization. 3. Knowlege intensive. 4. Increased efficiency rather than increased knowledge. 5. A form of operationalization (making existing knowledge operational (effective, efficient to use)). 6. Similar to STRIPS/PLANEX. %---------------------------------------------------------------------------- Mitchell: LEX2 Similar to LEX, but explain WHY the particular search path worked by PROPAGATING CONSTRAINTS BACKWARDS THROUGH SEARCH TREE. Start with last operator, see why it was applicable (look at its preconditions), then see what it was about the previous state that allowed the previous operator to set up these preconditions, and so on all the way back. Ends up with a generalized description of the types of problems for which that sequence of transformations will work. LEX2 algorithm: Explanation-Based Generalization (EBG) Given: - Goal concept: A concept definition describing the concept to be learned. (The goal concept fails the operationality criterion.) - Training example: An example of the goal concept. - Domain theory: A set of rules and facts to be used in explaining how the training example is an example of the goal concept. - Operationality criterion: A predicate over concept definitions, specifying the form in which the learned concept definition must be expressed. Determine: a generalization of the training example that is a sufficient concept definition for the goal concept and that satisfies the operationality criterion. Method: - EXPLAIN: Construct an explanation in terms of the domain theory that proves how the training example satisfies the goal concept definition. The explanation must be constructed so that each branch of the explanation structure terminates in an expression that satisfies the operationality criterion. - GENERALIZE: Determine a set of sufficient conditions under which the explanation structure holds, stated in terms that satisfy the operationality criterion. This is accomplished by regressing the goal concept through the explanation structure. The conjunction of the resulting regressed expressions constitutes the desired concept definition. Example: Goal concept: CUP x is a CUP if x is liftable, stable, and open-vessel Operationality criterion: Concept description must be expressed in purely structural terms, such as light, flat, etc. Domain theory: if x is light and has a handle y, then x is liftable if x has a bottom y which is flat, then x is stable if x has a concavity y which is upward-pointing, then x is open-vessel etc. Training example: foo23 is a CUP foo23 is owned by Ralph foo23 has a concavity foo30 foo30 is upward-pointing foo23 is light foo23 has color brown etc. Explain step: Build the following explanation: foo23 is a CUP, because: foo23 is liftable, because: foo23 is light and foo23 has a part foo44 and foo44 is a handle and foo23 is stable, because: foo23 has a part foo87 and foo87 is a bottom and foo87 is flat and foo23 is an open-vessel, because: foo23 has a part foo30 and foo30 is a concavity and foo30 is upward-pointing Generalize step: generalize the above explanation by regressing the goal concept (CUP) through the explanation structure. Note that Ralph, brown, etc. do not appear in the explanation, so the learned concept of CUP does not mention owner, color, etc. as being important for cupness. The learned definition for cup(x) is: x has a part y and y is a concavity and y is upward-pointing and x has a part z and z ia a bottom and z is flat and x has a part w and w is a handle and x is light Mooney and DeJong's EGGS/GENESIS system has a slightly more general algorithm but it does essentially the same thing. %---------------------------------------------------------------------------- Issues in analytical learning Speedup learning Macro operators Learning control knowledge - What is learned? Evaluation functions Operator strengths Operator selection and rejection rules - What needs to be explained? Target concept - How are explanations constructed? Trace of problem solving Analysis (usually abduction) Case-based explanation - How should a specific explanation be generalized? How to generalize? Goal regression Weakest preconditions Constraint posting Unification How far to generalize? - What if the domain theory is imperfect? - Psychological plausibility? - Goals - what to learn? %---------------------------------------------------------------------------- Analogical Learning Metaphor (Examples from Rich and Knight) Last month, the stock market was a roller coaster. Bill is like a fire engine. Problems in electromagnetism are just like problems in fluid flow. Example: the atom is like the solar system Two main issues: - how to retrieve the ``right'' analogy? - how to make the ``right'' transfer? What can be transferred? Anything: fillers, slots, schemas, etc. How to transfer: - based on causal connections, or explanations, of how the transferred knowledge holds together Several methods for analogical transfer: - transfer everything, then evaluate the new schema in some context (variation of generate and test); - ask a teacher what to transfer; - use some other knowledge as a filter to decide what to transfer. %---------------------------------------------------------------------------- Issues in analogical and case-based learning Using pre-existing knowledge inductively Access Indexing Mapping Adaptation Evaluation Evaluation Learning Learning Access/Retrieval Issues: Similarity assessment/Matching Indexing What is an index? Choosing indices without knowing future use Predictive Observable or Easy to compute (Operational) Memory organization Content vs. process Spreading activation Partial match retrieval Case selection: Choosing the appropriate analogy or case Mapping/Adaptation Issues: Structural/syntactic, structure-mapping, systematicity Semantic, importance-guided, pragmatic, purpose-directed Creativity Reasoning from multiple analogs or cases Evaluation Issues: Plausibility checking, non-deductive Repairing or refining the analogy Learning Issues: Storing transferred knowledge Generalizing transferred knowledge Learning from failures Index learning %---------------------------------------------------------------------------- Analogy between concrete situations (Example from Anderson 1983) R O N Y A B C D ____________________ ____________________ Given: RO = NY Given: AB > CD Prove: RN = OY Prove: AC > BD Proof: RO = NY ON = ON So RO+ON = ON+NY Mapping: A --> R Proof: AB > CD B --> O BC > BC C --> N (absurd) D --> Y > --> = Problem: what to map? Analogy between abstract situations (Example from Burstein 1983) A variable is like a box You can put the number 5 in the variable X by saying X = 5 Problem: What does X = Y mean? Tranformational analogy (Carbonell 1983) Find a previous problem similar to current problem Copy or "tranform" its solution Solutions are viewed as states in a problem space called the T-space Solution tranformation operators are called T-operators Reasoning by analogy = search in T-space New problem ---------------------> Old problem || || vv New solution <--------------------- Old solution TRANSFORM E.g., above geometry problem Derivational analogy (Carbonell 1986) Find a previous problem similar to current problem Copy the derivation of the solution to the previous problem New problem ---------------------> Old problem | || | New derivation |<=========================================| Old derivation v v New solution <--------------------- Old solution TRANSFORM E.g., writing a sort program in Lisp after having written it in Pascal Case-based reasoning Algorithm Use problem description to get reminded of old case Retrieve one or more cases Choose best one Retrieve results of processing old case (lessons, XPs, plans) Adapt results/solutions from old case to specifics of new situation Apply adapted results to new situation Evaluate new solution Learn (Update memory) "Anticipate and avoid" vs. "Create and debug" Architecture of a CBR system -- See notes on "Case-Based Planning" Assumptions behind CBR (from Ram 1992): EFFICIENCY ASSUMPTION: It is more efficient to retrieve and apply larger knowledge structures than to construct them from scratch each time out of smaller knowledge structures or inference rules. CONTENT ASSUMPTION: There are too many possible ways in which inference rules can be connected together, many of which will be irrelevant or meaningless. The content of the explanations produced from cases is likely to be better than those produced through exhaustive search through inference rules, because cases contain experiential knowledge about the ways in which the rules are actually connected together in real situations. TYPICALITY ASSUMPTION: Situations encountered in the real world are typical of the kinds of situations that are likely to be encountered in the future. Thus it is worthwhile creating a new case to represent novel experiences because remembering this case will make it easier (by virtue of the efficiency assumption and the content assumption) to process similar situations in the future. %---------------------------------------------------------------------------- Summary of main issues: Inductive or similarity-based learning Inductive bias Making inductive leaps Overfitting Analytical or explanation-based learning Need strong domain theories Operationality criterion Utility problem Analogical or case-based learning Indexing problem Adaptation Initial knowledge, bootstrapping %---------------------------------------------------------------------------- We've talked about: Generalization Induction Incremental learning Explanation Tracking down failures But not about putting learning into context: the issue of CONTROL: When should generalization happen? What should be generalized? What should be learned? How to control and focus the process? Three types of answers: 0. No control; we're interested in the learning process, not when it's triggered or how it's controlled. 1. Learning is triggered whenever there's a failure (e.g., Schank) or whenever there's a success (e.g., DeJong) or whenever there is a need or goal to learn (e.g., Hunter, Ram) or focussed by the problem solver (e.g., Carbonell, Mitchell, Schank, Sussman). 2. No overarching control; not because we're not interested in control, but because we're interested in learning through exploration, creativity, experimentation (e.g., Mitchell, Lenat). %---------------------------------------------------------------------------- Learning by exploration, learning by discovery, creativity, theory creation Lenat 1982, 1983 - Two programs: AM, EURISKO Programs that learn by exploring a domain, looking for interesting patterns and generalizations. AM's domain: elementary mathematics. EURISKO: several domains, including 3-d VLSI circuits, design of battle fleets for a space warfare game. Maintain a database of concepts (e.g., "set" and "function" in the math domain; "spaceship" in the space game domain). Concepts represented in a frame-like slot and filler notation. E.g., Name: Set Specializations: Empty-set, Nonempty-set, ... Generalizations: Unordered-structure, ... Examples: Typical: {{}}, {A, B} Boundary: {} Worth: 600 Name: Primes Statement: English: Numbers with two divisors Specializations: Odd-primes, Small-primes, Pair-primes Generalizations: Positive-natural-numbers Is-a: Set Examples: Typical: 5, 7, 11, 13, 17, 19 Typical-non-examples: 34, 100 Boundary: 2, 3 Boundary-non-examples: 0, 1 Conjectures: Good: Unique-factorization, formula-for-d(n) Good-units: Times, Divisions-of, Exponentiate, Numbers-with-three-divisors, Squaring Analogies: Simple-groups Origin: Divisions-of-1 (doubletons) Defined-using: Divisions-of Creation-date: 3/19/76 18:45 History: N-good-examples: 840 N-bad-examples: 5000 N-good-conjectures: 3 N-bad-conjectures: 7 Worth: 800 Works by creating (inventing) new concepts and filling in their slots, in the hope that some of these new concepts will turn out to be interesting. Works by adaptation/mutation (similar to SWALE). AM started with the concepts of "set" and "bag", and discovered numbers, addition, multiplication, prime numbers, etc. It conjectured that all numbers have unique prime factorizations. EURISKO designed battle fleets that won a national tournament two years in a row. Underlying idea: EXPLORATION instead of SEARCH. Many operators, but no goal state; just keeps trying to make more and more interesting discoveries. Can't search the space exhaustively; instead, maintains an AGENDA of tasks to do, ranked by how interesting they appear to be. A TASK is a program that tries to fill in a slot in a concept, usually creating more concepts, slots and tasks along the way. Example of a task: Activity: Fill in some entries Facet: Generalizations Concept: Primes Reasons: 1. Only one generalization of primes known so far. 2. Interestingness rating of primes is very high. 3. Focus of attention: AM has just been working on primes. 4. Very few examples of primes are known; if more were known, the concept would be more interesting. Priority: 350 (scale 0-1000) No task network, because no planning of tasks, because no protection violations. The main issue is what to spend time on. If the system has a choice between thinking about prime numbers, and thinking about sets with even numbers of elements, then it should spend time on the former. The trick is in computing INTERESTINGNESS in such a way that useful tasks usually don't get neglected. Uses a set of HEURISTICS to work on a task. Heuristics can: Suggest new tasks Create new concepts Fill in slot entries for specific facets Example of a heuristic: If the current task is to Check Examples of X and for some Y, Y is a generalization of X and Y has at least 10 examples and all examples of Y, ignoring boundary cases, are also examples of X Then conjecture that X is really no more specialized than Y and add it to the Examples slot of Conjectures and check the truth of this conjecture on the boundary examples of Y and add X to the Generalizations slot of Y and add Y to the Specializations slot of X and add the following task to the agenda: Check Examples of Y, with interestingness = 0.4 (Number of Examples of Generalizations of Y) + 0.3 (Number of Examples of Y) + 0.3 (Rating of current task) [Why this task? Because if Y is really the same concept as Y, some generalization Z of Y may turn out to be basically the same as X. At least, it's worth thinking about. If there are a lot of examples, this is more likely and more worth thinking about.] Example: AM used this heuristic while checking examples of the concept of "odd prime numbers" (which it had earlier created). The heuristic was retrieved with X = odd prime numbers. To evaluate the IF part of the rule, it bound Y = prime numbers because, barring the boundary case "2", all known examples of Y were also examples of X. Thus it concluded that all prime numbers, barring "2", were odd. Note that AM never proved its conjectures, it only discovered them. Some heuristics create new concepts by mutating old concepts. E.g., consider the concept of list equality: Name: equal? Definition: (lambda (x y) (cond ((and (null? x) (null? y)) 't) ((eq? (car x) (car y)) (equal? (cdr x) (cdr y))) (t nil))) This can be mutated in several ways. E.g., by dropping a test in the conditional, AM gets: Name: shmequal? Definition: (lambda (x y) (cond ((and (null? x) (null? y)) 't) ( <<< DROPPED CLAUSE (shmequal? (cdr x) (cdr y))) (t nil))) The next heuristic is to try and generate examples of the proposed concept. AM does not know about list length as yet, but it does pairs of lists that satisfy the definition. Ultimately, from length it finds numbers, and so on. AM is a collection of heuristics. In EURISKO, heuristics themselves were concepts, and heuristics could work on heuristics. EURISKO mutated and invented, not only concepts in the domain, but also heuristics themselves. It ultimately evolved a set of heuristics that worked well. AM's heuristics were not inspectable because they were just Lisp code, but in EURISKO heuristics were represented in the frame notation and the system could reason about them (an advantage of declarative representations). Note that AM and EURISKO have no way to control their search, because there is no good way to judge interestingness in general. In reality, much of the control was done manually by Lenat. AM and EURISKO don't have a way of evaluating the truth or utility of what they invent. AM invented several good concepts, but many many of its conjectures were boring (even if true). Lenat sifted through them to pick out the interesting ones (like the unique prime factorization conjecture). Why does AM appear to work? Because there is a close mapping between the structure of the concepts (Lisp code) and the semantics of the concepts. Deleting conjuncts from Lisp code is actually a semantically meaningful operation, and often yields interesting results. The trick is in getting the representation right; if you can represent the domain in a way such that syntactic manipulations of the representations are actually semantically meaningful and interesting, then this kind of creativity through mutation will work. Creative case-based reasoning (e.g., SWALE) relies on the same constraints. Lenat talks about the "sources of power" of the programs -- important to think about when doing AI research. %---------------------------------------------------------------------------- Copyright (c) Ashwin Ram, 1990-93 Assistant Professor, College of Computing Georgia Institute of Technology, Atlanta, Georgia 30332-0280 E-mail: