# PhD Qualifier Chris Simpkins December 3, 2008 # $A^2BL$: Advanced Agent Programming for All - By embedding reinforcement learning into an agent programming language, we can enable non-programming experts to write adaptive agents - Lower the threshold for agent programming - Maintain, even extend the ceiling for agent programming # Motivation - Many people use agent-oriented modeling - Current tools, like NetLogo, are fine for simple problems - Java and ABL handle more complex problems, but are hard to use for non-programmers - In each case, very hard to scale to more complex agents - Intelligence analysts modeling terrorists - Game designers writing interactive games and narratives - War game modelers writing adaptive training simulations - A basic problem: real agents are adaptive by nature, but encoding adaptive behavior in virtual agents is very hard - lots of code to write, high cognitive burden # Adaptive Software, Norvig and Cohn, 1998 Some problems not well suited to traditional programming. Another way to think of programming: adaptive agents. ![](norvig-cohn-table.png "") # Adaptive Software via Partial Programming - Partial programming: code known parts of solution, leave run-time system to learn the rest - Many ways to achieve adaptive software - Partial programming provides a way to mix human-authored and machine-authored (learned) behavior - Partial programming exploits domain knowledge when available # Partial Programming via Reinforcement Learning - Reinforcement learning a natural fit - Think of agent in terms of RL problem - states, actions, rewards - *Code known parts of solution...* - System learns action selection policy - *run-time system learns the rest.* - Partial programming reduces cognitive load on programmer - We hope that RL framework will be easy for non-programmers to grasp - certainly easier than a whole general-purpose programming language # From ABL to $A^2BL$ - Many languages "include" RL - Soar - ALisp - Icarus - ... - Our language based on ABL. Why? - ABL already includes much of what we want in an agent PL - ABL focused on production systems - Key differentiators: - $A^2BL$ aimed at non-programmers - $A^2BL$ intended for large-scale agent systems # Why a new language? Why a new language, rather than a library? - Expressiveness - More natural way to write adaptive agents. Important for non-programmers - Optimization - Opportunity for compiler optimizations - Safety - Like typing, compiler can provide many more constraints on programs than an API could An API that achieved our goals would end up looking an awful lot like a language anyway. # How Does $A^2BL$ Improve on ABL - Provides a framework for thinking explicitly about adaptivity - Reduces cognitive load on programmer - Metric for reducing cognitive load: amount of code required ... > Towards Adaptive Programming: Integrating Reinforcement > Learning into a Programming Language. Christopher Simpkins, > Sooraj Bhat, Charles Isbell, Jr., and Michael Mateas. *OOPSLA > '08: ACM SIGPLAN Conference on Object-Oriented Programming, > Systems, Languages, and Applications, Onward! Track*, > Nashville, TN USA, October 2008 # Example: The Furry Creature World ![](oopsla-slides.004.png "") # A Furry Creature Agent in ABL ![](oopsla-slides.005.png "") # A Furry Creature Agent in ABL ![](oopsla-slides.006.png "") # A Furry Creature Agent in ABL ![](oopsla-slides.007.png "") # A Furry Creature Agent in ABL ![](oopsla-slides.008.png "") # Solution: Adaptive Programming - Adaptive software uses available information about changes in its environment to improve its behavior (Norvig and Cohn 1998). Adapts at run-time - Our way: partial programming, enabled by integrating reinforcement learning into a programming language - Partial programming: code known parts of solutions, run-time system learns remainder at run-time - Separates the what of agent behavior from the how # Reinforcement Learning - This is all you have to know about RL to code adaptive agents in $A^2BL$: - States - agent can be in one of a finite number of states - Actions - agent has a set of actions available in each state that affects the state of the world, i.e., moves the agent to a different state - Rewards - each state gives the agent feedback in the form of a scalar reward signal - RL algorithm learns control policy - given a state, what action leads to best long-term cumulative reward, i.e., more rewarding actions are reinforced # The Furry Creature World as an RL Problem ![](oopsla-slides.011.png "") # A Furry Creature in $A^2BL$ ![](oopsla-slides.014.png "") # A Furry Creature in $A^2BL$ ![](oopsla-slides.015.png "") # A Furry Creature in $A^2BL$ ![](oopsla-slides.016.png "") # A Furry Creature in $A^2BL$ ![](oopsla-slides.017.png "") # A Furry Creature in $A^2BL$ ![](oopsla-slides.018.png "") # How Our $A^2BL$ Agent Would Behave ![](oopsla-slides.013.png "") # Advantages of $A^2BL$'s Built-in Adaptivity - World state is learned by the agent, but specified separately, possibly by a different programmer - a way for subject matter expert to inject domain knowledge - Conflicts between subgoals resolved by $A^2BL$’s arbitrator - Probabilistic reasoning without the programmer having to think explicitly about stochasticity - Decouples what from how - add a new subgoal and you only have to code it once # Summary of $A^2BL$ as a PL/SE Issue - RL + PL = Partial Programming - Partial Programming enables adaptive programming - Partial programming facilitates a discipline of *agent software engineering* # What have I done? - $A^2BL$ based on work of Charles, Michael Mateas, and Sooraj - My contribution: provide an intellectual framework for thinking about and developing $A^2BL$ - it's a language design and software engineering problem Why is this research? - No other advanced language currently aimed at non-programmers - No other language with embedded RL has practical focus - No current RL language explicitly enlisting participation of PL/SE communities - Intellectual framework: $A^2BL$ a new programming paradigm # $A2BL$ Future Directions and Research Issues - PL/SE Issues - Usability - visual programming, high- vs. low-level - Structuring mechanisms for large systems - Run-time efficiency - Debugging - OOP - Tie-in with Model-based Meta-Reasoning and TKML (Task-Method-Knowledge Language) - Using Model-Based Reflection to Guide Reinforcement Learning (Ulam, Goel, Jones, Murdock) - Similar to our partial programming idea - knowledge representation specifying domain knowledge, learning component to adapt at run-time # ICML 200?: Hydra: Arbitration for Modular Reinforcement Learning - Why Modular Reinforcement Learning (MRL)? To facilitate software engineering using partial programming, we will need true sub-agent modularity, which is a key feature of MRL - MRL is a framework in which a complex, multiple-goal agent is decomposed into several single-goal sub-agents that run concurrently. More later ... - Defined a reformulation of the arbitration problem that relaxes one of the properties that makes ideal arbitration impossible in full generality - Designed a first-cut algorithm to solve our reformulated MRL problem: Hydra-Q - Built a first-cut implementation Hydra-Q. Didn't get publishable results but have a draft paper # Other Stuff: Text Mining for Scientometrics - Collaborated with Italian economist Chiara Franzoni. Applied text mining to a corpus of physics abstracts to discover patterns of specialization among physics researchers > Using Content Analysis to Investigate The Research Paths Chosen by > Scientists Over Time. Chiara Franzoni, Christopher Simpkins, Li > Baoli and Ashwin Ram. *11th International Conference of the > International Society for Scientometrics and Informetrics (ISSI > 2007)* # Other Stuff: MIMIC for Antenna Design - GTRI-STL used genetic algorithms to design antennas. Had become frustrated by its stalling in local optima - Implemented a MIMIC-based antenna designer - Using STL researchers' pre-built database of 18-bit planar-array antennas, repeated GA validation experiments using MIMIC implementation. Found MIMIC to be far superior to GA due to the expense of the antenna evaluation function - Paper in preparation # Thanks! It's been an honor. Thank you for your time and support. # OOP in $A^2BL$ - Agent’s are kinda like objects, but different - What is a subtype in the agent world? - What does “is-a” mean? - Inherit learning algorithm, learned model? - Contracts in stochastic realm? - What does “has-a” mean? - Intellectually fun - we get to explore the very notion of agency # ICML 200?: Hydra: Arbitration for Modular Reinforcement Learning - Why Modular Reinforcement Learning (MRL)? To facilitate software engineering using partial programming, we will need true sub-agent modularity, which is a key feature of MRL - MRL is a framework in which a complex, multiple-goal agent is decomposed into several single-goal sub-agents that run concurrently. More later ... - Defined a reformulation of the arbitration problem that relaxes one of the properties that makes ideal arbitration impossible in full generality - Designed a first-cut algorithm to solve our reformulated MRL problem: Hydra-Q - Built a first-cut implementation Hydra-Q. Didn't get publishable results but have a draft paper # Multiple-Goal Reinforcement Learning (1 of 2) - Succinctly summarized in Sprague and Ballard, 2003 - General problem: multiple MDPs with distinct state spaces and a shared action set - Greatest Mass Q-Learning (GM-Q): sum Q-values over sub-agent Q-learners for each action. Action with highest summed Q-value ("greatest mass") is selected for execution - Top-Q: Winner take all - Action with highest Q-value is selected - Negotiated-W: Select action from sub-agent with most to lose # Multiple-Goal Reinforcement Learning (2 of 2) - Sprague and Ballard: Q-Learning approaches suffer from off-policy nature of Q-Learning. Convergence characteristics of Q-learning occur under assumption that optimal actions will be taken in future, but this doesn't happen when multiple agents are "compromising" in their action selection - Sprague and Ballard's solution: GM-Sarsa(0) - like GM-Q but with on-policy Sarsa(0) algorithm # Modular Reinforcement Learning, Bhat, et. al., 2006 - Multiple-Goal reinforcement learning approaches are not truly modular because sub-agent rewards must be comparable, thereby coupling all sub-agents because sub-agent rewards cannot be independently authored - A possible solution is arbitration: have an arbitrator sub-agent that selects actions based on other sub-agent preferences - An ideal arbitrator is defined as having the following properties (1) Universality, (2) Unanimity, (3)Independence of Irrelevant Attributes, (4) Scale Invariance, and (5) Non-Dictatorship - But ideal arbitration is impossible in full generality because it reduces to Arrows Impossibility Theorem. What to do ... # CHARLES - Concurrently Hierarchical Adaptive Reinforcement LEarning Setup - Name suggestions welcome! - A pragmatic reformulation of arbitration - A "greater good" sub-agent with its own reward signal - Relaxes the non-dictatorship property of ideal arbitration # Hydra-Q: Learning to Combine Sub-Agent Preferences - Basic idea: each sub-agent has an associated weight. Hydra-Q learns these weights, uses these weights to select a sub-agent action preference for execution - Hydra-Q is a Q-learning algorithm where the state space is a cross product of sub-agent state spaces (yeah, I know, that's huge) and the action set is a set of sub-agent weights (action selection is then trivially accomplished using these weights) - By combining sub-agent state spaces, we get a kind of contextual weighting, e.g., Hydra-Q learns to weight AvoidPredator when the predator is closer - If we can come up with an acceptable first-cut Hydra-Q, then we can work on state abstraction # Research Philosophy - Use research to solve practical engineering problems - Want to build working systems - Want to improve the engineering of working systems in fundamental ways, for example, with new paradigms - Cross-specialization - apply techniques from one area to solve problems in another # Career Plan/Grand Vision - Develop thesis to the point where IARPA/DARPA/DIA will fund it - Return to GTRI with PhD and a body of work to build on - Obtain funding to continue $A^2BL$ work via joint GTRI/CoC effort, employing GTRI researchers, CoC faculty, and CoC students # Written Exam, Machine Learning #2 RL Decomposition - Advantages - Factor out subproblem - don't have to repeat solution - May be abstractions available in subproblems - Disadvantages - How do you know how to decompose - How do you re-combine subproblems - If no abstractions or reuse, does not apply # Written Exam, Machine Learning #2 Function Approximation - Advantages - Efficient representation - Generalization - Ability to bias the representation - Ability to represent continuous state spaces - Disadvantages - Have to come up witha function model and features. E.g., linear combination of ... - Tables converge, function approximators don't necessarily - Don't work well with "rough" functions # Written Exam, Machine Learning #2 Can function approximation work well with problem decomposition? - Yes, in theory, but hasn't ben done yet. Peng is working on it. # Written Exam, Planning and Search #1 Compare Classical Planning with Markov Decision Process (MDP) Planning a) Is there any advantage to Classical Planning (logic-based) as compared to MDPs? List any advantages you find. - Classical planners can exploit domain knowledge better, becuase it can be encoded in forst-order represenataions, for example, that constrain the state transitions that need to be considered. b) Describe a planning domain where the advantage of classical planning is apparent. - Route planning in a public transit system. Here, you have deterministic state transitions, and a small number of simple actions that have straightforward preconditions and effects. # Written Exam, Planning and Search #1 c) Show this advantage by representing your domain as both a logical domain and an MDP. - Classical planner uses a simple determnitic domain theory. MDP only has states, and transitions between states labelled with probabilities. MDP starts with assumption that any state can result in any other state for some action with some probability. MDP would end up looking like d) Prove that the classical planner has better performance. # Written Exam, Planning and Search #1 e) When do MDPs have an advantage (in general, not just this domain)? List at least two advantages and explain. - MDPs have an advantage in stochastic environments. - MDPs handle exogenous events, that is, state transitions not under the control of the agent. f) Can we get the best of both worlds? How? # Written Exam, Planning and Search #3 a) Describe the circumstances and domains under which partial-order planners will dominate other approaches. - Graphplan and SAT approaches use conjuctions of boolean literals representing the entire problem state at every step. This limits the domains they can express and can require massive amounts of storage when there are several such literals. For such domains, like logistics problems with dozens of locations and objects, partial-order planners using first-order representations do much better. # Written Exam, Planning and Search #3 b) Some believe that traditional metrics such as speed, ratio of problems solved, and plan length ignore critical properties of planners. What other criteria could be considered? What features of partial-order planning algorithms, and partial-order plan representations might be considered advantageous over other approaches when considering these new criteria? - Decomposability and paralellizability. Partial-order plan representations exploit the inherent parallelism in problems.