Chris Simpkins
December 3, 2008
By embedding reinforcement learning into an agent programming language, we can enable non-programming experts to write adaptive agents
Current tools, like NetLogo, are fine for simple problems
Java and ABL handle more complex problems, but are hard to use for non-programmers
In each case, very hard to scale to more complex agents
A basic problem: real agents are adaptive by nature, but encoding adaptive behavior in virtual agents is very hard - lots of code to write, high cognitive burden
Some problems not well suited to traditional programming. Another way to think of programming: adaptive agents.

Partial programming: code known parts of solution, leave run-time system to learn the rest
Many ways to achieve adaptive software
Partial programming provides a way to mix human-authored and machine-authored (learned) behavior
Partial programming exploits domain knowledge when available
Reinforcement learning a natural fit
Think of agent in terms of RL problem - states, actions, rewards
System learns action selection policy
Partial programming reduces cognitive load on programmer
Many languages "include" RL
Our language based on ABL. Why?
Key differentiators:
Why a new language, rather than a library?
Expressiveness
Optimization
Safety
An API that achieved our goals would end up looking an awful lot like a language anyway.
Towards Adaptive Programming: Integrating Reinforcement Learning into a Programming Language. Christopher Simpkins, Sooraj Bhat, Charles Isbell, Jr., and Michael Mateas. OOPSLA '08: ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, Onward! Track, Nashville, TN USA, October 2008





Adaptive software uses available information about changes in its environment to improve its behavior (Norvig and Cohn 1998). Adapts at run-time
Our way: partial programming, enabled by integrating reinforcement learning into a programming language
Partial programming: code known parts of solutions, run-time system learns remainder at run-time
Separates the what of agent behavior from the how
This is all you have to know about RL to code adaptive agents in A2BL:
States - agent can be in one of a finite number of states
Actions - agent has a set of actions available in each state that affects the state of the world, i.e., moves the agent to a different state
Rewards - each state gives the agent feedback in the form of a scalar reward signal
RL algorithm learns control policy - given a state, what action leads to best long-term cumulative reward, i.e., more rewarding actions are reinforced







World state is learned by the agent, but specified separately, possibly by a different programmer - a way for subject matter expert to inject domain knowledge
Conflicts between subgoals resolved by A2BL’s arbitrator
Probabilistic reasoning without the programmer having to think explicitly about stochasticity
Decouples what from how - add a new subgoal and you only have to code it once
RL + PL = Partial Programming
Partial Programming enables adaptive programming
Partial programming facilitates a discipline of agent software engineering
A2BL based on work of Charles, Michael Mateas, and Sooraj
My contribution: provide an intellectual framework for thinking about and developing A2BL - it's a language design and software engineering problem
Why is this research?
No other advanced language currently aimed at non-programmers
No other language with embedded RL has practical focus
No current RL language explicitly enlisting participation of PL/SE communities
Intellectual framework: A2BL a new programming paradigm
PL/SE Issues
Usability - visual programming, high- vs. low-level
Structuring mechanisms for large systems
Run-time efficiency
Debugging
OOP
Tie-in with Model-based Meta-Reasoning and TKML (Task-Method-Knowledge Language)
Using Model-Based Reflection to Guide Reinforcement Learning (Ulam, Goel, Jones, Murdock)
Similar to our partial programming idea - knowledge representation specifying domain knowledge, learning component to adapt at run-time
Why Modular Reinforcement Learning (MRL)? To facilitate software engineering using partial programming, we will need true sub-agent modularity, which is a key feature of MRL
MRL is a framework in which a complex, multiple-goal agent is decomposed into several single-goal sub-agents that run concurrently. More later ...
Defined a reformulation of the arbitration problem that relaxes one of the properties that makes ideal arbitration impossible in full generality
Designed a first-cut algorithm to solve our reformulated MRL problem: Hydra-Q
Built a first-cut implementation Hydra-Q. Didn't get publishable results but have a draft paper
Using Content Analysis to Investigate The Research Paths Chosen by Scientists Over Time. Chiara Franzoni, Christopher Simpkins, Li Baoli and Ashwin Ram. 11th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2007)
GTRI-STL used genetic algorithms to design antennas. Had become frustrated by its stalling in local optima
Implemented a MIMIC-based antenna designer
Using STL researchers' pre-built database of 18-bit planar-array antennas, repeated GA validation experiments using MIMIC implementation. Found MIMIC to be far superior to GA due to the expense of the antenna evaluation function
Paper in preparation
It's been an honor. Thank you for your time and support.
Why Modular Reinforcement Learning (MRL)? To facilitate software engineering using partial programming, we will need true sub-agent modularity, which is a key feature of MRL
MRL is a framework in which a complex, multiple-goal agent is decomposed into several single-goal sub-agents that run concurrently. More later ...
Defined a reformulation of the arbitration problem that relaxes one of the properties that makes ideal arbitration impossible in full generality
Designed a first-cut algorithm to solve our reformulated MRL problem: Hydra-Q
Built a first-cut implementation Hydra-Q. Didn't get publishable results but have a draft paper
Succinctly summarized in Sprague and Ballard, 2003
General problem: multiple MDPs with distinct state spaces and a shared action set
Greatest Mass Q-Learning (GM-Q): sum Q-values over sub-agent Q-learners for each action. Action with highest summed Q-value ("greatest mass") is selected for execution
Top-Q: Winner take all - Action with highest Q-value is selected
Negotiated-W: Select action from sub-agent with most to lose
Sprague and Ballard: Q-Learning approaches suffer from off-policy nature of Q-Learning. Convergence characteristics of Q-learning occur under assumption that optimal actions will be taken in future, but this doesn't happen when multiple agents are "compromising" in their action selection
Sprague and Ballard's solution: GM-Sarsa(0) - like GM-Q but with on-policy Sarsa(0) algorithm
Multiple-Goal reinforcement learning approaches are not truly modular because sub-agent rewards must be comparable, thereby coupling all sub-agents because sub-agent rewards cannot be independently authored
A possible solution is arbitration: have an arbitrator sub-agent that selects actions based on other sub-agent preferences
But ideal arbitration is impossible in full generality because it reduces to Arrows Impossibility Theorem. What to do ...
Concurrently Hierarchical Adaptive Reinforcement LEarning Setup
A pragmatic reformulation of arbitration
A "greater good" sub-agent with its own reward signal
Relaxes the non-dictatorship property of ideal arbitration
Basic idea: each sub-agent has an associated weight. Hydra-Q learns these weights, uses these weights to select a sub-agent action preference for execution
Hydra-Q is a Q-learning algorithm where the state space is a cross product of sub-agent state spaces (yeah, I know, that's huge) and the action set is a set of sub-agent weights (action selection is then trivially accomplished using these weights)
If we can come up with an acceptable first-cut Hydra-Q, then we can work on state abstraction
Use research to solve practical engineering problems
Want to build working systems
Want to improve the engineering of working systems in fundamental ways, for example, with new paradigms
Cross-specialization - apply techniques from one area to solve problems in another
Develop thesis to the point where IARPA/DARPA/DIA will fund it
Return to GTRI with PhD and a body of work to build on
Obtain funding to continue A2BL work via joint GTRI/CoC effort, employing GTRI researchers, CoC faculty, and CoC students
RL Decomposition
Advantages
Factor out subproblem - don't have to repeat solution
May be abstractions available in subproblems
Disadvantages
How do you know how to decompose
How do you re-combine subproblems
If no abstractions or reuse, does not apply
Function Approximation
Advantages
Efficient representation
Generalization
Ability to bias the representation
Ability to represent continuous state spaces
Disadvantages
Have to come up witha function model and features. E.g., linear combination of ...
Tables converge, function approximators don't necessarily
Don't work well with "rough" functions
Can function approximation work well with problem decomposition?
Compare Classical Planning with Markov Decision Process (MDP) Planning
Is there any advantage to Classical Planning (logic-based) as compared to MDPs? List any advantages you find.
Describe a planning domain where the advantage of classical planning is apparent.
Show this advantage by representing your domain as both a logical domain and an MDP.
When do MDPs have an advantage (in general, not just this domain)? List at least two advantages and explain.
MDPs have an advantage in stochastic environments.
MDPs handle exogenous events, that is, state transitions not under the control of the agent.
Can we get the best of both worlds? How?
Describe the circumstances and domains under which partial-order planners will dominate other approaches.
Some believe that traditional metrics such as speed, ratio of problems solved, and plan length ignore critical properties of planners. What other criteria could be considered? What features of partial-order planning algorithms, and partial-order plan representations might be considered advantageous over other approaches when considering these new criteria?