The learning process of the Geo can be generally described as follows: 1. The Geo transitions into a state (at the beginning and right after a reset, this is defined as the first step of the dance) and begins doing random actions
2. After each random action, the Geo waits a little bit for user feedback
3. The user interacts with the Geo through providing it with rewards, punishments, and guidance
4. Depending on the user’s interactions, the Geo’s confidence of an action increases or decreases, and its selection of which action to perform changes
5. Depending on the Geo’s confidence, it will shorten the amount of wait time between actions, and also the number of actions to perform with no feedback before transitioning
6. When the number of actions with no feedback reaches a certain constant c (depends on the confidence level), the Geo transitions to the next state (next step of the dance) The three core components of this learning process are Guided Epsilon Greedy exploration, the idea of Confidence, and Direct Policy Learning for mapping between steps of the dance and the dance moves to perform at those steps. Guided Epsilon Greedy: We define the term epsilon (between 0 and 1) to denote the chance that Geo will do a random action from the set of actions instead of the action with the highest value, as indicated by Direct Policy Learning. Epsilon decreases with increasing Confidence, until it reaches 0. If guidance is provided, then the Geo will restrict its action selection to the subset of actions associated with the guidance signal, while still retaining all the other properties of Guided Epsilon Greedy. Confidence: Confidence is defined as the magnitude difference between the Direct Policy values of the two highest rated actions at the current state (step of the dance). Its value, discretized into three categories (low, medium, and high confidence), changes behavior at many levels including selecting actions, transitioning between states, and waiting for feedback. Direct Policy Learning: Direct Policy Learning is a simple learning mechanism that simply records the count of positive and negative rewards given to actions in a state. Thus, if action A was rewarded twice but action B only once, then A will have twice the Direct Policy value of B.