Geo's Learning Process


The learning process of the Geo can be generally described as follows:

1. The Geo transitions into a state (at the beginning and right after a reset,
this is defined as the first step of the dance) and begins doing random actions

2. After each random action, the Geo waits a little bit for user feedback

3. The user interacts with the Geo through providing it with rewards,
punishments, and guidance

4. Depending on the user’s interactions, the Geo’s confidence of an action
increases or decreases, and its selection of which action to perform changes

5. Depending on the Geo’s confidence, it will shorten the amount of wait time
between actions, and also the number of actions to perform with no feedback
before transitioning

6. When the number of actions with no feedback reaches a certain constant c
(depends on the confidence level), the Geo transitions to the next state (next
step of the dance)

The three core components of this learning process are Guided Epsilon Greedy
exploration, the idea of Confidence, and Direct Policy Learning for mapping
between steps of the dance and the dance moves to perform at those steps.

Guided Epsilon Greedy: We define the term epsilon (between 0 and 1) to denote
the chance that Geo will do a random action from the set of actions instead of
the action with the highest value, as indicated by Direct Policy Learning.
Epsilon decreases with increasing Confidence, until it reaches 0.  If guidance
is provided, then the Geo will restrict its action selection to the subset of
actions associated with the guidance signal, while still retaining all the other
properties of Guided Epsilon Greedy.

Confidence: Confidence is defined as the magnitude difference between the Direct
Policy values of the two highest rated actions at the current state (step of the
dance).  Its value, discretized into three categories (low, medium, and high
confidence), changes behavior at many levels including selecting actions,
transitioning between states, and waiting for feedback.

Direct Policy Learning: Direct Policy Learning is a simple learning mechanism
that simply records the count of positive and negative rewards given to actions
in a state.  Thus, if action A was rewarded twice but action B only once, then A
will have twice the Direct Policy value of B.