Reinforcement Learning

The purpose of this exercise is to implement an agent using reinforcement learning that can play a simple game to rescue humans from a building that has been flooded with radiation.

Reinforcement learning is a technique of using trial-and-error episodes to learn how to play a game as close to optimally as possible. By using reinforcement learning in a game, we are trying to play as well as (or better) than humans. Variants of reinforcement learning have been used to play Backgammon, Go, and other board games at skill levels superior to humans. Reinforcement learning has been used to train agents to play Atari games at (sometimes better) human skill levels.

Our agent will learn to play a simple game called "radiation world" which was originally designed for rescue scenarios. In radiation world, humans have been trapped in a building because of a radiation leak. There are areas that are irradiated that humans cannot safely move through. However a robot can move safely through them. Your robot must navigate the building to reach each human.

The following is an example game map. Black indicates walls. Red indicates areas with radiation. Blue indicates grid cells containing humans. Green is the base station for the robot.

The agent can move left, right, up, or down. If it enters a grid cell with a human, then that human is saved.

One last thing: there is an enemy robot roaming around the building that will try to hurt the agent if they are ever in the same location for more than one turn. Our agent has one additional action, "smash," which can destroy the enemy robot.

The goal of a reinforcement learning agent is to get as many points as it can during a game. Thus, the definition of optimal behavior depends on how score is computed. The default way score is computed is as follows:

20 points per turn for each human rescued
10 points for being at the base station
-50 points for being hurt by the enemy robot
-20 points for being in radiation
-10 points per turn that the enemy robot is no longer functional
-1 point per turn

(Yes, you lose points if you kill the enemy robot, so with this scoring function the bot will need to learn to avoid it instead.)

You will implement the Q-learning algorithm and test your agent in the radiation world.

What you need to know

Below are the important bits of information about objects that you will be working with or need to know about for this assignment.

Environment

The Environment describes how the simulated world works and also how reward (score) is defined.

Member variables:

map: Gives the physical environment as a 2D array. map[y][x] gives the (x, y) position.
- 1 = walls
- 2 = radiation
- 4 = goal (non-terminal), a place that reward can be gained over and over again (not used)
- 10 = goal (terminal), a place that reward can be gained once and the game ends (not used)
- 5-9 = unique identifiers for people to be rescued
width: the width of the map
height: the height of the map
influenceMap: tells the enemy agent how to move around the environment so that it doesn't run into walls. Each cell in the 2D array indicates the direction the enemy should move.
currentState: the state of the world. The state is an array where each index indicates a different feature of the world
- index 0: the bot's x location
- index 1: the bot's y location
- index 2: a boolean indicating whether the enemy is alive or not
- index 3: the enemy's x location
- index 4: the enemy's y location
- index 5: a boolean indicating whether the enemy is torturing the bot
- indices 6-9: a boolean indicating whether the person with that unique identifier has been rescued (True = rescued)
previousState: the previous state of the world
startState: the default start state (unless a random start state is generated)
reward: the amount of points obtained at a goal.
rescue: the amount of points obtained for each survior rescued.
penalty: the amount of points for not being in a goal position.
pain: the amount of points for being hurt by the enemy.
radiation: the amount of points for being in a radiation cell.
dead: the amount of points if the enemy is dead.
counter: keeps tracks of the number of turns.
moveTimer: how often the enemy can move.
enemyMode: Indicates how the enemy moves:
- 0 = enemy doesn't move
- 1 = enemy moves according to the influenceMap
- 2 = enemy moves randomly
- 3 = enemy chases the bot
- 4 = enemy gets instructions from an external source
nextEnemyMove: indicates which direction the enemy should move if enemyMode = 4.
randomStart: should a random start state be used?
lastActionValue: the last action the bot performed.
verbose: How much logging information to print. Don't refer to verbose directly, but use isVerbose()
- 0 or False= none
- 1 = print state tuple
- 2 = print map

Member functions:

validActions(): Return a list of valid actions: [0, 1, 2, 3, 4]
- 0 = up
- 1 = down
- 2 = left
- 3 = right
- 4 = smash (bot only)
actionToString(act): return the string name for an action.
env_start(): Starts the simulation
env_step(thisAction): Update the world based on the given action to be performed by the agent. This function also computes and executes the action the enemy agent performs (if any). The current state is updated by side effect. The function returns the next observation (state) and the immediate reward obtained.
end_reset(): start the simulation over again.
checkTerminal(): determines whether the current state is a terminal state.
executeAction(theAction): updates the current state based on the effects of the given action. You probably shouldn't call this directly and use env_step() instead.
calculateReward(theAction): looks at the current state and the action passed in (which should be the action just executed by the bot) and returns the amount of immediate reward.
printEnvironment(): print the world state in human-readable form.
isVerbose(): use this to check the verbose variable because it handles False and numbers.

Action

A container class for passing action information around. The member, actionValue, contains a numerical value referring to the action.

Observation

A container class for passing observation (state) information around. The member, worldState, contains the state tuple. Other member variables are not used.

Reward

A container class for passing reward information around. The member, rewardValue, contains the reward information as a floating-point value.

Agent

The agent class controls the bot and implements the Q-learning algorithm.

Member variables:

lastAction: the last action performed.
lastObservation: the last observation (state) received.
epsilon: how often to use the policy to act versus taking a random action.
gamma: the discount factor
learningRate: the learning rate
v_table: the value table, a 2D array where each row is a possible state and each column is a possible action. Implemented as a dictionary where key is state and values are arrays of values where the kth value in the array is for the kth action in Environment.validActions()
gridEnvironment: a pointer to an Environment object
initialObs: the initial state
currentObs: the current state
numSteps: how many actions should the agent take before quitting.
totalReward: total reward received.
verbose: amount of logging to print. Use isVerbose() to query the variable.
- 0 or False = none
- 1 or True = do logging
numActions: the number of possible actions in the environment

Member functions:

initializeVtableStateEntry(state): for a state, make sure there is a corresponding row in the v_table, initialized to 0.0 for all actions.
executePolicy(observation): Given an initial observation (state), execute the policy (v_table).
qLearn(observation): Given an initial observation (state), run the agent for numSteps time steps and update the v_table according to the Q-learning algorithm.
updateVtable(newState, lastState, newAction, lastAction, reward, terminal, availableActions): Used by qLearn() to update the v_table. Everything needed to perform that operation is provided as arguments. The v_table is updated as a side-effect. No return value. Must be completed by the student.
- newState: the state encountered after executing newAction
- lastState: the state the agent just left when executing newAction
- newAction: the action the agent just performed
- lastAction: the last action the agent performed
- reward: the amount of immediate reward received for performing newAction
- terminal: boolean, did the agent just hit a terminal state?
- availableActions: what actions can be performed in this new state?
eGreedy(observation): Pick a random action epsilon percent of the time, or pick the greedy action otherwise. Return the action picked. To be completed.
greedy(observation): Pick and return the best action according to the v_table. To be completed.
agent_reset(): restart the agent and the environment.
copyObservation(obs): make a new data structure in memory with the same information as the old observation.
calculateFlatState(theState): make sure the state is a tuple.
isVerbose(): determine how much logging to print. Use this instead of querying verbose directly.

Controller

This is a file that contains code to launch the game, train the agent, and test the agent. The code first trains the agent over a number of episodes. After each episode it tests the policy and reports the progress. After training is complete, the environment is reset it runs the agent for a number of steps, printing out the results of each step. Finally, optionally, the human can play the game as the bot or as the enemy robot.

Parameters:

episodes: number of episodes to train the agent.
trainingReportRate: how many episodes pass before printing out a report during training
play: Set to run in interactive mode:
- 0 = no interactive mode
- 1 = play as the bot
- 2 = play as the enemy

Other things you may want to experiment with:

gridEnvironment.enemyMode
gridEnvironment.verbose
gridEnvironment.randomStart
gridAgent.verbose
gridAgent.numSteps

Instructions

ExecutePolicy() and part of qLearn() have already been provided. You must complete the implementation of the Q-learning algorithm.

Step 1: Implement greedy(). Test it by using testgreedy.py.

(testgreedy.py doesn't verify the correctness of your implementation but gives a framework for easily inspecting a value table and visually verifying that greedy() is returning the correct result).

> python testgreedy.py

Step 2: Implement egreedy(). Test it by using testegreedy.py.

(testegreedy.py works similarly to testgreedy.py except one must vary the epsilon value and verify the results by hand).

> python testegreedy.py

Step 3: Implement updateVtable(). Test it using testvtable.py.

(testvtable.py doesn't verify the correctness of your implementation but gives a framework for easily inspecting a value table after executing a sequence of actions.)

> python testvtable.py

Step 4: Test your complete implementation:

> python Controller.py

Additional testing can be done by making your own maps.

Step 5: Write a report answering the following questions.

Why doesn't the bot avoid the radiation in the default map? What would have to be different for the bot to avoid as much of it as possible?
Under the default reward, the bot runs away from the enemy. What is the smallest value for enemyDead that would make it so that the bot is willing to kill the enemy if they cross paths? Explain why. What is the smallest value for enemyDead that would induce the bot to seek out the enemy and kill it? Explain why.
What effect does switching enemyMode from 1 (follow the influence map) to 2 during training have on the behavior of the bot, if any? How does more or fewer training episodes help or hurt? Hint: experiment with play = 2.

Grading

This homework assignments is worth 10 points. Your solution will be graded by autograder. The autograder will independently test your greedy(), egreedy(), and updateVtable() implementations.

2 points: correct implementation of greedy()
5 points: correct implementation of updateVtable() and eGreedy().
- 2 points for updating the Q values for an arbitrary execution sequence;
- 1 point for a Q-table that produces the correct execution sequence;
- 2 points for computing correct Q values after training.
3 points: report

Submission

To submit your solution, upload your modified Agent.py. All work should be done within this file.

You may modify Controller.py and Environment.py for testing, but do not submit your changes to this file. The autograder will use its own version of Controller and Environment.

DO NOT upload the entire directory.