Reinforcement Learning

The purpose of this exercise is to implement an agent using reinforcement learning that can play a simple game to rescue humans from a building that has been flooded with radiation.

Reinforcement learning is a technique of using trial-and-error episodes to learn how to play a game as close to optimally as possible. By using reinforcement learning in a game, we are trying to play as well as (or better) than humans. Variants of reinforcement learning have been used to play Backgammon, Go, and other board games at skill levels superior to humans. Reinforcement learning has been used to train agents to play Atari games at (sometimes better) human skill levels.

Our agent will learn to play a simple game called "radiation world" which was originally designed for rescue scenarios. In radiation world, humans have been trapped in a building because of a radiation leak. There are areas that are irradiated that humans cannot safely move through. However a robot can move safely through them. Your robot must navigate the building to reach each human.

The following is an example game map. Black indicates walls. Red indicates areas with radiation. Blue indicates grid cells containing humans. Green is the base station for the robot.

             
             
             
             
             

The agent can move left, right, up, or down. If it enters a grid cell with a human, then that human is saved.

One last thing: there is an enemy robot roaming around the building that will try to hurt the agent if they are ever in the same location for more than one turn. Our agent has one additional action, "smash," which can destroy the enemy robot.

The goal of a reinforcement learning agent is to get as many points as it can during a game. Thus, the definition of optimal behavior depends on how score is computed. The default way score is computed is as follows:

(Yes, you lose points if you kill the enemy robot, so with this scoring function the bot will need to learn to avoid it instead.)

You will implement the Q-learning algorithm and test your agent in the radiation world.


What you need to know

Below are the important bits of information about objects that you will be working with or need to know about for this assignment.

Environment

The Environment describes how the simulated world works and also how reward (score) is defined.

Member variables:

Member functions:

Action

A container class for passing action information around. The member, actionValue, contains a numerical value referring to the action.

Observation

A container class for passing observation (state) information around. The member, worldState, contains the state tuple. Other member variables are not used.

Reward

A container class for passing reward information around. The member, rewardValue, contains the reward information as a floating-point value.

Agent

The agent class controls the bot and implements the Q-learning algorithm.

Member variables:

Member functions:

Controller

This is a file that contains code to launch the game, train the agent, and test the agent. The code first trains the agent over a number of episodes. After each episode it tests the policy and reports the progress. After training is complete, the environment is reset it runs the agent for a number of steps, printing out the results of each step. Finally, optionally, the human can play the game as the bot or as the enemy robot.

Parameters:

Other things you may want to experiment with:


Instructions

ExecutePolicy() and part of qLearn() have already been provided. You must complete the implementation of the Q-learning algorithm.

Step 1: Implement greedy(). Test it by using testgreedy.py.

(testgreedy.py doesn't verify the correctness of your implementation but gives a framework for easily inspecting a value table and visually verifying that greedy() is returning the correct result).

Step 2: Implement egreedy(). Test it by using testegreedy.py.

(testegreedy.py works similarly to testgreedy.py except one must vary the epsilon value and verify the results by hand).

Step 3: Implement updateVtable(). Test it using testvtable.py.

(testvtable.py doesn't verify the correctness of your implementation but gives a framework for easily inspecting a value table after executing a sequence of actions.)

Step 4: Test your complete implementation:

Additional testing can be done by making your own maps.

Step 5: Write a report answering the following questions.

  1. Why doesn't the bot avoid the radiation in the default map? What would have to be different for the bot to avoid as much of it as possible?
  2. Under the default reward, the bot runs away from the enemy. What is the smallest value for enemyDead that would make it so that the bot is willing to kill the enemy if they cross paths? Explain why. What is the smallest value for enemyDead that would induce the bot to seek out the enemy and kill it? Explain why.
  3. What effect does switching enemyMode from 1 (follow the influence map) to 2 during training have on the behavior of the bot, if any? How does more or fewer training episodes help or hurt? Hint: experiment with play = 2.

Grading

This homework assignments is worth 10 points. Your solution will be graded by autograder. The autograder will independently test your greedy(), egreedy(), and updateVtable() implementations.


Submission

To submit your solution, upload your modified Agent.py. All work should be done within this file.

You may modify Controller.py and Environment.py for testing, but do not submit your changes to this file. The autograder will use its own version of Controller and Environment.

DO NOT upload the entire directory.