Abstract
Conducting hardware experiment is often expensive in various
aspects such as potential damage to the robot and the number of
people required to operate the robot safely. Computer
simulation is used in place of hardware in such cases, but it
suffers from so-called simulation bias in which policies tuned
in simulation do not work on hardware due to differences in the
two systems. Model-free methods such as Q-Learning, on the
other hand, do not require a model and therefore can avoid this
issue. However, these methods
typically require a large number of experiments, which may not
be realistic for some tasks such as humanoid robot balancing
and locomotion. This paper presents an iterative approach
for learning hardware models and optimizing policies with
as few hardware experiments as possible. Instead of learning
the model from scratch, our method learns the difference
between a simulation model and hardware. We then optimize
the policy based on the learned model in simulation. The
iterative approach allows us to collect wider range of data for
model refinement while improving the policy.