Learning to play Ice Hockey in SuperTuxKart through Reinforcement Learning
machine-learning

Learning to play Ice Hockey in SuperTuxKart through Reinforcement Learning

banner

Introduction to the project

This post is about the final project I did for the Deep Learning subject in the Data Science master at UT Austin.


The aim of the project was to demonstrate Reinforcement Learning (RL) is a viable framework for an agent to attain simple goals in the SuperTuxKart video game.


Overview of the project


The objective is to train an agent to play Ice Hockey and win by scoring goals. The agent is designed to control two actions: the acceleration and the steering.


The project is set up in Python using the pystk module which loads the SuperTuxKart engine. The game is played by 2 teams which are composed of 2 karts each. Pytorch is used as the Machine Learning module for the architecture of the Artificial Neural Network (ANN) and the optimization framework.


The training framework is based on Reinforcement Learning using a Policy Gradient, where the agent acts on the environment and records a reward for the decisions made at each step. As the agent plays the game and stores rewards for actions taken, the Policy Gradient algorithm updates a baseline network that sets the expected reward for each action given to each state. Afterwards, sample gradients are estimated, and the agent network weights are updated. This is done repeatedly to train the agent iteratively and doing it so reduces the variance when training through reinforcement learning.


Over numerous repetitions, the agent will learn to play the game by following the puck and hitting it towards the opponent’s goal.


  • The agent trained in this project controls the blue karts in the video.

Deep Learning approach


The ANN architecture building block is composed of fully connected linear layers and non-linear activations. Two building blocks are employed in the main architecture as depicted in figure bellow. Each neural network is focused on one of the two main actions: the acceleration and steering of the kart.

Agent Deep Learning Artificial Neural Network Architecture


As mentioned in the overview, a policy gradient was employed to train the agent. This means that the objective of the training is to optimize an ANN that devises the best course of action (a policy Ο€) given the presented state of the game.

Policy gradient algorithm procedure


The agent must take an action (a) given state (s) using its current policy which is determined by the ANN.

π‘Žπ‘π‘‘π‘–π‘œπ‘› = 𝐴𝑁𝑁(π‘ π‘‘π‘Žπ‘‘π‘’)

With this action, the agent acts upon the environment which determines the new state and a reward for the action taken. The whole cycle begins again, collecting even more rewards as the agent acts upon the environment.


During this phase, off-policy sampling is introduced. The agent receives the REINFORCE reward and additionally looks 60 steps ahead and measures the reward or penalty if any goal was scored or received; as well as how close the puck got to the goalpost when compared with the current step. This adjustment is then modulated with an importance sampling approach to steer the agent towards the actions that yield the best match outcome which is scoring goals while avoiding opponents score goals.


Results


The agent was first trained on an RL basis and played against itself or the basic AI.


Pretrained AI GoalsANN Agent Goals
872

REINFORCE takes the agent as far as learning how to navigate the field but does not score as many goals as one would expect when playing against the AI. After REINFORCE, the policy gradient initiates, and the agent faces the more advanced pretrained AIs.


Other DL AI GoalsANN Agent Goals
6115

Policy gradient improved the goal scoring metric with 41 games played and 15 goals (38%) vs 2 goals scored in 75 games (3%).