Introduction to the project

This post is about the final project I did for the Deep Learning subject in the Data Science master at UT Austin.

The aim of the project was to demonstrate Reinforcement Learning (RL) is a viable framework for an agent to attain simple goals in the SuperTuxKart video game.

Overview of the project

The objective is to train an agent to play Ice Hockey and win by scoring goals. The agent is designed to control two actions: the acceleration and the steering.

The project is implemented in Python using the pystk module, which loads the SuperTuxKart engine. The game is played by two teams of two karts each. PyTorch is used as the machine learning module for the architecture of the artificial neural network (ANN) and the optimization framework.

The training framework is based on Reinforcement Learning using a Policy Gradient, where the agent acts on the environment and records a reward for the decisions made at each step. As the agent plays the game and stores rewards for its actions, the policy gradient algorithm updates a baseline network that estimates the expected reward for each action in a given state. Subsequently, sample gradients are estimated, and the agent’s network weights are updated. This iterative process reduces variance during reinforcement learning training.

Through numerous repetitions, the agent learns to play the game by following the puck and hitting it toward the opponent’s goal.

The agent trained in this project controls the blue karts in the video.

Deep Learning approach

The ANN architecture building block is composed of fully connected linear layers and non-linear activations. Two building blocks are employed in the main architecture as depicted in figure below. Each neural network focuses on one of the two main actions: acceleration and steering.

Agent Deep Learning Artificial Neural Network Architecture

As mentioned in the overview, a policy gradient was used to train the agent. This means the training objective is to optimize an ANN that determines the best course of action (a policy π) given the current game state.

Policy gradient algorithm procedure

The agent must take an action (a) given state (s) using its current policy which is determined by the ANN.

𝑎𝑐𝑡𝑖𝑜𝑛 = 𝐴𝑁𝑁(𝑠𝑡𝑎𝑡𝑒)

With this action, the agent interacts with the environment, which determines the new state and a reward. The cycle repeats, accumulating rewards as the agent continues to interact with the environment.

During this phase, off-policy sampling is introduced. The agent receives the REINFORCE reward and also considers the outcome 60 steps ahead, measuring the reward or penalty based on whether a goal was scored or conceded, and how close the puck came to the goalpost compared to the current step. This adjustment is then modulated using an importance sampling approach to guide the agent toward actions that maximize the match outcome: scoring goals while preventing the opponent from scoring.

Results

The agent was first trained on an RL basis and played against itself or the basic AI.

Pretrained AI Goals	ANN Agent Goals
87	2

REINFORCE enables the agent to learn how to navigate the field but does not result in as many goals as expected when playing against the AI. After REINFORCE, the policy gradient initiates, and the agent faces the more advanced pretrained AIs.

Other DL AI Goals	ANN Agent Goals
61	15

Policy gradient improved the goal scoring metric with 41 games played and 15 goals (38%) vs 2 goals scored in 75 games (3%).