This script shows an implementation of Deep Q-Learning on the BreakoutNoFrameskip-v4 environment.
As an agent takes actions and moves through an environment, it learns to map the observed state of the environment to an action. An agent will choose an action in a given state based on a "Q-value", which is a weighted reward based on the expected highest long-term reward. A Q-Learning Agent learns to perform its task such that the recommended action maximizes the potential future rewards. This method is considered an "Off-Policy" method, meaning its Q values are updated assuming that the best action was chosen, even if the best action was not chosen.
In this environment, a board moves along the bottom of the screen returning a ball that will destroy blocks at the top of the screen. The aim of the game is to remove all blocks and breakout of the level. The agent must learn to control the board by moving left and right, returning the ball and removing all the blocks without the ball passing the board.
The Deepmind paper trained for "a total of 50 million frames (that is, around 38 days of game experience in total)". However this script will give good results at around 10 million frames which are processed in less than 24 hours on a modern machine.