Discovering Reinforcement Learning with Ms. PacMan

Vladimir Steiner
Nov 5, 2018
3 min read

Updated: Oct 12, 2022

The most fascinating domain in the already fascinating machine learning sector is in my opinion, Reinforcement Learning (RL). It became famous thanks to two companies : DeepMind and OpenAI. The first one launched the AI revolution and helped make it popular with AlphaGo, the first computer to beat the world genius Go player Lee Sedol. The second is a lab that made enormous progress too and continues to post a lot about the subject.

To understand RL, a good example is needed and one that is frequently used is Ms. Pacman. The principle behind RL is actually quite simple. You explain to your computer how he can move, you create a rule of rewards and punishments and then you let it play for hours until it finds the best way to have the best score.

It's easier to understand with the example. For Ms. PacMan, we tell our computer that it can move in 9 different ways (left, right, up, down, diagonals and stand still). The rule of punishments and rewards here is simply the score given by the game (the game designates Ms. Pacman, not our machine). Then we simply need to give an input to our computer, and here it will be a simplified version of the screen that you could see if you played the arcade game. Each instant (which should be, at human speed, a quarter second), we simply send the frame to our machine and the score written at the bottom. And with that, it's done. You just need to let your computer play millions of episodes (like a game played by a human, but it's often a lot shorter, it ends either after a certain time or by our PacMan's death).

But, we skipped a major part ! How does our computer learn ? How does he know how to optimize his score ? Well, this is the mathematical part (because we don't have to build the reward rule), it is called the policy and is really essential. The policy is what makes the decision of what move to make. It is essential because we need our machine to try a lot of different paths to understand what is the best. Thus, the policy needs to change along with the experience of our computer: it should not decide the same way at the first episode and a the millonth one. Of course, the randomness is absent of the testing part, where the computer will always move as it feels best, based on its experience and the scores he got during the learning phase.

The first policy that proved to work really well is called epsilon-greedy. It is a policy that makes random decision for a percentage that decreases with the number of steps (episods and steps designate the same thing). And when it is not random, the computer will move in the way it deems the best. It means that in the first step, every move it makes is random and after a few millions, the random moves will be very rare, so that our machine can try to explore the best trails it has found.

Of course, after the first tests, a lot of better, more efficient policies were made and helped making the learning time a lot shorter. And that is substantial, because one of the major problem of those Deep Q-networks (it is how the networks based on the Markov Decision Process are called) is that it needs a huge time to learn. The basic policy that I explained before was tested with 200 million training sessions, which would be equal to about 20 days of playing at a human speed. More precision on https://towardsdatascience.com/advanced-dqns-playing-pac-man-with-deep-reinforcement-learning-3ffbd99e0814

If this subjects interest you there is a lot of literature written about it, and are often a lot more technical than I, for example on https://towardsdatascience.com/, or even OpenAi's blog https://blog.openai.com/. I hope it was a clear introduction to reinforcement learning and see you next time for another peek into machine learning !

Discovering Reinforcement Learning with Ms. PacMan

Recent Posts

Comments