Before finishing the set of articles about Improving Language Understanding with Unsupervised Learning, I wanted to talk about what I have been interested in lately. I am studying the major advancements in the field of Reinforcement Learning (RL) in these last few years. In my opinion, the first significant one is the article I am presenting here, DeepMind's Playing Atari with Deep Reinforcement Learning.
It was the first time that the use of deep neural networks had better results in RL than more classical methods, on more than just one game. What it means is that, there was a RL algorithm that had succeeded in attaining a super-human level of play in backgammon, called TD-gammon. However, when researchers tried to extend the algorithm for other games, such as chess or go, it did not work at all.
Now let's really get into it, and look at what DeepMind did to have their good results, and what started the exponantial growth of this field in the four last years. The team decided to work on games from the console Atari 2600, and try to get good results on seven differents games, to make a neural network that is able to learn to play multiple games at once. The games were : Beam Rider, Breakout, Enduro, Pong, Q-bert, Seaquest and Space Invaders. The objective is to manage to make the network learn to play just by giving as input the raw pixels data (as explained in this blog's first article on Ms. Pacman).
Their approach is to link a RL algorithm to a deep neural network (which will directly work on the images of the screen). The network is trained using stochastic gradient descent (which is quite usual, I will not explain what it means here, you can find a lot of article explaining it clearly, including on towardsdatascience.com).
As said before (and in the first article), the network takes images of the screen as input, but reduced to only 84x84 pixel black and white images, to reduce the dimension of the input data.
In the approach taken by Mnih et al., the real big innovation is the creation of deep Q-learning (DQN). To explain it, we need to explain what Q-learning is first.
Basically, our network is an agent, who has choices to make every instant (recall with Ms. Pacman, it had to choose between moving in any direction or staying put). The decision making is based on Markov decision processes (MDP), which means that in each state, he has probabilities for each action, and estimates the reward each action could give him. The principle in Q-learning is to estimate what future (on k steps) will give the agent the best reward. It is often combined at the start of the training with an epsilon-greedy policy (as in Ms. Pacman) which makes epsilon % of the actions random, with epsilon decreasing with time. To see a concrete example of a Markov decision process, look at the image below
The student in this MDP has choices for each state he is in (or here for each place he is in), with the probability of each action and the rewards if this action is chosen. For more details on MDPs I suggest reading https://towardsdatascience.com/reinforcement-learning-demystified-markov-decision-processes-part-1-bf00dda41690.
The main difference made in the DQN is that the team introduces replay experience. It means that some past state K picked at random will be studied again. The agent will play from that state again and will update the Q function accordingly. Our network will have learned different things since the state K happened and will thus evaluate the possible actions differently. This allows to break the correlation between states N and N+1, which is a really imprortant bias (if the agent goes left multiple times in a row, he will tend to favor it for the next steps), and makes the training a lot more efficient. It also reduces the bias and makes it possible to train more steps with our data.
The result is that they managed to attain above expert human level on three games, and close to it on another. Three games were too complex for this algorithm, and it was shown later that there were Atari games even more complex, which were impossible to apprehend with DQN.
I hope this peak into the DQN was clear enough, and I'll see you soon for another post about exciting advances in RL ! As always, go look at towardsdatascience.com, or for more details at the original article https://arxiv.org/pdf/1312.5602v1.pdf
Comments