Today, we'll continue our talk about policy gradients. I want to insist on the fact that there is an enormous amount of research articles, and by extension, of different models that were published in the last five years. Therefore, when I pick my subjects, it is only what i read and personally found interesting and innovative. With that said, we will explain in this article what is an Actor-Critic (AC) method, and we will then study the model that combines everything we talked in the two last articles.
The principle behind every AC methods is that we split our agent in two parts, named (may come as a shock) Actor and Critic. This idea came from the fact that there were two main different ways to treat RL problems at some point: the value based methods and the policy based ones. Q-learning and DQN are perfect examples of the former, the function Q is optimized to map between actions and values. The policy based methods try to find the optimal policy without passing by a Q function or something alike. A good example would be the REINFORCE algorithm (which was not yet treated in this blog). When researchers thought about trying to merge those two methods, AC was born. The Critic estimates the value function (like in value based methods) and the Actor updates the policy in the direction advised by the Critic (which corresponds to policy based methods). The AC method was then further improved with A2C and A3C that we may treat at some point in the future.
Now, let's talk about Deep Deterministic Policy Gradient (or DDPG for short). It is, as said before a combination of everything we talked about in the last two articles, plus the AC method. Concretely, the DeepMind team takes bits from DPG, but also from DQN, while also using the AC method.
One of the main problems of DQN is that, even if it proved to work in higher dimensions, the space was still discrete (in Pacman you have 9 possible moves at all times, no more no less). If you wanted to use it on a continuous, an option would be to discreetize the space, but doing so in higher dimensions makes an enormous number of possible actions, which makes it near impossible to converge.
The DDPG has its actor decide what is the best action for each state and gets the experience replay, both of those things making it similar to DPG. Maybe you guessed it, but the Critic acts in a similar way as the DQN model (his evaluation of the policy function is similar as the Q function).
The different training environments were simulated physical environments of diverse difficulty. On the examples above you can see the cartpole (first one on the left) which is a classical environment for RL problematics, but you also have problematics with an articulated arm (with 6 articulations) trying to grab an object in a 2D space (3rd on the left). The results were impressive, it worked really well and the model learned good policies in less time than state-of-the-art models at the time. On low-dimensions, it proved to converge with 10 times less steps than DPG.
Like the AC methods with A3C, DDPG was improved later on by OpenAI, to make the Multi-Agents DDPG. Maybe those new models will be subjects of articles in the future. Hope you liked it and if you want to read more about DDPG here is the link https://arxiv.org/pdf/1509.02971.pdf. If you want to understand better AC methods, I recommend you to go look (as always) at https://towardsdatascience.com/the-idea-behind-actor-critics-and-how-a2c-and-a3c-improve-them-6dd7dfd0acb8 . See you next time !
Comentários