Policy Gradients (PG, DPG) - The Recent Evolution of Reinforcement Learning p2

Vladimir Steiner
Feb 27, 2019
3 min read

Updated: Oct 12, 2022

We are going to talk about one of the essential parts of reinforcement learning, policy gradients. First we need to explicit what is a policy, and for that, we need to do a quick recap about how a Reinforcement Learning (RL) problem works.

In a RL problem, we train an agent, who is in a environment. There are three parameters at each instant: the action A made by the agent X, the state S it puts him in and the reward R he gets from this action A. X takes decision about the action he takes, but what he does not control is the environment. Concretely it means that X chooses the action, but does not choose the reward he gets from this action. He can only try different actions and figure out the reward he gets from each of them. Then he can modify his weights to ensure that he chooses the correct one. This model is interesting because the future state only depends of the actual one, and how we got to this state does not matter. We also like it because it is quite human-like : you need to pass your finger over a candle to understand that you will get burnt, and that is the same reasoning as the agent, except that he calculates the reward on the N next moves, not only one.

And a policy is exactly those decisions that the agent makes to be sure to have the best reward at the end. The important point is that the policy will incorporate the consequences of the actions recommended by the policy itself. In practice it is the probabilty distribution of the actions given a state, meaning that the decisions the agent makes are giving probabilities to each action. If an action is given a big probability in that state, it means that it is the most probable action to do in this state.

In our RL problems, the policy is parametrized and we try to maximize the total reward with it, modificating our parameters until it looks like we have reached a maximum. Thus, one simple approach is the same as in machine learning (ML), using gradient ascent (because we are trying to maximize the function, contrary to the cost function in ML that we want to minimize). That is the principle behind policy gradient (PG), of course it is in reality a bit more complicated, we need to approximate our equation with the Markov Chain Monte Carlo technique for starters (if interested, look at https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo).

In 2014, D.Silver et al from DeepMind developed further this policy gradient and created the Deterministic Policy Gradient (DPG). Contrary to the PG, which is stochastic, the idea here is to try to make a deterministic PG or as S.Kapoor said in https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d, "instead of learning a large number of probability distributions, let us directly learn a deterministic action for a given state". The obvious way to do this would be to find the maximum of the possible reward for each state to identify the action that must be chosen, but that needs to search the entire space, which is not possible in high dimensions. That is why the DPG consists in doing an approximation of this maximum. This time again, that is only the principle and there a few more steps needed in reality.

That is it for today, but this subject is not over yet ! Next time we will thus see the Actor-critic methods and DDPG, a further development of DPG. Thanks again to towardsdatascience.com, and if you want to read the full article about DPG, follow this link http://proceedings.mlr.press/v32/silver14.pdf !

Policy Gradients (PG, DPG) - The Recent Evolution of Reinforcement Learning p2

Recent Posts

Comments