Hello guys.
I'm fairly new to reinforcement learning. I have implemented DQN in the past and now I'm working on A3C for a custom environment. And I noticed that in DQN I used an epsilon greedy policy, so I used something like this to force exploration:
if eps <= random.random():
return random.randint(0, num_actions-1)
else:
return np.argmax(model.predict(state))
But in A3C I am using this instead:
policy = model.predict(state)
return np.random.choice(num_actions, p=policy)
As far as I know, this is used to make model conservative about its actions, so we are trying to encourage the model to give a much higher probability (close to 1) for good actions and reduce unpredictability .
In A3C we use a critic model to predict value, which is basically a n-step return (expected reward for future n steps) right?
But the question is why do we use different approaches? Can I use epsilon greedy policy in A3C or vise versa? Which one is better and when? Or is there certain type of environment which requires to use one of them? And what if my environment is impossible to predict (I mean the future reward), but it is possible to develop a strategy that can beat the game. Let's say, it is a game where you start from a random point and never know what obstacle will come out, but you know for sure that you have to avoid them. Do I have to predict the value then?