The scalarized multi-objective multi-armed bandit problem: An empirical study of its exploration vs. exploitation tradeoff Vrije Universiteit Brussel
The multi-armed bandit (MAB) problem is the simplest sequential decision process with stochastic rewards where
an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward.
Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of
a good balance between exploitation and ...
an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward.
Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of
a good balance between exploitation and ...