Decision making using Thompson Sampling

  • Joseph Mellor

Student thesis: Phd


The ability to make decisions is a crucial ability of many autonomous systems. In many scenarios the consequence of a decision is unknown and often stochastic. The same decision may lead to a different outcome every time it is taken. An agent that can learn to make decisions based purely on its past experience needs less tuning and is likely more robust. An agent must often balance between learning the payoff of actions by exploring, and exploiting the knowledge they currently have. The multi-armed bandit problem exhibits such an exploration-exploitation dilemma. Thompson Sampling is a strategy for the problem, first proposed in 1933. In the last several years there has been renewed interest in it, with the emergence of strong empirical and theoretical justification for its use.This thesis seeks to take advantage of the benefits of Thompson Sampling while applying it to other decision-making models. In doing so we propose different algorithms for these scenarios. Firstly we explore a switching multi-armed bandit problem. In real applications the most appropriate decision to take often changes over time. We show that an agent assuming switching is often robust to many types of changing environment. Secondly we consider the best arm identification problem. Unlike the multi-armed bandit problem, where an agent wants to increase reward over the entire period of decision making, the best arm identification is concerned in increasing the reward gained by a final decision. This thesis argues that both problems can be tackled effectively using Thompson Sampling based approaches and provides empirical evidence to support this claim.
Date of Award1 Aug 2014
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorJonathan Shapiro (Supervisor) & Mark Muldoon (Supervisor)


  • changepoint detection
  • thompson sampling
  • Bayesian
  • multi-armed bandit
  • best-arm identification

Cite this