Abstract
Online learning has traditionally focused on the expected rewards. In this paper, a risk-averse online learning problem under the performance measure of the mean-variance of the rewards is studied. Both the bandit and full information settings are considered. The performance of several existing policies is analyzed, and new fundamental limitations on risk-averse learning is established. In particular, it is shown that although a logarithmic distribution-dependent regret in time T is achievable (similar to the risk-neutral problem), the worst-case (i.e. minimax) regret is lower bounded by Ω(T) (in contrast to the Ω(√T) lower bound in the risk-neutral problem). This sharp difference from the risk-neutral counterpart is caused by the the variance in the player's decisions, which, while absent in the regret under the expected reward criterion, contributes to excess mean-variance due to the non-linearity of this risk measure. The role of the decision variance in regret performance reflects a risk-averse player's desire for robust decisions and outcomes.
Original language | English |
---|---|
Title of host publication | 2019 IEEE 58th Conference on Decision and Control, CDC 2019 |
Subtitle of host publication | 11-13 December 2019, Nice, France |
Place of Publication | New York |
Publisher | IEEE |
Pages | 2738-2744 |
Number of pages | 7 |
ISBN (Electronic) | 9781728113982 |
ISBN (Print) | 9781728113999 |
DOIs | |
Publication status | Published - Dec 2019 |
Event | 58th IEEE Conference on Decision and Control, CDC 2019 - Nice, France Duration: 11 Dec 2019 → 13 Dec 2019 |
Publication series
Name | Proceedings of the IEEE Conference on Decision and Control |
---|---|
Volume | 2019-December |
ISSN (Print) | 0743-1546 |
ISSN (Electronic) | 2576-2370 |
Conference
Conference | 58th IEEE Conference on Decision and Control, CDC 2019 |
---|---|
Country/Territory | France |
City | Nice |
Period | 11/12/19 → 13/12/19 |