Convergence analysis of temporal difference learning algorithms based on a test problem

Onder Tutsoy, Martin Brown, Hong Wang

    Research output: Chapter in Book/Conference proceedingConference contribution

    Abstract

    Reinforcement learning is a method focusing on learning appropriate control actions by maximizing a numerical reward. Since it is able to solve complex control/learning problems without knowing the dynamics of systems, it has been applied to various systems such as humanoid robots and autonomous helicopters. However, properties of learning value function, which is the long term performance of learning, has not been examined on a simple system. In this paper, a simple first order unstable plant with a linear piecewise control is introduced to lead explicit parameter convergence analysis of the value function. A specific closed form solution for the value function is determined consisting of optimal parameters and an optimal polynomial basis. It is shown that a number of parameters occur, which are function of plant parameters and value function discount factor. It is also proved that the temporal difference error introduces an almost null space until cut off point of piecewise linear control. Moreover, it is shown that residual gradient algorithm converges faster than TD(O).
    Original languageEnglish
    Title of host publicationProceedings of the IASTED International Conference on Control and Applications, CA 2012|Proc. IASTED Int. Conf. Control Appl., CA
    Pages315-322
    Number of pages7
    DOIs
    Publication statusPublished - 2012
    EventIASTED International Conference on Control and Applications, CA 2012 - Crete
    Duration: 1 Jul 2012 → …

    Conference

    ConferenceIASTED International Conference on Control and Applications, CA 2012
    CityCrete
    Period1/07/12 → …

    Keywords

    • Polynomial basis function
    • Rate of convergence
    • Temporal difference learning
    • Value function approximation

    Fingerprint

    Dive into the research topics of 'Convergence analysis of temporal difference learning algorithms based on a test problem'. Together they form a unique fingerprint.

    Cite this