Adaptive critic for sigma-pi networks

Richard Stuart Neville, Thomas John Stonham

    Research output: Contribution to journalArticlepeer-review

    Abstract

    This article presents an investigation which studied how training of sigma-pi networks with the associative reward-penalty (A(R-P)) regime may be enhanced by using two networks in parallel. The technique uses what has been termed an unsupervised 'adaptive critic element' (ACE) to give critical advice to the supervised sigma-pi network. We utilise the conventions that the sigma-pi neuron model uses (i.e., guantisation of variables) to obtain an implementation we term the 'quantised adaptive critic', which is hardware realisable. The associative reward-penalty training regime either rewards, r = 1, the neural network by incrementing the weights of the net by a delta term times a learning rate, α, or penalises, r = 0, the neural network by decrementing the weights by an inverse delta term times the product of the learning rate and a penalty coefficient, α x λ(rp). Our initial research, utilising a 'bounded' reward signal, r(*) ε {0,...,1}, found that the critic provides advisory information to the sigma-pi net which augments its training efficiency. This led us to develop an extension to the adaptive critic and associative reward-penalty methodologies, utilising an 'unbounded' reward signal, r(*) ε {-1,...,2}, which permits penalisation of a net even when the penalty coefficient, λ(rp), is set to zero, λ(rp) = 0. One should note that with the standard associative reward-penalty methodology the net is normally only penalised if the penalty coefficient is non-zero (i.e., 0 <λ(rp) ≤ 1). One of the enigmas of associative reward-penalty (A(R-P)) training is that it broadcasts sparse information, in the form of an instantaneous binary reward signal, that is only dependent on the present output error. Here we put forward ACE and A(R-P) methodologies for sigma-pi nets, which are based on tracing the frequency of 'stimuli' occurrence, and then using this to derive a prediction of the reinforcement. The predictions are then used to derive a reinforcement signal which uses temporal information. Hence one may use more precise information to enable more efficient training.
    Original languageEnglish
    Pages (from-to)603-625
    Number of pages22
    JournalNeural Networks
    Volume9
    Issue number4
    DOIs
    Publication statusPublished - Jun 1996

    Keywords

    • Adaptive critic
    • Associative reward-penalty
    • Dynamic programming
    • Multi-cube
    • Reinforcement
    • Sigma-pi

    Fingerprint

    Dive into the research topics of 'Adaptive critic for sigma-pi networks'. Together they form a unique fingerprint.

    Cite this