Abstract
This article presents an investigation which studied how training of sigma-pi networks with the associative reward-penalty (A(R-P)) regime may be enhanced by using two networks in parallel. The technique uses what has been termed an unsupervised 'adaptive critic element' (ACE) to give critical advice to the supervised sigma-pi network. We utilise the conventions that the sigma-pi neuron model uses (i.e., guantisation of variables) to obtain an implementation we term the 'quantised adaptive critic', which is hardware realisable. The associative reward-penalty training regime either rewards, r = 1, the neural network by incrementing the weights of the net by a delta term times a learning rate, α, or penalises, r = 0, the neural network by decrementing the weights by an inverse delta term times the product of the learning rate and a penalty coefficient, α x λ(rp). Our initial research, utilising a 'bounded' reward signal, r(*) ε {0,...,1}, found that the critic provides advisory information to the sigma-pi net which augments its training efficiency. This led us to develop an extension to the adaptive critic and associative reward-penalty methodologies, utilising an 'unbounded' reward signal, r(*) ε {-1,...,2}, which permits penalisation of a net even when the penalty coefficient, λ(rp), is set to zero, λ(rp) = 0. One should note that with the standard associative reward-penalty methodology the net is normally only penalised if the penalty coefficient is non-zero (i.e., 0 <λ(rp) ≤ 1). One of the enigmas of associative reward-penalty (A(R-P)) training is that it broadcasts sparse information, in the form of an instantaneous binary reward signal, that is only dependent on the present output error. Here we put forward ACE and A(R-P) methodologies for sigma-pi nets, which are based on tracing the frequency of 'stimuli' occurrence, and then using this to derive a prediction of the reinforcement. The predictions are then used to derive a reinforcement signal which uses temporal information. Hence one may use more precise information to enable more efficient training.
Original language | English |
---|---|
Pages (from-to) | 603-625 |
Number of pages | 22 |
Journal | Neural Networks |
Volume | 9 |
Issue number | 4 |
DOIs | |
Publication status | Published - Jun 1996 |
Keywords
- Adaptive critic
- Associative reward-penalty
- Dynamic programming
- Multi-cube
- Reinforcement
- Sigma-pi