ADALINE & Delta Learning Rule

**ADALINE (ADAptive LInear NEuron) Learning Rule** This can be viewed as a slightly modified instance of the perceptron learning rule. In the perceptron rule, the threshold result of the weighted sum of inputs is used for updating the weights in each iteration. In ADALINE, the weighted sum of inputs itself is used to update the weights in training. - They are both classifiers for binary classification. - Both have a linear decision boundary. - Both can learn iteratively, sample by sample (the Perceptron naturally, and ADALINE via stochastic gradient descent). - Both use a threshold function. |![[ETH/ETH - Deep Learning in Artificial & Biological Neuronal Networks/Images - ETH Deep Learning in Artificial & Biological Neuronal Networks/image70.png]] | ![[ETH/ETH - Deep Learning in Artificial & Biological Neuronal Networks/Images - ETH Deep Learning in Artificial & Biological Neuronal Networks/image69.png]] | |---|---| **DELTA Learning Rule** The Delta Rule uses the difference between target activation (i.e., target output values) and obtained activation to drive learning. The weight updates from this equation aims to directly minimize a neuron's output error for a target value $t_i$ and output value $y_i$, which can be formulated using gradient descent minimizing the squared error between these values. This error can be formulated as: $E = \sum_{i}^{}{\frac{1}{2}\left( t_{i} - y_{i} \right)}^{2}$ Finding the appropriate weight updates according to the gradient descent optimization method requires calculating the change in error with respect to each weight that we wish to update. This can be expressed as: $\frac{\partial E}{\partial w_{ji}} = \frac{\partial{\frac{1}{2}\left( t_{i} - y_{i} \right)}^{2}}{\partial w_{ji}}$ Assuming a model structured similarly as in the perceptron and ADALINE learning rules with a single layer between inputs and the output value and using as the activation function $g(x) = \Theta(x)$ the heavyside function, this yields: $\mathrm{\Delta}w_{ji} = \alpha(t_{j} - y_{j})g'(h_{j})x_{i}$ where we used $h_{j} = x_{i}w_{ji}$ as inputs, $y_{j} = g(h_{j})$ as outputs, $t_{j}$ as target value, learning rate $\alpha$ and the derivative *g'* of the activation function *g*. The weight update equation for the DELTA learning rule clearly shares some similarities with that of the perceptron learning rule. Both weight update equations contain an error term calculated by the difference between the target and output values $\left( t_{i} - y_{i} \right)$ multiplied with the input $x_{i}$. However, the DELTA learning rule adds some complexities as it incorporates a learning rate $\alpha$ to adapt learning as well as the derivative of the activation function applied to the sum of the inputs $g'(h_{j})$. While the perceptron learning rule, particularly in light of its well-defined algorithm, defines the problem in terms of shifting hyperplanes to adapt a decision boundary, the DELTA learning rule optimizes the sum of squared error for a model with an activation function applied to a linear output. As previously mentioned in the contrast between the perceptron and ADALINE rules, the perceptron rule will either reach a stable zero-error solution (in the case of linearly separable data) or continually oscillate (otherwise). In contrast, the DELTA rule due in part to its adaptable learning rate can continually converge to a minimum error solution. **DELTA Rule vs Perceptron Learning Rule** We have seen that the DELTA rule and the Perceptron learning rule for training single-layer Perceptrons have a similar weight update equation. However, the two algorithms were obtained from very different theoretical starting points. The Perceptron learning rule was derived from a consideration of how we should shift around the decision hyper-planes for step function outputs, while the DELTA rule emerged from a gradient descent minimization of the Sum Squared Error for a linear output activation function. The Perceptron learning rule will converge to zero error and no weight changes in a fine number of steps if the problem is linearly separable, but otherwise the weights will keep oscillating. On the other hand, the DELTA rule will (for sufficiently small learning rates) always converge to a set of weights for which the error is a minimum, though the convergence to the precise target values will generally proceed at an ever decreasing rate proportional to the output discrepancies. **Biologically Plausible Rules** The second category of learning rules we will discuss are those that draw inspiration from Neuroscience and Biology.