Introduction
The Perceptron Learning Rule was really the first approaches at modeling the neuron for learning purposes. It was based on the MCP neuron model.
This article tries to explain the underlying concept in a more theoritical and mathematical way.
The whole idea behind MCP neuron model and the perceptron model is to minimally mimic how a single neuron in the brain behaves. it either fires or doesn’t fire.
The perceptron rule is thus, fairly simple, and can be summarized in the following steps:-
Initialize the weights to 0 or small random numbers.
For each training sample \(x^{(i)}\):
- Compute the output value y^
- update the weights based on the learning rule
Terminology and components of the Perceptron
The Perceptron learning rule:-
Input Vector / Input matrix / Input values:- the data values as vectors/matrix/rows
Weight Vector:- a column vector containing weights for each dimension of input value.
Net input:- The linear combination of the input values (x) and the weight vector (w)
the net input \(z = w_{1}x_{1} + w_{2}x_{2} + … + w_{m}x_{m}\)
\[ w = \begin{bmatrix}w_{1} \\ . \\ . \\ w_{m} \end{bmatrix} , x = \begin{bmatrix} x_{1}\\ .\\ .\\ x_{m}\end{bmatrix} then, \\ \\ z\ = \ w_{1}x_{1} + w_{1}x_{1} + … + w_{m}x_{m} \]
Now, in context of binary classification task, if the net input of a particular sample \(x_{i}\) is greater than a defined threshold θ, we predict class 1, else class -1.
In case of perceptron, the decision function is a variant of the step function.
\(\phi\)(z) = 1 if z >= θ ; -1 otherwise.
for simplicity, we can bring the threshold θ; to the left size of the equation and define a weight-zero as w0 = - θ; and x0 = 1, so that we can write in a more compact form:
\(z = w_{0}x_{0} + w_{1}x_{1} + …. + w_{m}x_{m} = x^T w\)
and
\(\phi\) (z) = 1 if z>= 0, -1 otherwise
In ML, this negative threshold, or weight, \(w_{0}\) = - θ; , is usually called bias unit
Diagram:
After each sample, all the weights in the weights vector are updated, according to the rule:-
\(w_{j} := w_{j} + \Delta w_{j}\) where,
\(\\ \Delta w_{j} = \eta (y^{(i)} - \hat y^{(i)})x^{(i)}_{j}\)
where,
\(y^{(i)} = actual\ output\)
\(\\ \hat y^{(i)} = predicted\ output\)
\(\eta = learning\ rate\)
Catch
For all the simplicity the perceptron rule offers, theres a catch to it when applying it for binary classification: -
The convergence[] of perceptron is only achived if the two classes are linearly seprable, i.e. can be separated by a linear decision boundary.
we can:-
- Set the max number of passes of dataset(epochs)
- Set a threshold for max number of misclassifications
or the perceptron will never stop updating.