Lets say you have a dataset for binary classification. Each data point x has n-dimensions, and the target value is y = 1 or -1 for the two classes.
The perceptron algorithm can be visualized as follows.
For a given data point x, the prediction is simply based on whether the linear function f(x) = b + w1x1 + w2x2 + … + wn*xn is greater than or lesser than 0 (in the above diagram, w0 is b).
define predict(x): let f(x) := dot_product(x, w) + b prediction = 1 if f(x) > 0 and -1 if f(x) < 0
The learning algorithm is iterative. We start by initializing w with random values, and iteratively update its value.
define perceptron(): A: initialize weights w randomly (n dimensions) B: for each data point (x, y) in the dataset make prediction p given input x if p is not equal to y (i.e. prediction is wrong) w = w + y*x (update the weights) C: repeat step B until no points are misclassified
The algorithm works because if a point is misclassified, then the new weights are better for the point than the old weights. 
For example, suppose for the data point (x, y) that f(x) < 0 and y = 1. Then, after the update, we would want f(x) to be higher than before the update (so that after many updates, it becomes > 0).
Old weights = w, and new weights = w + x. It is easy to verify that
dot-product(w + x, x) + b > dot-product(w, x) + b
Here is an animation of the algorithm during training. Notice how the line moves only when a point is misclassified.
- Sometimes people initialize the weights to 0 instead of with random values (algorithm step A).
- Theoretical guarantee: If the dataset is linearly separable, then the perceptron algorithm will always find a function that separates the data.
- If data is not linearly separable, you can repeat step B of algorithm a fixed number of times.
This is my first time writing a tutorial. Hope you enjoyed, and I’d love some feedback. I tried to make the tutorial as clear and to the point as possible.