I think Tyler's answer explains the reasons for squared error and mean very well. You can use other cost functions too. It's just a question of prediction error. Squaring the error ensures that negative and positive errors relative to the regression line (or plane) don't cancel each other out. Sometimes, you might want that. Other times, you might not.

For 1) that's just a simplified representation. Assume that each W on the X-axis is one combination of weights. So there will be a unique combination W* that'll minimize J.

For 3) Your representation can be re-written to take the same form as the multiple linear regression expression in the article. Set k0 = w0+w2+w4+...

For 4) I'd say that's just notations and convention. Typically, most mathematical notation for polynomials tends to be in increasing order of power. This is a case where the highest power is 1.