I have four unresolved questions after reading the above article. If anyone knows the answer to any of these, I'd be grateful for a reply!
1) For gradient descent, since W is a vector, I don't understand what it means to have W along the X axis—the X axis is a single dimension, while W is multidimensional. What is happening, specifically to each element of the vector as we move to the right along the X-axis?
2) Mean squared error surprised me. My (I suppose naïve) expectation was that the cost function would be a sum of every residual, with the cost function with the lowest total being considered the best fit. Why squared? Why mean?
3) For multiple linear regression, why is the intercept term only considered for the "first" feature? I'd expected to see y = w0 + w1 * x1 + w2 + w3 * x2 + ... + w(2n) + w(2n+1) * w(n). I'm sure this doesn't even make sense mathematically, but from a conceptual perspective, can you explain why this isn't the case?
4) Is there a specific reason why ML turns y = mx + ...