Speaking purely from intuition, I would say for (2) that the **square **would serve two purposes, (a) create absolute values and (b) weight residuals in a exponential fashion. Exponential weighting makes the existence of any "large" residuals negatively impact the cost function more than if a linear weight (not squared) was used. The result should be a regression with more uniform residuals and less drastic outliers.

As for why a **mean**, I would say so that the result is independent of the number of data points used. A sum would be proportional to the number of data points, while a mean is not. Probably makes comparison between data sets easier and the results more meaningful to people who do regressions in different problem spaces.