In Chapter 7, Page 234, it says:

The new matrix is the same as the original one, but with the addition of alpha to the diagonal. The diagonal entries of this matrix correspond to the variance of each input feature. We can see that L^2 regularization causes the learning algorithm to “perceive” the input X as having higher variance, which makes it shrink the weights on features whose covariance with the output target is low compared to this added variance.

I can see the effect of L2 regularization on the "covariance-like" matrix in equation 7.17; I'm not quite sure I follow the second sentence here. How does the higher overall variance lead to smaller weights on features with lower covariance to the output?

Read more… (123 words)