I have a question regarding chapter 5.5 Maximum Likelihood Estimation. How the *argmax{sum(log(p_model(x;theta)))}* transforms into the *argmax* of expectation over empirical distribution *p^hat_data*? It's been said that "... we can divide by m to obtain a version of the criterion that is expressed as an expectation w.r.t. the empirical distribution *p^hat_data* defined by the training data". So my question is "**what is the form of (expression for) p^hat_data?**"

If it's just a uniform, *p^hat_data = 1/m*, then our attempt to minimize KL divergence w.r.t. *p^hat_data* is wrong since mostly you wouldn't expect the natural data *p_data* to be uniformly distributed.