This is a really cool and clear introduction to MAP/MLE, especially since you take great pains to explain what all of the notation means. I'll definitely be pointing some people I know to this blog.<p>OT on technical blogs:
Experts often are unable to put themselves in the shoes of someone with no experience, which really harms the pedagogy. When one practices a technical topic for a long time, concepts that were once foreign and difficult become instinctual. This makes it very hard to understand in what ways a beginner could be tripped up. It takes a large amount of thought to avoid this problem, which I think is why much introductory material - blog posts, books, etc., is really sub-par.
Could someone explain in a bit more detail the move from 26 to 27? I don't get the significance of being "worried about optimization" or why/how we cancel p(x). I do get the later point about integration and the convenience of the reformulation. I just don't get why or how it is "allowed".<p>Sorry if this is obvious but I have been doing a lot of reading on this and have come across this step a few times before...but am just missing some part of every explanation.
Nice, clear explanation. Looking forward to the Bayesian inference one!<p>One note though: I think on equation 25 you are missing a log on the left hand side.
Nice write-up! Minor nitpick: ML/MAP estimators don't _require_ observations to be independent. At least, in my field we're looking at a single observation of a multivariate distribution, and we don't need to assume the elements are independent (ie, we permit a non-diagonal covariance matrix). My intuition says this is equivalent to assuming multiple correlated scaler observations, but I'd have to sit down with some paper. Also, you use "trough" where I think you mean "through."