Why Minimize Negative Log Likelihood?
One of the wonders of machine learning is the diversity of divergent traditions from which it originates, from classical statistics (both frequentist and Bayesian) to information and control theories, plus a significant dose of pragmatism from computer science. For those interested in the historical relationship between statistics and machine learning, see Breiman’s Two Cultures.
This diversity is reflected in the surprising complexity in answering simple-sounding questions, which often speaks to the heart of trading using computational machine learning models—ranging from estimating HMM models via MLE (e.g. vol / correlation regime models) to non-convex optimization via non-standard likelihood or loss functions (e.g. portfolio optimization via omega):
Why is minimizing the negative log likelihood equivalent to maximum likelihood estimation (MLE)?
Or, equivalently, in Bayesian-speak:
Why is minimizing the negative log likelihood equivalent to maximum a posteriori probability (MAP), given a uniform prior?
Answering this question provides insight into the foundations of machine learning, as well as connection with several branches of mathematics.
Classic statistics opens the answer, beginning with the definition of a likelihood function:
Applying the natural log function in this context is handy, for several reasons. First, numerical analysis reminds us that logs reduce potential for underflow, due to very small likelihoods. Second, calculus reminds us logs permit the addition trick: converting a product of factors into a summation of factors (as seen before in Why Log Returns?). Finally, calculus again reminds us that the natural log function is a monotone transformation.
Thus, the extrema of are equivalent to the extrema of :
From which the maximum likelihood estimator is defined as:
As an aside, Bayesians will remind us we can generalized into a MAP estimator, given uniform prior :
From which optimization and real analysis reminds us of the following equivalence, for all :
Thus, the following are equivalent:
From this, we technically have an answer to the above two questions on equivalence. Yet, from here lies the opportunity to continue and uncover the relationship between MLE/MAP and both entropy and loss via Kullback-Leibler divergence (KL). To get there, consider the statistical average of the above:
Which converges, by the strong law of large numbers, to the expectation:
Which is interesting when considering the difference in distribution between and its corresponding true actual parameter :
Which is indeed equal to none other than the KL divergence, , between and :
Which information theory reminds us is relative entropy, and thus is also equal to the excess risk for the loss function defined by the negative log-likelihood. Finally, connecting Bayesian statistics to the foundation of information theory: gain in Shannon entropy going from prior to posterior is indeed the KL divergence.
Thus, maximum likelihood and maximum a posteriori probability are special case loss functions (see Loss Function Semantics for more on loss semantics in ML).