The blog Computational Information Geometry Wonderland pointed me toward the article “k-MLE: A fast algorithm for learning statistical mixture models” by Frank Nielsen (2012). $k$-means can be viewed as alternating between 1) assigning points to clusters and 2) performing a maximum likelihood estimation (MLE) of the mean of spherical Gaussians clusters (all of which are forced to have the same covariance matrix equal to a scalar multiple of the identity). If we replace the spherical Gaussian with another set of distributions, we get $k$-MLE. Nielsen does a remarkably good job of introducing the reader to some complex concepts without requiring anything other than a background in probability and advance calculus. He explores the relationships between $k$-MLE with exponential families and information geometry. Along the way he exposes the reader to Bregman divergences, cross-entropy, Legendre duality, Itakura-Saito divergence, and Burg matrix divergence.