The blog Computational Information Geometry Wonderland pointed me toward the article “k-MLE: A fast algorithm for learning statistical mixture models” by Frank Nielsen (2012).  $k$-means can be viewed as alternating between 1) assigning points to clusters and 2) performing a maximum likelihood estimation (MLE) of the mean of spherical Gaussians clusters (all of which are forced to have the same covariance matrix equal to a scalar multiple of the identity).  If we replace the spherical Gaussian with another set of distributions, we get $k$-MLE.  Nielsen does a remarkably good job of introducing the reader to some complex concepts without requiring anything other than a background in probability and advance calculus.  He explores the relationships between $k$-MLE with exponential families and information geometry.  Along the way he exposes the reader to Bregman divergences,  cross-entropy,  Legendre dualityItakura-Saito divergence, and Burg matrix divergence.