You are currently browsing the archive for the Math category.

I’ve been reading “Category Theory for Programmers” which was suggested to me by Mark Ettinger.  This book presents many examples in C++ and Haskell.  It teaches you some Haskell as you read the book.  It uses almost zero upper level mathematics and it skips almost all of the mathematical formalities.  If you decide that you want to read it, then you might want to read the first six chapters of “Learn You a Haskell for Great Good!” and write a few small Haskell programs first.  (I also would suggest trying to solve the first three problems in Project Euler  using Haskell.)

I find the book to be easy to read and informative.  When the author makes a statement like   A*(B+C) = A*B + A*C where * means categorical product, + means coproduct, and = means isomorphic, I find myself trying to figure out the categories where the statement is true and the categories for which it is false.  (It is true for the Category of Set and the Category Hask.  The book is mostly about those categories.) That type of thinking improves my understanding of category theory.  The book is also reawakening the parts of my brain that had forgotten parts of category theory and Haskell.

Interestingly, in category theory, $A*(B+C) = A*B + A*C$ can be translated into the following theorems :

  1.  A*(B+C) = A*B + A*C  is true for all positive integers A,B, and C,
  2. max(A, min(B,C)) = min( max(A,B), max(A,C))  for all real numbers A, B, and C,
  3. lcm(A, gcd(B,C)) = gcd( lcm(A,B), lcm(A,C) )   where  lcm means least common multiple and gcd means greatest common denominator, and
  4. intersection(A, union(B,C)) = union( intersection(A,B), intersection(A, C)).
If you don’t believe the four theorems, here is some Mathematica Code which tests each theorem:
test[ funcRandChoose_, prod_, sum_, i_] := Tally[ Table[ 
     A = funcRandChoose[];
     B = funcRandChoose[];
     C = funcRandChoose[];
      prod[ A, sum[B, C]] == sum[ prod[A, B] , prod[A, C]], 

test[ RandomInteger[{1, 1000}] &, Times, Plus, 100]
test[ RandomInteger[{-1000, 1000}] &, Max, Min, 100]
test[ RandomInteger[{-1000, 1000}] &, LCM, GCD, 100]
test[ RandomSample[ Subsets[ Range[5]]] &, Intersection, Union, 100]

=\sum_{n=0}^\infty \frac{x^n}{n!}\cdot\sum_{n=0}^\infty \frac{(-x)^n}{n!}
=\sum_{n=0}^\infty \frac{x^n}{n!}\cdot\sum_{n=0}^\infty(-1)^n \frac{x^n}{n!}
=\sum_{n=0}^\infty\sum_{i=0}^n (-1)^i \frac1{i!}\frac1{(n-i)!}x^n \quad\mathrm{collecting\ coef\ of\ } x^n
= 1+ \sum_{n=1}^\infty \frac{ (1-1)^n}{ n!} x^n = 1
using the binomial theorem on the second to last equality.

So, I noticed Brian Rushton’s question on He asked,

“What should be included in a freshman ‘Mathematics for computer programmers’ course?”

User dtldarek had a pretty cool reply broken up into three parts:

  • Math that is actually useful.
  • Math that you can run into, and is generally good to know.
  • Math that lets you build more awesome math.

His answer got me to think about and revise my “100 Most useful Theorems and Ideas in Mathematics” post.  I wondered which of the 100 I used every week. Rather than trying to think about every week, I just thought it would be interesting to identify which ideas and theorems I used this week excluding the class I am teaching.

The first several ideas on the top 100 list, you can’t seem to avoid:  counting, zero, decimal notation, addition, subtraction, fractions, and negative numbers. Everyone uses these ideas when they buy anything or check their change.  (I wonder how many people rarely count the change from the cashier.)

The next group was mostly about algebra:  equality, substitution, variables, equations, distributive property, polynomials, and associativity.  Every week I attend two math seminars (Tensor Networks and Approximation Theory), so I use these ideas every week, just to add to the discussion during the seminar.

Strangely, I can’t remember using scientific notation this week except while teaching.  During class, I briefly mentioned the dinosaur killing asteroid (see the Chicxulub crater).  It’s hard to calculate any quantity about the asteroid (energy = 4 ×1023 Joules, mass = 2 ×1015 kg ?,  velocity $\approx$ 20000 meters/sec) without using scientific notation.  But, other than during class, I don’t remember using scientific notation.

During the Approximation Theory Seminar, we discussed polynomial approximations to the absolute value function so we used polynomials, infinity, limits, irrational numbers (square roots), the idea of a function, inequalities, basic set theory, vector spaces esp. Hilbert/Banach Spaces, continuity, Karush–Kuhn–Tucker conditions, linear regression, the triangle inequality, linearity, and symmetry during our discussion.  That seems like a lot, but when you have been using these ideas for decades, they are just part of your vocabulary.

So between the approximation theory seminar and daily restaurant mathematics, I used most of the first 40 items from the top 100 list.   Earlier in the week, I wrote up a simple category theory proof about isomorphisms and pull-backs.  Turns out, if you pullback an isomorphism and another arrow, then you get another isomorphism.  For example, in the commutative diagram below the top arrow and the bottom arrow are both isomorphisms (isometries even) and the entire diagram is a pullback.  If the bottom arrow is an isomorphism and the entire diagram is a pullback, then the top arrow is an isomorphism.  (The reverse is not true.)




Working on the proof got me thinking about trig functions, modular arithmetic, injective/surjective functions, Fourier transforms, duality, eigenvalues, metric spaces, projections, the Reiz Representation Theorem, Plancherel’s theorem, and numerical integration.

Probability and information theory were largely missing from my week until I stopped by Carl’s house yesterday. He and I got into a discussion about category theory, information theory, and neural networks.  One thing we noticed was that 1-1 functions are the only functions that conserve the entropy of categorical variables. For example, if $X\in\{1,2,3,4,5,6\}$ is a die roll, then $Y=f(X)$ has the same entropy as $X$ only if $f$ is 1-1.  Turns out you can characterize many concepts from Category theory using entropy and information theory.  So we used random variables, probability distributions, histograms, the mean (entropy is the mean value of the log of the probability), standard deviations, the Pythagorean theorem, statistical independence, real numbers, correlation, the central limit theorem, Gaussian Distributions (the number e), integrals, chain rule, spheres, logarithms, matrices, conic sections (the log of the Gaussian is a paraboloid), Jacobians, Bayes Theorem, and compactness.

I almost avoided using the Pigeon Hole Principle, but then Carl argued that we could use the definition of average value to prove Pigeon Hole Principle.  I still don’t feel like I used it, but it did come up in a discussion.

Notably missing from my week were (surprisingly): derivatives, volume formulas (volume of a sphere or cube), Taylor’s theorem, the fundamental theorem of calculus, and about half of the ideas between #63 Boolean algebra and #99 uncountable infinity.

So, it looks like I typically use around 70 ideas from the list each week—more than I thought.


I will probably get back to 2048 stuff later this month and maybe a post on Fourier Transforms.


Have a great day!    – Hein



Equation from Teacher’s Choice

So I have been wanting to understand Category Theory (see [1], [2], [3]) mainly because I thought it would help me understand advanced functional programming in Haskell and because of the book “Physics, topology, logic and computation: a Rosetta Stone” by Baez and Stay (2011).  A number of the constructs in Haskell were derived from Category Theory so that was enough motivation for me to learn it.  But also, monoidal categories are useful in several fields: physics, computer science, pure mathematics, and statistics. The biggest application for me is the idea of generalizing probabilistic graphical models through either Ologs or monoidal categories / tensor networks (see Brendan Fong’s thesis “Causal Theories: A Categorical Perspective on Bayesian Networks“).

To start my investigation of Category Theory, I began with the $20 thin book “Basic Category Theory for Computer Scientists” by Benjamin Pierce (see also the free online version “A taste of category theory for computer scientists” (1988)).

I really enjoyed the book.  Dr. Pierce’s style is a little informal compared to pure math books like Mac Lane’s “Categories for the Working Mathematician” (pdf / Amazon), but I enjoy that more relaxed style of writing when I am first learning a field.

The biggest obstacle for learning category theory is the fact that category theory generalizes a lot of areas of pure mathematics like topology, abstract algebra, and geometry.  It’s hard to generalize before you have examples to generalize, but the examples being generalized in category theory are mostly from higher level mathematics found in senior level undergraduate and graduate level courses. Pierce ameliorates this problem by introducing some of the most basic categories first: sets, ordered sets, partially ordered sets, groups, monoids, vector spaces, measure spaces, topological spaces, proofs, and a simple functional computer language.  He takes the time to explicitly define most of these ideas, so, in theory, you could read this book without a background in theoretical mathematics, but it would be hard.

After defining categories and introducing the most basic categories, Pierce describes and defines the most basic ideas in category theory: subcategories, commutative diagrams, monomorphisms, epimorphisms, isomorphisms, initial/terminal objects, products, coproducts, universal constructions, equalizers, pullbacks, pushouts, limits, cones, colimits, cocones, exponentiation, and closed Cartesian categories. These ideas are spelled out over the thirty pages of chapter one including illuminating homework exercises. The homework exercises varied significantly in difficulty.  Many of the exercises were trivial and there are two or three that I am still working on despite investing several hours of thought.  Generally, I found the exercises to be a bit harder than those in Mac Lane’s book, but Pierce’s book required less of a background in mathematics.  A couple of the exercises were incorrectly stated or impossible.

Chapter two introduced functors, natural transformations, adjoints, and F-algebras. After reading this chapter, I was finally able to understand the definition of monads which are an important part of the computer language Haskell!  Pierce provides many examples of each of these ideas and enjoyable homework exercises to increase understanding.  Pierce’s definition of adjoints is much easier to understand than the standard definitions using counit adjunction or Hom Sets.

The last major chapter concerns applications of category theory to computer science–specifically lambda-calculus and programming language design.  

The first two chapters of the book give a reasonable condensed introduction to category theory for students that have taken a course in abstract algebra.  A course in topology or linear algebra would be another useful prerequisite.  I carried around the light 100 page book for a few months so that I could learn something whenever I had some extra time.  I had hoped that when I had proven that several functors where monads, I would then really understand monads, but a full understanding still eludes me.  Similarly, I have proven that several functor pairs are adjoint, but I still don’t feel as though I understand adjoint functors.  I guess more work and learning are needed.

Have a great year.  –  Cheers, Hein


Orthogonal functions are very useful and rather easy to use.  The most famous ones are sin(n x) and cos(n x) on the interval [$-\pi$, $\pi$].  It is extremely easy to do linear regression with orthogonal functions.

Linear regression involves approximating a known function $f(x)$ as the weighted sum of other functions.  This shows up all the time in various disguises in machine learning:

  • When using AdaBoosting we perform a weighted sum of weak classifiers. 
  • Support vector machines find a linear combination of the features that separates the data into two classes.  
  • Each node in a neural net typically computes a weighted sum of the nodes in the prior layer.

For orthogonal functions, computing the linear regression is simple.  Suppose that the features are $f_1, f_2, \ldots, f_n$ and the functions $f_i$ are orthonormal, then the best approximation of the function $g(x)$ is the same as the solution to the linear regression

$$ g(x) \approx \sum_{i=1}^n \alpha_i f_i(x)$$

where $\alpha_i$ is the “inner product” of $f_i$ and $g$ denoted

$\alpha_i= \langle\;g, f_i\rangle = \int g(x) f(x) w(x) dx.$  (Two function are orthogonal if their inner product (or dot product) is zero.  The set of orthogonal functions $\{f_i\}$ is called orthonormal if $<f_i, f_i>=1$ for all $i$.)  The function $w(x)$ is a positive weight function which is typically just $w(x)=1$.

For example, if we try to approximate the function $g(x)=x^2$ on the interval [-1,1] by combining the features $f_1(x) = 1/\sqrt(2)$ and $f_2(x) = x \sqrt{3 \over 2} $, then

$$ g(x) \approx \sum_{i=1}^n \alpha_i f_i(x)$$


$$\alpha_1 = \langle g, f_1\rangle = \int_{-1}^1 g(x) f_1(x) dx =  \int_{-1}^1 x^2 1/\sqrt2 dx = \sqrt{2}/3,$$

$$\alpha_2 = \langle g, f_2\rangle = \int_{-1}^1 g(x) f_2(x)  dx=  \int_{-1}^1 x^2  \sqrt{3/2}\; x dx =0,$$

and the best approximation is then $\sum_{i=1}^n \alpha_i f_i(x) = \alpha_1 f_1(x) = 1/3$.


The other day I needed to do a linear regression and I wanted to use orthogonal functions, so I started by using the trigonometric functions to approximate my function $g(x)$.  Using trig functions to approximate another function on a closed interval like [-1,1] is the same as computing the Fourier Series.  The resulting approximations were not good enough at x=-1 and x=1, so I wanted to change the weight function $w(x)$.  Now assume that $u_1(x), u_2(x), \ldots, u_n(x)$ are orthonormal with the weight function w(x) on [-1,1].  Then

$$ \int_{-1}^1 u_i(x) u_j(x) w(x) dx = \delta_{i,j}$$

where $\delta_{i,j}$ is one if $i=j$ and zero otherwise.  If we make a change of variable $x=y(t)$, then

$$ \int_{y^{-1}(-1)}^{y^{-1}(1)} u_i(y(t)) u_j(y(t)) w(y(t)) y'(t) dx = \delta_{i,j}.$$

It would be nice if

$$(1)\quad w(y(t)) y'(t)=1,$$

because then I could just set $u_i(y(t))$ equal to a trig function and the resulting $u_i(x)$ would be orthonormal with respect to the weight function $w(x)$. Condition (1) on $y$ is really an ordinary differential equation

$$y’ = {1\over{w(y)}}.$$

In order to get a better fit at $x=\pm 1$, I tried two weight functions $w(x) = {1\over{(1-x)(1+x)}}$ and $w(x) = {1\over{\sqrt{(1-x)(1+x)}}}$.  For the former, the differential equation has solutions that look similar to the hyperbolic tangent function. The latter $w(x)$ yields the nice solution

$$y(t) = \sin( c + t).$$

So now if I choose $c=\pi/2$, then $y(t) = \cos(t)$, $y^{-1}(-1) = \pi$, $y^{-1}(1)=0$, so I need functions that are orthonormal on the interval $[-\pi, 0]$.  The functions $\cos(n x)$ are orthogonal on that interval, but I need to divide by $\sqrt{\pi/2}$ to make them orthornormal.  So, I set $u_n(y(t)) = \sqrt{2/\pi}\cos(n\;t)$ and now

$$ \int_{-1}^1 {{u_i(x) u_j(x)}\over {\sqrt{(1-x)(1+x)}}}  dx = \delta_{i,j}.$$

The $u_n(x)$ functions are orthonormal with respect to the weight $w(x)$ which is exactly what I wanted.

Solving for $u(x)$ yields

$$u_n(x) = \sqrt{2/\pi} \cos( n \;\arccos(x)),$$

$$u_0(x) = \sqrt{2/\pi}, u_1(x) = \sqrt{2/\pi} x,$$

$$u_2(x) = \sqrt{2/\pi}(2x^2-1), u_3(x) = \sqrt{2/\pi}(4x^3-3x),\ldots.$$


If you happen to be an approximation theorist, you would probably notice that these are the Chebyshev polynomials multiplied by $\sqrt{2/\pi}$.   I was rather surprised when I stumbled across them, but when I showed this to my friend Ludmil, he assured me that this accidental derivation was obvious.  (But still fun :) )

You can find anything on the web.  Earlier today, I noticed that the Sterling approximation and the Gaussian distribution both had a $\sqrt{2 \pi}$ in them, so I started thinking that maybe you could apply Sterling’s approximation to the binomial distribution with $p=1/2$ and large $n$ and the central limit theorem to derive the Gaussian distribution.  After making a few calculations it seemed to be working and I thought, “Hey, maybe someone else has done this.”  Sure enough, Jake Hoffman did it and posted it here.  The internet is pretty cool.

I was reading MarkCC’s gripe about the misuse of the word dimension and it reminded me about the Hausdorff dimension.  Generally speaking, the Hausdorff dimension matches our normal intuitive definition of dimension, i.e the H-dimension of a smooth curve or line is 1, the H-dimension of a smooth surface is 2, the H-dimension of the interior of a cube is 3. But for fractals, the H-dimension can be a non-integer. For the Cantor set, the H-dimension is 0.63. For more information, check out the Wikipedia article


Check out Terence Tao‘s wonderful post “Matrix identities as derivatives of determinant identities“.

« Older entries