Math

You are currently browsing the archive for the Math category.

An ODE, Orthogonal Functions, and the Chebyshev Polynomials

September 16, 2013 in Math by hundalhh | 3 comments

Orthogonal functions are very useful and rather easy to use. The most famous ones are sin(n x) and cos(n x) on the interval [$-\pi$, $\pi$]. It is extremely easy to do linear regression with orthogonal functions.

Linear regression involves approximating a known function $f(x)$ as the weighted sum of other functions. This shows up all the time in various disguises in machine learning:

When using AdaBoosting we perform a weighted sum of weak classifiers.
Support vector machines find a linear combination of the features that separates the data into two classes.
Each node in a neural net typically computes a weighted sum of the nodes in the prior layer.

For orthogonal functions, computing the linear regression is simple. Suppose that the features are $f_1, f_2, \ldots, f_n$ and the functions $f_i$ are orthonormal, then the best approximation of the function $g(x)$ is the same as the solution to the linear regression

$$ g(x) \approx \sum_{i=1}^n \alpha_i f_i(x)$$

where $\alpha_i$ is the “inner product” of $f_i$ and $g$ denoted

$\alpha_i= \langle\;g, f_i\rangle = \int g(x) f(x) w(x) dx.$ (Two function are orthogonal if their inner product (or dot product) is zero. The set of orthogonal functions $\{f_i\}$ is called orthonormal if $<f_i, f_i>=1$ for all $i$.) The function $w(x)$ is a positive weight function which is typically just $w(x)=1$.

For example, if we try to approximate the function $g(x)=x^2$ on the interval [-1,1] by combining the features $f_1(x) = 1/\sqrt(2)$ and $f_2(x) = x \sqrt{3 \over 2} $, then

$$ g(x) \approx \sum_{i=1}^n \alpha_i f_i(x)$$

where

$$\alpha_1 = \langle g, f_1\rangle = \int_{-1}^1 g(x) f_1(x) dx = \int_{-1}^1 x^2 1/\sqrt2 dx = \sqrt{2}/3,$$

$$\alpha_2 = \langle g, f_2\rangle = \int_{-1}^1 g(x) f_2(x) dx= \int_{-1}^1 x^2 \sqrt{3/2}\; x dx =0,$$

and the best approximation is then $\sum_{i=1}^n \alpha_i f_i(x) = \alpha_1 f_1(x) = 1/3$.

The other day I needed to do a linear regression and I wanted to use orthogonal functions, so I started by using the trigonometric functions to approximate my function $g(x)$. Using trig functions to approximate another function on a closed interval like [-1,1] is the same as computing the Fourier Series. The resulting approximations were not good enough at x=-1 and x=1, so I wanted to change the weight function $w(x)$. Now assume that $u_1(x), u_2(x), \ldots, u_n(x)$ are orthonormal with the weight function w(x) on [-1,1]. Then

$$ \int_{-1}^1 u_i(x) u_j(x) w(x) dx = \delta_{i,j}$$

where $\delta_{i,j}$ is one if $i=j$ and zero otherwise. If we make a change of variable $x=y(t)$, then

$$ \int_{y^{-1}(-1)}^{y^{-1}(1)} u_i(y(t)) u_j(y(t)) w(y(t)) y'(t) dx = \delta_{i,j}.$$

It would be nice if

$$(1)\quad w(y(t)) y'(t)=1,$$

because then I could just set $u_i(y(t))$ equal to a trig function and the resulting $u_i(x)$ would be orthonormal with respect to the weight function $w(x)$. Condition (1) on $y$ is really an ordinary differential equation

$$y’ = {1\over{w(y)}}.$$

In order to get a better fit at $x=\pm 1$, I tried two weight functions $w(x) = {1\over{(1-x)(1+x)}}$ and $w(x) = {1\over{\sqrt{(1-x)(1+x)}}}$. For the former, the differential equation has solutions that look similar to the hyperbolic tangent function. The latter $w(x)$ yields the nice solution

$$y(t) = \sin( c + t).$$

So now if I choose $c=\pi/2$, then $y(t) = \cos(t)$, $y^{-1}(-1) = \pi$, $y^{-1}(1)=0$, so I need functions that are orthonormal on the interval $[-\pi, 0]$. The functions $\cos(n x)$ are orthogonal on that interval, but I need to divide by $\sqrt{\pi/2}$ to make them orthornormal. So, I set $u_n(y(t)) = \sqrt{2/\pi}\cos(n\;t)$ and now

$$ \int_{-1}^1 {{u_i(x) u_j(x)}\over {\sqrt{(1-x)(1+x)}}} dx = \delta_{i,j}.$$

The $u_n(x)$ functions are orthonormal with respect to the weight $w(x)$ which is exactly what I wanted.

Solving for $u(x)$ yields

$$u_n(x) = \sqrt{2/\pi} \cos( n \;\arccos(x)),$$

$$u_0(x) = \sqrt{2/\pi}, u_1(x) = \sqrt{2/\pi} x,$$

$$u_2(x) = \sqrt{2/\pi}(2x^2-1), u_3(x) = \sqrt{2/\pi}(4x^3-3x),\ldots.$$

If you happen to be an approximation theorist, you would probably notice that these are the Chebyshev polynomials multiplied by $\sqrt{2/\pi}$. I was rather surprised when I stumbled across them, but when I showed this to my friend Ludmil, he assured me that this accidental derivation was obvious. (But still fun )

Deriving the Gaussian Distribution from the Sterling Approximation and the Central Limit Theorem

March 30, 2013 in Math, Statistics by hundalhh | Permalink

You can find anything on the web. Earlier today, I noticed that the Sterling approximation and the Gaussian distribution both had a $\sqrt{2 \pi}$ in them, so I started thinking that maybe you could apply Sterling’s approximation to the binomial distribution with $p=1/2$ and large $n$ and the central limit theorem to derive the Gaussian distribution. After making a few calculations it seemed to be working and I thought, “Hey, maybe someone else has done this.” Sure enough, Jake Hoffman did it and posted it here. The internet is pretty cool.

Hausdorff dimension

March 24, 2013 in Math by hundalhh | Permalink

I was reading MarkCC’s gripe about the misuse of the word dimension and it reminded me about the Hausdorff dimension. Generally speaking, the Hausdorff dimension matches our normal intuitive definition of dimension, i.e the H-dimension of a smooth curve or line is 1, the H-dimension of a smooth surface is 2, the H-dimension of the interior of a cube is 3. But for fractals, the H-dimension can be a non-integer. For the Cantor set, the H-dimension is 0.63. For more information, check out the Wikipedia article

http://en.wikipedia.org/wiki/Hausdorff_dimension

“Matrix identities as derivatives of determinant identities”

February 14, 2013 in Math by hundalhh | Permalink

Check out Terence Tao‘s wonderful post “Matrix identities as derivatives of determinant identities“.

“Proofs without words”

December 20, 2012 in Math by hundalhh | Permalink

Thank you to Freakonometrics for pointing me toward the book “Proofs without words” by Rodger Nelson. Might be a nice Christmas present

Matroids

November 28, 2012 in Math by hundalhh | 2 comments

Matroids are an abstraction of Vector Spaces. Vectors spaces have addition and scalar multiplication and the axioms of vectors spaces lead to the concepts of subspaces, rank, and independent sets. All Vector Spaces are Matroids, but Matroids only retain the properties of independence, rank, and subspaces. The idea of the Matroid closure of a set is the same as the span of the set in the Vector Space.

100 Most useful Theorems and Ideas in Mathematics

November 27, 2012 in Math by hundalhh | 6 comments

So I have been thinking about which ideas in mathematics I use most often and I have listed them below. Please feel free to comment because I would like to use this list as a starting point for a future list of “The most useful ideas in Mathematics for Scientists and Engineers”.

$\ $

I showed this list to Carl and he said his list would be completely different. (He choked a bit when he saw primes at #71.) Hopefully we can post his list as well as others.

$\ $

counting
zero
integer decimal positional notation 100, 1000, …
the four arithmetic operations + – * /
fractions
decimal notation 0.1, 0.01, …
basic propositional logic (Modus ponens, contrapositive, If-then, and, or, nand, …)
negative numbers
equivalence classes
equality & substitution
basic algebra – idea of variables, equations, …
the idea of probability
commutative and associative properties
distributive property
powers (squared, cubed,…), – compound interest (miracle of)
scientific notation 1.3e6 = 1,300,000
polynomials
first order predicate logic
infinity
irrational numbers
De Morgan’s laws
statistical independence
the notion of a function
square root (cube root, …)
inequalities (list of inequalities)
power laws (i.e. $a^b a^c = a^{b+c}$ )
Cartesian coordinate plane
basic set theory
random variable
probability distribution
histogram
the mean, expected value & strong law of large numbers
the graph of a function
standard deviation
Pythagorean theorem
vectors and vector spaces
limits
real numbers as limits of fractions, the least upper bound
continuity
$R^n$, Euclidean Space, and Hilbert spaces (inner or dot product)
derivative
correlation
central limit theorem, Gaussian Distribution, Properties of Guassains.
integrals
chain rule
modular arithmetic
sine cosine tangent
$\pi$, circumference, area, and volume formulas for circles, rectangles, parallelograms, triangles, spheres, cones,…
linear regression
Taylor’s theorem
the number e and the exponential function
Rolle’s theorem, Karush–Kuhn–Tucker conditions, derivative is zero at the maximum
the notion of linearity
Big O notation
injective (one-to-one) / surjective (onto) functions
imaginary numbers
symmetry
Euler’s Formula $e^{i \pi} + 1 = 0$
Fourier transform, convolution in time domain is the product in the frequency domain (& vice versa), the FFT
fundamental theorem of calculus
logarithms
matrices
conic sections
Boolean algebra
Cauchy–Schwarz inequality
binomial theorem – Pascal’s triangle
the determinant
ordinary differential equation (ODE)
mode (maximum likelihood estimator)
cosine law
prime numbers
linear independence
Jacobian
fundamental theorem of arithmetic
duality – (polyhedron faces & points, geometry lines and points, Dual Linear Program, dual space, …)
intermediate value theorem
eigenvalues
median
entropy
KL distance
binomial distribution
Bayes’ theorem
$2^{10} \approx 1000$
compactness, Heine – Borel theorem
metric space, Triangle Inequality
Projections, Best Approximation
$1/(1-X) = 1 + X + X^2 + \ldots$
partial differential equations
quadratic formula
Reisz representation theorem
Fubini’s theorem
the ideas of groups, semigroups, monoids, rings, …
Singular Value Decomposition
numeric integration – trapezoidal rule, Simpson’s rule, …
mutual information
Plancherel’s theorem
matrix condition number
integration by parts
Euler’s method for numerical integration of ODEs (and improved Euler & Runge–Kutta)
pigeon hole principle

There is a long list of mathematical ideas that I use less often. Here’s a sampling: Baire category theorem, Banach Spaces, Brouwer Fixed Point Theorem, Carathéodory’s Theorem, Category Theory, Cauchy integral formula, calculus of variations, closed graph theorem, Chinese remainder theorem, Clifford algebra (quaternions), Context Free Grammars, countable vs uncountable infinity, Cramer’s Rule, cohomology, Euclidean algorithm, fundamental group, Gauss’ Law, Grassmannian algebra , Graph Theory, Hahn-Banach Theorem, homology, Hairy Ball Theorem, Hölder’s inequality, inclusion-exclusion, Jordan Decomposition, Kalman Filters, Markov Chains (Hidden Markov Models), modules, non-associative algebras, Picard’s Great Theorem, Platonic/Euclidean solids, Principle of Induction, Probabilistic Graphical Models (Bayesian Networks, Markov Random Fields), Pontryagain duality, Quaternions, Spectral Theorem, Sylow p subgroup, repeating decimals equal a fraction, ring ideals, sine law, tensors, tessellation, transcendental numbers, Uniform Boundedness Theorem, Weierstrass approximation theorem, …

Gauss’s continued fraction

November 24, 2012 in Math by hundalhh | Permalink

Gauss did a lot of math, but sometimes I am surprised when I find something new to me done by Gauss

$\tan(x) = { 1 \over{1/x \ – {1\over{3/x \ – {1\over{5/x\ – \ldots}}}}}}$

Topological Sorting

November 20, 2012 in Math by hundalhh | Permalink

Given a directed a-cyclic graph (DAG) find an ordering of the nodes which is consistent with the directed edges. This is called Topological Sorting and several algorithms exist.

Why are Gaussian Distributions Great?

September 27, 2012 in Information Theory, Math, Statistics by hundalhh | 3 comments

Gaussian distributions are the most “natural” distributions. They show up everywhere. Here is a list of the properties that make me think that Gaussians are the most natural distributions:

The sum of several random variables (like dice) tends to be Gaussian. (Central Limit Theorem).
There are two natural ideas that appear in Statistics, the standard deviation and the maximum entropy principle. If you ask the question, “Among all distributions with standard deviation 1 and mean 0, what is the distribution with maximum entropy?” The answer is the Gaussian.
Randomly select a point inside a high dimensional hypersphere. The distribution of any particular coordinate is approximately Gaussian. The same is true for a random point on the surface of the hypersphere.
Take several samples from a Gaussian Distribution. Compute the Discrete Fourier Transform of the samples. The results have a Gaussian Distribution. I am pretty sure that the Gaussian is the only distribution with this property.
The eigenfunctions of the Fourier Transforms are products of polynomials and Gaussians.
The solution to the differential equation y’ = -x y is a Gaussian. This fact makes computations with Gaussians easier. (Higher derivatives involve Hermite polynomials.)
I think Gaussians are the only distributions closed under multiplication, convolution, and linear transformations.
Maximum likelihood estimators to problems involving Gaussians tend to also be the least squares solutions.
I think all solutions to stochastic differential equations involve Gaussians. (This is mainly a consequence of the Central Limit Theorem.
“The normal distribution is the only absolutely continuous distribution all of whose cumulants beyond the first two (i.e. other than the mean and variance) are zero.” – Wikipedia.
For even n, the nth moment of the Gaussian is simply an integer multiplied by the standard deviation to the nth power.
Many of the other standard distributions are strongly related to the Gaussian (i.e. binomial, Poisson, chi-squared, Student t, Rayleigh, Logistic, Log-Normal, Hypergeometric …)
“If X1 and X2 are independent and their sum X1 + X2 is distributed normally, then both X1 and X2 must also be normal.” — From the Wikipedia.
“The conjugate prior of the mean of a normal distribution is another normal distribution.” — From the Wikipedia.
When using Gaussians, the math is easier.
The Erdős–Kac theorem implies that the distribution of the prime factors of a “random” integer is Gaussian.
The velocities of random molecules in a gas are distributed as a Gaussian. (With standard deviation = $z*\sqrt{ k\, T / m} $ where $z$ is a “nice” constant, $m$ is the mass of the particle, and $k$ is Boltzmann’s constant.)
“A Gaussian function is the wave function of the ground state of the quantum harmonic oscillator.” — From Wikipedia
Multivariate Gaussians are the only continuous multivariate distributions with radial symmetry and independence of all the random variables.
Kalman Filters.
The Gauss–Markov theorem.

« Older entries § Newer entries »

Artificial Intelligence Blog