As I look at the hit statistics for the last quarter, I cannot help but wonder how well they fit Zipf’s law (a.k.a. Power Laws, Zipf–Mandelbrot law, discrete Pareto distribution).  Zipf’s law states that the distribution of many ranked things like city populations, country populations, blog hits, word frequency distribution, probability distribution of questions for Alicebot, Wikipedia Hits, terrorist attacksthe response time of famous scientists, … look like a line when plotted on a log-log diagram.  So here are the numbers for my blog hits and, below that, a plot of log(blog hits) vs log(rank) :

 

“Deep Support Vector Machines for Regression Problems” 400
Simpson’s paradox and Judea Pearl’s Causal Calculus 223
Standard Deviation of Sample Median 220
100 Most useful Theorems and Ideas in Mathematics 204
Computer Evaluation of the best Historical Chess Players 181
Notes on “A Few Useful Things to Know about Machine Learning” 178
Comet ISON, Perihelion, Mars, and the rule of 13.3 167
Dropout – What happens when you randomly drop half the features? 139
The Exact Standard Deviation of the Sample Median 101
Bengio LeCun Deep Learning Video 99
Category Theory ? 92
“Machine Learning Techniques for Stock Prediction” 89
Approximation of KL distance between mixtures of Gaussians 75
“A Neuro-evolution Approach to General Atari Game Playing” 74
The 20 most striking papers, workshops, and presentations from NIPS 2012 65
Matlab code and a Tutorial on DIRECT Optimization 61
About 51

 

 

bloghits4

 

Not too linear.  Hmmm.

(Though Zipf’s “law” has been known for a long time, this post is at least partly inspired by Tarence Tao’s wonderful post “Benford’s law, Zipf’s law, and the Pareto distribution“.)

So I have been reading a number of texts in my quest to learn category theory and apply it to AI.  Two of the most interesting introductions to category theory are “Category Theory for Scientists” by Spivak (2013) and “Ologs: A Categorical Framework for Knowledge Representation” by Spivak and Kent (2012).

Ologs are a slightly more sophisticated, mathematical form of semantic networks.

Semantic networks are a graphical method for representing knowledge. Concepts are drawn as words (usually nouns) in boxes or ovals.  Relationships between concepts are depicted as lines or arrows between concepts.  (In graph theory, theses would be called edges.)  Usually, the lines connecting concepts are labeled with words to describe the relationship.  My favorite semantic network is the semantic network describing the bookGödel Escher Bach“, but here is a simpler example.

A Semantic Network

 

Mathematically, semantic networks are a graph where the nodes are concepts and the edges are relationships.  There are some cool algorithms for constructing semantic networks from text (see e.g. “Extracting Semantic Networks from Text via Relational Clustering” by Kok and Domingos (2008)).

Though semantic networks look like probabilistic graphical models (PGMs, a.k.a. Bayesian networks), they are different.  In PGMs, the nodes are random variables not general classes and if five edges point to a node, then there is an associated five parameter, real-valued function associated with those five edges indicating the likelihood that the target variable has a particular value.

As I said before, Ologs are a more sophisticated directed graph.  In Ologs, the nodes are typically nouns that can be assigned to one of many values.  In directed graphs the edge starts at one node called the tail and ends at another node called the head.  The edges in Ologs represent functions that given a value with the type of the tail node give you one value with the type of the head node. In the example shown below, the box “a date” could be assigned the values “Oct 12, 2010″ or “January 5, 2011″.  The corresponding values of the “a year” would be 2010 and 2011.

Example Olog from "Ologs: A categorical framework for knowledge representation"

Example Olog from “Ologs: A categorical framework for knowledge representation”

In category theory, categories are directed graphs where the edges are called arrows and they have some specific properties.  Ologs are part of the category Set.  One of the important properties of arrows is that if the head of one arrow is the tail of another arrow, then a third arrow is implied which is called the composition of the original two arrows.  In the diagram above the arrows “has, as birthday” and “includes” can be combined to create a new arrow from “a person” to “a year” which assigns to each person his or her birth year.

I just wanted to introduce categories and Ologs for later articles.

Cheers – Hein

 

Orthogonal functions are very useful and rather easy to use.  The most famous ones are sin(n x) and cos(n x) on the interval [$-\pi$, $\pi$].  It is extremely easy to do linear regression with orthogonal functions.

Linear regression involves approximating a known function $f(x)$ as the weighted sum of other functions.  This shows up all the time in various disguises in machine learning:

  • When using AdaBoosting we perform a weighted sum of weak classifiers. 
  • Support vector machines find a linear combination of the features that separates the data into two classes.  
  • Each node in a neural net typically computes a weighted sum of the nodes in the prior layer.

For orthogonal functions, computing the linear regression is simple.  Suppose that the features are $f_1, f_2, \ldots, f_n$ and the functions $f_i$ are orthonormal, then the best approximation of the function $g(x)$ is the same as the solution to the linear regression

$$ g(x) \approx \sum_{i=1}^n \alpha_i f_i(x)$$

where $\alpha_i$ is the “inner product” of $f_i$ and $g$ denoted

$\alpha_i= \langle\;g, f_i\rangle = \int g(x) f(x) w(x) dx.$  (Two function are orthogonal if their inner product (or dot product) is zero.  The set of orthogonal functions $\{f_i\}$ is called orthonormal if $<f_i, f_i>=1$ for all $i$.)  The function $w(x)$ is a positive weight function which is typically just $w(x)=1$.

For example, if we try to approximate the function $g(x)=x^2$ on the interval [-1,1] by combining the features $f_1(x) = 1/\sqrt(2)$ and $f_2(x) = x \sqrt{3 \over 2} $, then

$$ g(x) \approx \sum_{i=1}^n \alpha_i f_i(x)$$

where

$$\alpha_1 = \langle g, f_1\rangle = \int_{-1}^1 g(x) f_1(x) dx =  \int_{-1}^1 x^2 1/\sqrt2 dx = \sqrt{2}/3,$$

$$\alpha_2 = \langle g, f_2\rangle = \int_{-1}^1 g(x) f_2(x)  dx=  \int_{-1}^1 x^2  \sqrt{3/2}\; x dx =0,$$

and the best approximation is then $\sum_{i=1}^n \alpha_i f_i(x) = \alpha_1 f_1(x) = 1/3$.

 

The other day I needed to do a linear regression and I wanted to use orthogonal functions, so I started by using the trigonometric functions to approximate my function $g(x)$.  Using trig functions to approximate another function on a closed interval like [-1,1] is the same as computing the Fourier Series.  The resulting approximations were not good enough at x=-1 and x=1, so I wanted to change the weight function $w(x)$.  Now assume that $u_1(x), u_2(x), \ldots, u_n(x)$ are orthonormal with the weight function w(x) on [-1,1].  Then

$$ \int_{-1}^1 u_i(x) u_j(x) w(x) dx = \delta_{i,j}$$

where $\delta_{i,j}$ is one if $i=j$ and zero otherwise.  If we make a change of variable $x=y(t)$, then

$$ \int_{y^{-1}(-1)}^{y^{-1}(1)} u_i(y(t)) u_j(y(t)) w(y(t)) y'(t) dx = \delta_{i,j}.$$

It would be nice if

$$(1)\quad w(y(t)) y'(t)=1,$$

because then I could just set $u_i(y(t))$ equal to a trig function and the resulting $u_i(x)$ would be orthonormal with respect to the weight function $w(x)$. Condition (1) on $y$ is really an ordinary differential equation

$$y’ = {1\over{w(y)}}.$$

In order to get a better fit at $x=\pm 1$, I tried two weight functions $w(x) = {1\over{(1-x)(1+x)}}$ and $w(x) = {1\over{\sqrt{(1-x)(1+x)}}}$.  For the former, the differential equation has solutions that look similar to the hyperbolic tangent function. The latter $w(x)$ yields the nice solution

$$y(t) = \sin( c + t).$$

So now if I choose $c=\pi/2$, then $y(t) = \cos(t)$, $y^{-1}(-1) = \pi$, $y^{-1}(1)=0$, so I need functions that are orthonormal on the interval $[-\pi, 0]$.  The functions $\cos(n x)$ are orthogonal on that interval, but I need to divide by $\sqrt{\pi/2}$ to make them orthornormal.  So, I set $u_n(y(t)) = \sqrt{2/\pi}\cos(n\;t)$ and now

$$ \int_{-1}^1 {{u_i(x) u_j(x)}\over {\sqrt{(1-x)(1+x)}}}  dx = \delta_{i,j}.$$

The $u_n(x)$ functions are orthonormal with respect to the weight $w(x)$ which is exactly what I wanted.

Solving for $u(x)$ yields

$$u_n(x) = \sqrt{2/\pi} \cos( n \;\arccos(x)),$$

$$u_0(x) = \sqrt{2/\pi}, u_1(x) = \sqrt{2/\pi} x,$$

$$u_2(x) = \sqrt{2/\pi}(2x^2-1), u_3(x) = \sqrt{2/\pi}(4x^3-3x),\ldots.$$

 

If you happen to be an approximation theorist, you would probably notice that these are the Chebyshev polynomials multiplied by $\sqrt{2/\pi}$.   I was rather surprised when I stumbled across them, but when I showed this to my friend Ludmil, he assured me that this accidental derivation was obvious.  (But still fun :) )

Comet ISON passing Mars

Comet ISON passing Mars

Comet Ison passing Mars (Wolfram Alpha)

 

I just got back from the Black Forest Star Party where Dr. Carey Lisse, head of NASA’s ISON Observing Campaign, gave a speech on comet ISON.

Comet ISON (see [1], [2]) is passing by Mars soon (Oct 1) and it will be “grazing the sun” before the end of the year, so I wondered if there was some relationship between the orbital period of a planet and the time it takes a passing comet to go from the planet to the Sun. Turns out there is a relationship.  Here’s the approximate rule:

 

Time to sun $\approx$ Orbital Period / $ 3 \pi \sqrt{2} \approx $  Orbital Period / 13.3 !

 

In the case of ISON and Mars,

 

Time to sun $\approx$ 687 days / 13.3 $\approx$ 52 days.

 

But Oct 1 + 52 days is Nov 22, and perihelion is on Nov 28.  Why so far off?

Well, turns out that Mars is farther from the sun than usual.  If we correct for that, then the formula estimates perihelion to within 1 day—much better.

 

For those that like derivations, here is the derivation for the 13.3 rule.

The orbital period of a planet is $T_p = 2 \pi \sqrt{ r^3 \over {G m_S} }$ where $m_S$ is the mass of the Sun, $r$ is the radius of the planet’s orbit (or, more precisely, the semi-major axis of its orbit), and G = 6.67e-11 is the gravitational constant.

The speed of a comet from the Oort cloud is derived from its energy.

Kinetic Energy = -Potential Energy

$ \frac12 m v^2 = G m m_S / r$

$v = \sqrt{ {2 G m_S}\over{r}} $

where $r$ is the distance from the comet to the sun.

So the time for a Sun grazer to go from distance $r_0$ to the sun is about

$$T_S = \int_0^{r_0} {1\over v} dr $$

$$ = \int_0^{r_0} \sqrt{ r\over{2 G m_S}} dr $$

$$ = \frac13 \sqrt{ 2 r^3 \over{G m_S} }.$$

Finally,

$$T_p/T_S = {{2 \pi \sqrt{ r^3 \over {G m_S} }}\over{ \frac13 \sqrt{ 2 r^3 \over{G m_S} }}} =  3 \pi \sqrt{2} \approx 13.3.$$

I really enjoyed Yann Esposito’s slides “Category Theory Presentation” which give a relatively simple and artistic introduction to category theory for Haskell programmers.  (I like Haskell too.)  His slides present the basic definitions of categories, functors, natural transformations, and monads along with working Haskell code.

Here is a fragment of a typical slide.

 

composition

Artificial intelligence and machine learning are linked with a huge list of other fields of study: statistics, probability, information theory, computational complexity theory, algorithmic complexity theory, linear regression, linear programming, approximation theory, neuroscience, automata theory, geometry of high dimensional spaces, reproducing kernal Hilbert spaces, optimization, formal languages, and many other areas (see e.g. the graphic below by Swami Chandrasekaran).

My focus has always been on applied mathematics rather than “pure” mathematics. But, by studying AI I was almost forced into learning about areas that I had avoided in graduate school like decidability (logic) and proof theory which did not seem practical at the time.  I felt that most abstract mathematics and some abstract computer science, though beautiful, were not very useful. So, I am surprised that I am now studying the most abstract and possibly most useless area of mathematics, Category Theory.  

Why category theory (aka “Abstract Nonsense“)?  Mathematicians often wonder “What is the simplest example of this idea?”, “Where do idea A and idea B intersect?”, and “How does this idea generalize?”.  Category theory seems to be about theses questions which I hope will be useful for AI research.  I was also inspired to read about category theory by Ireson-Paine‘s article “What Might Category Theory do for Artificial Intelligence and Cognitive Science?” and “Ologs: A Categorical Framework for Knowledge Representation” by Spivak and Kent (2012). Paine describes relationships between category theory, logical unification, holographic reduced representation, neural nets, the Physical Symbol System Hypothesis, analogical reasoning, and logic. The article includes many authoritative interesting references and links.  Ologs appear to be a category theoretic version of semantic networks.

Since I am studying category theory in my off time, I will probably blog about it. I am an enthusiastic beginner when it comes to category theory, so hopefully I can communicate that enthusiasm and avoid posting the common conceptual blunders that are part of learning a new field.

Cheers, Hein

PS:  I loved this graphic by Swami Chandrasekaran from the article “Becoming a Data Scientist“.

Road to Data Scientist

 

Road To Data Science by Swami Chandrasekaran

In a previous post, I gave the well-known approximation to the standard deviation of the sample median

$$\sigma \approx {1 \over 2\sqrt{n}\,f(x_m)}$$

where $f(x)$ is the probability density function and $x_m$ is the median (see Laplace and Kenney and Keeping).  Here are some examples.

Distribution Median Approx StD of Median
Standard Gaussain mean 0 std 1 0 $\sqrt{\pi \over{2 n}}$
Uniform 0 to 1 1/2 $1\over{2\sqrt{n}}$
Logistic with mean 0 and shape $\beta$ 0 ${2\beta}\over{\sqrt{n}}$
Student T with mean 0 and $\nu$ deg free 0 $\frac{\sqrt{\nu }\  B\left(\frac{\nu }{2},\frac{1}{2}\right)}{2 \sqrt{n}}$

$\ $

Computing the exact standard deviation of the sample median is more difficult. You first need to find the probability density function of the sample median which is

$$f_m(x) = g(c(x)) f(x)$$

where

$$g(x) = \frac{(1-x)^{\frac{n-1}{2}}
x^{\frac{n-1}{2}}}{B\left(\frac{n+1}{2},\frac{n+1}{2}\right)},$$
$B$ is the beta function, $c(x)$ is the cumulative distribution function of the sample distribution, and $f(x)$ is the probability density function of the sample distribution.

Now the expected value of the sample median is

$$\mu_m = \int x f_m(x) dx$$

and the standard deviation of the sample median is

$$\sigma_m = \sqrt{\int (x-\mu_m)^2 f_m(x)\ dx}. $$

 

Generally speaking, these integrals are hard, but they are fairly simple for the uniform distribution.  If the sample distribution is uniform between 0 and 1, then

$f(x) = 1,$

$c(x) = x,$

$g(x) = \frac{(1-x)^{\frac{n-1}{2}}
x^{\frac{n-1}{2}}}{B\left(\frac{n+1}{2},\frac{n+1}{2}\right)},$

$f_m(x) = g(x),$

$\mu_m = \int x g(x) dx \ =\ 1/2,$ and

$$\sigma_m = \sqrt{\int (x-\mu_m)^2 f_m(x)\ dx}\  = \ {1\over{2\sqrt{n+2}}} $$

which is close to the approximation given in the table.

 

(Technical note:  The formulas above only apply for odd values of $n$ and continuous sample probability distributions.)

If you want the standard deviation for the sample median of a particular distribution and a $n$, then you can use numerical integration to get the answer. If you like, I could compute it for you.  Just leave a comment indicating the distribution and $n$.

I wonder if there is some way to identify which internet queries are the most frequently asked and least frequently successfully answered.

I bring this up because my three most popular blog posts are:

  1. 100 Most useful Theorems and Ideas in Mathematics
  2. Standard Deviation of Sample Median
  3. “Deep Support Vector Machines for Regression Problems”

The second one about the Median was originally posted to answer the question “What has the most variability: the sample mean or the sample median?”  But, I think that most of the people who are referred to this post from Google are trying to find the answer to “What is the standard deviation of the sample median?” The second question is quite practical for people doing statistics and it is difficult to find an answer to this question on the internet.  An approximation to the answer is given in the post, but the post really was not designed to answer this question, so I imagine that many people read the article and find the approximation to be insufficient or they don’t even understand the approximation given.  So I wonder if there are many questions posed to Google that remain unanswered after Googling. It would be a great service if such questions could be answered by an expert and indexed by Google.

The 100 theorems post and the deep support post are similar.  Both are fairly short simple answers to potential internet queries. The 100 theorems post is an answer to “What are the most useful theorems in mathematics?” and the “deep support” article answers the very technical AI question “What is a deep support vector machine?”

Maybe I should more posts like those.

 

Cheers, Hein

One of my friends showed me his new gaming computer and said that the GPU could do 1.3 teraflops (1.3 trillion floating point operations per second) which is about 500 times faster than my home computer, so I thought, “Imagine how quickly we could search a game tree.”  So I started looking around the internet for a super-great GPU chess engine and found basically nothing!!  Turns out that the amount of memory per thread is too low, the size of the L1 cache is too small, and the alpha-beta pruning algorithm is not quite parallel enough for GPUs to play chess well.  Here is a nice graphic of the L1 access time for a few CPUs and GPUs.

 

In the paper “Parallel Game Tree Search Using GPU” (2011), L’ubomír Lackovic improved the tree search speed by a factor of two to three by using a GPU instead of the more traditional CPU based tree search for Czech draughts (similar to American Checkers).  His tests were based on the ATI Radeon 4890 GPU, the Nvidia GTX460 GPU, and the quad-core processor Intel i5 750 CPU.  I had hoped for more speed.

In “Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks” (2011), von Damian Sulewski invented and test several algorithms for search, reviewed game theory algorithms, and applied GPU processing to several games including “Nine Men’s Morris“.  Sulewski used an Intel Core i7 CPU 920 with an NVIDIA GeForce 285 GTX GPU to run his tests. He reported that the GPU was faster by a factor of three to twelve as long as sufficient RAM was available.  If the computer ran out of RAM and had to use disk storage, then the GPU performance degraded significantly.  He states,

 

“The observed speed-ups of over one order of magnitude have been obtained (plotted in bold font), exceeding the number of cores on most current PCs. Note that this assertion is true for the dual 6-core CPUs available from Intel, but not on a dual Xeon machine with two quad-core CPUs creating 16 logical cores due to multi-threading. Nonetheless, better speed-ups are possible since NVIDIA GPUs can be used in parallel and the Fermi architecture (e.g. located on the GeForce GTX 480 graphics card) is coming out which will go far beyond the 240 GPU cores we had access to.

For larger levels, however, we observe that the GPU performance degrades. When profiling the code, we identified I/O access as one limiting factor. For example, reading S8,8 from one HDD required 100 seconds, while the expansion of 8 million states, including ranking and unranking required only about 1 second on the GPU.”

 

So GPU’s are not the silver bullet for games yet.

 

The linear-nonlinear-Poisson (LNP) cascade model is a standard model of neuron responses.  Louis Shao has recently shown that an artificial neural net consisting of LNP neurons can simulate any Boltzmann machine and perform “a semi-stochastic Bayesian inference algorithm lying between Gibbs sampling and variational inference.”  In his paper he notes that the “properties of visual area V2 are found to be comparable to those on the sparse autoencoder networks [3]; the sparse coding learning algorithm [4] is originated directly from neuroscience observations; also psychological phenomenon such as end-stopping is observed in sparse coding experiments [5].”

 

« Older entries § Newer entries »