In “Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity“, Ailon (2011) presents a method for ordering a set based on noisy comparisons of elements. He proves that his adaptive method approximates the NP Hard optimal solution (Alon 2006) to within $(1+\epsilon)$ times the minimum error after $f(\epsilon^{-1}, n)$ comparisons where $f$ is a polynomial. Ailon builds on the work of Kenyon-Mathieu and Schudy (2007) who also developed a polynomial time algorithm for approximate ranking. The Kenyon-Mathieu and Schudy algorithm was not query efficient.
In the seminal paper “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data“, Lafferty, McCallum, Pereira (2001) introduce a very popular type of Markov random field for segmentation. Conditional Random Fields (CRFs) are used in many fields including machine translation, parsing, genetics, and transmission codes. They are a non-directed version of Hidden Markov Networks. The paper describes conditional random fields, provides an iterative method to estimate the parameters of the CRF, and reports experimental comparisons between CRFs, hidden Markov models, and maximum entropy Markov models.
In “Back Propagation in a Clifford Algebra“, Pearson and Bisset (1992) discuss the interesting problem of replacing real numbers in a neural net by elements from a Clifford Algebra. They replace the sigmoid activation function with
$$f(x) = x / ( c + |x|/r)$$
where $c$ and $r$ are real positive constants and $|x|$ is the norm of the element in the Clifford algebra. Deriving the back propagation algorithm is straight forward otherwise.
In a later article “Neural Networks in the Clifford Domain“, the same authors explain how complex numbers, quaternions, or Clifford algebras can convey electrical phase information between neurons which might be necessary for more accurate representation of how the brain actually works. Also, it is possible that signal processing and image processing applications may benefit. They write,
“It is conjectured that complex valued feed-forward networks will be able to achieve better representations of problems that map into the complex domain naturally (such as phase and frequency information) than if the components of the signal were split up and presented to a real valued feed forward network.”
I was reading “Around the Blogs in 80 Summer Hours” at Nuit Blanche and these two links caught my eye:
Implementation: BiLinear Modelling Via Augmented Lagrange Multipliers (BLAM)
In “Algorithms for Infinitely Many-Armed Bandits”, Wang, Audibert, and Munos (2008) describe some algorithms for the multi-armed bandit problem when a large number or infinitely many arms are available. Their algorithms are designed for the situation where all rewards are contained in $[0,1]$ and “the probability that a new arm is $\epsilon$-optimal is of order $\epsilon^\beta$”. More precisely, there exist real numbers $c, \mu^*,$ and $\beta$ such that the expected value of an unexplored arm $\mu$ obeys
$$P(\mu^* – \mu < \epsilon) < c \epsilon^\beta.$$
They prove that the total regret is at most of order $n^{\beta/(\beta+1)}\log^2(n)$ if $\beta > 1$ and $\log^2(n)\sqrt{n}$ otherwise. Additionally, they prove a lower bound of order $n^{\beta / (\beta + 1)}$ for any algorithm. Their algorithm applies UCB to the first $n^{\beta/(\beta+1)}$ arms. (The case where $\beta = 1$ was explored in “Bandit problems with infinitely many arms” by Berry, Chen, Zame, Heath, and Shepp (1997).)
In “Active Learning Literature Survey“, Burr Settles (2010) reviews uncertainty sampling (Lewis and Gale, 1994), margin sampling (Scheffer et al., 2001), entropy sampling, optimal experimental design, query-by-committee (Seung et al., 1992), query-by-boosting, query-by-bagging, expected model change, expected error reduction, expected information gain, variance reduction, and density weighted methods. He then comments on theoretical and empirical performance of these methods, practical considerations, and related areas of machine learning including: semi-supervised learning, reinforcement learning, and compression.
The Jensen-Shannon divergence is a symmetric version of the Kullback-Leibler divergence. Unlike the KL-divergence, the square root of the JS-divergence is a true metric obeying the triangle inequality. Interestingly, if $X$ is a random variable chosen from the average of two distributions $Q$ and $R$, then the J-S divergence between $Q$ and $R$ is equal to the mutual information between $X$ and the indicator function $I$, where $P(I=1) = q(X)/( q(X) + r(X) )$, $q(x)$ is the density of $Q$, and $r(x)$ is the density of $R$. (One immediate consequence is the JS divergence is always less than 1.)
In “Hashing Algorithms for Large-Scale Learning” Li, Shrivastava, Moore, and Konig (2011) modify Minwise hashing by storing only the $b$ least significant bits. They use the $b$ bits from $k$ hashing functions as features for training a support vector machine and logistic regression classifiers. They compare their results against Count-Min, Vowpal Wabbit, and Random hashes on a large spam database. Their algorithm compares favorably against the others.