In “Analysis of Thompson Sampling for the multi-armed bandit problem“, Agrawal and Goyal show that Thompson Sampling (“The basic idea is to choose an arm to play according to its probability of being the best arm.”) has logarithmic regret. More specifically, if there are $n$ bandits and the regret $R$ at time $T$ is defined by
$$R(T) := \sum_{t=1}^T (\mu^* – \mu_{i(t)})$$
where $\mu_i$ is the expected return of the $i$th arm and $\mu^* = \max_{i = 1, \ldots, n} \mu_i$, then
$$E[R(T)] \in O\left(\left(\sum_{i\neq i^*} 1/\Delta_i^2\right)^2 \log T \right)$$
where $\mu(i^*) = \mu^*$ and $\Delta_i = \mu^* – \mu_i$.
Cotner’s Observation:$\ $
- If $A$ and $B$ have Gaussian Distributions with equal standard deviations, then $p(x)$ is decreasing.
- If $A$ and $B$ have Laplace Distributions with equal standard deviations, then $p(x)$ is becomes constant for all $x > E[B]$ and $p(x) < 0.5$ when $x > E[B]$.
- If $A$ and $B$ have Student T-Distributions with equal standard deviations and equal degrees of freedom, then $p(x)$ will actually increase if $x$ is large enough and $p(x)$ approaches 0.5 as $x$ goes to infinity.
In “BOA: The Bayesian Optimization Algorithm“, Pelikan, Goldberg, and Cant´u-Paz introduce an adaptive improvement over genetic optimization algorithms (See also [1]). They write, “In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed. To estimate the distribution, techniques for modeling multivariate data by Bayesian networks are used.” and
“The algorithm proposed in this paper is also capable of covering higher order interactions. It uses techniques from the field of modeling data by Bayesian networks in order to estimate the joint distribution of promising solutions. The class of distributions that are considered is identical to the class of conditional distributions used in the FDA. Therefore, the theory of the FDA can be used in order to demonstrate the power of the proposed algorithm to solve decomposable problems. However, unlike the FDA, our algorithm does not require any prior information about the problem. It discovers the structure of a problem on the fly.”
where FDA refers to the Factorized Distribution Algorithm (Mühlenbein et al., 1998). The algorithm consists of the following steps
The Bayesian Optimization Algorithm (BOA)
(1) set t ← 0 randomly generate initial population P(0)
(2) select a set of promising strings S(t) from P(t)
(3) construct the network B using a chosen metric and constraints
(4) generate a set of new strings O(t) according to the joint distribution encoded by B
(5) create a new population P(t+1) by replacing some strings from P(t) with O(t) set t ← t + 1
(6) if the termination criteria are not met, go to (2)
where B is a Bayesian network.
Check out the NIPS 2011 workshop.
Check out the new class (all lectures on-line) advertised on the Machine Learning (Theory) blog.
Check out the Nuit Blanche posts “Predicting the Future: The Steamrollers” and “Predicting the Future: Randomness and Parsimony” where Igor Carron repeats the well known mantra of Moore’s law that always seems to catch us by surprise. Carron’s remarks on medicine surprised me but also I thought, “I should have guessed that would happen” while reading the articles.
Garvesh Raskutti has some nice slides on Probabilistic Graphical Models and Markov Logic Networks (Richardson & Domingos 2006). Markov Logic Networks encode first order predicate logic into a Markov Random Field. The resulting networks can be quite large because statements like “for all x, y, and z, x is y’s parent and z is x’s parent imply z is y’s grandparent” require the existence of $2n^2$ nodes in the graph where $n$ is the number of people considered. The resulting networks are frequently solved by using Gibbs Sampling. For even more information Pedro Domingos has put an entire course online at
www.cs.washington.edu/homes/pedrod/803
“Ecological Fallacy” is a phenomena in statistics where correlations for groups of data points are differ from the correlations for ungrouped data. Here a quote from the Wikipedia,
“Another example is a 1950 paper by William S. Robinson that coined the term.[5] For each of the 48 states + District of Columbia in the US as of the 1930 census, he computed the illiteracy rate and the proportion of the population born outside the US. He showed that these two figures were associated with a negative correlation of −0.53 — in other words, the greater the proportion of immigrants in a state, the lower its average illiteracy. However, when individuals are considered, the correlation was +0.12 — immigrants were on average more illiterate than native citizens. Robinson showed that the negative correlation at the level of state populations was because immigrants tended to settle in states where the native population was more literate. He cautioned against deducing conclusions about individuals on the basis of population-level, or “ecological” data. In 2011, it was found that Robinson’s calculations of the ecological correlations are based on the wrong state level data. The correlation of −0.53 mentioned above is in fact −0.46.[6]“
Check out the Bookworm9765’s review of Kurzweil‘s book “How to Create a mind” on Amazon.com. Here is a snippet:
The core of Kurzweil’s theory is that the brain is made up of pattern processing units comprised of around 100 neurons, and he suggests that the brain can be understood and simulated primarily by looking at how these lego-like building blocks are interconnected.
In “Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems” Ibrahimi, Javanmard, and Van Roy (2012) present a fast reinforcement learning solution for linear control systems problems with quadratic costs. The problem is to minimize the total cost $\min_u \sum_{i=1}^T x_i^t Q x_i + u_i^t R u_i$ for the control system $x_{i+1} = A x_i + B u_i + w_i$ where $x\in R^j$ is the state of the system; $u\in R^k$ is the control; $Q, R, A,$ and $B$ are matrices; and $w_i\in R^j$ is a random multivariate normally distributed vector.