Articles by hundalhh

Mathematician and Father. Into games, astronomy, psychology and philosophy.

Currently, Neural Networks are the most ubiquitous artificial intelligence method. Artificial intelligence (AI) has had massive breakthroughs over the last 15 years. From the year 2000 until 2008, computers were not able to get near human level performance in most pattern recognition tasks. For example, they had about 75% accuracy when converting spoken words to text, maybe a 30% chance of recognizing things in images (like people, planes, or mountains), and maybe a 90% accuracy rate on recognizing hand written text. Big data helped and fast computers helped, but they still were not getting human level performance. Different methods were used for each pattern recognition system. Now in 2023, neural nets seem are about as good as humans in many pattern recognition tasks and, in almost all cases, the computers are using neural nets to achieve human level performance.

Neural Nets were first discovered in the 1960s, but they were not very useful. They were used for some tasks in AI, but they were not the best tool for anything except maybe hand written digit recognition until about 2008. In about 2007 (or maybe 2006?), Geoffrey Hinton discovered how to train “deep” neural nets. Neural Nets have layers and it was very difficult to teach any neural net with more than three layers. Hinton found a way to train neural nets with up to 15 layers. (I attended the lecture at the NIPS conference in 2007 when Hinton introduced this new method–it was very exciting.) This method massively improved the pattern recognition abilities of neural nets and by 2010 neural nets were one of the best pattern recognition systems. In 2012, Hinton discovered the “Drop Out” training method. After Drop Out was introduced, neural nets became better than humans in many pattern recognition tasks like image recognition.

Over the last 10 years, three other big break throughs occurred: GANs (generative adversarial networks), transducers, and diffusion models. These breakthroughs have created neural nets that can create photo realistic faces, digital image art, and intelligently answer almost almost any question (the answer is correct only around 70% of the time). The biggest recent breakthrough is the GPT3 neural net which can pass final exams in several college subjects, write good essay on just about any topic, write short computer programs, and it can carry on a short conversation.

For further information, I am adding links for an introduction to neural nets that actually describes what they are, and a 17 page history of AI.

https://www.analyticsvidhya.com/blog/2022/01/introduction-to-neural-networks/

https://www.dropbox.com/s/3bnucak3fbn8jgq/HistAI17.pdf?dl=0

There are many games that have an investment phase followed by an exploitation phase including the card game Dominion, Terraforming Mars, Slay the Spire (during a combat), Saint Petersburg, Roll for the Galaxy, Master of Orion, Hansa Teutonica, Century Spice Road, Cash Flow and many more games.

One of the basic ideas is that you should not invest in something if there is not enough time for the investment to pay for itself.

A VERY SIMPLE HYPOTHETICAL GAME

Suppose that there are $T$ turns left in a game excluding the current turn, your current income level is I dollars per turn, and you have two choices:

  1. (invest) Invest this turn to increase your income from $I$ dollars per turn to $I +i$ dollars per turn. You would then receive $I+i$ dollars per turn for the remaining $T$ turns for a total of $(I+i)\cdot T$ dollars.
  2. (don’t invest) Receive $I$ dollars this turn and the remaining $T$ turns for a total of $(T+1)\cdot I$ dollars.

 

EXAMPLE

For example, assume that

  • it is currently turn 5,

  • at the start of turn 5 you have an income of $I$ = \$2 per turn,

  • the game ends on turn 9, and

  • you have the choice between two options:

  1. Invest this turn to increase your income from \$2 per turn to \$3 dollars per turn. You would then receive \$3 dollars per turn for turns 6,7,8, and 9 for a total of \$3 * 4 = \$12 dollars.

  2. Do not invest and you receive \$2 on this turn, turn 5, and the remaining turns 6,7,8, and 9, for a total of \$$2\cdot5$ = \$10 dollars.

For this example, $T=4$ because there are 4 remaining turns after the current turn. You have the option of increasing your income from \$2 to \$3, so the investment increases your income by $i$ = \$1 per turn.

The Correct Strategy

If you choose option 1, then your total earnings will be

invest_option_returns = $T\cdot (I+i)$ dollars.

If you choose option 2, then your total earnings will be

no_investment_option_returns = $(T+1)\cdot I$ dollars.

So, you should invest if

$$\begin{aligned}T\cdot (I+i) &> (T+1)\cdot I \\T\cdot I+T\cdot i &> T\cdot I + I \\T\cdot i &> I \\T&>\frac{I}{i}. \\\end{aligned}$$

So, according to the math above, you should not invest in something unless there are more than $$\frac{I}{i}= \frac1{r}$$ turns left in the game after the current turn where $$r=\frac{i}{I}$$ is sometimes called the return on investment (ROI).

When you invest $I$ dollars to get an increase of $i$ dollars per turn, it takes $I/i$ turns for the investment to “pay for itself”.

 

The Gain

If there are more than $\frac1{r}=\frac{I}{i}$ turns in the game, then you will gain

gain = invest_option_returns – no_investment_option_returns

$$\begin{aligned}&=T\cdot (I+i) – (T+1)\cdot I\\&=T I+T i – T i – I\\&= T i – I \\&= T r I – I \\&= (Tr -1) I \\&= I r  \left(T -\frac1{r}\right) \end{aligned}$$

 

dollars by choosing to invest.  The investment pays for itself after $1/r$ turns and every turn after that gives you $I r=i$ dollars.

 

APPLYING THE CORRECT STRATEGY TO THE EXAMPLE

For the example, $T=4, I=2,$ and $i=1$, so $r = i/I =0.5$. The correct strategy is to invest if

$$T > I/i,$$

but T=4 which is greater then I/i = 2, so the best option is the investment option.

 

 

I am hoping to write about applying this idea to several games over the next few weeks.

Morgan Housel posted an article titled “Ideas that Changed My Life”.  See

https://collabfund.com/blog/ideas-that-changed-my-life/

and

https://news.ycombinator.com/item?id=33907245

 

I really like the idea that there are these great ideas that change your life.  I decided to sit down and try to write up my own list of the ideas that influenced me.  Here is my first draft (ordered from most influential to least):

 

1) Practice makes you better.

2) You can make decisions that affect the rest of your life.

3) Programming (the idea that many procedures can be broken down into precisely defined steps which can be done by a computer.)

4) People are mostly kind.  women are somewhat more likely to be kind than men.

5) Your conscious mind can make changes to your brain and your body.  You can exercise.  You can teach yourself like or dislike things more.  You can change your habits.

6) You can learn skills from a book or other written materials.

7) The idea and laws of probability (including Baysian)

8) The ability to estimate (Fermi Questions).

9) Information Theory and  Kolmogorov Complexity (Algorithmic Complexity)

10) Polynomial and Exponential Growth  (related to Black Swans)

11) You can use algebra to solve for one variable given a few equations

12) Calculus (Esp Fundamental Theorem of Calculus)

13) Object oriented Programming (Inheritance, Polymorphism)  

14) Mathematical proof – the idea that you can take a bunch of axioms and use logic to reach many amazing and useful conclusions.  Furthermore, you can have a high degree of certainty that the result (theorem) is true especially if it is verified by another mathematician or a computer.

15) Game Theory (The Nash Equilibrium often exists and it (or they) can often be found.)

16) Energy and Momentum are both conserved.  That fact arises from the symmetries of the universe (invariance under translation by space time, and rotation (see Emmy Noether)).

17) Optimization ( many problems have an optimal solution and often calculus or math can be used to find it.)

18) You can guess many of the formulas of physics

19) Thermodynamics (Laws of Thermodynamics)

20) Recursion

21) Extending the idea of the Pythagorean theorem   (Application of Hilbert Spaces)

22) Category Theory

23) Declarative Programming

Yesterday, I wrote about how the fractions

$$\begin{aligned}
\frac{1000}{998999} &= 0.00100100200300500801302103405509… \\
\frac{10000}{99989999} &= 0.000100010002000300050008001300210… \\
\frac{100000}{9999899999} &= 0.0000100001000020000300005000080001… \\
\frac{1000000}{999998999999} &= 1.000001000002000003000005000008000013…\cdot 10^{-6} \\
\frac{10000000}{99999989999999} &= 1.000000100000020000003000000500000080000013\cdot 10^{-7} \\
\end{aligned}$$

display the first several values of the Fibonacci sequence.  The general formula for these fractions is $$f_k = \frac{10^k}{10^{2 k} – 10^k – 1}.$$

Today, I will provide a fairly simple derivation for this formula without summing an infinite series.

Let $a_1=1, a_2=1, a_3=2, a_4=3, a_5=5, a_6=8,\ldots$ be the Fibonacci sequence.  Let $a_i=0$ when $i\leq 0$.  Now, we know that by definition

$$a_n = a_{n-1} + a_{n-2}$$

for all $n>1$.  The formula above also holds for $n<1$, but fails for $n=1$.  We can fix that by adding in the Kronecker delta function $\delta_{i,j}$ which is 0 if $i\neq j$ and 1 when $i=j$.  The amended formula valid for all $n$ is

$$a_n = a_{n-1} + a_{n-2} +\delta_{n,1}.$$

In the prior article, we defined the Fibonacci fractions to be

$$f_k = \sum_{i=1}^\infty \frac{a_i}{10^{\ ki}}.$$

So,

$$\begin{aligned} f_k &= \sum_{i=1}^\infty \frac{a_i}{10^{\ ki}}  \\ &= \sum_{i=1}^\infty \frac{a_{i-1} + a_{i-2} + \delta_{i,1}}{10^{\ ki}}\\&= \sum_{i=1}^\infty \frac{a_{i-1}}{10^{\ ki}}  +\frac{a_{i-2}}{10^{\ ki}} + \frac{\delta_{i,1}}{10^{\ ki}}\\&= \frac{1}{10^{\ k}}  + \sum_{i=1}^\infty \frac{a_{i-1}}{10^k10^{\ k(i-1)}}  +\frac{a_{i-2}}{10^{2k}10^{\ k(i-2)}} \\&= \frac{1}{10^{\ k}}  + \frac{f_k}{10^k}  +\frac{f_k}{10^{2k}}. \end{aligned}$$

Thus,

$$f_k= \frac{1}{10^{\ k}}  + \frac{f_k}{10^k}  +\frac{f_k}{10^{2k}}$$

$$f_k   – \frac{f_k}{10^k}  -\frac{f_k}{10^{2k}} =  \frac{1}{10^{\ k}}$$

$$f_k  \left(1 – \frac{1}{10^k}  -\frac{1}{10^{2k}}\right) =  \frac{1}{10^{\ k}}$$

$$f_k  = \frac{\frac{1}{10^{\ k}}}{ 1 – \frac{1}{10^k}  -\frac{1}{10^{2k}} }$$

$$f_k  = \frac{10^{\ k}}{ 10^{2k} – 10^k  -1 }$$

which completes the derivation.

This technique is often used for Z-transforms and generating functions.

 

I think that it is really cool that some fractions have interesting patterns in their decimal expansion.  For example,

$$1/499 = 0.0020040080160320641282565130260….$$

You can see the first 8 values of the doubling sequence (a.k.a the powers of 2)  2, 4, 8, 16, 32, 64, 128, and 256 in the decimal expansion.  If you want more values, you can just add more 9’s to the denominator.

$$\begin{aligned} 1/49999 = &0.0000200004000080001600032000640012800256005120\\&1024020480409608192163\
8432768655373107462….\end{aligned}$$

Other fractions can encode other sequences.  The first several square numbers 1, 4, 9, 16, … are encoded by the fractions

$$\small\frac{10100}{970299},\frac{1001000}{997002999},\frac{100010000}{999700029999},\frac{10000100000}{999970000299999},\frac{1000001000000}{999997000002999999},….$$

For example,

$$\frac{1001000}{997002999} = 0.001004009016025036049064081100121144169196225….$$

It turns out that it is not too hard to find sequences of fractions which encode any sequence of positive integers that is a sum of

  • polynomials  (e.g.  cubes         1, 8, 27, 64, 125, …),
  • exponentials (e.g. powers of 3  1, 3, 9, 27, 81, 243, …), or
  • products of polynomials and exponentials.  (e.g.  cubes times power of 3     1, 24, 243, 1728, 10125, 52488, 250047,…).

The Fibonacci sequence

The Fibonacci sequence is

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, ….

Each value after the initial ones is the sum of the previous two values (e.g. 2 +3 =5, 3+5 =8, …).  It turns out that the Fibonacci sequence is the sum of exponentials, so it also can be encoded in a sequence of fractions.

$$\begin{aligned}
\frac{1000}{998999} &= 0.00100100200300500801302103405509… \\
\frac{10000}{99989999} &= 0.000100010002000300050008001300210… \\
\frac{100000}{9999899999} &= 0.0000100001000020000300005000080001… \\
\frac{1000000}{999998999999} &= 1.000001000002000003000005000008000013…\cdot 10^{-6} \\
\frac{10000000}{99999989999999} &= 1.000000100000020000003000000500000080000013\cdot 10^{-7} \\
\end{aligned}$$.

All of these fractions $f_k$ can be computed with the formula

$$f_k=\sum_{i=1}^\infty \frac{a_i}{10^{\ k i}}$$

where the $a_i$ are the values of the sequence and $k$ is a positive integer specifying the spacing between the values of $a$ in the decimal expansion.   This sum is fairly easy to compute when $a$ is a polynomial, an exponential, or a combination of polynomials and exponentials.

For example, to get the doubling numbers $a_i = 2^i$, the fractions would be

$$\begin{aligned}\sum_{i=1}^\infty \frac{a_i}{10^{\ k i}}&= \sum_{i=1}^\infty \frac{2^i}{10^{\ k i}}\\&= \sum_{i=1}^\infty \left(\frac{2}{10^k}\right)^i\\&=\frac{\frac{2}{10^k}}{1-\frac{2}{10^k}}\quad{\text{using} \sum_{i=1}^\infty x^i= \frac{x}{1-x}}\\&= \frac{2}{10^k – 2}.\end{aligned}$$

For the Fibonacci sequence $$a_i = \frac{\varphi^i-\psi^i}{\varphi-\psi}= \frac{\varphi^n-\psi^n}{\sqrt 5}$$ where $$\varphi = \frac{1 + \sqrt{5}}{2} \approx 1.61803$$ and $$\psi = \frac{1 – \sqrt{5}}{2} = 1 – \varphi = – {1 \over \varphi} \approx -0.61803.$$ (Copied verbatim with glee from the Wikipedia).

The math required to compute the Fibonacci fractions is almost the same as the powers of two math using the same $$\sum_{i=1}^\infty x^i= \frac{x}{1-x}$$ trick.  :)

After you go through the computations, the Fibonacci fractions can be found with the formula

$$f_k = \frac{10^k}{10^{2 k} – 10^k – 1}.$$

 

 

Most people look at the first four digits of 1/7 ~= .1428 and think that the fact that 7*2=14 and 7*4=28 is a mere coincidence. It’s not. Let’s compute the decimal expansion of 1/7 without doing long division, but using instead $$x/(1-x) = x + x^2 + x^3 + \ldots.$$
1/7
= 7/49
= 7*1/(50-1)
= 7*1/50/(1-1/50)
= 7*.02/(1-.02)
= 7*(.02 + .02^2 + .02^3 + …)
= 7*(.02 + .0004 + .000008 + .00000016 + .0000000032 + …)
= 7*(.020408163264…)
= 7*(.020408) + 7*(..000000163264)
= .142856 + 7*(.00000016) + 7*(.00000000326…)
= .142856 + .00000112 + 7*(.00000000326…)
Notice that the first n digits must be .142857. The decimal expansion of 1/j either terminates or repeats with at most (j-1) repeating digits and we have 6 digits, so those six digits must repeat, thus
1/7 = .142857142857142857142857142857142857….

About a year ago, I posted a short article about optimal putting in disc golf.  I gave an approximate rule of thumb to use at the end of the article, but it turns out that there is a better thumb rule, so I’m revising the post accordingly below.

I often have to make a decision when putting at the edge of the green during disc golf.  Should I try to put the disc in the basket (choice 1) or should I just try to put the disc near the basket (choice 2)? If I try to put it in the basket and I miss, sometimes the disc will fly or roll so far way that I miss the next shot.

In order to answer this question, I created a simplified model.  Assume:

  • If I do try to put it in the basket (choice 1), I will succeed with probability $p_1$.
  • If I do try to put it in the basket (choice 1), fail to get it in the basket, and then fail again on my second try, then I will always succeed on the third try.
  • If I don’t try to put the disc in the basket (choice 2), I will land near the basket and get it in the basket on the next throw.

Using these assumptions, I can compute the average number of throws for each choice.

For choice 2, I will always use two throws to get the disc in the basket.

For choice 1, there are three possible outcomes:

  • outcome 1.1: With probability $p_1$, I get it the basket on the first throw!
  • outcome 1.2: With probability $p_2$, I miss the basket, but get the disc in the basket on the second throw.
  • outcome 1.3: With probability $p_3$, I miss twice, and get it on the third throw.

I am assuming that if I miss twice, I will always get it on the third try, so

$$p_1 + p_2 + p_3=1.$$

Let $a$ be the average number of throws for choice 1. Then $$a = p_1\cdot 1 +p_2\cdot 2 +p_3\cdot 3.$$

I should choose choice 1 (go for it) if I average fewer throws than choice 2, i.e. if $a<2$.  This occurs when

$$\begin{aligned}2 >& p_1\cdot 1 +p_2\cdot 2 +p_3\cdot 3 \\2 (p_1+p_2+p_3) >&  p_1\cdot 1 +p_2\cdot 2 +p_3\cdot 3\\p_1\cdot 2+p_2\cdot 2+p_3\cdot 2 >&  p_1\cdot 1 +p_2\cdot 2 +p_3\cdot 3\\ p_1 >& p_3.\end{aligned}$$

So, I should choose choice 1 if $p_1> p_3$.  In words,

 

“Go for it if the probability of getting it on the first try is greater than the probability of missing twice”.

 

 

I often have to make a decision when putting at the edge of the green during disc golf.  Should I try to put the disc in the basket (choice 1) or should I just try to put the disc near the basket (choice 2)? If I try to put it in the basket and I miss, sometimes the disc will fly or roll so far way that I miss the next shot.

In order to answer this question, I created a simplified model.  Assume:

  • If I don’t try to put the disc in the basket (choice 2), I will land near the basket and get it in the basket on the next throw.
  • If I do try to put it in the basket (choice 1), I will succeed with probability $p$, where $p$ depends only on distance to the basket.
  • If I do try to put it in the basket and fail to get it in the basket (choice 1), the probability that I will succeed on the second try is $q$ where $q$ is constant which does not depend on distance.
  • If I do try to put it in the basket (choice 1), fail to get it in the basket, and then fail again on my second try, then I will always succeed on the third try.

Using these assumptions, I can compute the average number of throws for each choice.

For choice 2, I will always use two throws to get the disc in the basket.

For choice 1, there are three possible outcomes:

  • outcome 1.1: I get it the basket on the first throw!
  • outcome 1.2: I miss the basket, but get the disc in the basket on the second throw.
  • outcome 1.3: I miss twice, and get it on the third throw.

 

The probabilities for each of those outcomes are:  $p$, $(1-p) q$, and $(1-p)(1-q)$ respectively.

Let $a$ be the average number of throws for choice 1. Then $$\begin{aligned}a &= p\cdot 1 +(1-p)q\cdot 2 + (1-p)(1-q)\cdot 3 \\&= p + 2 q – 2 p q + 3 – 3 p – 3 q + 3 p q\\&=3 -2 p – q + p q.\end{aligned}$$

 

I should choose choice 1 if $ 2>a$.  This occurs when

$$\begin{aligned} 2 &> 3 -2 p – q + p q\\-1 &> -2 p – q + p q \\-1 &> (q-2) p – q  \\q -1 &> (q-2) p\\ \frac{q -1}{q-2} &< p   \\\frac{1-q}{2-q} &< p. \\ \end{aligned}$$

Now you can plug in various values for $q$ to find the lowest value for $p$ needed to go for it.

Probability of Success          Required
 After Missing             Probability of Success
                             on the first try
     100%                           0%
      99%                           1%
      95%                           5%
      90%                           9%
      85%                          13%
      80%                          17%
      75%                          20%
      50%                          33%
      0%                           50%

So, if you are 100% certain that you will put it in the basket on the second try, then you should use choice 1 (going for it) if $p>0$ (i.e. always).

If you are 90% certain that you will put it in the basket on the second try, then you should use choice 1 (going for it) if $p>0.09=9\%$.

$$ $$

a rule of thumb

A nice approximate rule of thumb is to go for it if the sum of $p$ and $q$ is more than 100%.

When I am 6 yards from the basket, I will get it in about 75% of the time (p=0.75), and if I miss, I will usually get it in 90% of the time.  The sum of 70% and 90% is 160%, so obviously, I should go for it.

When I am 9 yards from the basket, I will get it in about 20% of the time (p=0.20), and if I miss, I will usually get it in 85% of the time.  The sum of 20% and 85% is 105%, so it seems like I should go for it.

If the basket is on a hill or if it is windy, then the disc will sometimes roll a fair distance if I miss.  In that case, $q$ might be only 75%. The rule of thumb is that $p+q$ should be at least 100% to go for it, so according to the rule of thumb, I would need $p$ to be at least 25% to go for it.  On a windy day, that’s like 6 yards for me.

 

GPT-3 is a rather amazing neural net. If you give it any snippet of English text, then it will try it’s best to continue the text.  Stephan Marshe at The Newyorker gave GPT-3 the introductory text of Kafka’s “The Metamorphis” which begins with,

“One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.”,

and continues with Gregor’s struggles with his parents and boss over the next few hours. At the end of this introduction, Gregor is in his bed. GPT-3 continues the story with

“As soon as Gregor was alone, he began to feel ill. Turning around was an effort. Even breathing was an effort. A thin stream of blood trickled from his flank down his fuzzy belly. He wanted to crawl away from it, but there was no place to go. He lay still on the spot where he had come to rest just in order to get his breath back and to stop the bleeding. “I’m in a bad way,” said Gregor. It had never occurred to him before that he could really become ill. He had seen sick animals—a dove once in a while, which had fallen out of the nestling into the gutter and could not fly any more, or the weak infants of the woman next door who had to be picked up with the tongs and thrown into the dustbin, or the bugs his father used to bring to him when he was still a young boy and which he had liked so much.”
The point is that GPT-3 can generate very realistic completions of any input text. When Kevin Laker gave GPT-3 the following input,


Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.

Q: Who was president of the United States in 1955?
A: Dwight D. Eisenhower was president of the United States in 1955.

Q: What party did he belong to?
A: He belonged to the Republican Party.

Q: Who was president of the United States before George W. Bush?
A: Bill Clinton was president of the United States before George W. Bush.

Q: Who won the World Series in 1995?
A: The Atlanta Braves won the World Series in 1995.

Q: What are two reasons that a dog might be in a bad mood?

GPT-3 gave the response “Two reasons that a dog might be in a bad mood are if it is hungry or if it is hot.”
Basically, it is possible to carry on a conversation with GPT-3 by initializing it with the beginning of a conversation.

So, here is how you achieve a kind of immortality. You record all of your conversations for a year or two and use a speech-to-text converter to convert all of that speech into text while labeling the speakers. For example, if you have a conversation with your friend Mary, then you would enter in text similar to


START CONVERSATION

Mary: Hey Joe, how are you?
Me: I’m doing all right.
Mary: What have you been up to?
Me: I just got back from Denver yesterday. Tim and I were working on the robotic arm all week, and it seems to be getting better at folding the clothing.
Mary: ….

END CONVERSATION
“.

You could do this for every conversation that you had.

So now if Elon Musk (one of the founders of Open AI which developed GPT-3) wants to simulate a conversation with you, he could take that long transcript of your conversations and append


START CONVERSATION
Elon: Hey Joe, I haven’t seen you for a while, where have you been?
Me:
“.

Then GPT-3 would type out your most likely response to that question. Once GPT-3 reaches the end of your simulated response, then Elon could append his next contribution to the conversation and once again GPT-3 would generate your mostl likely response to Elon. In this way, GPT-3 could create a simulation of conversation with you.

This simulator could be improved by tweaking the parameters of the neural net to better fit your conversational style. You could also feed it many conversations between all kinds of people all prefaced with the bios of the each speaker. By giving GPT-3 more conversational text data and associated bio’s, the neural net could become significantly better at simulating people.

So, if you have a large collection of your own written text and you record yourself for a year, GPT-3 can create a simulation of you. This is a kind of immortality because the GPT-3 program and your conversational text can produce simulated conversations with you even after your death. These simulated conversations could become even more accurate if GPT-3 is improved further.

(One of my friends informed me that this ideas has been discussed on Reddit and the Black Mirror.)

Professor Peter Griffin discovered a nice theorem about the best way to count cards in Blackjack. (See the appendix in his book “Theory of Blackjack”). In this note, we review the theorem and reproduce his proof with more detail.

Suppose that you have a wagering game like blackjack which is played with a deck of cards. If you remove some cards from the deck, then the expected value of the game changes. Griffin found an easy formula for estimating the change in expectation caused by the removal of the cards. His formula depends on $E_j$ which is the change in expectation of the game when removing only the $j$th card from the deck.

Assume that the game never requires more than $k$ cards and the deck has $n$ cards in it. Obviously $n\geq k>0$. There are $m = {n \choose k}$ subsets of the deck $I_1, I_2, \ldots, I_m$ that contain $k$ cards each. Let $y_i$ be the expected value of the game played with the card subset $I_i$. We would like to estimate $y_i$ based on the cards in $I_i$.

We can create a linear estimate of $y_i$ by creating a vector $b=(b_1, b_2, \ldots, b_n)$ where $b_j$ is the “value” of the $j$th card. More specifically,
$$
y_i \approx \sum_{j\in I_i} b_j.
$$
Griffin found that the correct formula for $b_j$ is simply $$b_j = (\mu – (n-1) E_j)/k$$ where $\mu$ is the expected value of the game with a fresh deck. Using this value vector $b$ minimizes the squared error of the estimator.  This formula is remarkable both for its simplicity and the fact that $k$ only plays a small roll in the calculation.

Griffin’s Theorem Let the error function $e:R^m \rightarrow R$ be defined as

$$
e(b) = \sum_{i=1}^m \left( y_i – \sum_{j\in I_i} b_j \right)^2 .
$$
Then $e(b^*) = \min_{b\in R^m} e(b)$ is the unique global minimum of $e$ where
$$
b^*_j = (\mu – (n-1) E_j)/k
$$
and $\mu$ is the expected value of the game with a fresh deck.

In the theorem above, $e$ is the sum of the squared errors in the linear estimate of the expected value of the game. In order to prove the theorem, we need two lemmas.

 

Lemma 1 If $\tilde{y}_j$ is the average expectation of the $k$ card subsets that do not contain card $j$, $\bar{y}_j$ is the average expectation of the $k$ card subsets that contain card $j$, and $\mu$ is the expectation of the game with a fresh deck, then
$$\mu = \frac{k}{n}\; \bar{y}_j + \left(1- \frac{k}{n}\right)\tilde{y}_j$$
and
$$\bar{y}_j = \frac{n \mu – (n-k) \tilde{y}_j}{k}.$$

The short proof of this lemma is left to the reader.

Lemma 2
Suppose for $j = 1,\ldots, n$,
$$(2.1)\quad b_j + \alpha \sum_{i=1}^n b_i = \gamma_j.$$  Then

$$ b_j = \gamma_j – \frac{\alpha\; n\; \bar\gamma}{1 + n \alpha}$$
where $\bar\gamma = \frac1n \sum_{j=1}^n \gamma_j$.

Proof: Sum both sides of equation (2.1) to get
$$
n \bar{b} + \alpha n^2 \bar{b} = n \bar{\gamma}
$$
where $\bar{b} = \frac1n \sum_{j=1}^n b_j$. Then,
$$
\begin{aligned}
\bar{b} + \alpha n \bar{b} &= \bar{\gamma} \\
\bar{b} &= \frac{\bar{\gamma}}{1+ n \alpha}.
\end{aligned}
$$
Applying that to equation (2.1) yields
$$
\begin{aligned}
b_j + \alpha \sum_{j=1}^n b_j &= \gamma_j \\
b_j + \alpha n \bar{b} &= \gamma_j\\
b_j + \alpha n \frac{\bar{\gamma}}{1+n \alpha} &= \gamma_j\\
b_j &= \gamma_j – \frac{\alpha\; n\; \bar{\gamma}}{1+n \alpha} \\
\end{aligned}
$$
which completes the proof of the Lemma.

Now we can prove Griffin’s Theorem.

Proof:  Let the matrix $X$ be defined by $X_{ij} = 1$ if card $j$ is in set $I_i$ and $X_{ij}=0$ otherwise. We wish to minimize the sum of the squared errors. If we assume that the value of the $j$th card is $b_j$, then we can estimate the expectation of the game played using only card subset $I_i$ with
$$
\sum_{j\in I_i} b_j.
$$
The error of this estimate is $(y_i – \sum_{j\in I_i} b_j)$. The sum of the squared error is
$$
e(b) = \sum_{i=1}^m \left( y_i – \sum_{j\in I_i} b_j \right)^2.
$$
Noice that $\sum_{j\in I_i} b_j = x_i \cdot b$ where $x_i$ is the $ith$ row of $X$. So,
$$
e(b) = \sum_{i=1}^m \left( y_i – \sum_{j=1}^n X_{ij} b_j \right)^2 = \| X b – y \|^2.
$$
In other words $e(b)$ is the square of the distance between $y$ and $Xb$. The Gauss-Markov theorem provides a nice solution for the $b$ which minimizes this distance
$$
b^* = \left(X^T X\right)^{-1} X^T y
$$
where $X^T$ indicates the transpose of $X$. Hence,
$$
(1)\quad X^T X b^* = X^T y.
$$
Let $C=X^T X$. It turns out that $C$ has a very simple structure.

$C_{ij} = x_i^T x_j$ where $x_i$ and $x_j$ are the $i$th and $j$th columns of $X$, so
$$
(2) \quad C_{ii} = x_i^T x_i = { n-1 \choose k-1},
$$
and if $i \neq j$,
$$
(3)\quad C_{ij} = x_i^T x_j = { n-2 \choose k-2}.
$$

So, applying equations (2) and (3) to  $i$th row of matrix equation (1) gives

$${n-1 \choose k-1} b_i^* + {n-2 \choose k-2} \sum_{j\neq i} b_j^* = {n-1 \choose k-1} {\bar{y}_i}$$
for $j=1, 2, \ldots n$ where $\bar{y}_j$ is the average expectation of the ${n-1\choose k-1}$ subsets with $k$ cards that contain card $j$.

Note that $$ {n-2 \choose k-2} / {n-1 \choose k-1} = \frac{k-1}{n-1} ,$$ so
$$
\begin{aligned}
b_i^* + \frac{k-1}{n-1} \sum_{j\neq i}^n b_j^* &= \bar{y}_i \\
b_i^* – \frac{k-1}{n-1} b_i^* + \frac{k-1}{n-1} \sum_{j=1}^n b_j^* &=\bar{y}_i \\
\frac{n-k}{n-1} b_i^* + \frac{k-1}{n-1} \sum_{j=1}^n b_j^* &= \bar{y}_i \\
(n-k) b_i^* + (k-1) \sum_{j=1}^n b_j^* &= (n-1) \bar{y}_i\\
b_i^* + \frac{k-1}{n-k} \sum_{j=1}^n b_j^* &= \frac{n-1}{n-k}\bar{y}_i.
\end{aligned}
$$
We apply Lemma 2 with $\alpha = \frac{k-1}{n-k}$ and $\gamma_j = \frac{n-1}{n-k} \bar{y}_j$ to get
$$
\begin{aligned}
b_j^* &= \gamma_j – \frac{\alpha\; n\; \bar\gamma}{1 + n \alpha} \\
&= \frac{n-1}{n-k} \bar{y}_j\; – \frac{\frac{k-1}{n-k}\; n\; \bar\gamma}{1 + n\; \frac{k-1}{n-k}} \\
&= \frac{n-1}{n-k} \bar{y}_j\; – \frac{(k-1) \; n\; \bar\gamma}{n-k + n(k-1)} \\
&= \frac{n-1}{n-k} \bar{y}_j\; – \frac{(k-1) \; n\; \bar\gamma}{n-k + n k-n} \\
&= \frac{n-1}{n-k} \bar{y}_j\; – \frac{(k-1) \; n\; \bar\gamma}{-k + n k} \\
&= \frac{n-1}{n-k} \bar{y}_j\; – \frac{(k-1) \; n\; \bar\gamma}{k (n-1)} \\
&= \frac{n-1}{n-k} \bar{y}_j\; – \frac{(k-1) \; n\; \frac{n-1}{n-k} \mu }{k (n-1)} \\
&= \frac{n-1}{n-k} \bar{y}_j\; – \frac{(k-1) \; n\; \mu }{(n-k)k}. \\
\end{aligned}
$$
By Lemma 1,
$$\bar{y}_j = \frac{n \mu – (n-k) \tilde{y}_j}{k},$$
so
$$
\begin{aligned}
b_j^*
&= \frac{n \mu – (n-k) \tilde{y}_j}{k} \frac{n-1}{n-k} – \frac{ n (k-1)\mu}{k (n-k)} \\
&= \frac{ n (n-1) \mu }{(n-k)k} – \frac{ (n-k) \tilde{y}_j (n-1)}{k(n-k)} – \frac{ n (k-1)\mu}{k (n-k)} \\
&= \frac{ n (n-1) \mu }{(n-k)k} – \frac{ n (k-1)\mu}{k (n-k)} – \frac{ (n-k) \tilde{y}_j (n-1)}{k(n-k)} \\
&= \frac{n \mu}{k} \left[ \frac{n-1}{n-k} - \frac{k-1}{n-k} \right] – \frac{ \tilde{y}_j (n-1)}{k} \\
&= \frac{n \mu}{k} – \frac{ \tilde{y}_j (n-1)}{k} \\
&= \frac{n \mu – (n-1) \tilde{y}_j}{k} \\
&= \frac{\mu – (n-1) (\tilde{y}_j- \mu)}{k} \\
&= \frac{\mu – (n-1) E_j}{ k }
\end{aligned}
$$
which completes the proof.

In the future, I hope to write some articles about how Griffin’s Theorem can be applied to other games.

I should mention that it is not too hard to extend Griffin’s Theorem to get more accurate quadratic approximation of the expectation of the game with card removal (or higher degree polynomial approximations).

« Older entries