## A Review of “Knowledge Vault: A Web-Scale Approach to a Probabilistic Knowledge Fusion”

The KDD 2014 article, written by Dong, Gabrilovich, Heitz, Horn, Lao, Murphy, Strohmann, Sun, and Zang, describes in detail Google’s knowledge database Knowledge Vault. Knowledge Vault is a probability triple store database. Each entry in the database is of the form subject-predicate-object-probability restricted to about 4500 predicates such as “born in”, “married to”, “held at”, or “authored by”. The database was built by combining the knowledge base Freebase with the Wikipedia and approximately one billion web pages. Knowledge Vault appears to be the largest knowledge base ever created. Dong et al. compared Knowledge Vault to NELL, YAGO2, Freebase, and the related project Knowledge Graph. (Knowledge Vault is probabilistic and contains many facts with less than 50% certainty. Knowledge Graph consists of high confidence knowledge.)

The information from the Wikipedia and the Web was extracted using standard natural language processing (NLP) tools including: “named entity recognition, part of speech tagging, dependency parsing, co-reference resolution (within each document), and entity linkage (which maps mentions of proper nouns and their co-references to the corresponding entities in the KB).” The text in these sources is mined using “distance supervision” (see Mintz, Bills, Snow, and Jurafksy “Distant Supervision for relation extraction without labeled data” 2009). Probabilities for each triple store are calculated using logistic regression (via MapReduce). Further information is extracted from internet tables (over 570 million tables) using the techniques in “Recovering semantics of tables on the web” by Ventis, Halevy, Madhavan, Pasca, Shen, Wu, Miao, and Wi 2012.

The facts extracted using the various extraction techniques are fused with logistic regression and boosted decision stumps (see “How boosting the margin can also boost classifier complexity” by Reyzin and Schapire 2006). Implications of the extracted knowledge are created using two techniques: the path ranking algorithm and a modified tensor decomposition.

The path ranking algorithm (see “Random walk inference and learning in a large scale knowledge base” by Lao, Mitchell, and Cohen 2011) can guess that if two people parent the same child, then it is likely that they are married. Several other examples of inferences derived from path ranking are provided in table 3 of the paper.

Tensor decomposition is just a generalization of singular value decomposition, a well-known machine learning technique. The authors used a “more powerful” modified version of tensor decomposition to derive additional facts. (See “Reasoning with Neural Tensor Networks for Knowledge Base Completion” by Socher, Chen, Manning, and Ng 2013.)

The article is very detailed and provides extensive references to knowledge base construction techniques. It, along with the references, can serve as a great introduction to modern knowledge engineering.

## “Grandmaster Clash”

Even if you are not into chess, I think you will enjoy reading about Caruana’s amazing performance at Sinquefield.

(The cast comes off my hand today.  I will be able to type with both hands soon!)

## Rules for CS Reseach and Seven Principles of Learning

I read two nice articles this week: “Ten Simple Rules for Effective Computational Research” and “Seven Principles of Learning Better From Cognitive Science”.

In “Ten Simple Rules for Effective Computational Research”, Osborne, Bernabeu, Bruna, Calderhead, Cooper, Dalchau, Dunn, Fletcher, Freeman, Groen, Knapp, McInerny, Mirams, Pitt-Francis, Sengupta, Wright, Yates, Gavaghan, Emmott and Deane wrote up these guidelines for algorithm research:

1. Look Before You Leap
2. Develop a Prototype First
3. Make Your Code Understandable to Others (and Yourself)
4. Don’t Underestimate the Complexity of Your Task
5. Understand the Mathematical, Numerical, and Computational Methods Underpinning Your Work
6. Use Pictures: They Really Are Worth a Thousand Words
7. Version Control Everything
8. Test Everything
9. Share Everything
10. Keep Going!

Read the full PLOS 5 page article by clicking here!

Scott Young recently returned from a year abroad and wrote Up “Seven Principles of Learning Better From Cognitive Science” which is a review/summary of the book “Why Don’t Students Like School?” by Daniel Willingham. Here are the 7 Principles:

1. Factual knowledge precedes skill.
2. Memory is the residue of thought.
3. We understand new things in the context of what we already know.
4. Proficiency requires practice.
5. Cognition is fundamentally different early and late in training.
6. People are more alike than different in how we learn.
7. Intelligence can be changed through sustained hard work.

Read the full 5 page article by clicking here! Get the great $10 book here. ## Computers are getting better at Go Monte Carlo Tree Search (survey) is working rather well for Go. ## “Humans need not Apply” broke my hand on thursday, so typing is difficult. check out this video https://www.youtube.com/watch?v=7Pq-S557XQU ## DeepLearning.net’s Reading List Topics covered include: • 5 General overview papers (Bengio, Schmidhuber, Graves ….) • 12 Foundation Theory and Motivation papers (Bingio, Hinton, LeCun, ….) • 7 Computer Vision papers (Deep Convolutional Networks, Hierarchical Features, Practical Applications) • 5 Papers on Natural Language Processing and Speech (Recursive Auto-encoders, Bidirectional Long Short Term Memory) • 15 Papers on Unsupervised Feature Learning (Deep Boltzmann Machines, Autoencoders, …) • And about 40 papers under the headings: Disentangling Factors and Varitions with Depth, Transfer Learning and domain adaptation, Practical Tricks and Guides, Sparse Coding, Classification, Large Scale Deep Learning, Recurrent Networks, Hyper Parameters, Miscellaneous, and Optimization. They conclude their list with a list of three other machine learning reading lists and three other links to deep learning tutorials. ## Gunnar Carlsson on the Shape of Data Carl sent me a YouTube video by Dr Gunnar Carlsson on the application of Topology to Data Mining (Topological Data Analysis). Dr. Carlsson created a short 5 minute introduction, and a longer video of one of his lectures. For more information, check out “Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival” by Nicolaua, Levineb, and Carlsson. Also, Bansal and Choudhary put together a nice set of slides on the subject with applications to clustering and visualization. ## The Best Deep Learning Blog Post Ever (Christopher Olah) Christopher Olah wrote an incredibly insightful post on Deep Neural Nets (DNNs) titled “Deep Learning, NLP, and Representations“. In his post, Chris looks at Deep Learning from a Natural Language Processing (NLP) point of view. He discusses how many different deep neural nets designed for different NLP tasks learn the same things. According to Chris and the many papers he cites, these DNNs will automatically learn to intelligently embed words into a vector space. Words with related meanings will often be clustered together. More surprisingly, analogies such as “France is to Paris as Italy is to Rome” or “Einstein is to scientist as Picasso is to Painter” are also learned by many DNNs when applied to NLP tasks. Chris reproduced the chart of analogies below from “Efficient Estimation of Word Representations in Vector Space” by Mikolov, Chen, Corrado, and Dean (2013). Relationship pairs in a word embedding. From Mikolov et al. (2013). Additionally, the post details the implementation of recurrent deep neural nets for NLP. Numerous papers are cited, but the writing is non-technical enough that anyone can gain insights into how DNNs work by reading Chris’s post. So why don’t you just read it like NOW — CLICK HERE. ## “Multi-stage Markov Chain Monte Carlo Methods for Porous Media Flows” I was doing a little bit of research on Mulit-stage Markov Chains for a friend when I ran across this nice set of slides on Multi-stage Markov Chains for diffusion through porous media (i.e. oil or water flowing through the ground) by Pereira and Rahunanthan. The authors first review the multi-scale equations for diffusion under pressure (see Darcy’s law) and mixed finite element analysis. Then, interestingly they introduce a Bayesian Markov chain Monte Carlo method (MCMC) to get approximate solutions to the differential equation. They use Multi-Stage Hastings-Metropolis and prefetching (to parallelize the computation) to improve the speed of MCMC. The slides also display the results of their algorithm applied to aquifer contamination and oil recovery. It’s really cool how methods used for probabilistic graphical models can be used to approximate the solutions to partial differential equations. I’m hoping to get back to 2048 in a few weeks. Turns out that it takes a long time to write code, make the code look nice, run the code, analyze the results, and then put together a blog post. It’s much easier and quicker to read papers and summarize them or to try to explain things you already know. Have a great weekend. – Hein ## Mathematical Concepts I use (almost) every week So, I noticed Brian Rushton’s question on matheducators.stackexchange.com. He asked, ##### “What should be included in a freshman ‘Mathematics for computer programmers’ course?” User dtldarek had a pretty cool reply broken up into three parts: • Math that is actually useful. • Math that you can run into, and is generally good to know. • Math that lets you build more awesome math. His answer got me to think about and revise my “100 Most useful Theorems and Ideas in Mathematics” post. I wondered which of the 100 I used every week. Rather than trying to think about every week, I just thought it would be interesting to identify which ideas and theorems I used this week excluding the class I am teaching. The first several ideas on the top 100 list, you can’t seem to avoid: counting, zero, decimal notation, addition, subtraction, fractions, and negative numbers. Everyone uses these ideas when they buy anything or check their change. (I wonder how many people rarely count the change from the cashier.) The next group was mostly about algebra: equality, substitution, variables, equations, distributive property, polynomials, and associativity. Every week I attend two math seminars (Tensor Networks and Approximation Theory), so I use these ideas every week, just to add to the discussion during the seminar. Strangely, I can’t remember using scientific notation this week except while teaching. During class, I briefly mentioned the dinosaur killing asteroid (see the Chicxulub crater). It’s hard to calculate any quantity about the asteroid (energy = 4 ×1023 Joules, mass = 2 ×1015 kg ?, velocity$\approx$20000 meters/sec) without using scientific notation. But, other than during class, I don’t remember using scientific notation. During the Approximation Theory Seminar, we discussed polynomial approximations to the absolute value function so we used polynomials, infinity, limits, irrational numbers (square roots), the idea of a function, inequalities, basic set theory, vector spaces esp. Hilbert/Banach Spaces, continuity, Karush–Kuhn–Tucker conditions, linear regression, the triangle inequality, linearity, and symmetry during our discussion. That seems like a lot, but when you have been using these ideas for decades, they are just part of your vocabulary. So between the approximation theory seminar and daily restaurant mathematics, I used most of the first 40 items from the top 100 list. Earlier in the week, I wrote up a simple category theory proof about isomorphisms and pull-backs. Turns out, if you pullback an isomorphism and another arrow, then you get another isomorphism. For example, in the commutative diagram below the top arrow and the bottom arrow are both isomorphisms (isometries even) and the entire diagram is a pullback. If the bottom arrow is an isomorphism and the entire diagram is a pullback, then the top arrow is an isomorphism. (The reverse is not true.) Working on the proof got me thinking about trig functions, modular arithmetic, injective/surjective functions, Fourier transforms, duality, eigenvalues, metric spaces, projections, the Reiz Representation Theorem, Plancherel’s theorem, and numerical integration. Probability and information theory were largely missing from my week until I stopped by Carl’s house yesterday. He and I got into a discussion about category theory, information theory, and neural networks. One thing we noticed was that 1-1 functions are the only functions that conserve the entropy of categorical variables. For example, if$X\in\{1,2,3,4,5,6\}$is a die roll, then$Y=f(X)$has the same entropy as$X$only if$f\$ is 1-1.  Turns out you can characterize many concepts from Category theory using entropy and information theory.  So we used random variables, probability distributions, histograms, the mean (entropy is the mean value of the log of the probability), standard deviations, the Pythagorean theorem, statistical independence, real numbers, correlation, the central limit theorem, Gaussian Distributions (the number e), integrals, chain rule, spheres, logarithms, matrices, conic sections (the log of the Gaussian is a paraboloid), Jacobians, Bayes Theorem, and compactness.

I almost avoided using the Pigeon Hole Principle, but then Carl argued that we could use the definition of average value to prove Pigeon Hole Principle.  I still don’t feel like I used it, but it did come up in a discussion.

Notably missing from my week were (surprisingly): derivatives, volume formulas (volume of a sphere or cube), Taylor’s theorem, the fundamental theorem of calculus, and about half of the ideas between #63 Boolean algebra and #99 uncountable infinity.

So, it looks like I typically use around 70 ideas from the list each week—more than I thought.

I will probably get back to 2048 stuff later this month and maybe a post on Fourier Transforms.

Have a great day!    – Hein