Since the 1990s, advancement in technology has drastically changed the way we go about our lives. Before then, there was no (notable) internet, and you had to press the “Turbo” button on your PC to get to 100 MHz calculation speed. I remember having to go to the payphone on the street if I wanted to call my friends.

Now most people carry around a pocket device that would have been considered a super-computer not too long ago – and you can call anyone across the world or download information on any topic in an instant.

This fundamental change in how we communicate and interact with the world enables companies such as internet service providers (ISPs), search engines (e.g. Google or Baidu), and social networks (e.g. Facebook) to collect huge amounts of data (hence “**big data**“).

Here, “data” refers to everything from emails and photos, to the way you move your mouse on the screen.

These data, together with improved hardware, is what enabled the ongoing machine learning revolution. **Machine learning** (ML) refers to a set of methods that enable computers to analyze data – even if the data are “fuzzy”, i.e. images, text or audio recordings.

Such fuzzy data where originally thought to be impossible for computers to understand. A computer was a calculation machine that was good at adding and multiplying numbers – not at recognizing a human face. But it turned out that the problem of recognizing a human face in an image can be formulated as a mathematical problem. Nowadays, ML techniques enable computers to solve it.

Many of these techniques (such as artificial **neural networks**) have been known for some time, but since they require strong computers and large datasets to train on, it was not feasible to use them until about 2006 .

Presently, I suspect that the uses of ML with the greatest social impact are (i) captivating your attention, (ii) automating some tasks that would be simple but slow for humans to do (like cataloging images, improving search results, suggesting restaurants, news feeds and movies that you might be interested in), and (iii) financial and insurance-related applications.

To do these things, ML algorithms essentially learn to model and extract useful information from the data that they are given. The “quality” of these data is essential, however. For example, state of the art image classifiers (a particular ML algorithm) can distinguish a thousand different objects from images when trained on the ImageNet dataset. But developing a new, specific image classifier from scratch still takes months to do, because the existing classifiers do not generalize well and few training data sets are as well organized as ImageNet.

The best results are obtained when the training data can be generated. For example, in board games like Chess and Go, the state of the art ML algorithm Alpha Zero can beat the previous world champions, even though it learned these games just by playing against itself and learning from the moves it took (so the data were generated by the algorithm itself). But today’s ML techniques also have many problems.

For example, if you train an ML algorithm to recognize faces, but your data set only contains white people, then it will not recognize black people, or vice versa. The good news is that many people are working hard on correcting such biases, and progress is being made.

It is also very difficult to specify what we actually want, because ML systems lack common sense. In a set of ML techniques called **reinforcement learning** (RL), for example, you train an “agent” by rewarding it for good outcomes. So you may reward an RL agent for collecting points along the track in a boat racing video game, and of course you expect it to finish the game. Instead, it intentionally causes an accident that puts it into an infinite loop where the agent can collect rewards forever – without finishing the race.

For Facebook and similar platforms, ML algorithms create a model of the social group that you belong to. If you provide enough data (“likes”, comments, or just the information whom you are communicating with), these algorithms will be able to predict your personal preferences with high accuracy .

If the ML algorithm learns how likely you are going to react to a certain **meme** (statement or image), it also knows how to capture your attention. Hence, it can fill your news feed entirely with things that will make you stay there for longer. It is usually programmed to do so, because a longer engagement time makes you more likely to be influenced by the advertisements that appear in between the news.

Unfortunately, we are more inclined to engage with memes that make us angry than with memes that trigger other emotions or are emotionally neutral . More generally, the ML algorithms learn how to exploit our emotional weak points, even though they were not programmed to do so – they just *find* this strategy as a solution to their goal of maximizing human engagement with a website.

As the video by CGP Grey illustrates, this can be hugely detrimental for human relationships. However, for things like movie or restaurant suggestions, it is great if my phone can suggest things to me that I will actually enjoy.

Other common applications of ML are, for example, face recognition and automatic image enhancement of your camera, email spam filtering, automatic recognition of facial features used by Snapchat, voice and handwriting recognition, and forecasts of the stock market.

A new recent trend is the automatic generation of text, images, and even videos by ML algorithms. Once more, this has some good applications, but it also comes with issues that may soon become disruptive to our society. For example, it is already possible to automatically superimpose a persons face onto a video recording, so it looks like that person did things he/she never did. Imagine it would become cheap to generate a million photo-realistic videos of public figures doing things they have never actually done. This is already possible with voice generation, but with video we are not quite there yet, though I think it should be possible within the next decade.

On the positive side, ML algorithms can help reduce power consumption for server farms and other facilities, as well as help doctors with medical diagnoses and treatment selection.

As a side effect, yet of no less importance, we learn a great deal more about ourselves and how our brains work, as these questions overlap with AI research.

So machine learning is everywhere, and there are good, bad, and risky applications of one and the same technology.

The most pressing near term issue is probably that of **lethal autonomous weapons** (LAWs).

All parts of this technology already exist today, but as far as is publicly known, it is not militarized yet. If we manage to make LAWs as internationally stigmatized as chemical or biological weapons, we may have a chance to prevent fatal scenarios such as the one presented in the video above (or to the right, depending on your screen size).

You can find more information about what you can do personally to steer our civilization away from LAWs on this website: autonomousweapons.org, which is supported by thousands of AI researchers (including myself), and backed by the Future of Life Institute and other organizations.

Since ML techniques can only become better over time, we should also expect increasingly sophisticated scams and targeted attacks to make money or gain influence. Platforms like YouTube, Facebook, and Twitter are already in an ongoing arms race with bad actors that try to do just that.

It is a very hard problem, and ML will play a role on both sides of the battle.

The most important component in this battle is you. And thus, you are in the best place to do something about it. For example: (i) always think twice before you share a meme – even if you like it, it may not be true; (ii) socialize with people in the real world and practice critical thinking in day-to-day situations; (iii) Always remember that when you write a comment, there is a person on the other end reading this.

A more positive development is the current progress in **autonomous driving**. This technology – even if only applied to trucks and trains – has the potential to greatly reduce the number of traffic accidents. If you are a young truck driver, however, you may want to educate yourself about alternative jobs.

The displacement and disappearance of jobs in general may be a central issue in the mid-term future.

It is unclear how long it will take until self-driving cars are available to the public. As I wrote earlier, ML algorithms have no common sense. This makes the autonomous driving problem so hard. But it is certainly solvable.

Through and through beneficial new technologies that may soon come out of machine learning are, for example, (i) improved accuracy in medical diagnoses, and automatic optimization of energy distribution in computer clusters and wind farms.

We can also expect advances in medicine through indirect contributions of ML. For example, recently DeepMind created a novel algorithm that can solve the protein folding problem with high accuracy, which is essential for medical research.

You may have noticed that so far I have used the term machine learning (ML) exclusively, and did not write anything about **artificial intelligence** (AI). This is because these terms are not clearly defined, and AI is especially hard to pin down. Specifically, it seems like whenever a new AI algorithm solves a previously unsolvable problem (like face recognition), it is not considered AI anymore.

Today’s ML algorithms are good at specific tasks, and it is debatable if we can have them become generally intelligent without a major paradigm shift. Nevertheless, for **artificial general intelligence** (AGI) the question is not “if”, but “when” it will be created (barring prior destruction of our global society). Predictions about AGI timelines vary a lot, but mostly focus on the present century.

When talking about AGI, an algorithm’s competence is typically compared to that of a human. If we manage to write a computer program that is able to solve every cognitive task an average human can solve (from arithmetic to composing music and consoling a heartbroken friend), then we’d say that we have created AGI.

But there is no good reason to assume that it stops at the human level. There may be levels of intelligence that are vastly superior to ours. This is known as **superintelligence** . You could imagine a single superintelligent AGI as a whole civilization of Einsteins and Mozarts who all work together in perfect harmony and think 100 times faster than their human equivalents since they are not bound by the limitations of actual biological brains.

Before we program the first AGI, however, there is (at least) one critical issue to solve: the **alignment problem**.

It roughly goes like this: How do I ensure that a being that is vastly superior to me keeps doing what I would approve of? Even if it knows that what I want is stupid and I am incapable to see this?

Personally, I concur with Eliezer Yudkowsky’s perspective that we can be certain that if we build a superintelligence without solving the alignment problem first, then it will change the world in a way we do not approve of, and (by definition) it will be damned good at doing this.

The alignment problem is extremely hard, and **we will likely need at least as long to solve it, as we need to develop AGI**. Therefore, we have to start working on it right now. Research institutes like MIRI and the Future of Humanity Institute are committed to solving this problem, but considering its scope, there should be many more people working on this.

If you are not a mathematician or AI researcher, there is still something that you should think about: What society do we actually want our descendants to live in? What do we value? If you had the power of a superintelligence, what *should* you try to do? This is something that has to be decided by all of us, AI researcher or not. The Future of Life institute is in fact running a survey about this.

I imagine that you probably have one of two viewpoints, if you have never thought deeply about AI.

One viewpoint is that when you list the benefits and problems, you decide that the benefits are not worth the risk. So why do all that? Can’t we just stop here, or even downgrade to an earlier level of technology?

The answer is no, we cannot. First of all, the economy requires us to never stop innovating – it won’t work otherwise. And innovation also means developing ever smarter ML/AI algorithms. We might want to change how the economy works, of course, but this is beyond the scope of this article.

Second, the world has already changed and continues to do so. Things that worked in the past (e.g. combustion engines) often don’t work as a long-term solution (causes climate change and is dependent on limited resources). Today’s problems (climate change, organizing a global society, etc.) are more complicated than anything we have dealt with in the past, and solving them requires novel ideas and technologies.

The other viewpoint is that “everything will be alright”. This gut feeling that no matter what we do, no matter how bad things get, we’ll somehow get out of it and move on.

But the most important lesson that I have learned in my study of physics is that **nature does not care about us. That is why it is so important that we care about each other.**

Nature does not care if we live or die, if we are happy or if we suffer. It does not even care if we understand nature, or if we have a sense of purpose. Because nature is not a person. These things matter to *us*, however, and we need to get our act together to make this right.

This insight is particularly important when we consider the big picture. There is no grand plan. No guarantee that we are heading in a good direction, no safety-net that ensures that life continues irrespective of what we do. But there are also no constraints on how good the world could become, apart from the laws of physics.

It is up to us to figure out what direction is good for us, where we want to go, and how we go there.

Personally, I think we should aim to develop AI that helps us become the best people that we can be. Instead of suggesting enraging memes to us, it should help us understand each other better. Instead of learning to manipulate us for the sake of power and influence, it should learn to motivate us to reach excellence at the things we love to do. Once we become the best version of ourselves, we can decide where we want to go from there.

The algorithm is very simple. Say you have a set of high-dimensional vectors and you want to represent them in an image, such that each vector is associated with a pixel of that image, and the similarity between vectors should correspond to the distance between pixels.

As a first step, we associate a random vector with each pixel in our map. Then we go through the input vectors and compute the “best matching unit” (BMU), which is the vector in our map that is closest to the input.

Now we update the BMU, and vectors that are close to it on the map, to be more like the input vector. How strongly we shift the vectors may depend on the distance from the BMU, and with each new input, the sphere of influence should decrease.

For example, we can take the RGB vectors of 1000 random colors as inputs, and create a 2D color map:

Here is a simple (non-optimized) python source code that generated the image above:

]]>The space of all concepts is enormously large. Much larger than the space of all possible things. But somehow our brains can navigate this space and find meaningful relations between concepts. How does this work, and how is this related to natural language?

In natural language, we don’t give one word to one concept. Instead, the same word may describe different concepts, depending on the context. For example, the word “organ” may refer to a musical instrument, or to an assemblage of tissues.

Intuitively, we can “add” or “subtract” words to refine concepts. For example,

\text{organ} - \text{tissue} = \text{instrument}\;,or

\text{car} + \text{fast} = \text{sports car}\;.But until Webber’s work (which I am describing), it was very difficult to make a computer handle such relations. So how does it work?

The key insight comes from neuroscience. Specifically, neuroscientists have hypothesized that the outer-most layer of the brain, called neo-cortex, is essentially made up of a large number of physical, two-dimensional maps of concept space, known as **cortical modules**. Crucially, although every point on such a map corresponds to a concept, not every concept corresponds to a point on that map. Instead, some concepts are *combinations* of points on the map.

For example, there may be a point for “car”, and a point for “fast”. If both are active, then this represents “sports car”.

We can abstract this idea to computer science, and mimic a cortical module as a two-dimensional binary array. Say it constitutes 128 \times 128 bits. To assign meaning to these bits, we take a large body of text, e.g. Wikipedia, slice the raw text into snippets, and then assign one bit in this 128 \times 128 matrix to each snippet in such a way that snippets with similar content point to bits that are close to one another. This process is known as **semantic folding**.

The folding can be done in two different distance measures: associative and synonymous.

The clustering of alike concepts in the semantic folding process could be done, for example, using self-organizing maps, although Webber never seems to specify the exact algorithm that he uses in his work.

Now that the semantic map is created, we can create **semantic fingerprints** of words. To this end, we would activate all points in our semantic map that correspond to snippets in which the word appeared. This creates a **sparse distributed representation** of the word, using one 128\times128 binary matrix.

For a whole sentence or document, we would simply add up the maps of all the words within the document.

Words that are unspecific (a.k.a. **stop-words**), such as “with” or “it”, will activate points all over the map, whereas very specific words, such as “cake” or “molecule”, will only activate a few points on the map.

We eliminate the stop-words, and thereby create a sparse representation of the document, by deactivating all but the most active 2% of the points on the map. In this way, only the most important semantic points remain, and we are left with a sparse distributed representation of the entire document.

The possible applications of these fingerprints are plenty. Essentially, everything that natural language processing is trying to do might be made possible with semantic fingerprints. Check out these demos at cortical.io to see a few applications in action.

This is another one of those ideas that blew my mind. I am baffled that semantic fingerprinting does not even appear on the Wikipedia page about natural language processing.

Sparse distributed representations might as well be used to encode visual or audio data, but to my knowledge this has not yet been explored.

Since I have just spent time studying neural processes , it seems clear to me that there is a close relation between NPs and the theory presented here. I wonder if the performance of NPs can be improved by re-designing the encoder(s) such that two-dimensional sparse distributed representations are generated.

Let’s say you want to train a reinforcement learning (RL) agent to solve some task in a complicated environment. All that is given to the agent is the raw image data of its field of view, a set of actions it can take at any time step, and a reward signal.

Ha and Schmidhuber’s approach to this problem is to program separate components for (i) translating the visual input into an abstract representation of the situation, (ii) predicting the future state in terms of that representation, and (iii) choosing actions to take. The third component is intentionally made very simple, so that most of the “work” goes into understanding the environment and the consequences of the agent’s actions.

Incidentally, this setup very much reminds me of my own Rubik’s Cube project, which is why I now blog about this article.

The vision component is a *variational autoencoder*, which, simply put, is a neural network that takes an image as input and tries to reproduce the same image as output, but is constrained by a hidden layer that has much smaller dimension than the image. Thus, training this autoencoder reproduces an abstract representation z (known as “latent representation”) of the image in the activation of that hidden layer. It is a little bit more complicated than that, but let’s discuss this in another blog post.

The prediction component is a recurrent neural network (RNN). As external inputs, it takes the representation vector z and an action a, and it tries to predict a distribution over possible z vectors in the next time step.

As a recurrent input, the prediction component feeds the state of one of its hidden layers to the next iteration of itself. Thus, it resembles a specific kind of RNN, known as long-short-term-memory (LSTM) network, but what variant of LSTM it is, is not given explicitly in the paper (the appendix contains a reference to another paper, however).

It makes sense to implement the prediction component as LSTM. This allows the network to keep track of information that is relevant for the prediction task over longer periods of time (e.g. “What level of the game am I in?”), and at the same time to selectively pick the information that is relevant at the present moment.

Finally, the policy component is a simple linear unit that takes the state z and the hidden layer from the prediction unit as inputs, and outputs the action to take.

Note, that the policy component does not take the predicted future state distribution as input! Not even a sample of it. Instead it takes the hidden layer of the LSTM. This makes sense, as this hidden layer should convey the relevant information about the present situation. This is a very distinct approach from, say, Monte Carlo tree search over possible futures, and it seems to be more efficient.

Ha and Schmidhuber demonstrate several variations of their setup, as it performs on a simple car racing game and the VizDoom environment, in which the agent has to step left or right to evade oncoming fireballs and stay alive as long as possible.

The fun part is where they let the agent train in a “dream world”. That is, once the agent has trained its vision and prediction components, it can generate scenarios of sequences of future states and even visualize these states using the decoder-part of its vision component (the variational auto encoder).

Thus, the agent can train its policy component without taking actions in the “real” world. This seems particularly useful for training autonomous robots in the actual real world.

As one would expect, the “dreams” are not entirely realistic. Thus, the agent sometimes learns to exploit certain “bugs” in the dream. To counter this, Ha and Schmidhuber introduce a “temperature parameter” which controls the overall uncertainty over the predicted future states. Turning up the temperature makes the dream less predictable, which seems to remedy the exploitation issue.

This article by Ha and Schmidhuber is an interesting summary of a large body of work. The interactive version is especially fun to play with.

Most interesting to me was the idea of feeding the hidden state of the predictor LSTM into the policy component. This might be much more efficient than searching through the space of possible futures, and I can probably use this technique in my Rubik’s Cube project.

It would be interesting to see if one can improve the performance of Ha and Schmidhuber’s framework by replacing the variational auto encoder with a neural process .

We wanted to establish whether Snickers bars from different countries taste different or not. To this end, we collected three Snickers bars, one from England (GB), one from Germany (DE), and one from Vietnam (VN). There are five plausible hypothesis:

- All Snickers bars taste the same: \mathcal{H}_{=}
- All Snickers bars taste different: \mathcal{H}_{\neq}
- The German and English bars are identical, but the Vietnamese is different: \mathcal{H}_{VN}
- The German and Vietnamese bars are identical, but the English is different: \mathcal{H}_{GB}
- The English and Vietnamese bars are identical, but the German is different: \mathcal{H}_{DE}

During the experiment we sliced the three bars into 12 slices each. Then we randomly paired slices and assessed whether they tasted the same or not. For example, one measurement may result in the observation \rm{DE} = \rm{GB} and another measurement may result in \rm{GB} \neq \rm{VN}. In addition, we also compare Snickers bars to themselves, so we might obtain \rm{VN} = \rm{VN}. We can perform a Bayesian update (explained below) on the above hypotheses after each measurement.

The measurement is subjective and may be affected by substantial noise. In particular, given two samples the experimenter may decide they taste different, even though they are equal, or vice versa. For simplicity, we assume a single failure rate \epsilon for all experiments. Given one sample pair, the probability that any experimenter misjudges the equality of the samples is \epsilon, thus \epsilon \in [0,1].

To make the five hypotheses that we have introduced above more explicit, we write down the outcome probabilities of an experiment (likelihoods), given that the hypothesis k is true and \epsilon is known.

For the all-equal hypothesis \mathcal{H}_{=} we have:

P(\rm{DE} = \rm{UK}\,|\,\mathcal{H}_{=}, \epsilon) = P(\rm{DE} = \rm{VN}\,|\,\mathcal{H}_{=}, \epsilon) = P(\rm{UK} = \rm{VN}\,|\,\mathcal{H}_{=}, \epsilon) = 1 - \epsilon\;, P(\rm{DE} \neq \rm{UK}\,|\,\mathcal{H}_{=}, \epsilon) = P(\rm{DE} \neq \rm{VN}\,|\,\mathcal{H}_{=}, \epsilon) = P(\rm{UK} \neq \rm{VN}\,|\,\mathcal{H}_{=}, \epsilon) = \phantom{1 - } \epsilon\;.For the \mathcal{H}_{VN} hypothesis we have instead:

P(\rm{DE} = \rm{UK}\,|\,\mathcal{H}_{VN}, \epsilon) = P(\rm{DE} \neq \rm{VN}\,|\,\mathcal{H}_{VN}, \epsilon) = P(\rm{UK} \neq \rm{VN}\,|\,\mathcal{H}_{VN}, \epsilon) = 1 - \epsilon\;, P(\rm{DE} \neq \rm{UK}\,|\,\mathcal{H}_{VN}, \epsilon) = P(\rm{DE} = \rm{VN}\,|\,\mathcal{H}_{VN}, \epsilon) = P(\rm{UK} = \rm{VN}\,|\,\mathcal{H}_{VN}, \epsilon) = \phantom{1 - } \epsilon\;,and so on.

Before we performed the experiment, we formulated the following prior beliefs. We assigned equal probabilities to all five hypothesis, i.e.

P(\mathcal{H}_{=}) = P(\mathcal{H}_{\neq}) = P(\mathcal{H}_{\rm{GB}}) = P(\mathcal{H}_{\rm{DE}}) = P(\mathcal{H}_{\rm{VN}}) = 1/5In addition, we were uncertain about the failure rate \epsilon. Therefore, we split each of the five hypotheses into sub-hypotheses with different values for \epsilon. Specifically, we define the cases \{\epsilon \in [0.0,0.1], \epsilon \in [0.1,0.2], \dots, \epsilon \in [0.9,1.0]\} and assign the following prior probabilities to the different epsilons:

P(\epsilon \in [0.0, 0.1]) = 10/81 P(\epsilon \in [0.1, 0.2]) = 13/81 P(\epsilon \in [0.2, 0.3]) = 16/81 P(\epsilon \in [0.3, 0.4]) = 14/81 P(\epsilon \in [0.4, 0.5]) = 12/81 P(\epsilon \in [0.5, 0.6]) = 8/81 P(\epsilon \in [0.6, 0.7]) = 4/81 P(\epsilon \in [0.7, 0.8]) = 2/81 P(\epsilon \in [0.8, 1.0]) = 1/81For simplicity we assumed that Nellissa and I both have the same failure rate, and we also assumed that \epsilon is independent of the hypothesis, i.e.

P(\mathcal{H}_k, \epsilon) = P(\mathcal{H}_k)\,P(\epsilon)\;.Thus, we can display the whole probability space in a contour plot, as shown in Figure 3.

When we collect a datum such as \rm{DE} = \rm{VN}, we update our believes (probabilities) according to Bayes’ theorem:

P(\mathcal{H}_k, \epsilon\,|\,x) =P(x\,|\,\mathcal{H}_k, \epsilon)\,P(\mathcal{H}_k, \epsilon) / P(x)\;.where x is the datum and we can expand the probability to observe x into

P(x) = \sum_{k,i} P(x\,|\,\mathcal{H}_k, \epsilon_i)\,P(\mathcal{H}_k)\,P(\epsilon_i)\;.With a little bit of code, we can now explore how each datum changes our belief map, according to Bayes’ theorem above:

The most probable hypothesis is thus that the German and English Snickers bars are the same, but the Vietnamese Snickers bar is different (\mathcal{H}_{\rm{VN}}), and that our failure rate \epsilon is 30% to 40%.

The conclusion changes, however, when only mine or Nellissa’s measurements are taken into account. Here is a little video, where I explore the data with a little app I wrote.

Much more insight can be gained from the belief maps shown in the video above, but I leave this to you, dear reader, to think about the results. I, for my part, have had enough Snickers bars for a lifetime.

]]>West then applies the same model to answer similar questions about cities. Do they grow forever? How is the number of gas stations related to the crime rate and resource requirements? Again, the theory that West and his colleagues have developed makes impressively accurate predictions. Less accurate than with the biological questions, but in a sense it even predicts that it should be less accurate for cities!

This is one of those books I could not put down. I recommend you read this book, too, if:

- You are sensitive to the sensations of beauty and awe that some mathematical theories convey, or
- you want to see an aspect of biology that is as “hard” a science as physics is, or
- you think about studying economics (or biology or city planning), or
- you just want to learn something new that’s interesting.

The final application West writes about I find quite worrying: economics. West’s theory predicts that to sustain itself, the (global) economy needs to grow faster than exponential, leading to a singularity after a finite amount of time. Fortunately, this part is less well explained – perhaps because it is a new and active area of research; but if any of you is an expert on this, please let me know!

]]>A neural process (NP) is a novel framework for regression and classification tasks that combines the strengths of neural networks (NNs) and Gaussian processes (GPs) . In particular, similar to GPs, NPs learn distributions over functions and predict their uncertainty about the predicted function values. But in contrast to GPs, NPs scale linearly with the number of data points (GPs typically scale cubically ). A well-known special case of an NP is the generative query network (GQN) that has been invented to predict 3D scenes from unobserved viewpoints .

Neural processes should come in handy for several parts of my Rubik’s Cube project. Thus, I aim to build a Python package that lets the user implement NPs and all their variations with a minimal amount of code. As a first step, here I reproduce some of Garnelo et al.’s work on *conditional* neural processes (CNPs), which are the precursors of NPs.

If you just want to know what you can do with CNPs, feel free to skip ahead to the next section, but a little bit of mathematical background can’t hurt

Consider the following scenario. We want to predict the values \boldsymbol{y}^{(t)} = f(\boldsymbol{x}^{(t)}) of an (unknown) function f at a given set of target coordinates \boldsymbol{x}^{(t)}. We are provided with a set of context points {\boldsymbol{x}^{(c)}, \boldsymbol{y}^{(c)}} at which the function values are known, i.e. \boldsymbol{y}^{(c)} = f(\boldsymbol{x}^{(c)}). In addition, we can look at an arbitrarily large set of graphs of other functions that are members of the same class as f, i.e. they have been generated by the same stochastic process. A CNP solves this prediction problem by training on these other functions, thereby parametrizing the stochastic process with an NN.

Specifically, the CNP consists of three components: an **encoder**, an **aggregator**, and a **decoder**. The encoder h is applied to each context point (x_i^{(c)}, y_i^{(c)}) and yields a representation vector \boldsymbol{r}_i of that point. The aggregator is a commutative operation \oplus that takes all the representation vectors {\boldsymbol{r}_i} and combines them into a single representation vector \boldsymbol{r} = \boldsymbol{r}_1 \oplus \dots \oplus \boldsymbol{r}_n. In this work, the aggregator simply computes the mean. Finally, the decoder g takes a target coordinate x_i^{(t)} and the representation vector \boldsymbol{r}, and (for regression tasks) predicts the mean and variance for each function value that is to be estimated.

Here, both h and g are multi-layer perceptrons (MLPs) that learn to parametrize the stochastic process by minimizing the negative conditional log-probability to find \boldsymbol{y}^{(t)}, given the context points and \boldsymbol{x}^{(t)}.

Ok, now to applications. I reproduced two of the application examples that Garnelo et al. demonstrate. I plan to add more results and a generalization to NPs at a later stage. Please refer to my GitHub repository for updates.

As a first example, we generate functions from a GP with a squared-exponential kernel and train a CNP to predict these functions from a set of context points. After only 10^5 episodes of training, the CNP already performs quite well:

In the plot above, the gray line is the mean function that the CNP predicts, and the blue band is the predicted variance. For this example, the CNP is provided with the context points indicated by red crosses. as well as 100 target points on the interval [-1,1] that constitute the graph.

Notice that the CNP is less certain in regions far away from the given context points (see left panel around x \approx 0.75). When more points are given, the prediction improves and the uncertainty decreases.

In contrast to a GP, however, the CNP does not predict exactly the context points, even though they are given.

Of course, a GP with the same kernel as the GP that the ground truth function was sampled from performs better:

but this is kind of an unfair comparison, since the CNP had to “learn the kernel function” and we did not spend much time on training.

Now comes the really cool thing about CNPs. Since they scale linearly with the number of sample points, and since they can learn to parametrize *any* stochastic process, we can also conceive the set of all possible handwritten digit images as samples from a stochastic process and use a CNP to learn them.

After just 4.8\times10^5 training episodes, the same CNP that I used for 1-D regression above, has learned to predict the shapes of handwritten digits, given a few context pixels:

Garnelo et al.’s results look much nicer than mine, but my representation vector was only half the size of the one they used and they probably also spent more resources on training the CNP.

CNPs and their generalizations promise great potential, as they alleviate the curse of dimensionality of Gaussian processes and have already shown to be powerful tools in the domain of computer vision .

Garnelo et al. provide enough details about the implementation, so it was straightforward to reproduce their work. I only encountered one minor issue: When training the CNP, I sometimes find that it outputs NaN values. This problem disappears if we enforce a positive lower bound on the output variance.

Implementing CNPs was a good exercise for me to learn more about Tensorflow. Since the results are very rewarding and implementation is not too difficult, I recommend you try this yourself!

There are quite a few things that can be improved upon CNPs, which leads us to NPs and their extensions. But this is material for a later post.

I dread reading my postal mail. Bills here, adverts there, and worst of all: forms to fill out. It feels like such a waste of time! Which is why I sometimes let letters stay in my inbox for several months. Reading “Algorithms to live by” changed that.

Of course, I still have to deal with my mail, but by scheduling one particular date every month to do it all at once at least minimizes the impact and duration of this distraction. And knowing that I have put thought into optimizing this helps, somehow.

Christian and Griffiths manage to distill the essence of well-known computer science problems into a form that could be helpful with anyone’s daily decision making.

Even with a background in computer science and mathematics, “Algorithms to live by” is a delight to read. Making the connections between thrashing of a CPU and thrashing of my personal agenda is just fun to think about.

All in all, I recommend you read this book if:

- you have time management problems, or
- you want to know how computer science can solve problems for you – even if you don’t have a computer, or
- you are planning to setup or improve any kind of system (e.g. a book shelf, a dresser, a research group, etc.), or
- you just want to read a good book.

A Gaussian process (GP) is a mathematical tool that, just like neural networks (NNs), can be used to learn a probability distribution from data, i.e. to do regression, classification, and inference.

GPs are a generalization of multivariate Gaussian (a.k.a. Normal) distributions. The (multivariate) Gaussian distribution is fully characterized by a mean vector \pmb{\mu} and a positive definite, symmetric covariance matrix \mathbf{\Sigma}. Its probability density function is

P(\mathbf{y}|\pmb{\mu}, \mathbf{\Sigma}) = \frac{1}{Z}\,\exp\biggl(-\frac{1}{2}\,(\mathbf{y} - \pmb{\mu})^{\rm{T}}\cdot\mathbf{\Sigma}^{-1}\cdot(\mathbf{y} - \pmb{\mu})\biggr)\;,where Z is a normalization factor that only depends on \mathbf{\Sigma}. Notice, that the right hand side contains the *inverse* of \mathbf{\Sigma}.

MacKay explains very clearly how these “boring old Gaussian distributions” can be used for things like inference. First, we consider the two-dimensional case, where the above probability density may be written as

P(y_1, y_2|\mathbf{\Sigma}) = \frac{1}{Z}\,\exp\biggl(-\frac{1}{2}\,\left(\begin{array}{cc} y_1 & y_2 \end{array}\right)\cdot\mathbf{\Sigma}^{-1}\cdot\left(\begin{array}{c} y_1 \\ y_2\end{array}\right)\biggr)\;.Here, we assume that \pmb{\mu} = (0, 0), for simplicity.

A sample from that distribution is sometimes displayed as a dot in a contour plot of the probability density function (see left panel of Figure 1, below). But we can also represent it by displaying the two coordinates of the sample point separately, as in the right panel of Figure 1.

On the right panel of Figure 1 I also display the confidence intervals, given by the mean \mu_i = 0, and the standard deviations \sigma_1 and \sigma_2. The latter are the square roots of the diagonal elements of \mathbf{\Sigma}:

\mathbf{\Sigma} = \left(\begin{array}{cc}\sigma_1^2 & \rho\,\sigma_1\,\sigma_2 \\ \rho\,\sigma_1\,\sigma_2 & \sigma_2^2\end{array}\right)\;.Note, that in the right panel of Figure 1 there is some correlation between the height of the left dot, and the height of the right dot. The strength of that correlation is controlled by the factor \rho in the equation above. If \rho = 0, then there is no correlation, the left and right point’s positions would be unrelated, and the contour plot would show perfect circles.

What if we take a “measurement”, and the y_1 coordinate is known? What do we learn about the distribution of y_2? In other words, we would like to know the probability to get y_2, given y_1, which is

P(y_2 | y_1, \mathbf{\Sigma}) = \frac{P(y_1, y_2|\mathbf{\Sigma})}{P(y_1|\mathbf{\Sigma})}\;.But this is again a Gaussian distribution! Specifically, in this 2-dimensional case we find

P(y_2 | y_1, \mathbf{\Sigma}) \propto \exp\biggl(-\frac{1}{2}\,(y_2 - \mu_{y_2})^2\,\Sigma_{y_2}\biggr)\;,where

\mu_{y_2} = \frac{\rho\,\sigma_1\,\sigma_2}{\sigma_1^2}\,y_1 \;\;\;,\;\;\;\Sigma_{y_2} = \sigma_2^2\,(1 - \rho^2)\;.In Figure 2 I recreate Figure 1 with y_1 fixed, using the formulas above.

The advantage of the representation depicted in the right panels of Figures 1 and 2 is, that it can be easily generalized to higher dimensions. Can you see how this might lead us to a method for function approximation?

The idea is, that we can send the number of dimensions to infinity. So we have an infinite collection of correlated Gaussian random variables, each representing the function value at one input. This can be conceived as a distribution over *functions*, and the graphical representation in the right panels of Figure 1 and 2 is related to the possible graphs of these functions.

To make this more concrete, here are some samples from a 40-dimensional Gaussian distribution:

If we have infinitely many dimensions, then the Gaussian distribution has to be described in terms of a mean *function*, and a covariance *function*. We can then sample from the distribution in a finite dimensional subspace because of the property of a multivariate Gaussian that also defines a GP:

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution .

In other words, if we want to sample one function from a GP, we may begin by sampling one value y_1. Having drawn that sample, the GP is now conditioned on that sample, so that the value of y_2 will depend on it through the covariance function and y_1 in a similar way as we have seen in the two-dimensional example.

For the example shown above, I chose the common exponential covariance function

k(x_1, x_2) = \sigma^2\,\exp\bigl(-(x_1 - x_2)^2 / (2\,l^2)\bigr)The covariance matrix \mathbf{\Sigma} can now be generated by evaluating k(x_1, x_2) at all pairs of the x-coordinates that we are interested in.

In the exponential covariance function above, the correlation length l determines how far the effect of fixing y_1 should reach. If l is sent to 0, then there is no correlation between any two points of the function, and we will not obtain a smooth graph. For increasingly large l, the sampled functions will tend to become increasingly smooth.

The \sigma parameter is the signal variance, which determines the variation of function values from their mean. It is also common to add a noise variance \delta_{ij}\,\sigma_{\nu}^2 to the kernel k(x_1, x_2), where \sigma_{\nu} describes how much noise is expected to be present in the data.

MacKay also considers other covariance functions, and discusses why we write the distribution in terms of \mathbf{\Sigma} instead of \mathbf{\Sigma}^{-1}. Have a look at his lecture for those details.

Just like his book , MacKay’s lecture is exceptionally insightful. In particular, he provides several enlightening exercises (sometimes with solutions) in both his book and during his lecture. Therefore, I highly recommend both to anyone who is mathematically inclined.

Form this lecture, I have not only gained a rough understanding of what a GP is, but now it is also more clear to me what people mean by “non-parametric models” of which the GP is an example.

This is only an introduction to GPs, of course, and for more details, MacKay recommends the book by Rasmussen and Williams , as well as his own book . I, however, will not spend more time on GPs in the near future, and instead explore the newly invented framework of *neural processes* in my next Article of the Week.

The Rubik’s Cube is a 3-dimensional combination puzzle, invented by Erno Rubik in 1974. The Cube has six faces that can each be turned by 90° in either direction. Turning the sides moves the colored stickers that are attached to the “cubelets” that this side is made of. The goal of the game is to bring the Cube back into the sorted state (one color on each side), starting from a scrambled state.

For a full description of the problem, have a look at McAleer et al.’s article or check out this cool animation of how the mechanics of the Cube works:

This puzzle is very hard to solve with common reinforcement learning (RL) techniques, because there are \approx 4 \times 10^{19} possible configurations of the Rubik’s Cube. In particular, it is very unlikely that you stumble over the solution by taking random actions, once you start from a scrambled state. Taking random actions, however, is exactly what a typical untrained RL agent does.

McAleer et al. circumvent this reward signal sparsity by starting the exploration of the state space with the *solved* configuration, to learn the value- and policy functions of the agent. They call their learning algorithm Autodidactic Iteration (ADI).

ADI learns the value and policy functions in three steps. First, it generates training inputs by modifying a solved Cube turn by turn – each turn yields a new state that is another training input. Second, for each training input state, it evaluates the value function for every child state that can be reached from the input state with one step. Finally, it sets the value and policy targets based on the maximal estimated value of the children.

Once the value and policy targets are known, McAleer et al. use the RMSProp optimization algorithm to update the weights of the NN that constitutes the value and policy functions. To stabilize the algorithms, they have to define a learning rate that is inversely proportional to the distance of the sample to the solved Cube.

A Monte Carlo tree search (MCTS) can now use the learned value and policy functions to solve the Cube.

McAleer et al. compare their algorithm (called DeepCube) with other algorithms that either use human knowledge (“Kociemba”) or are extremely slow (“Korf”). In addition, they also compare DeepCube to two simplified versions of itself.

They find that DeepCube is able to solve 100% of randomly scrambled Cubes within one hour while achieving a median solve length of 13 moves (same as Korf). Moreover, DeepCube matches the optimal path (that Korf is guaraneed to find) in 74% of the cases, while evaluating much faster than Korf.

Interestingly, DeepCube learns to use certain sequences of moves that are also used by humans, e.g. a\,b\,a^{-1}, where a and b are arbitrary actions, and a^{-1} is the inverse action of a. Moreover, DeepCube seems to follow a particular strategy when solving Cubes.

The Rubik’s Cube is indeed a hard and interesting problem to solve, and McAleer et al.’s DeepCube algorithm manages to solve this problem without human knowledge that is specific to it. This by itself is quite impressive.

I also find that their article is very well written and illustrated. It thus serves as a good introduction to the Rubik’s Cube machine learning problem.

Most interesting to me was the fact that DeepCube often seems to follow a particular strategy.

The critical trick of allowing the algorithm to always start from the solved state, however, does not seem like a revolutionary idea. In real-world problems you cannot just reset the state of the world, and when somebody hands me a scrambled Rubik’s Cube, I cannot just solve it in order to learn how to solve it.

Furthermore, there is *some* human domain knowledge included in DeepCube, namely the representation of the Cube. In my own Rubik’s Cube project I aim for my algorithm to come up with a representation on its own and solve the Cube without direct access to the solution. Follow my Rubik’s Cube blog to track my progress on this project.