The post Why Rasa uses Sparse Layers in Transformers appeared first on Johannes E. M. Mosig.

]]>Feed forward neural network layers are typically fully connected, or *dense*. But do we actually need to connect every input with every output? And if not, which inputs should we connect to which outputs? It turns out that in some of Rasa’s machine learning models we can *randomly *drop as much as 80% of all connections in feed forward layers *throughout training* and see their performance unaffected! Here we explore this in more detail.

In our transformer-based models (DIET and TED), we replace most Dense layers with our own `RandomlyConnectedDense`

layers. The latter are identical to Dense layers, except that they connect only some of the inputs with any given output.

You can use the `connection_density`

parameter to control how many inputs are connected to each output. If `connection_density`

is 1.0, the `RandomlyConnectedDense`

layers are identical to Dense layers. As you reduce the `connection_density`

, fewer inputs are connected to any given output. During initialization the layer chooses randomly which connections are removed. But it guarantees that, even at `connection_density = 0.0`

:

- every output is connected to at least one input, so the output is dense, and
- every input is connected to at least one output, so we don’t ignore any of the inputs.

On the implementation level, we achieve this by setting a fraction of the kernel weights to zero. For example, a `RandomlyConnectedDense`

layer with 8 inputs and 8 outputs, no bias, no activation function, and a `connection_density`

of 0.25, might implement the following matrix multiplication of the kernel matrix with the input vector:

The 8×8 kernel matrix on the left contains 16 = 0.25 * 8^2 random entries. Each output on the right hand side therefore contains a linear combination of some of the input vector entries a, b, c, and so on. Which entries enter this linear combination is random, because we choose the position of the zeros in each row of the kernel matrix at random during initialization (except the diagonal, which guarantees that at least one input is connected to each output and vice versa).

Note, that this is different from weight pruning, which is done *after *training. It is also different from neural networks that learn a sparse topology during training. In our case the topology (“which inputs are connected to which outputs”) is *random* and doesn’t change during training.

Why do we do this, and why is this not detrimental to performance? Read on to find out.

Rasa’s Dual Intent and Entity Transformer (DIET) classifier is a transformer-based model. The transformer in DIET attends over tokens in a user utterance to help with intent classification and entity extraction. The following figure shows an overview of the most important aspects of a layer in DIET’s transformer.

All of the W_{…} layers inside the transformer that are usually Dense layers in other applications are `RandomlyConnectedDense`

in Rasa. Specifically, the key, query, value, attention-output, and feed-forward layers in each transformer encoder layer, as well as the embedding layer just before the encoder, are `RandomlyConnectedDense`

in Rasa.

By default, outputs of any `RandomlyConnectedDense`

layer in DIET’s transformer are only connected to 20% of the inputs, i.e. `connection_density = 0.20`

and we might call these layers sparse.

Why do we do this? Let’s compare DIET’s performance on intent classification with different densities and transformer sizes.

The figure above shows how the weighted average F1 test score for intent classification on the Sara dataset depends on the number of trainable weights in the sparse layers. The green graph corresponds to density = 1.0, i.e. all layers are Dense layers and we change the layer size to alter the number of trainable weights. We see that in this fully dense case, the performance first increases rapidly between 1,000 and 5,000 weights, and then stays nearly constant all the way up to 1,000,000 weights. Note, that the horizontal axis is scaled logarithmically.

Now consider the yellow, blue, and red graphs in the figure. Here we use sparse layers with densities of 20%, 5%, and 1%, respectively. We see that the lower the density, the fewer trainable weights are needed to reach the same level of performance. When we use only 1% of the connections (red graph), we need only 5,000 trainable weights to achieve the same level of performance as with 400,000 weights in a dense architecture!

This picture changes a bit for the entity extraction task on the same dataset, as we show in the figure below.

DIET reaches peak performance at about 100,000 trainable weights with 100% dense layers, which is on par with 20% dense layers with much fewer trainable weights (right end of yellow graph). But if we decrease the density to 1%, performance can drop significantly, even when we make layers very large (right end of the red graph). So we cannot reduce density arbitrarily, but it is still true that sparse layers (at 20% density) perform pretty much as well as dense layers with many more trainable weights.

Due to the way Tensorflow is implemented and hardware is built, the reduced number in trainable weights does not mean, unfortunately, that we can save time during training or inference by making layers less dense. It might be that convergence during training is more robust, but we are still investigating this hypothesis. For now, our primary reason to use sparse layers is Occam’s Razor: When you have two models that do the same thing, but one of them has fewer parameters, then you should choose the latter.

Do our findings of the previous section also hold for other models? Let’s have a look at our favorite machine learning model on the dialogue management side: the Transformer Embedding Dialogue (TED) policy. After each step in a conversation, TED decides what action Rasa should take next. We think of this as a classification problem: It should choose the right action (the class) for any given dialogue state (the input).

TED contains the same transformer component that we discussed in the DIET Section, and therefore TED also contains many `RandomlyConnectedDense`

layers. We evaluate TED in the same way we evaluate DIET, but here we use the Conversational AI Workshop dataset, which contains more interesting stories.

The above figure shows that the dense architecture (green graph) reaches peak performance at about 21,000 trainable weights. But as in DIET, we achieve the same performance with a 20% dense layer that contains only 5,000 trainable weights in `RandomlyConnectedDense`

layers. If we reduce the density further, the limiting performance declines, so we shouldn’t go much below 20% density in real applications.

Why do models not immediately lose performance as we reduce the connection density? In particular, why is a 20% dense transformer as good as a fully dense transformer? And why does it not matter *which* inputs we connect to any given output?

To understand this, let’s first realize that much of the art of engineering neural networks lies in figuring out what *not* to connect. In principle, an infinite fully connected neural network can learn *anything*. But it is neither possible to implement, nor would it be efficient. So we need to impose some structure on the network (throw away some connections) such that it still learns what it needs to learn. A common technique is to prune weights away after training of a dense network. And in image processing, it is common to throw away a lot of connections by using convolutional layers instead of dense layers. We can guess beforehand that this particular structure works, because neighboring pixels in an image are more related to each other than pixels that are far apart.

In contrast to image processing, the neighboring dimensions (“pixels”) of a natural language embedding have no guaranteed relation to each other (unless we use semantic map embeddings). But it makes sense that we can still drop most of the connections throughout training, like in a convolutional layer, since pruning works in language models.

So we can drop connections, but why can we choose *randomly* which ones to drop? This is because we always have at least one dense layer before all the sparse architecture and a single dense layer can learn to *permute *the inputs! So during training, the first dense layer learns to feed just the right information to the remaining sparse layers. And apparently this is enough for the models to learn as if they were fully dense all the way.

The post Why Rasa uses Sparse Layers in Transformers appeared first on Johannes E. M. Mosig.

]]>The post Semantic Map Embeddings – Part II appeared first on Johannes E. M. Mosig.

]]>In Part I we introduced semantic map embeddings and their properties. Now it’s time to see how we create those embeddings in an unsupervised way and how they might improve your NLU pipeline.

At the heart of our training procedure is a batch self-organizing map (BSOM) algorithm. The BSOM takes vectors as training inputs and essentially arranges them in a grid such that similar vectors end up close to each other.

To visualize the BSOM process better, let’s forget about text and natural language processing for the moment, and pretend that we want to arrange colors in a two dimensional grid such that similar colors are neighbouring. Our input vectors are therefore three-dimensional RGB color vectors with entries between 0 and 1.

Let’s say we want to arrange the 10 colors shown above into a 12 ⨉ 12 grid. The BSOM algorithm first creates a 12 ⨉ 12 matrix where the entries are random vectors of the size of the input, i.e. 3. The resulting (12, 12, 3) tensor is often called the “codebook”. Since our inputs are RGB colors, this tensor is also a 12 ⨉ 12 pixel image, specifically, the image shown in the top left corner of the figure below.

The BSOM algorithm looks at each color in the training data and decides which unit in the codebook (the “pixels”) has a color that is most similar to that input. Once it has identified all these best matching pixels, it goes through each of the 144 pixels in the codebook and assigns a weighted average of training colors to that pixel, where the weights depend on how far the best matching pixel of the input is from the current pixel. The closer the best matching pixel of an input is to the current pixel, the more weight is given to it.

This process is repeated many times, and each time the radius of influence decreases. You can see how the codebook changes in each step in the image above.

Since we had only 10 training colors and the map has 12 ⨉ 12 = 144 pixels, every input is nicely separated from every other input after training is complete. Note that every color in the training data appears in the final image, and it is close to similar colors (e.g. pink and purple are in the top right corner and the greens are in the lower left corner).

The shape of the resulting color clusters depends on how we define the distance between pixels, as this distance enters the weights of the weighted sum. (For illustration purposes we set pixels that are not influenced by any input to black, so the shapes are clearly visible. Normally, pixels would retain the color that was last assigned to them, and thus the colored clusters would be connected by interpolated colors.) In the instance above, we use the Chessboard distance, and as a result we see squares of colors in the final image. Alternatively we can use a hexagonal distance function that also wraps around at the edges (the left edge is connected to the right, and the top edge to the bottom):

This can help with convergence when inputs have more than three dimensions and leads to the hexagonal grid look that you saw throughout Part I of this post.

All of this also works when we have more inputs than pixels on the map.

So, a BSOM has only a few hyperparameters to tune: the map width and height, the number of training epochs, the choice of pixel-distance function, and perhaps some parameters that determine how quickly the radius of influence decreases in each epoch.

To adapt the BSOM algorithm to natural language processing, we use high-dimensional binary vectors as inputs instead of RGB colors. Each input vector now represents one snippet of text from some text corpus, such as Wikipedia or this very blog post. Each binary entry in the vector is associated with one word in a fixed vocabulary, and it is 1 if the word appears in the snippet and otherwise 0. In essence, each input is now a binarized bag-of-words vector.

If we use a structured corpus such as this blog post as training data, we use paragraphs as snippets and also prepend all the respective title words to each snippet. So the previous paragraph becomes a vector where the entries for “To”, “adapt”, “the”, “BSOM”, … are 1, but also “Adapting”, “to”, “Text” and “Training”, “Semantic”, “Maps”, as well as “Exploring”, “Map”, “Embeddings” from the title of this blog post.

We then use the binary bag-of-words vectors of the snippets to train the BSOM algorithm. We have implemented a fast parallelized version that is specialized to binary input vectors, so we can train on all of Wikipedia in a day or two for a 128 ⨉ 128 map and 79649 words in the vocabulary.

The BSOM algorithm results in a dense codebook tensor. We can think of it as a M ⨉ N matrix where each entry is a real vector whose size is the size N_{vocab} of the vocabulary. What do these vocabulary-sized vectors represent? At the end of the BSOM training, each component of these vectors is the weighted average of 1s and 0s of inputs that are associated with that pixel. In other words, the *k*th entry in each codebook vector represents the probability that the *k*th word in the vocabulary appears in the context that is represented by that codebook vector!

To create the semantic map embedding for the *k*th word, we consider the *k*th slice of our trained M ⨉ N ⨉ N_{vocab} codebook. This slice is a M ⨉ N matrix where each pixel in that matrix represents a context class, and the value of each pixel represents the probability that the *k*th word appears in that context class. We illustrate this matrix on the left side of the figure below. To create our sparse embedding we set the 98% of all pixels with the lowest values to zero, as shown in the center of the figure. The 98% value is arbitrary, but it should be close to 100%, as we wish to generate sparse matrices. Finally, we set all the remaining non-zero values to 1. The resulting sparse binary matrix is a simplified form of our semantic map embedding for the *k*th word.

All pixels that are 1 signify that this word appears particularly often in the context classes associated with those pixels. Alternatively, we can also normalize the codebook and divide the pixel values in the *k*th codebook slice by the probability of any word appearing in a context class. As a result, we get 1s for context classes where the word appears particularly often *compared to other words*. This, in fact, is our default.

We can repeat that process for each word in our vocabulary to generate embeddings for all words in the vocabulary.

Semantic map embeddings have some very interesting properties, as we’ve seen in Part I of this post. But do they help with NLU tasks such as intent or entity recognition? Our preliminary tests suggest that they can help with entity recognition, but often not with intent classification.

For our tests we feed our semantic map embedding features into our DIET classifier and run intent and entity classification experiments on the ATIS, Hermit (KFold-1), SNIPS, and Sara datasets. The figure below compares the weighted average F1 scores for both tasks on all four datasets and compares them with the scores reached using BERT or Count-Vectorizer features:

We observe that our Wikipedia-trained (size 128 ⨉ 128, no fine-tuning) semantic map embedding reaches entity F1 scores about half way between those of the Count-Vectorizer and BERT. On some datasets, the semantic map that is trained on the NLU dataset itself (no pre-training) can also give a performance boost. For intents, however, semantic map embeddings seem to confuse DIET and are outperformed even by the Count-Vectorizer, except on the Hermit dataset.

Having said this, when we feed the semantic map features into DIET, *we do not make use of the fact that similar context classes are close to each other on the map* and that might be the very key to make this embedding useful. We’ve got plenty of ideas of where to go from here, but we thought it is a good point to share what we’re doing with this project. Here are some of the ideas:

- Instead of DIET, use an architecture that can make use of the semantic map embedding properties to do intent and entity classification. For example, a Hierarchical Temporal Memory (HTM) algorithm or the architectures explored here.
- When training on Rasa NLU examples, let the BSOM algorithm only train on prepended intent labels during the first epoch, and afterwards on the intent labels and the text.
- Use a max-pooling layer to reduce the representation’s size.
- Explore text filtering / bulk-text-labelling based on the binary features.
- Create semantic maps for other languages.

If you are curious and want to try out the Semantic Map embedding for yourself, we’ve created a `SemanticMapFeaturizer`

component for the Rasa NLU pipeline! You can find the featurizer on the NLU examples repo. It can load any pre-trained map that you find here. Let us know how this works for you (e.g. you can post in the forum and tag `@j.mosig`

)!

The post Semantic Map Embeddings – Part II appeared first on Johannes E. M. Mosig.

]]>The post Semantic Map Embeddings – Part I appeared first on Johannes E. M. Mosig.

]]>How do you convey the “meaning” of a word to a computer? Nowadays, the default answer to this question is “use a word embedding”. A typical word embedding, such as GloVe or Word2Vec, represents a given word as a real vector of a few hundred dimensions. But vectors are not the only form of representation. Here we explore semantic map embeddings as an alternative that has some interesting properties. Semantic map embeddings are easy to visualize, allow you to semantically compare single words with entire documents, and they are sparse and therefore might yield some performance boost.

Semantic map embeddings are inspired by Francisco Webber’s fascinating work on semantic folding. Our approach is a bit different, but as with Webber’s embeddings, our semantic map embeddings are sparse binary matrices with some interesting properties. In this post, we’ll explore those interesting properties. Then, in Part II of this series, we’ll see how they are made.

A semantic map embedding of a word is an M ⨉ N sparse binary matrix. We can think of it as a black-and-white image. Each pixel in that image corresponds to a class of contexts in which the word could appear. If the pixel value is 1 (“active”), then the word is common in its associated contexts, and if it is 0 (“inactive”), it is not. Importantly, neighboring pixels correspond to *similar* context classes! That is, the context of the pixel at position (3,3) is similar to the context of the pixel at (3,4).

Let’s look at a concrete example and choose M = N = 64. A semantic map embedding of the word “family” might look like this:

You may wonder why our image is made up of hexagons instead of squares. This is because we trained our semantic map in a way that each pixel has 6 neighbours instead of 4 or 8. We will describe in Part II how that training works.

Notice that the semantic map embedding of the word “family” is not a random distribution of active pixels. Instead, the active pixels cluster together in certain places. Each cluster of active pixels is also a cluster of similar context classes in which the word “family” appears. Larger clusters contain more context classes and might correspond to a whole topic.

We can see this if we compare the semantic map embedding of “family” with that of the word “children”:

The middle image shows the active pixels that both embeddings share. We can do this with a few other words and figure out what the different clusters stand for:

So each cell in the semantic map corresponds to a class of contexts, and neighbouring pixels stand for contexts that are similar. When a word’s particular meaning ranges across multiple very similar contexts, then you find clusters of active pixels on the map.

In the previous section we have already seen that we can count the number of active pixels that two word embeddings share to compute a similarity score. How good is that score?

A simple thing we can try first is to pick a word and then find all the words in the vocabulary that have the most overlap with it. For example:

The size of the words in the word clouds corresponds to their overlap score with the target word in the middle (which is biggest because it has 100% overlap with itself). The higher the score, the larger the word. Looks pretty good!

We can attempt a more rigorous analysis and use the BLESS dataset, which is a list of word pairs and relation labels. For example, one element of the list is (“alligator”, “green”, attribute), i.e. “green” is a typical attribute of “alligator”. The different relations are:

Relation | Description | Example with “alligator” |

Typical attribute | green | |

Typical related event | swim | |

Meronym | “part of” relation | leg |

Hypernym | “is a” relation | creature |

Co-hyponym | Both terms have a common generalization | frog |

Unrelated | electronic |

We compute the overlap between all the word pairs in the BLESS dataset. Ideally, all related words get a high score (maximum is 1) and all unrelated words (in the “Unrelated” category) get a low score (close to 0). Note, that a high score is only possible if the two words almost always appear together in any text. Thus, scores will mostly be below 0.5.

Here is how our 128×128 English Wikipedia embedding performs on the BLESS dataset:

This shows that our embedding captures all kinds of relations between words – especially co-hyponym relations – and less often associates supposedly unrelated words. This works because the matrices are sparse: the probability that you find a random overlap between two sparse binary matrices drops rapidly as you make the matrices bigger.

Still, many supposedly unrelated words do show high overlap scores. This seems to be not so great, except that many of the word pairs with the highest score in the *Unrelated *category are arguably quite related:

Term 1 | Term 2 | Overlap Score (%) | Note |

saxophone | lead | 56.1 | |

guitar | arrangement | 49.4 | |

trumpet | additional | 48.6 | |

saxophone | vibes | 46.0 | |

violin | op | 45.4 | “op” often stands for “opus” |

robin | curling | 33.8 | “Robin Welsh” is a famous curling player |

guitar | adam | 32.9 | “Adam Jones” is a famous guitar player |

dress | uniform | 25.3 | |

phone | server | 25.0 | |

clarinet | mark | 22.6 | |

cabbage | barbecue | 22.3 | |

cello | taylor | 22.0 | |

butterfly | found | 21.6 | |

saw | first | 21.0 | |

table | qualification | 19.2 | table tennis qualification |

oven | liquid | 18.9 | see “polymer clay” |

vulture | red | 18.6 | |

ambulance | guard | 18.3 | |

chair | commission | 17.4 | |

musket | calibre | 17.1 |

We have introduced semantic map embeddings as sparse binary matrices that we assign to each word in a fixed vocabulary. We can interpret the pixels in these matrices and compare words to each other via the overlap score. But can we go beyond individual words and also embed sentences or entire documents?

Indeed we can! Semantic map embeddings offer a natural operation to “combine and compress” word embeddings such that you can create sparse binary matrices for any length of text. We call this the (symmetric) merge operation.

The merge operation does not take the order of the words into account. In that sense it is similar to taking the mean of the word vectors generated by Word2Vec embeddings. But in contrast to those traditional embeddings, our merge operation preserves the most “relevant” meanings of each word.

Let’s go through our derivation of the merge operation step by step. Our goal is to define a function MERGE that takes a list of sparse binary matrices as input and gives a sparse binary matrix as output. If the inputs are semantic map embeddings, then the output should somehow represent the shared meaning of all the inputs.

A naive way to combine the semantic map embeddings of words could be a pixel-wise OR operation. This does create a new binary matrix from any number of sparse binary matrices.

However, the combined matrix would be slightly more dense than the individual semantic map embedding matrices. If we were to combine a whole text corpus, most of the pixels would be active, and thus the OR-combined matrix would barely contain any information.

The next idea is then to *add* all the sparse binary matrices of the input words. As a result, we would get an integer matrix, and we could throw away all but the top, say, 2% of all active pixels, as those represent the context classes that are most shared by all the words in the text!

Now only one problem remains: if you want to merge just two words, then the top 2% are hard to determine, because all pixels will have values of either 0, 1, or 2. To remedy this final issue, we give extra weight to those pixels that have neighbours (more neighbours, more weight). We call the matrices with the weighted pixel values boosted embeddings. By doing this boosting, we not only get a more fine-grained weighting between the combined pixels, but we also emphasize those pixels of each word that carry the most likely meanings!

In summary, our merge operation takes the sum of all the boosted semantic map embeddings of the words of a text, then sets the lowest 98% (this number is arbitrary, but should be high) of all pixels to zero and the remaining 2% to 1. The resulting sparse binary matrix is a compressed representation of the meaning of the input text. For example, here is a semantic map embedding of *Alice in Wonderland* (the whole book):

We can now directly compare this semantic map embedding of the book with that of any word in the vocabulary. Here is a cloud with the words that most overlap with Alice in Wonderland:

This doesn’t look too bad, given that our embedding was trained on Wikipedia, which is arguably quite a different text corpus than *Alice in Wonderland*. What if we merge the text contents of the Wikipedia article about airplanes?

This word cloud also makes sense, though it seems to focus more on what airplanes are made of, or on technical things like “supersonic” or “compressed”. It doesn’t contain “flying”, but only “flew”. We think that this is an artifact of what Wikipedia articles are about. It’s an encyclopedia, not a novel. Also note that the word “components” does not appear anywhere in the article about airplanes, yet it is strongly associated with them.

That’s it for the interesting properties of this embedding. You can try our pre-trained semantic maps yourself, using our SemanticMapFeaturizer on the rasa-nlu-examples repo.

But how do we create these embeddings? In Part II we explore our unsupervised training procedure and compare our embeddings to BERT and Count-Vector featurizers.

The post Semantic Map Embeddings – Part I appeared first on Johannes E. M. Mosig.

]]>The post Corona Tests and the Bayes Factor appeared first on Johannes E. M. Mosig.

]]>For example, when you know that in your city 3,000 people are infected and 7,000 people are not infected, then you may estimate your prior odds as 3 : 7. If you take a Corona test, and it comes out positive, then your odds change by a certain factor, known as the Bayes factor. For a test with sensitivity of 80% and specificity of 98% (this is the information you typically find on the web or on the test package), the Bayes factor of a positive test result is 40, so your odds would change from 3 : 7 to 120 : 7.

I created this little web app that lets you easily compute how your odds change:

Remember that your prior odds are higher if you had recently been exposed to an infected person, or if you show symptoms. Also, remember that the test’s Bayes factor does not take into account that you (or your doctor) might make a mistake in taking the test.

The post Corona Tests and the Bayes Factor appeared first on Johannes E. M. Mosig.

]]>The post Seven Scientists appeared first on Johannes E. M. Mosig.

]]>The problem comes in two parts. In the first part, we use the maximum likelihood heuristic to solve the following problem:

N datapoints \{x_n\} are drawn from N distributions, all of which are Gaussian with mean \mu, but with different unknown standard deviations \sigma_n. What are the maximum likelihood parameters \mu, \{\sigma_n\}, given the data?

For example, seven scientists (A, B, C, D, E, F, G) with widely-differing experimental skills measure \mu. You expect some of them to do accurate work (i.e. to have small \sigma_n), and some of them to turn in wildly inaccurate answers (i.e. to have enormous \sigma_n). […] What is \mu and how reliable is each scientist?

MacKay, exercise 22.15.

In the book, the seven results by the seven scientists A to G are: -27.02, 3.57, 8.191, 9.898, 9.603, 9.945, and 10.056, respectively, as shown in this Figure:

To find the maximum likelihood values, we find the fixed point of the following rule:

\sigma_i \leftarrow \left|x_i - \frac{1}{N}\,\sum_{k = 1}^N \frac{x_k}{\sigma_k^2}\right|\;,resulting in \mu = 0.1.

The dashed red line in Figure 2 indicates the maximum likelihood estimate of \mu, and error bars indicate the estimated standard deviation of each scientist.

The second part of the problem assesses the same situation with the full power of Bayesian statistics:

Give a Bayesian solution to exercise 22.15, where seven scientists of

varying capabilities have measured \mu with personal noise levels \sigma_n, and we are interested in inferring \mu. Let the prior on each \sigma_n be a broad prior, for example a gamma distribution with parameters (s, c)= (10, 0.1). Find the posterior distribution of \mu. Plot it, and explore its properties for a variety of data sets such as the one given, and the data set \{x_n\} = \{13.01, 7.39\}.MacKay, exercise 24.3.

So, we assume that the \sigma_n come from a gamma distribution, i.e.

P(\sigma) = \frac{1}{\Gamma(c)\,s}\,\biggl(\frac{\sigma}{s}\biggr)^{c - 1}\,\exp\biggl(-\frac{\sigma}{s}\biggr)\;,where \Gamma is the gamma function. Furthermore, if we’d know \mu and \sigma, then the probability to measure x is given by a Gaussian distribution

P(x|\mu, \sigma) = \frac{1}{\sqrt{2\,\pi}}\,\exp\biggl(-\frac{(x - \mu)^2}{2\,\sigma^2}\biggr)\;,and we want to know P(\mu | \{x_n\}).

As a first step, we seek the *n*th scientist’s noise level P(\sigma_n | \mu, x_n), which is given by Bayes’ theorem as

where we’ve used that P(\sigma_n | \mu) = P(\sigma_n).

We already know the numerator of the right hand side. So we only need to find the normalizing constant P(x_n | \mu) by marginalizing P(x_n|\mu, \sigma_n) over \sigma_n, i.e. we integrate P(x|\mu, \sigma)\,P(\sigma) over \sigma from 0 to infinity:

P(x_n | \mu) = \int_0^\infty\;P(x_n | \mu, \sigma_n)\,P(\sigma_n)\,\rm{d}\sigma_n\;.We solve this integral analytically and find

\begin{aligned}P(x_n | \mu) =& \frac{1}{2\,\sqrt{\pi}\,s^2\,\Gamma(c)}\Biggl(\sqrt{2}\,\Gamma(c - 2)\,{}_0F_2\biggl(;\frac{3-c}{2}, \frac{4 - c}{2}; -\frac{(x_n - \mu)^2}{8\,s^2}\biggr) \\ &+ 2^{(1-c)/2}\,s^{2-c}\,|x_n - \mu|^{c-2}\,\Gamma\biggl(\frac{2 - c}{2}\biggr)\,{}_0F_2\biggl(;\frac{1}{2}, \frac{c}{2}; -\frac{(x_n - \mu)^2}{8\,s^2}\biggr) \\ &-\;\;\; 2^{-c/2}\,s^{1-c}\,|x_n - \mu|^{c-1}\,\Gamma\biggl(\frac{1-c}{2}\biggr)\,{}_0F_2\biggl(;\frac{3}{2}, \frac{c + 1}{2}; -\frac{(x_n - \mu)^2}{8\,s^2}\biggr)\Biggr)\;,\end{aligned}where {}_0F_2 is the generalized hypergeometric function.

We now have everything we need to plot P(\sigma_n | x_n, \mu). If we would assume that \mu = 10, then we’d find that we’re quite uncertain about the noise level of scientist A, but it is generally large, while we are very certain about the noise levels of scientists D, F and G, which should be small, as you can see in Figure 3.

If we instead assumed the \mu that we found with the maximum likelihood estimate, \mu = 0.1, then we’d be relatively sure that B’s noise level is lower than all other noise levels (see Figure 4), which is consistent with Figure 2.

In the previous section we have found an expression for P(x_n | \mu). Since all measurements are independent from one another, the probability to find all the x_n is

P(\{x_n\} | \mu) = \prod_{n = 1}^N\,P(x_n | \mu)\;,where N = 7 is the number of scientists.

Now we use Bayes’ theorem again to find

P(\mu | \{x_n\}) = \frac{P(\{x_n\} | \mu)\;P(\mu)}{P(\{x_n\})} \propto P(\{x_n\} | \mu)\;.We don’t know anything about the prior P(\mu), but we could assume that it’s a uniform distribution between some large negative and positive values. In this case, P(\mu | \{x_n\}) is proportional to P(\{x_n\} | \mu), as indicated above.

Inserting the given values for the parameters c and s, we can now plot P(\mu | \{x_n\}) in Figure 5 (vertical scales are not given, since we have not computed the normalization).

The graph in Figure 5 diverges to +\infty for every \mu = x_n. The probability density around these singularities accumulates when multiple data are close to one another, resulting in an increased probability to find \mu around \mu = 10.

MacKay also asks what happens if we had just two scientists, measuring \{x_n\} = \{13.01,7.39\}. As we would expect, the probability distribution over \mu is exactly symmetric:

This symmetry reflects the fact that with just two measurements, we cannot decide which scientist is the more accurate one, and therefore, we cannot know if 13.01 or 7.39 is closer to the true \mu.

1.

MacKay, D. J. C. *Information Theory, Inference, and Learning Algorithms*. (Cambridge University Press, 2003).

The post Seven Scientists appeared first on Johannes E. M. Mosig.

]]>The post Working at Rasa appeared first on Johannes E. M. Mosig.

]]>https://blog.rasa.com/the-humans-behind-the-bots-johannes-mosig/

The post Working at Rasa appeared first on Johannes E. M. Mosig.

]]>The post B. de Mesquita and Smith: The Dictator’s Handbook appeared first on Johannes E. M. Mosig.

]]>Central to the author’s thesis is the realization that leaders (benevolent or not) must first and foremost ensure that they get into (and stay) in power, because without power, they cannot affect anything. They also introduce a more refined view of political systems than “democracy” vs “dictatorship”, discuss how seemingly poor decisions of various leaders throughout history make perfect sense when viewed through the lens of their political theory, and under what circumstances authoritarian regimes can transition into democracies and vice versa. If you ever want to understand politics, this is a must-read (or at least watch the summary video above)!

The post B. de Mesquita and Smith: The Dictator’s Handbook appeared first on Johannes E. M. Mosig.

]]>The post Douglas: The Reader’s Brain appeared first on Johannes E. M. Mosig.

]]>*The Reader’s Brain* has been invaluable to me, as it greatly improved the quality of my PhD thesis. It is doubly interesting, because it is not only useful, but also gives interesting insights into how our minds work.

I strongly recommend you read this book if:

- You write things (theses, blog posts, emails, etc.), or
- you are interested in neuroscience and natural language understanding;
- And you prefer science-based advice over authority-based advice on your writing style.

The post Douglas: The Reader’s Brain appeared first on Johannes E. M. Mosig.

]]>The post How AI may impact your life and how you can help making this impact a positive one appeared first on Johannes E. M. Mosig.

]]>Since the 1990s, advancement in technology has drastically changed the way we go about our lives. Before then, there was no (notable) internet, and you had to press the “Turbo” button on your PC to get to 100 MHz calculation speed. I remember having to go to the payphone on the street if I wanted to call my friends.

Now most people carry around a pocket device that would have been considered a super-computer not too long ago – and you can call anyone across the world or download information on any topic in an instant.

This fundamental change in how we communicate and interact with the world enables companies such as internet service providers (ISPs), search engines (e.g. Google or Baidu), and social networks (e.g. Facebook) to collect huge amounts of data (hence “**big data**“).

Here, “data” refers to everything from emails and photos, to the way you move your mouse on the screen.

These data, together with improved hardware, is what enabled the ongoing machine learning revolution. **Machine learning** (ML) refers to a set of methods that enable computers to analyze data – even if the data are “fuzzy”, i.e. images, text or audio recordings.

Such fuzzy data were originally thought to be impossible for computers to understand. A computer was a calculation machine that was good at adding and multiplying numbers – not at recognizing a human face. But it turned out that the problem of recognizing a human face in an image can be formulated as a mathematical problem. Nowadays, ML techniques enable computers to solve it.

Many of these techniques (such as artificial **neural networks**) have been known for some time, but since they require strong computers and large datasets to train on, it was not feasible to use them until about 2006 .

Presently, I suspect that the uses of ML with the greatest social impact are (i) captivating your attention, (ii) automating some tasks that would be simple but slow for humans to do (like cataloging images, improving search results, suggesting restaurants, news feeds and movies that you might be interested in), and (iii) financial and insurance-related applications.

To do these things, ML algorithms essentially learn to model and extract useful information from the data that they are given. The “quality” of these data is essential, however. For example, state of the art image classifiers (a particular ML algorithm) can distinguish a thousand different objects from images when trained on the ImageNet dataset. But developing a new, specific image classifier from scratch still takes months to do, because the existing classifiers do not generalize well and few training data sets are as well organized as ImageNet.

The best results are obtained when the training data can be generated. For example, in board games like Chess and Go, the state of the art ML algorithm Alpha Zero can beat the previous world champions, even though it learned these games just by playing against itself and learning from the moves it took (so the data were generated by the algorithm itself). But today’s ML techniques also have many problems.

For example, if you train an ML algorithm to recognize faces, but your data set only contains white people, then it will not recognize black people, or vice versa. The good news is that many people are working hard on correcting such biases, and progress is being made.

It is also very difficult to specify what we actually want, because ML systems lack common sense. In a set of ML techniques called **reinforcement learning** (RL), for example, you train an “agent” by rewarding it for good outcomes. So you may reward an RL agent for collecting points along the track in a boat racing video game, and of course you expect it to finish the game. Instead, it intentionally causes an accident that puts it into an infinite loop where the agent can collect rewards forever – without finishing the race.

For Facebook and similar platforms, ML algorithms create a model of the social group that you belong to. If you provide enough data (“likes”, comments, or just the information whom you are communicating with), these algorithms will be able to predict your personal preferences with high accuracy .

If the ML algorithm learns how likely you are going to react to a certain **meme** (statement or image), it also knows how to capture your attention. Hence, it can fill your news feed entirely with things that will make you stay there for longer. It is usually programmed to do so, because a longer engagement time makes you more likely to be influenced by the advertisements that appear in between the news.

Unfortunately, we are more inclined to engage with memes that make us angry than with memes that trigger other emotions or are emotionally neutral . More generally, the ML algorithms learn how to exploit our emotional weak points, even though they were not programmed to do so – they just *find* this strategy as a solution to their goal of maximizing human engagement with a website.

As the video by CGP Grey illustrates, this can be hugely detrimental for human relationships. However, for things like movie or restaurant suggestions, it is great if my phone can suggest things to me that I will actually enjoy.

Other common applications of ML are, for example, face recognition and automatic image enhancement of your camera, email spam filtering, automatic recognition of facial features used by Snapchat, voice and handwriting recognition, and forecasts of the stock market.

A new recent trend is the automatic generation of text, images, and even videos by ML algorithms. Once more, this has some good applications, but it also comes with issues that may soon become disruptive to our society. For example, it is already possible to automatically superimpose a persons face onto a video recording, so it looks like that person did things he/she never did. Imagine it would become cheap to generate a million photo-realistic videos of public figures doing things they have never actually done. This is already possible with voice generation, but with video we are not quite there yet, though I think it should be possible within the next decade.

On the positive side, ML algorithms can help reduce power consumption for server farms and other facilities, as well as help doctors with medical diagnoses and treatment selection.

As a side effect, yet of no less importance, we learn a great deal more about ourselves and how our brains work, as these questions overlap with AI research.

So machine learning is everywhere, and there are good, bad, and risky applications of one and the same technology.

The most pressing near term issue is probably that of **lethal autonomous weapons** (LAWs).

All parts of this technology already exist today, but as far as is publicly known, it is not militarized yet. If we manage to make LAWs as internationally stigmatized as chemical or biological weapons, we may have a chance to prevent fatal scenarios such as the one presented in the video above (or to the right, depending on your screen size).

You can find more information about what you can do personally to steer our civilization away from LAWs on this website: autonomousweapons.org, which is supported by thousands of AI researchers (including myself), and backed by the Future of Life Institute and other organizations.

Since ML techniques can only become better over time, we should also expect increasingly sophisticated scams and targeted attacks to make money or gain influence. Platforms like YouTube, Facebook, and Twitter are already in an ongoing arms race with bad actors that try to do just that.

It is a very hard problem, and ML will play a role on both sides of the battle.

The most important component in this battle is you. And thus, you are in the best place to do something about it. For example: (i) always think twice before you share a meme – even if you like it, it may not be true; (ii) socialize with people in the real world and practice critical thinking in day-to-day situations; (iii) Always remember that when you write a comment, there is a person on the other end reading this.

A more positive development is the current progress in **autonomous driving**. This technology – even if only applied to trucks and trains – has the potential to greatly reduce the number of traffic jams and accidents. This could save millions of lives, since more than a million people die in car accidents each year.

If you are a young truck driver, however, you may want to educate yourself about alternative jobs. The displacement and disappearance of jobs in general may be a central issue in the mid-term future .

It is unclear how long it will take until self-driving cars are available to the public. As I wrote earlier, ML algorithms have no common sense. This makes the autonomous driving problem so hard. But it is certainly solvable, and I think it would be a net positive development.

Through and through beneficial new technologies that may soon come out of machine learning are, for example, (i) improved accuracy in medical diagnoses, (ii) automatic optimization of energy distribution in computer clusters and wind farms, and (iii) more helpful digital personal assistants as well as speech or text based user interfaces.

We can also expect advances in medicine through indirect contributions of ML. For example, recently DeepMind created a novel algorithm that can solve the protein folding problem with high accuracy, which is essential for medical research.

You may have noticed that so far I have used the term machine learning (ML) exclusively, and did not write anything about **artificial intelligence** (AI). This is because these terms are not clearly defined, and AI is especially hard to pin down. Specifically, it seems like whenever a new AI algorithm solves a previously unsolvable problem (like face recognition), it is not considered AI anymore.

Today’s ML algorithms are good at specific tasks, and it is debatable if we can have them become generally intelligent without a major paradigm shift. Nevertheless, for **artificial general intelligence** (AGI) the question is not “if”, but “when” it will be created (barring prior destruction of our global society). Predictions about AGI timelines vary a lot, but mostly focus on the present century.

When talking about AGI, an algorithm’s competence is typically compared to that of a human. If we manage to write a computer program that is able to solve every cognitive task an average human can solve (from arithmetic to composing music and consoling a heartbroken friend), then we’d say that we have created AGI.

But there is no good reason to assume that it stops at the human level. There may be levels of intelligence that are vastly superior to ours. This is known as **superintelligence** . You could imagine a single superintelligent AGI as a whole civilization of Einsteins and Mozarts who all work together in perfect harmony and think 100 times faster than their human equivalents since they are not bound by the limitations of actual biological brains.

Before we program the first AGI, however, there is (at least) one critical issue to solve: the **alignment problem**.

It roughly goes like this: How do I ensure that a being that is vastly superior to me keeps doing what I would approve of? Even if it knows that what I want is stupid and I am incapable to see this?

Personally, I concur with Eliezer Yudkowsky’s perspective that we can be certain that if we build a superintelligence without solving the alignment problem first, then it will change the world in a way we do not approve of, and (by definition) it will be damned good at doing this.

The alignment problem is extremely hard, and **we will likely need at least as long to solve it, as we need to develop AGI**. Therefore, we have to start working on it right now. Research institutes like MIRI and the Future of Humanity Institute are committed to solving this problem, but considering its scope, there should be many more people working on this.

If you are not a mathematician or AI researcher, there is still something that you should think about: What society do we actually want our descendants to live in? What do we value? If you had the power of a superintelligence, what *should* you try to do? This is something that has to be decided by all of us, AI researcher or not. The Future of Life institute is in fact running a survey about this.

I imagine that you probably have one of two viewpoints, if you have never thought deeply about AI.

One viewpoint is that when you list the benefits and problems, you decide that the benefits are not worth the risk. So why do all that? Can’t we just stop here, or even downgrade to an earlier level of technology?

The answer is no, we cannot. First of all, the economy requires us to never stop innovating – it won’t work otherwise . And innovation also means developing ever smarter ML/AI algorithms. We might want to change how the economy works, of course, but this is beyond the scope of this article.

Second, the world has already changed and continues to do so. Things that worked in the past (e.g. combustion engines) often don’t work as a long-term solution (causes climate change and is dependent on limited resources). Today’s problems (climate change, organizing a global society, etc.) are more complicated than anything we have dealt with in the past, and solving them requires novel ideas and technologies.

The other viewpoint is that “everything will be alright”. This gut feeling that no matter what we do, no matter how bad things get, we’ll somehow get out of it and move on.

But the most important lesson that I have learned in my study of physics is that **nature does not care about us. That is why it is so important that we care about each other.**

Nature does not care if we live or die, if we are happy or if we suffer. It does not even care if we understand nature, or if we have a sense of purpose. Because nature is not a person. These things matter to *us*, however, and we need to get our act together to make this right.

This insight is particularly important when we consider the big picture. There is no grand plan. No guarantee that we are heading in a good direction, no safety-net that ensures that life continues irrespective of what we do. But there are also no constraints on how good the world could become, apart from the laws of physics.

It is up to us to figure out what direction is good for us, where we want to go, and how we go there.

Personally, I think we should aim to develop AI that helps us become the best people that we can be. Instead of suggesting enraging memes to us, it should help us understand each other better. Instead of learning to manipulate us for the sake of power and influence, it should learn to motivate us to reach excellence at the things we love to do. Once we become the best version of ourselves, we can decide where we want to go from there.

1.

West, G. B. *Scale: the universal laws of growth, innovation, sustainability, and the pace of life in organisms, cities, economies, and companies*. (Penguin Press, 2017).

1.

Lee, K.-F. *AI superpowers: China, Silicon Valley, and the new world order*. (Houghton Mifflin Harcourt, 2018).

1.

Berger, J. A. & Milkman, K. L. What Makes Online Content Viral? *SSRN Electronic Journal* (2009) http://doi.org/10.2139/ssrn.1528077.

1.

Kosinski, M., Stillwell, D. & Graepel, T. Private traits and attributes are predictable from digital records of human behavior. *Proceedings of the National Academy of Sciences* **110**, 5802–5805 (2013).

1.

Bostrom, N. *Superintelligence: paths, dangers, strategies*. (Oxford University Press, 2014).

1.

Goodfellow, I., Bengio, Y. & Courville, A. *Deep Learning*. (2016).

The post How AI may impact your life and how you can help making this impact a positive one appeared first on Johannes E. M. Mosig.

]]>The post Self-Organizing Maps appeared first on Johannes E. M. Mosig.

]]>The algorithm is very simple. Say you have a set of high-dimensional vectors and you want to represent them in an image, such that each vector is associated with a pixel of that image, and the similarity between vectors should correspond to the distance between pixels.

As a first step, we associate a random vector with each pixel in our map. Then we go through the input vectors and compute the “best matching unit” (BMU), which is the vector in our map that is closest to the input.

Now we update the BMU, and vectors that are close to it on the map, to be more like the input vector. How strongly we shift the vectors may depend on the distance from the BMU, and with each new input, the sphere of influence should decrease.

For example, we can take the RGB vectors of 1000 random colors as inputs, and create a 2D color map:

Here is a simple (non-optimized) python source code that generated the image above:

import numpy as np | |

class SelfOrganizingMap(object): | |

def __init__(self, map_dims, input_dim): | |

self.map_dims = map_dims | |

self.input_dim = input_dim | |

self.weights = np.random.uniform(size=(map_dims + [input_dim]), low=0.0, high=1.0) | |

def __call__(self, input_batch, training=False): | |

# Compute how similar each vector in the map is to the input vectors | |

activations = np.array([np.sum(np.square(self.weights - b), axis=-1) for b in input_batch]) | |

if training: | |

num_batch = np.shape(input_batch)[0] | |

num_indices = np.product(self.map_dims) | |

# Loop through all vectors on the input batch | |

for b in range(num_batch): | |

# Find which point on the map has the vector that is closest to the input | |

best_matching_unit = np.argmax(activations[b], axis=None) | |

x0 = np.unravel_index(best_matching_unit, self.map_dims) | |

# Determine a radius of influence | |

r = 5.0 * np.exp(-1.0 * b / num_batch) | |

# Update all vectors associated with the points in the vicinity of the best-matching-unit | |

for i in range(num_indices): | |

# Compute the squared Euclidean distance between points on the map | |

x = np.unravel_index(i, self.map_dims) | |

dist2 = np.sum(np.square(np.array(list(x)) - np.array(list(x0)))) | |

# Set the update strength to be a Gaussian, centered at the best matching unit | |

w = 1.0 * np.exp(-dist2 / (2 * r**2)) | |

# Update weights | |

self.weights[x] = (1.0 - w) * self.weights[x] + w * input_batch[b] | |

return activations | |

if __name__ == '__main__': | |

# Example: Self-organizing map for 1000 colors (reducing 3 to 2 dimensions) | |

import matplotlib.pyplot as plt | |

som = SelfOrganizingMap([12, 12], 3) | |

colors = np.random.uniform(0.0, 1.0, size=[1000, 3]) | |

som(colors, training=True) | |

plt.imshow(som.weights) | |

plt.show() |

The post Self-Organizing Maps appeared first on Johannes E. M. Mosig.

]]>