The Rubik’s Cube is a 3-dimensional combination puzzle, invented by Erno Rubik in 1974. The Cube has six faces that can each be turned by 90° in either direction. Turning the sides moves the colored stickers that are attached to the “cubelets” that this side is made of. The goal of the game is to bring the Cube back into the sorted state (one color on each side), starting from a scrambled state.

For a full description of the problem, have a look at McAleer et al.’s article or check out this cool animation of how the mechanics of the Cube works:

This puzzle is very hard to solve with common reinforcement learning (RL) techniques, because there are \approx 4 \times 10^{19} possible configurations of the Rubik’s Cube. In particular, it is very unlikely that you stumble over the solution by taking random actions, once you start from a scrambled state. Taking random actions, however, is exactly what a typical untrained RL agent does.

McAleer et al. circumvent this reward signal sparsity by starting the exploration of the state space with the *solved* configuration, to learn the value- and policy functions of the agent. They call their learning algorithm Autodidactic Iteration (ADI).

ADI learns the value and policy functions in three steps. First, it generates training inputs by modifying a solved Cube turn by turn – each turn yields a new state that is another training input. Second, for each training input state, it evaluates the value function for every child state that can be reached from the input state with one step. Finally, it sets the value and policy targets based on the maximal estimated value of the children.

Once the value and policy targets are known, McAleer et al. use the RMSProp optimization algorithm to update the weights of the NN that constitutes the value and policy functions. To stabilize the algorithms, they have to define a learning rate that is inversely proportional to the distance of the sample to the solved Cube.

A Monte Carlo tree search (MCTS) can now use the learned value and policy functions to solve the Cube.

McAleer et al. compare their algorithm (called DeepCube) with other algorithms that either use human knowledge (“Kociemba”) or are extremely slow (“Korf”). In addition, they also compare DeepCube to two simplified versions of itself.

They find that DeepCube is able to solve 100% of randomly scrambled Cubes within one hour while achieving a median solve length of 13 moves (same as Korf). Moreover, DeepCube matches the optimal path (that Korf is guaraneed to find) in 74% of the cases, while evaluating much faster than Korf.

Interestingly, DeepCube learns to use certain sequences of moves that are also used by humans, e.g. a\,b\,a^{-1}, where a and b are arbitrary actions, and a^{-1} is the inverse action of a. Moreover, DeepCube seems to follow a particular strategy when solving Cubes.

The Rubik’s Cube is indeed a hard and interesting problem to solve, and McAleer et al.’s DeepCube algorithm manages to solve this problem without human knowledge that is specific to it. This by itself is quite impressive.

I also find that their article is very well written and illustrated. It thus serves as a good introduction to the Rubik’s Cube machine learning problem.

Most interesting to me was the fact that DeepCube often seems to follow a particular strategy.

The critical trick of allowing the algorithm to always start from the solved state, however, does not seem like a revolutionary idea. In real-world problems you cannot just reset the state of the world, and when somebody hands me a scrambled Rubik’s Cube, I cannot just solve it in order to learn how to solve it.

Furthermore, there is *some* human domain knowledge included in DeepCube, namely the representation of the Cube. In my own Rubik’s Cube project I aim for my algorithm to come up with a representation on its own and solve the Cube without direct access to the solution. Follow my Rubik’s Cube blog to track my progress on this project.

The Rubik’s Cube is a popular combination puzzle. Like any other cube it has six sides, but each side consists of six panels. In the sorted (solved) state, there is exactly one color on each side of the cube. You can turn each side independently and thereby re-order the colored panels. The problem is to sort the cube again, after you’ve scrambled it.

There are 43.252.003.274.489.856.000 possible configurations of the Rubik’s Cube. This makes this puzzle very hard to solve with pure reinforcement learning (RL) techniques, because once you are in a scrambled state, it is exceedingly unlikely that you stumble over the solution by taking random actions. Taking random actions, however, is exactly what a typical untrained RL agent does.

Recently, McAleer et al. managed to write a program (“DeepCube”) that learns to solve Rubik’s cube without prior human knowledge , except for an abstract representation of the cube. Check out my other blog post for a summary.

In the present project, my goal is to go one step further and solve a somewhat more difficult problem than that of McAleer et al., as I describe in the next paragraph. For instance, in my problem the algorithm is only provided with a description of the goal, and cannot learn by starting from the goal state, as DeepCube does.

The idea is this: I want to create an algorithm that is given only *images* of the cube, as well as a set of actions that turn it’s six sides. After each turn it can observe a new set of images of the (now) modified cube.

In addition to the set of images of the present state of the cube, the algorithm is also given a set of images of the solved cube. Thus, there is no external reward signal, but only a description (image) of what it is supposed to achieve.

To solve the objective above, I plan to construct an algorithm with three components.

A **representation component** that generates an abstract representation of the environment / Rubik’s Cube. This might be achieved with a generative query network , or some variation of it.

An **action manager component** that learns to reliably modify specific parts of that representation by taking actions in the environment and updating the representation. (It could also learn using an internal action-consequence predictor.) Once trained, this component should be able to return a set of useful algorithms (action-sequences) when it is given a set of positions in the representation that are to be manipulated. An algorithm is useful if:

- it reliably changes a specific subset of the representation with little to no effect on the other parts and
- the algorithms are “almost orthogonal” in the sense that linear combinations of algorithms approximately affect the union of the subsets of the representation.

The **actor component** uses algorithms generated by the manager component to move the current state of the environment towards the goal state (e.g. using Monte Carlo tree search and some distance heuristic). The latter is given as a list of images or textural description of the solved cube, as mentioned above. This description is encoded using the representation component.

My Rubik’s Cube project is a playground for trying out different machine learning techniques. But the specific long-term goal outlined in this post should act as a guide, as well as a source of new problems to solve.

This is my first blog post (ever!), and also my first large machine learning project. I hope you enjoy reading about my journey into the exciting field of artificial intelligence, and I very much appreciate your constructive feedback.

Click here to return to the project overview.