Reinforcement learning and Rubik's Cube

This week’s article is “Solving the Rubik’s Cube Without Human Knowledge” by McAleer, Agostinelli, Shmakov, and Baldi , which was submitted to NeurIPS 2018. To understand this article, you need to have a basic understanding of neural networks and be familiar with reinforcement learning.

In my own words

The Rubik’s Cube is a 3-dimensional combination puzzle, invented by Erno Rubik in 1974. The Cube has six faces that can each be turned by 90° in either direction. Turning the sides moves the colored stickers that are attached to the “cubelets” that this side is made of. The goal of the game is to bring the Cube back into the sorted state (one color on each side), starting from a scrambled state.

For a full description of the problem, have a look at McAleer et al.’s article or check out this cool animation of how the mechanics of the Cube works:

What’s inside of a Rubik’s Cube? By Jared Owen.

This puzzle is very hard to solve with common reinforcement learning (RL) techniques, because there are $\approx 4 \times 10^{19}$ possible configurations of the Rubik’s Cube. In particular, it is very unlikely that you stumble over the solution by taking random actions, once you start from a scrambled state. Taking random actions, however, is exactly what a typical untrained RL agent does.

McAleer et al. circumvent this reward signal sparsity by starting the exploration of the state space with the solved configuration, to learn the value- and policy functions of the agent. They call their learning algorithm Autodidactic Iteration (ADI).

ADI learns the value and policy functions in three steps. First, it generates training inputs by modifying a solved Cube turn by turn – each turn yields a new state that is another training input. Second, for each training input state, it evaluates the value function for every child state that can be reached from the input state with one step. Finally, it sets the value and policy targets based on the maximal estimated value of the children.

Once the value and policy targets are known, McAleer et al. use the RMSProp optimization algorithm to update the weights of the NN that constitutes the value and policy functions. To stabilize the algorithms, they have to define a learning rate that is inversely proportional to the distance of the sample to the solved Cube.

A Monte Carlo tree search (MCTS) can now use the learned value and policy functions to solve the Cube.

McAleer et al. compare their algorithm (called DeepCube) with other algorithms that either use human knowledge (“Kociemba”) or are extremely slow (“Korf”). In addition, they also compare DeepCube to two simplified versions of itself.

They find that DeepCube is able to solve 100% of randomly scrambled Cubes within one hour while achieving a median solve length of 13 moves (same as Korf). Moreover, DeepCube matches the optimal path (that Korf is guaraneed to find) in 74% of the cases, while evaluating much faster than Korf.

Interestingly, DeepCube learns to use certain sequences of moves that are also used by humans, e.g. $a\,b\,a^{-1}$ , where $a$ and $b$ are arbitrary actions, and $a^{-1}$ is the inverse action of $a$ . Moreover, DeepCube seems to follow a particular strategy when solving Cubes.

Opinion, and what I have learned

The Rubik’s Cube is indeed a hard and interesting problem to solve, and McAleer et al.’s DeepCube algorithm manages to solve this problem without human knowledge that is specific to it. This by itself is quite impressive.

I also find that their article is very well written and illustrated. It thus serves as a good introduction to the Rubik’s Cube machine learning problem.

Most interesting to me was the fact that DeepCube often seems to follow a particular strategy.

The critical trick of allowing the algorithm to always start from the solved state, however, does not seem like a revolutionary idea. In real-world problems you cannot just reset the state of the world, and when somebody hands me a scrambled Rubik’s Cube, I cannot just solve it in order to learn how to solve it.

Furthermore, there is some human domain knowledge included in DeepCube, namely the representation of the Cube. In my own Rubik’s Cube project I aim for my algorithm to come up with a representation on its own and solve the Cube without direct access to the solution. Follow my Rubik’s Cube blog to track my progress on this project.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Johannes E. M. Mosig

Reinforcement learning and Rubik’s Cube

In my own words

Opinion, and what I have learned

References

Leave a ReplyCancel reply