We first trained the policy network on 30 million moves from games played by human experts, until it could predict the human move 57% of the time (the previous record before AlphaGo was 44%). But our goal is to beat the best human players, not just mimic them. To do this, AlphaGo learned to discover new strategies for itself, by playing thousands of games between its neural networks, and gradually improving them using a trial-and-error process known as reinforcement learning. This approach led to much better policy networks, so strong in fact that the raw neural network (immediately, without any tree search at all) can defeat state-of-the-art Go programs that build enormous search trees.
This bit from Elizabeth Gibney, reporting for Nature, also touched on a difficulty in AI that I hadn’t thought much about:
[DeepMind co-founder Demis Hassabis] says that many challenges remain in DeepMind’s goal of developing a generalized AI system. In particular, its programs cannot yet usefully transfer their learning about one system — such as Go — to new tasks; a feat that humans perform seamlessly. “We’ve no idea how to do that. Not yet,” Hassabis says.