Yeah Buddy!!! Happy Thanksgiving!!!

Let’s find the repeated sequences of moves in chess matches between chess masters.

The source of the chess moves is here: https://github.com/rozim/ChessData/tree/master/ChessOk.com

Stat Type Number
# of matches 469,825
# of moves 19,031,726
runtime of analysis 62 minutes 23 seconds

Portable Game Notation (PGN) is a plain text format for recording chess games. I’m running on a 2013 MacBook with Intel i5 2.6Ghz CPU. I will re-run analysis on a more current Lenovo X1 with Intel i7 1.8Ghz CPU.

Here is an example game: 1.d4 d5 2.Nf3 Nf6 3.c4 e6 4.e3 c6 5.Nbd2 g6 6.b3 Qa5 7.Qc2 Bg7 8.Bd3 O-O 9.O-O c5 10.dxc5 Nc6 11.a3 dxc4 12.Qxc4 Nd7 13.Rb1 Nde5 14.Qc2 Nxd3 15.Qxd3 Qxc5 16.b4 Qe7 17.Bb2 a6 18.Bxg7 Kxg7 19.Nc4 f6 20.Nb6 Rb8 21.Qc3 e5 22.Nd2 Be6 23.Ne4 Bf5 24.Rbd1 Rbd8 25.Nc5 Qc7 26.Nca4 Bg4 27.f3 Be6 28.Nc5 Rfe8 29.Rxd8 Nxd8 30.Nca4 Qxc3 31.Nxc3 Nc6 32.Ne4 Re7 33.Nc5 Kf7 34.Rc1 Nb8 35.Rd1 Nc6 36.Rd6 Na7 37.a4 Nc8 38.Rxe6 Rxe6 39.Nxc8 Rc6 40.Nxb7 Rc1+ 41.Kf2 Rc2+ 42.Kg3 Ke8 43.Nb6 f5 44.a5 Rb2 45.Nd5 Kd7 46.Nc5+ Kc6 47.e4 fxe4 48.fxe4 Kb5 49.h4 Rc2 50.Nd7 h6 51.Nxe5 g5 52.hxg5 hxg5 53.Nf3 Rc8 54.Nxg5 Rf8 55.Ne6 Rg8+ 56.Kf3 Kc4 57.g4 Rg6 58.Nec7 Kd4 59.Kf4 Rg7 60.Ne6+ 1-0

To allow the moves to reveal more patterns, I replace move numbers with a generic # symbol. This way we can see patterns between move sequences with different numbering. The first number one is replaced with the word start. This identifies this as part of a chess opening. I may reconsider this, and replace it with #.

start d4 d5 # Nf3 Nf6 # c4 e6 # e3 c6 # Nbd2 g6 # b3 Qa5 # Qc2 Bg7 # Bd3 O-O # O-O c5 # dxc5 Nc6 # a3 dxc4 # Qxc4 Nd7 # Rb1 Nde5 # Qc2 Nxd3 # Qxd3 Qxc5 # b4 Qe7 # Bb2 a6 # Bxg7 Kxg7 # Nc4 f6 # Nb6 Rb8 # Qc3 e5 # Nd2 Be6 # Ne4 Bf5 # Rbd1 Rbd8 # Nc5 Qc7 # Nca4 Bg4 # f3 Be6 # Nc5 Rfe8 # Rxd8 Nxd8 # Nca4 Qxc3 # Nxc3 Nc6 # Ne4 Re7 # Nc5 Kf7 # Rc1 Nb8 # Rd1 Nc6 # Rd6 Na7 # a4 Nc8 # Rxe6 Rxe6 # Nxc8 Rc6 # Nxb7 Rc1+ # Kf2 Rc2+ # Kg3 Ke8 # Nb6 f5 # a5 Rb2 # Nd5 Kd7 # Nc5+ Kc6 # e4 fxe4 # fxe4 Kb5 # h4 Rc2 # Nd7 h6 # Nxe5 g5 # hxg5 hxg5 # Nf3 Rc8 # Nxg5 Rf8 # Ne6 Rg8+ # Kf3 Kc4 # g4 Rg6 # Nec7 Kd4 # Kf4 Rg7 # Ne6+ 1-0

Now I’m ready to combine all the games into a text file and run analysis tools on it. The resulting file is 342M bytes. I removed anything that was not a chess move. That reduced the file to 192M bytes. It has 4,499,168 lines of text. Here’s a sample of the input file.


start d4 d5 # Nf3 Nf6 # c4 e6 # e3 c6 # Nbd2 g6 # b3 Qa5 # Qc2 Bg7
# Bd3 O-O # O-O c5 # dxc5 Nc6 # a3 dxc4 # Qxc4 Nd7 # Rb1
Nde5 # Qc2 Nxd3 # Qxd3 Qxc5 # b4 Qe7 # Bb2 a6 # Bxg7 Kxg7
# Nc4 f6 # Nb6 Rb8 # Qc3 e5 # Nd2 Be6 # Ne4 Bf5 # Rbd1
Rbd8 # Nc5 Qc7 # Nca4 Bg4 # f3 Be6 # Nc5 Rfe8 # Rxd8 Nxd8
# Nca4 Qxc3 # Nxc3 Nc6 # Ne4 Re7 # Nc5 Kf7 # Rc1 Nb8 # Rd1
Nc6 # Rd6 Na7 # a4 Nc8 # Rxe6 Rxe6 # Nxc8 Rc6 # Nxb7 Rc1+
# Kf2 Rc2+ # Kg3 Ke8 # Nb6 f5 # a5 Rb2 # Nd5 Kd7 # Nc5+
Kc6 # e4 fxe4 # fxe4 Kb5 # h4 Rc2 # Nd7 h6 # Nxe5 g5 # hxg5
hxg5 # Nf3 Rc8 # Nxg5 Rf8 # Ne6 Rg8+ # Kf3 Kc4 # g4 Rg6
# Nec7 Kd4 # Kf4 Rg7 # Ne6+ 1-0


start d4 d5 # c4 e6 # Nf3 Nf6 # Bg5 h6 # Bxf6 Qxf6 # Nc3 c6 # e3
Nd7 # Bd3 g6 # O-O dxc4 # Bxc4 Bg7 # Rc1 O-O # Ne4 Qe7 # Bb3
Rd8 # Qc2 Rb8 # Rfd1 Nf8 # Ne5 Bd7 # f4 Be8 # Qc5 Qxc5 # dxc5
Bxe5 # fxe5 Nd7 # Nf6+ Kf8 # Nxd7+ Rxd7 # Rd6 Rc7 # Bc4
b6 # b3 Ke7 # Kf2 bxc5 # a4 Rd8 # Rcd1 Rdd7 # Rxd7+ Rxd7
# Rxd7+ Bxd7 # h4 f6 # Kf3 fxe5 # e4 a5 # g4 Be8 # g5 hxg5
....

From here out, we focus on the output of the patterns finding programs.

The most popular single move is O-O, castle queenside. That move is made 752,998. Because each side can make this move, the max number of times that move is 469,825 times 2. That means that the probability of O-O is played in a game is 752,998 / (469,825 x 2) = 0.80136008088.

Let’s look for the most common quick O-O. That means that the castle happens early in the move sequence. This sequence is played 8431 in the data.

1. e4 e5 
2. Nf3 Nc6 
3. Bb5 a6 
4. Ba4 Nf6 
5. O-O

The Ruy Lopez also called the Spanish Opening or Spanish Game, is a chess opening characterised by the moves:

1. e4 e5
2. Nf3 Nc6
3. Bb5

What are the common variations that follow these opening moves in the Spanish Game? The common move is the Sicilian Defence with the Najdorf Variation. For 100 years, the Sicilian was criticized as weak. In the last 25 years, this game has been favored by Kasparov and became more popular.

Later, Garry Kasparov also adopted the 5…a6 move order, but with the idea of playing …e6 rather than …e5. Kasparov’s point is that the immediate 5…e6 allows 6.g4, which is White’s most dangerous line against the Scheveningen. By playing 5…a6 first, Black temporarily prevents White’s g4 thrust and waits to see what White plays instead. Source: https://en.wikipedia.org/wiki/Sicilian_Defence#Najdorf_Variation

In the analysis, we see this pretty pattern.

12444    start e4 c5# Nf3 d6# d4 cxd4# Nxd4 Nf6# Nc3 a6#

That means that this opening is played 12,444 times. We can translate that to this more standard PGN notation.

1.e4 c5 
2.Nf3 d6 
3.d4 cxd4 
4.Nxd4 Nf6 
5.Nc3 a6

Below is a visualization of the Sicilian Defence - Najdorf Variation:

There is a gold mine of patterns in the patterns uncovered. Here link to the raw output file of the analysis of the top 10 patterns of each length of repeated phrases. The longest phrases are first. It looks like there is some dirty data with repeated games. I did not clean up all the data. Patterns results are very robust. We are only considering patterns that happen thousands of times or more. The duplications are at most 2 or 3 times.

Summarizing the games of chess masters into an easy and quick to remember is fun. As an amateur chess player, we need to be efficient learners because we don’t have as much time to play as masters with major commitments of time.