Skip to content

Latest commit

 

History

History

annotation_vs_simulation

Annotation-vs-Simulation Testing Framework

There are two independently developed components in this project: one is the annotation program (annotator) and the other is the simulation program (simulator).

Both of them are essential for learning Mahjong AI, and their correctness is a fundamental requirement for successful learning. If the annotator is buggy, then learning will be based on incorrect features, or learning will be aimed at an wrong objective. The simulator can be used to compare the performance of differently trained models, and may also be used for reinforcement learning. If there are bugs in the simulator, the performance comparison of the trained models may not be correct, or incorrect simulation results may be produced for reinforcement learning.

In order to ensure the correctness of these two components, one might normally think of employing a typical testing framework such as unit testing. However, both of these components have mahjong game states, which are extremely complex, inside them, and it would be extremely costly to prepare such internal states for each unit test. In addition, there are a large number of unstated corner cases in the mahjong rule. Manually identifying these corner cases and writing the corresponding unit tests will inevitably result in some corner cases being omitted from the tests.

Due to the circumstances described above, this project adopts a testing framework called the Annotation-vs-Simulation Testing Framework. In this framework, an extremely large number of game records crawled from Mahjong Soul are used as test cases, and for each test case, it is checked if there is any discrepancy between the annotator implementation and the simulator implementation.

The following figure shows the outline of the Annotation-vs-Simulation Testing Framework.

Diagram of Annotation-vs-Simulation Testing Framework

In the testing framework, the first step is to generate the test data. Test data is generated by running test/annotation_vs_simulation/generate.py with the following data as the input:

The test data is JSON files, and each JSON file contains complete information about the progress and results of one game. The format of each JSON file looks like the following:

{
  "uuid": "YYMMDD-XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
  "room": 3,
  "style": 1,
  "players": [
    {
      "grade": 6,
      "final_ranking": 0,
      "final_score": 30000
    },
    .....
  ],
  "rounds": [
    {
      "paishan": [36, .....],
      "decisions": [
        {
          "sparse": [2, .....],
          "numeric": [0, .....],
          "progression": [0, .....],
          "candidates": [144, .....],
          "index": 0
        },
        .....
      ],
      "delta_scores": [0, 0, 0, 0]
    },
    .....
  ],
}

For a more precise specification of the format of test data with JSON Schema, please refer to the "Appendix: JSON Schema for Test Data" section.

Once test data is ready, the testing framework runs test/annotation_vs_simulation/run.py with the test data. The program first initializes an instance of the TestModel Python class. This class is a dummy model, which decides actions based on a sequence of pairs of features and decisions at each decision-making point in the actual game record saved in the test data. After that, test/annotation_vs_simulation/run.py starts simulation with the tile walls restored from the game record with the TestModel dummy model as player's decision model. If both annotation and simulation are correctly implemented, the simulation passes features at each decision-making point with the content and order expected by the dummy model, and the dummy model behaves exactly like each player's decisions as recorded in the game record and returns the correct decisions to the simulation in the correct order.

If there is any bug in either annotation or simulation, either wrong features are passed to the dummy model, features are passed in an wrong order, or simulation returns an wrong result. The testing framework checks these indications of discrepancy between annotation and simulation, and causes the test to fail when such discrepancy is detected.

This testing framework is very powerful and has allowed me to find many bugs in annotation and simulation so far. In addition, since I have been running this test framework on a set of test data consisting of more than 100 million rounds, I have even uncovered a large number of unstated corner cases of the standard rule of Mahjong Soul.

Appendix: JSON Schema for Test Data

{
  "type": "object",
  "required": [
    "uuid",
    "room",
    "style",
    "players",
    "rounds"
  ],
  "properties": {
    "uuid": {
      "type": "string"
    },
    "room": {
      "type": "integer",
      "minimum": 0,
      "maximum": 4
    },
    "style": {
      "type": "integer",
      "minimum": 0,
      "maximum": 1
    },
    "players": {
      "type": "array",
      "minItems": 4,
      "maxItems": 4,
      "items": {
        "type": "object",
        "required": [
          "grade",
          "final_ranking",
          "final_score"
        ],
        "properties": {
          "grade": {
            "type": "integer",
            "minimum": 0,
            "maximum": 15
          },
          "final_ranking": {
            "type": "integer",
            "minimum": 0,
            "maximum": 3
          },
          "final_score": {
            "type": "integer"
          }
        }
      }
    },
    "rounds": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": [
          "paishan",
          "decisions",
          "delta_scores"
        ],
        "properties": {
          "paishan": {
            "type": "array",
            "minItems": 136,
            "maxItems": 136,
            "items": {
              "type": "integer",
              "minimum": 0,
              "maximum": 36
            }
          },
          "decisions": {
            "type": "array",
            "minItems": 1,
            "items": {
              "type": "object",
              "required": [
                "sparse",
                "numeric",
                "progression",
                "candidates",
                "index"
              ],
              "properties": {
                "sparse": {
                  "type": "array",
                  "minItems": 16,
                  "maxItems": 21,
                  "items": {
                    "type": "integer",
                    "minimum": 0
                  }
                },
                "numeric": {
                  "type": "array",
                  "minItems": 6,
                  "maxItems": 6,
                  "items": {
                    "type": "integer"
                  }
                },
                "progression": {
                  "type": "array",
                  "minItems": 1,
                  "items": {
                    "type": "integer",
                    "minimum": 0
                  }
                },
                "candidates": {
                  "type": "array",
                  "minItems": 1,
                  "items": {
                    "type": "integer",
                    "minimum": 0
                  }
                },
                "index": {
                  "type": "integer",
                  "minimum": 0
                }
              }
            }
          },
          "delta_scores": {
            "type": "array",
            "minItems": 4,
            "maxItems": 4,
            "items": {
              "type": "integer"
            }
          }
        }
      }
    }
  }
}