The rules seem weird - Martin Gardner had the same matchbox self-learning robot for playing hexapawn
<a href="https://en.wikipedia.org/wiki/Hexapawn" rel="nofollow">https://en.wikipedia.org/wiki/Hexapawn</a>
It had less states so it could fit on 20+ matchboxes that were filled with candy, but the rules are:
1. If it wins, nothing is changed
2. If it loses, you take the last move that has been made that resulted in the loss, and eat the candy, thus cutting this move from the possible move graph<p>This way every game lost improves the engine 100%, while in this Menace example the draw introduces unnecessary noice by bringing back moves, and the punishment for a loss seems unnecessarily harsh - removing EVERY MOVE played, which may cut out the best strategy