TechEcho

11 comments

dzdtabout 5 years ago

This reminds me where many years ago I learned about the world record holder for computer optical character recognition (OCR) accuracy.The computer scientists took as a target an eastern European chess journal which printed move-by-move reports of tournament chess matches. They incorporated a crude chess engine in the recognition step estimating the liklihood of next moves and combining that with the OCR engine liklihood estimate that the printed characters were particular glyphs. Despite very low quality of the printing, the journal had very high quality editing. The source material was self consistent. Completely illegible characters could mostly be filled in as the sensible game moves that were allowed. It took hundreds of hours of human review time to find a single OCR mistake from this process!

评论 #22525517 未加载

评论 #22526257 未加载

willvarfarabout 5 years ago

Hmm, a lot of effort to make documents small, but then storing it in mongodb?!?!If size and performance are a focus, just store them in a normal sorted table with compression (e.g. leveldb, or mysql using rocksdb).This means all these small documents can be compressed with repetition between games and not just within each game.And probably much much faster and simpler etc.Basically, the size taken by the database should be at the same kind of level as you get by just having a text file with one line per game, and gzipping the whole thing. I'd expect it to be an order of magnitude smaller than per-document compression.

评论 #22524902 未加载

评论 #22525049 未加载

评论 #22528140 未加载

thomabout 5 years ago

I have always loved just how much of computer science you could learn by doing nothing but working on chess your entire life.

评论 #22524649 未加载

SethTroabout 5 years ago

The article doesn't say it anywhere but I sure hope all the games are backed up to disk/tape in txt format and that this is just used to keep in-memory DB/memcache size down.Otherwise an interesting article about injecting domain knowledge to improve beyond what's possible with a normal compression algorithm.

评论 #22525643 未加载

jameshartabout 5 years ago

the basis of this algorithm is to rank the possible moves from the current position, then use that to choose a Huffman encoding. In essence, they use a very naive single-move-look ahead chess AI to quickly rank moves giving them a crude measure for how ‘surprising’ a particular move would be at that point in the game.Interesting question: If you just generated the bit string that corresponded to taking the ‘most obvious move’ according to their heuristics, what game would that play out? In a way, that would be the ‘most obvious chess game’, or perhaps the ‘least creative chess game’...In theory a more optimal version would be to use a more sophisticated AI to rank the next moves, and even to choose how aggressively to Huffman code the possible options.In a sense this would measure the information content of a chess game (at least relative to the ‘knowledge’ of that AI) . I wonder what the variance in the number of bits needed to encode various games would tell you about the play style of different players, or the nature of the game, or even the relationship betweeen information theory and creativity....

评论 #22527897 未加载

jhrmnnabout 5 years ago

This nicely illustrates the point of the Hutter Prize, that efficient compression comes from understanding the content on a nontrivial level.<a href="https://en.wikipedia.org/wiki/Hutter_Prize" rel="nofollow">https://en.wikipedia.org/wiki/Hutter_Prize</a>

6510about 5 years ago

The to slow approach is fun. If you freeze and use a reasonable engine a bit string could represent which moves match the engine move. (lets call it block 1) The missing moves could be 3 bit numbers like 000 for second best engine move, 001 for 3rd, 010 for 4th, 011 for 5th, 100 for 6th best move, 101 for 7th, 110 for 8th and 111 for other moves. (block 2) The other moves are simply put like E4, E24 or NF3G5.(block 3)

评论 #22524727 未加载

评论 #22524662 未加载

steerablesafeabout 5 years ago

As I understand it uses the Huffman-code for all moves including the opening moves. Alternatively statistics could be gathered for all first moves, all second moves, etc..., then different Huffmann-code could be applied for opening moves.I wouldn't be surprised if the statistics for the first few moves were significantly different to the moves deep into the game.

评论 #22530405 未加载

gowldabout 5 years ago

> Our database runs on 3 servers (1 primary, 2 secondaries), each with two 480GB SSDs in a RAID 1 configurationAlgorithmically cool, but quite a lot of work to save <$500/yr in hardware for a handful of 1TB SSDs.Converting the data to the new format cost more than upgrading disks.

评论 #22529202 未加载

SamReidHughesabout 5 years ago

Nice. A touchy balance between overengineering and too much overengineering. Maybe you could better compress combinations of moves, like exchanges.If you also store move times, there’s not much of a win to this.

bumbledravenabout 5 years ago

The article includes a brief but clear explanation of Huffman coding. I didn’t understood how it works until now!

11 comments

dzdtabout 5 years ago

评论 #22525517 未加载

评论 #22526257 未加载

willvarfarabout 5 years ago

评论 #22524902 未加载

评论 #22525049 未加载

评论 #22528140 未加载

thomabout 5 years ago

I have always loved just how much of computer science you could learn by doing nothing but working on chess your entire life.

评论 #22524649 未加载

SethTroabout 5 years ago

评论 #22525643 未加载

jameshartabout 5 years ago

评论 #22527897 未加载

jhrmnnabout 5 years ago

6510about 5 years ago

评论 #22524727 未加载

评论 #22524662 未加载

steerablesafeabout 5 years ago

评论 #22530405 未加载

gowldabout 5 years ago

评论 #22529202 未加载

SamReidHughesabout 5 years ago

bumbledravenabout 5 years ago

The article includes a brief but clear explanation of Huffman coding. I didn’t understood how it works until now!

Improved chess game compression (2018)

11 comments

Improved chess game compression (2018)

11 comments