Learning ‘Montezuma’s Revenge’ from a single demonstration

151 pointsby gdbalmost 7 years ago

11 comments

rsfalmost 7 years ago

> In addition, the agent learns to exploit a flaw in the emulator to make a key re-appear at minute 4:25 of the videoAfter a bit of debugging, this appears to be a very intentional feature in the game rather than a flaw. That key appears after a while if you're not in the room (and don't have one).Based on this disassembly: <a href="http://www.bjars.com/source/Montezuma.asm" rel="nofollow">http://www.bjars.com/source/Montezuma.asm</a>Here's the relevant code with some annotations added:<a href="https://goo.gl/VUDr9F" rel="nofollow">https://goo.gl/VUDr9F</a>I'm not sure if this is a previously known feature in the game (a quick google search does not reveal much). It would be quite interesting if the RL agent was the first to find it!PS: If you launch MAME with the "-debug" option and press CTRL+M you can see the whole memory (atari 2600 only has 128 bytes!!) while playing the game. If you keep an eye on the byte at 0xEA you will know when the key is about to pop up. Alternatively you can speed things along by changing it yourself to a value just below 0x3F.

bryanhalmost 7 years ago

One thing that is striking to me in almost all these sorts of otherwise impressive demonstrations are the apparently bizarre "jitter" movements while waiting for a door to open or path to clear in the game. Clearly there is no fitness in quietly waiting.It is darkly humorous to contrast Hollywood's or scifi's "killer AI robots" that methodically hunt you down to these real world demonstrations of emerging AI. Maybe the first "killer AI robots" would exhibit similarly bizarre behaviors while they methodically hunt down the unlikely hero. :-)

评论 #17458672 未加载

评论 #17459674 未加载

评论 #17458676 未加载

评论 #17458440 未加载

评论 #17458649 未加载

评论 #17459985 未加载

评论 #17459360 未加载

评论 #17459753 未加载

评论 #17459682 未加载

dane-pgpalmost 7 years ago

"By multiplying N of these probabilities together, we end up with the resulting probability p(get key) that is exponentially smaller than any of the individual input probabilities."So they solved this by feeding the AI with a human demonstration, but have there been any attempts at giving the AI an explicit reward for maximising the "novelty" of the input state (i.e. the image on the screen)?The game does not give the player points for reaching new rooms, but if the AI was rewarded for producing the "novel" state of a new room, then that would give it a drive to explore. Similarly, there would be an implicit penalty to the AI for repeatedly falling off a ledge or returning back to a room it had already visited (although some amount of back-tracking would no doubt be useful), whereas reaching a new part of the screen (by climbing a ladder, say) would be rewarded.There are times where the AI would have to be patient and wait, but the window could be learned or set as a hyper-parameter. This might be enough to stop the unproductive behaviour of it jittering left and right continuously, since doing so does not produce a new state, relative to just standing still at least.

评论 #17459012 未加载

评论 #17459552 未加载

jwcruxalmost 7 years ago

I thought this was going to be very different than what it was.

评论 #17458446 未加载

评论 #17459042 未加载

评论 #17458700 未加载

empressplayalmost 7 years ago

Montezuma's Revenge is one of the more impressive "3D conversions" done by our Apple II emulator [1][1] <a href="https://paleotronic.com/wp-content/uploads/2018/05/5.png" rel="nofollow">https://paleotronic.com/wp-content/uploads/2018/05/5.png</a>

goatloveralmost 7 years ago

It's an impressive achievement, but it does seem to get stuck at times, like from around 1:35 to 2:10 and 3:45 to 4:30 (irritatingly on the edge of two screens), but that second time actually resulted in a new key showing up, which the article says was a flaw in the emulation that it was exploiting.Interesting that their approach didn't work for Pitfall (never played Gravitar).

评论 #17460425 未加载

gfodoralmost 7 years ago

If I'm understanding this right, the AI wasn't given a "full demonstration" of the game, but specific frame snapshots at goal completion points. So it basically learned how to get from goal A to goal B, but it had to be given examples of what goal A and goal B looked like visually.Iow, it was showing what beating the game would look like at some level of granularity. I guess the next obvious question is how far up you could dial the granularity and result in the AI still learning how to beat the game.

评论 #17459663 未加载

YeGoblynQueennealmost 7 years ago

>> The exploration problem can largely be bypassed in Montezuma’s Revenge by starting each RL episode by resetting from a state in a demonstration. By starting from demonstration states, the agent needs to perform much less exploration to learn to play the game compared to when it starts from the beginning of the game at every episode. Doing so enables us to disentangle exploration and learning.Or in other words- use the Domain Knowledge, Luke. Quit trying to learn everything from scratch. Because that's just dumb.

评论 #17461807 未加载

seanwilsonalmost 7 years ago

> Our agent playing Montezuma’s Revenge. The agent achieves a final score of 74,500 over approximately 12 minutes of play (video is double speed). Although much of the agent’s game mirrors our demonstration, the agent surpasses the demonstration score of 71,500 by picking up more diamonds along the way.How well would this adapt if the map/layout changed then?

AstralStormalmost 7 years ago

Please call me again when they actually solve the exploration problem instead of falling back to a good example.People beating this game do not do it based on a let's play video.

评论 #17459257 未加载

评论 #17459409 未加载

Johnny555almost 7 years ago

Not being familiar with the game, I thought Montezuma's Revenge was something entirely different.<a href="https://en.wikipedia.org/wiki/Traveler%27s_diarrhea" rel="nofollow">https://en.wikipedia.org/wiki/Traveler%27s_diarrhea</a>