OpenAI Five

646 pointsby gdbalmost 7 years ago

28 comments

boulosalmost 7 years ago

Disclosure: I work on Google Cloud (and vaguely helped with this).For me, one of the most amazing things about this work is that a small group of people (admittedly well funded) can show up and do what used to be the purview of only giant corporations.The 256 P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these results ran for just a weekend.In days past, researchers would never have had access to this kind of computing unless they worked for a national lab. Now it's just a budgetary decision. We're getting closer to a (more) level playing field, and this is a wonderful example.

评论 #17394554 未加载

评论 #17395547 未加载

评论 #17397761 未加载

评论 #17394346 未加载

评论 #17394249 未加载

评论 #17394378 未加载

评论 #17394523 未加载

评论 #17399170 未加载

评论 #17394552 未加载

评论 #17395508 未加载

评论 #17397687 未加载

naturalgradientalmost 7 years ago

So as someone working in reinforcement learning who has used PPO a fair bit, I find this quite disappointing from an algorithmic perspective.The resources used for this are almost absurd and my suspicion is, especially considering [0], that this comes down to an incredibly expensive random search in the policy space. Or rather, I would want to see a fair bit of analysis to be shown otherwise.Especially given all the work in intrinsic motivation, hierarchical learning, subtask learning, etc, the sort of intermediate summary of most of these papers from 2015-2018 is that so many of these newer heuristics are too brittle/difficult to make work, so we resort to slightly-better-than brute force.<a href="https://arxiv.org/abs/1803.07055" rel="nofollow">https://arxiv.org/abs/1803.07055</a>

评论 #17392802 未加载

评论 #17393457 未加载

gakosalmost 7 years ago

This article (like pretty much all from OpenAI) is really well done. I love the format and supporting material - makes it waay more digestible and fun to read in comparison to something from arxiv. The video breakdowns really drive the results home.

评论 #17397525 未加载

评论 #17393794 未加载

ufoalmost 7 years ago

This is a really interesting writeup, specially if you know a bit more about how Dota works.That it managed to learn creep blocking from scratch was really surprising for me. To creep block you need to go out of your way to stand in front of the creeps and consciously keep doing so until they reach their destination. Creep blocking just a bit is almost imperceptible and you need to do it all the way to get a big reward out of it.I also wonder if their reward function directly rewarded good lane equilibrium or if that came indirectly from the other reward functions

评论 #17393949 未加载

评论 #17394069 未加载

minimaxiralmost 7 years ago

They are using preemptible CPUs/GPUs on Google Compute Engine for model training? Interesting. The big pro of that is cost efficiency, which isn't something I expected OpenAI to be optimizing. :PHow does training RL with preemptible VMs work when they can shut down at any time with no warning? A PM of that project asked me the same question awhile ago (<a href="https://news.ycombinator.com/item?id=14728476" rel="nofollow">https://news.ycombinator.com/item?id=14728476</a>) and I'm not sure model checkpointing works as well for RL. (maybe after each episode?)

评论 #17393100 未加载

评论 #17394123 未加载

bobcostas55almost 7 years ago

>OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.A bit disappointing, it would be very cool to see what kind of communication they'd develop.

评论 #17393558 未加载

评论 #17392920 未加载

评论 #17392874 未加载

评论 #17393391 未加载

hsradaalmost 7 years ago

I wanted to add the observation that all the restricted heroes are ranged. Necrophos, Sniper, Viper, Crystal Maiden, and Lich.Since playing a lane as a ranged hero is very different from playing the same lane as a melee hero, I wonder whether the AI has learned to play melee heroes yet.

评论 #17394494 未加载

foobawalmost 7 years ago

I've played DotA for over 10 years so this development is quite relevant to me. So excited to see this next month!Although it's extremely impressive, all the restrictions will definitely make this less appealing to the audience (shown in the Reddit thread comments).

评论 #17393605 未加载

eslaughtalmost 7 years ago

> Partially-observed state. Units and buildings can only see the area around them. The rest of the map is covered in a fog...Actually, this is true on multiple levels. There is fog of war, but then there is the fact that a human player can only look at a given window of the game at a time, and has to pan the window to see the area away from their character. (The mini-map shows some level of detail for the rest of the map, but isn't high resolution and doesn't show everything that might be of interest.) Also, you can only issue orders on what is directly visible to you, so if you pan away from your character that restricts what you can do.Is OpenAI Five modeling this aspect of the game? Otherwise it's still "cheating" in some sense vs how a human would be forced to play.

评论 #17393855 未加载

jakecrouchalmost 7 years ago

While this is a cool result, I wonder if the focus on games rather than real-world tasks is a mistake. It was a sign of past AI hype cycles when researchers focused their attention on artificial worlds - SHRLDU in 1970, Deep Blue for chess in the late 1990s. We may look back in retrospect and say that the attention Deepmind got for winning Go signaled a similar peak. The problem is that it's too hard to measure progress when your results don't have economic importance. It's more clear that the progress in image processing was important because it resulted in self-driving cars.

评论 #17396230 未加载

d0malmost 7 years ago

Will the agent controls all 5 players or will each agent control a single player?One of the hard challenge of DOTA is whether or not to "trust" your teammate to do the right action. I.e. One can aggressively go for a kill knowing that their support will back them.. but one can also aggressively go for a kill while their support let them die, and then the whole team starts blaming and tilting because the dps "threw". It's a fine balance.. From personal experience, it seems like in lower leagues it's better to always assume that you're by yourself, whereas in higher leagues you can start expecting more team plays.Another example is often many players will use their ultimate ability at the same time and "wasting" it. It would be easy for an agent controlling all 5 players to avoid this.. but how would a individual agent knows whether or not to use their ult? Are the agents able to communicate between each others? If so, is there a cap to "how fast it does it?". I.e. on voice, it takes a few seconds to give orders.

评论 #17395601 未加载

评论 #17397575 未加载

评论 #17398427 未加载

obastanialmost 7 years ago

I think this is quite impressive. I'm a bit confused about the section saying that "binary rewards can give good performance". Is it saying that binary rewards (instead of continuous rewards) work fine, but end-of-rollout rewards (instead of intermediate rewards such as kills) work poorly?

评论 #17395863 未加载

mooneateralmost 7 years ago

I want to see this datapoint on their AI and Compute chart: <a href="https://blog.openai.com/ai-and-compute/" rel="nofollow">https://blog.openai.com/ai-and-compute/</a>

loser777almost 7 years ago

>Each of OpenAI Five’s networks contain a single-layer, 1024-unit LSTM that sees the current game state (extracted from Valve’s Bot API)This will likely dramatically simplify the problem vs. what the DeepMind/Blizzard framework does for StarCraft II, which provides a game state representation closer to what a human player would actually see. I would guess that the action API is also much more "bot-friendly" in this case, i.e., it does not need to do low-level actions such as boxing to select.

评论 #17394272 未加载

KPLauritzenalmost 7 years ago

Wow, very excited about this. I don't know too much about RL, but for me the "170,000 possible actions per hero" seems far too large an output space to be feasible. What happens if the bot wants to do an invalid action? Nothing, or some penalty for selecting something invalid?

KillcodeXalmost 7 years ago

OpenAI is cover up research AI for the CIA. The main goal will be to kill innocent folks with this type of AI research. These folks are working for CIA without noticing the involvement of The Spy Agency. They are ostensibly private institutions and businesses which are in fact financed and controlled by the CIA. From behind their commercial and sometimes non-profit covers, the agency is able to carry out a multitude of clandestine activities—usually covert-action operations. Many of the firms are legally incorporated in Delaware because of that state's lenient regulation of corporations, but the CIA has not hesitated to use other states when it found them more convenient. The NSA/CIA's best-known proprietaries are Amazon, facebook, Microsoft, Palantir, OpenAI (cover up research AI via non-profit) and Google.... Good luck with working inside a military research without decoding the source of funding.

nerdponxalmost 7 years ago

Are those 180 years of games "seeded" by real games, or was it entirely self play?Also, how does this system cope with gameplay changes that arise when the game is patched? It's new news to any experienced Dota player that even small changes can have major impact on the meadow gam that even small changes can have major impact on winning strategy. Would it need to be re-trained every patch?

评论 #17393290 未加载

ericsoderstromalmost 7 years ago

What are the 170,000 discrete actions?Rough guesses for available actions:<pre><code> 32 (directions for movement) </code></pre> + 10 (spell/item activations)<pre><code> * 20 (potential targets. heroes + near by creeps) </code></pre> + 15 (attack commands. 5 enemy heroes and ~10 near by creeps)Which still leaves... approximately 170,000 actions unaccounted for

评论 #17394566 未加载

评论 #17394707 未加载

评论 #17394634 未加载

评论 #17394889 未加载

评论 #17394612 未加载

formalsystemalmost 7 years ago

Any thoughts from the Dota team on how drafting heroes will work by the time we get to TI? Am also curious if you've seen more experimental drafts in early results that aren't as popular in the pro scene.

评论 #17395873 未加载

yazralmost 7 years ago

Any thoughts from the DOTA team on handling a world map which not bounded in size ?In my projects, the "world" size can change (unlike Go, Chess where the board size is fixed).Is the DoTA board size fixed?I guess the LTSM encodes the board history as seen by the agent. But this probably slows the learning.Some people suggested auto-encoder to compress the world, and then feed it to a regular CNN.Any comments would be welcome.

评论 #17394843 未加载

inverse_pialmost 7 years ago

I'm a Legend dota2 player and also a Machine Learning researcher and I'm fascinated by this result. The main message I take away is, we might already have powerful enough methods (in terms of learning capabilities), and we're limited by hardware (this also makes me a little sad). My thoughts,1) "At the beginning of each training game, we randomly "assign" each hero to some subset of lanes and penalize it for straying from those lanes until a randomly-chosen time in the game...." Combining this with "team spirit" (weighted combined reward - networth, k/d/a). They were able to learn early game movement for position 4 (farming priority position). For roaming position, identifying which lane to start out with, what timing should I leave the lane to have the biggest impact, how should I gank other lanes are very difficult. I'm very surprised that very complex reasoning can be learned from this simple setup.2) Sacrificing safe-lane to control enemy's jungle requires overcoming local minimum (considering the rewards), and successfully assign credits over a very very long horizon. I'm very surprised they were able to achieve this with PPO + LSTM. However, one asterik here is if we look at the draft, Sniper, Lich, CM, Viper, Necro. This draft is very versatile with Viper and Necro can play any lane. This draft is also very strong in laning phase and mid game. Whoever win sniper's lane and win laning phase in general is probably going to win. So this makes it a little bit less of a local optimal. (In contrast to having some safe lane heroes that require a lot of farm).3) "Deviated from current playstyle in a few areas, such as giving support heroes (which usually do not take priority for resources) lots of early experience and gold." Support heroes are strong early game and doesn't require a lot items to be useful in combat. Especially with this draft, CM with enough exp (or a blink, or good positioning) can solo kill almost any hero. So it's not too surprising if CM takes some farm early game, especially when Viper and Necro are naturally strong and doesn't need too much of farm (they still do, but not as much as sniper). This observation is quite interesting, but maybe not something completely new as it might sound like.4) "Pushed the transitions from early- to mid-game faster than its opponents. It did this by: (1) setting up successful ganks (when players move around the map to ambush an enemy hero — see animation) when players overextended in their lane, and (2) by grouping up to take towers before the opponents could organize a counterplay." I'm a little bit skeptical of this observation. I think with this draft, whoever wins the laning phase will be able to take next objectives much faster. And winning the laning phase is really 1v1 skill since both Lich and CM are not really roaming heroes. If you just look at their winning games and draw conclusion, it will be biased.5) This draft is also very low mobility. All 5 heroes Sniper, Lich, CM, Necro, Viper share the weakness of small movement speed (except for maybe Lich). Also, none of these heroes can go at Sniper in mid/late game, so if you have better positioning + reaction time, you'll probably win.Overall, I think this is a great step and great achievement (with some caveats I noted above). As far as next steps, I would love to see if they can try meta-learned agent where they don't have to train from scratch for a new draft. I would love to see they learn item building, courier usage instead of using scripts. I would also love to see they learn drafting (can be simply phrased as a supervised problem). I'm pretty excited about this project, hopefully they release a white paper with some more details so we can try to replicate.

akeckalmost 7 years ago

This feels like Ender's Game without Ender.

andreykalmost 7 years ago

Quite a good read! Impressive results, it seems. Still think much more useful to research learning complex things without absurd compute/sample inefficiency/various hacks eg reward shapring (which, lets be honest, this seems to have a lot of), but still interesting results.

matachuanalmost 7 years ago

What are other killer applications of deep learning rather than CV and gameplaying?

zawerfalmost 7 years ago

What's the estimated cost of a project like this?

评论 #17393974 未加载

wnevetsalmost 7 years ago

The live 5v5 match at TI should be great to watch.

lawlessonealmost 7 years ago

>OpenAI Five plays 180 years worth of games against itself every day.Human players do it in a fraction of their much smaller lifespans.

评论 #17400251 未加载

40945482almost 7 years ago

pref list item view