×
top 200 commentsshow 500

[–]NikEy 168 points169 points  (16 children)

Hi guys, really fantastic work, extremely impressive!

I'm an admin at the SC2 AI discord and we had a few questions in our #research channel that you may hopefully be able to shed light on:

  1. From the earlier versions (and in fact, the current master version) of pysc2 it appeared that the DM development approach was based on mimicking human gameplay to the fullest extent, e.g. the bot was not even able to get info on anything outside of the screen-view. With this version you seemed to have relaxed these constraints, since feature layers are now "full map size" and new features have been added. Is that correct? If so, then how does this really differ from taking the raw data from the API and simply abstracting them into structured data as inputs for the NNs? The blog even suggests that you take raw data and properties directly as data in list form and feed it into the NNs - which seems to suggest that you're not really using feature layers anymore at all?
  2. When I was working with pysc2 it turned out to be an incredibly difficult problem to maintain knowledge of what has been built, is in-progress, has completed, and so on, since I had to pan the camera view all the time to get that information. How is that info kept within the camera_interface approach? Presumably a lot of data must still be available in full via raw data access (e.g. counts of unitTypeID, buildings, etc) even in camera_interface mode?
  3. How many games needed to be played out in order to get to the current level? Or in other words: how many games is 200 years of learning in your case?
  4. How well does the learned knowledge transfer to other maps? Oriol mentioned on discord that it "worked" on other maps, and that we should guess which one it worked best on, so I guess it's a good time for the reveal ;) In my personal observations AlphaStar did seem to rely quite a bit on memorized map knowledge. Is it likely that it could execute good wall-offs or proxy cheeses on maps that it has never seen before? What would be the estimated difference in MMR when playing on a completely new map?
  5. How well does it learn the concept of "save money for X", e.g. Nexus first. It is not a trivial problem, since if you learn from replays and take the non-actions (NOOPs) from the players into account, the RL algo will more often than not think that NOOP is the best decision at non-ideal points in the game. So how do you handle "save money for X" and do you exclude NOOPs in the learning stage?
  6. What step size did you end up using? In the blog you write that each frame of StarCraft is used as one step of input. However, you also mention an average processing time of 50ms, which would exceed real time (which requires < 46ms given 22.4fps). So do you request every step, or every 2nd, 3rd, maybe dynamic?

I have lots more questions, but I guess I'll better ask these in person the next time ;)

Thanks!

[–]OriolVinyals[S] 67 points68 points  (1 child)

Re. 2: Yes, we did relax the view of the agent a bit, mostly due to computational reasons -- games without camera moves last for about 1000 moves, whereas with camera moves (humans do spam a lot!) can be 2 to 3 times longer. We do use feature layers for the minimap, but for the screen you can think of the list of features as “transposing” that information. In fact, it turns out that even for processing images, treating each pixel independently as a list, works quite well! See https://arxiv.org/abs/1711.07971

[–]OriolVinyals[S] 60 points61 points  (0 children)

Re. 1: Indeed, with the camera (and non-camera) interface, the agent has the knowledge of what has been built as we input this as a list (which is further processed by a Neural Network Transformer). In general, even if you don’t keep such a list, the agent will know what has been built as the memory of the agent (the LSTM) keeps track of all previously issued actions, and all the camera locations visited in the past.

[–]OriolVinyals[S] 57 points58 points  (0 children)

Re. 3: At an average duration of 10 minutes per game, this amounts to about 10 million games. Note, however, that not all agents were trained for as long as 200 years, that was the maximum amongst all the agents in the league.

[–]David_SilverDeepMind 57 points58 points  (0 children)

Re: 5

AlphaStar actually chooses in advance how many NOOPs to execute, as part of its action. This is learned first from supervised data, so as to mirror human play, and means that AlphaStar typically “clicks” at a similar rate to human players. This is then refined by reinforcement learning, which may choose to reduce or increase the number of NOOPs. So, “save money for X” can be easily implemented by deciding in advance to commit to several NOOPs.

[–]OriolVinyals[S] 42 points43 points  (0 children)

Re. 6: We request every step, but the action, due to latency and several delays as you note, will only be processed after that step concludes (i.e., we play asynchronously). The other option would have been to lock the step, which makes the playing experience for the player not great : )

[–]OriolVinyals[S] 13 points14 points  (0 children)

Re. 4: See above for an answer.

[–]gwern 326 points327 points  (161 children)

  1. what was going on with APM? I was under the impression it was hard-limited to 180 WPM by the SC2 LE, but watching, the average APM for AS seemed to go far above that for long periods of time, and the DM blog post reproduces the graphs & numbers mentioned without explaining why the APMs were so high.
  2. how many distinct agents does it take in the PBT to maintain adequate diversity to prevent catastrophic forgetting? How does this scale with agent count, or does it only take a few to keep the agents robust? Is there any comparison with the efficiency of the usual strategy of historical checkpoints in?
  3. what does total compute-time in terms of TPU & CPU look like?
  4. the stream was inconsistent. Does the NN run in 50ms or 350ms on a GPU, or were those referring to different things (forward pass vs action restrictions)?
  5. have any tests of generalizations been done? Presumably none of the agents can play different races (as the available units/actions are totally different & don't work even architecture-wise), but there should be at least some generalization to other maps, right?
  6. what other approaches were tried? I know people were quite curious about whether any tree searches, deep environment models, or hierarchical RL techniques would be involved, and it appears none of them were; did any of them make respectable progress if tried?

    Sub-question: do you have any thoughts about pure self-play ever being possible for SC2 given its extreme sparsity? OA5 did manage to get off the ground for DoTA2 without any imitation learning or much domain knowledge, so just being long games with enormous action-spaces doesn't guarantee self-play can't work...

  7. speaking of OA5, given the way it seemed to fall apart in slow turtling DoTA2 games or whenever it fell behind, were any checks done to see if the SA self-play lead to similar problems, given the fairly similar overall tendencies of applying constant pressure early on and gradually picking up advantages?

  8. At the November Blizzcon talk, IIRC Vinyals said he'd love to open up their SC2 bot to general play. Any plans for that?

  9. First you do Go dirty, now you do Starcraft. Question: what do you guys have against South Korea?

[–]OriolVinyals[S] 133 points134 points  (83 children)

Re. 1: I think this is a great point and something that we would like to clarify. We consulted with TLO and Blizzard about APMs, and also added a hard limit to APMs. In particular, we set a maximum of 600 APMs over 5 second periods, 400 over 15 second periods, 320 over 30 second periods, and 300 over 60 second period. If the agent issues more actions in such periods, we drop / ignore the actions. These were values taken from human statistics. It is also important to note that Blizzard counts certain actions multiple times in their APM computation (the numbers above refer to “agent actions” from pysc2, see https://github.com/deepmind/pysc2/blob/master/docs/environment.md#apm-calculation). At the same time, our agents do use imitation learning, which means we often see very “spammy” behavior. That is, not all actions are effective actions as agents tend to spam “move” commands for instance to move units around. Someone already pointed this out in the reddit thread -- that AlphaStar effective APMs (or EPMs) were substantially lower. It is great to hear the community’s feedback as we have only consulted with a few people, and will take all the feedback into account.

Re. 5: We actually (unintentionally) tested this. We have an internal leaderboard for the AlphaStar, and instead of setting the map for that leaderboard to Catalyst, we left the field blank -- which meant that it was running on all Ladder maps. Surprisingly, agents were still quite strong and played decently, though not at the same level we saw yesterday.

[–]Mangalaiii 54 points55 points  (13 children)

  1. Dr. Vinyals, I would suggest that AlphaStar might still be able to exploit computer action speed over strategy there. 5 seconds in Starcraft can still be a long time, especially for a program that has no explicit "spot" APM limit (during battles AlphaStar's APM regularly reached >1000). As an extreme example, AS could theoretically take 2500 actions in 1 second, and the other 4 seconds take no action, resulting in an average of 500 actions over 5 seconds. Also, TLO may have been using a repeater keyboard, popular with the pros, which could throw off realistic measurements.

Btw, fantastic work.

[–]ILikeBigButtss 40 points41 points  (1 child)

The numbers for the TLO games and the Mana games need to be looked at separately. TLO's numbers are pretty funky and it's pretty clear that he was constantly and consistently producing high amounts of garbage APM. He normally plays Zerg and is a significantly weaker Protoss player than Mana. TLO's high APM is quite clearly artificially high and much more indicative of the behavior of his equipment than his actual play and intentional actions. Based on DeepMind's graphic, TLO's average APM almost suprpasses Mana's peak APM.

The numbers when only MaNa and AlphaStar are considered are pretty indicative of the issue. The average APM numbers are much closer. AlphaStar was able to achieve much higher peak APMs than Mana, presumably during combat. These high peak APM numbers are offset by lower numbers during macro stretches. It should also be noted that due to the nature of it's interface, AlphaStar had no need to perform many actions that are routine and common for human players.

The choice to combine TLO and Mana's numbers for the graph shown during the stream was misleading. The combined numbers look ok only because TLO's artificially high APM numbers hide Mana's numbers which paint a much more accurate picture of the APM disadvantage.

[–]AjarKeen 12 points13 points  (4 children)

Agreed. I think it would be worth taking a look at EAPM / APM ratios for human players and AlphaStar agents in order to better calibrate these limitations.

[–]Rocketshipz 20 points21 points  (3 children)

And even here, you have the problem that AlphaStar is still so much more precise potentially.

The problem of this is that it encourages "cheesy" behaviors and not more long term strategies. I'm basically afraid that with this the agent will be stuck in strategies relying on his superhuman micro, which makes it so much less impressive because a human couldn't do this even if he thought of it.

Note that it totally wasn't the case with the other game agents such as AlphaGo, AlphaZero... which didn't play in real time, or even OpenAI's DotA, which is actually correctly capped iirc.

[–]EvgeniyZh 12 points13 points  (1 child)

AS could theoretically take 50 actions in 1 second, resulting in average of 50/5*60=600 APM in this 5 second period

[–]Ape3000 19 points20 points  (1 child)

  1. I would be very interested to see if the AI would still be good even if the APM was hard limited to something like 50, which is clearly worse than human level. Would it still beat humans with superior strategy and decision making?

Also, I would like to see how two unlimited Alphastars would play agains each other. Super human >2000 APM micro would probably be insane and very cool looking.

[–]LH_Hyjal 12 points13 points  (0 children)

Hello! Thank you for the great work.

I wonder if you considered the inaccuracy in human inputs, we saw that AlphaStar did some crazy precise macro because it will never mislick yet human players won't likely to precisely select every unit in they want to control.

[–]starcraftdeepmind 113 points114 points  (54 children)

In particular, we set a maximum of 600 APMs over 5 second periods, 400 over 15 second periods, 320 over 30 second periods, and 300 over 60 second period.

Statistics aside, it was clear from the gamers', presenters', and audience's shocked reaction to the Stalker micro, all saying that no human player in the world could do what AlphaStar was doing. Using just-beside-the-point statistics is obfuscation and an avoiding of acknowledging this.

AlphaStar wasn't outsmarting the humans—it's not like TLO and MaNa slapped their foreheads and said, "I wish I'd thought of microing Stalkers that fast! Genius!"

Postscript Edit: Aleksi Pietikäinen has written an excellent blog post on this topic. I highly recommend it. A quote from it:

Oriol Vinyals, the Lead Designer of AlphaStar: It is important that we play the games that we created and collectively agreed on by the community as “grand challenges” . We are trying to build intelligent systems that develop the amazing learning capabilities that we possess, so it is indeed desirable to make our systems learn in a way that’s as “human-like” as possible. As cool as it may sound to push a game to its limits by, for example, playing at very high APMs, that doesn’t really help us measure our agents’ capabilities and progress, making the benchmark useless.

Deepmind is not necessarily interested in creating an AI that can simply beat Starcraft pros, rather they want to use this project as a stepping stone in advancing AI research as a whole. It is deeply unsatisfying to have prominent members of this research project make claims of human-like mechanical limitations when the agent is very obviously breaking them and winning it’s games specifically because it is demonstrating superhuman execution.

[–]Prae_ 20 points21 points  (0 children)

It wasn't really about speed to be honest. It was more about the 'width' of control and number of fronts precisely coordinated. AlphaStar wasn't inhumanly fast, but managed to out-manoeuver MaNa by being everywhere at the same time.

All throughout the matches, AlphaStar demonstrated more than just fast execution. It knew which units to target first, how to exploit (or prevent MaNa from exploiting) the immortal ability. So it's not just going fast, it's doing a lot of good things fast. Overall, as a fairly good player of SC2, I have to say it was really impressive (the blink stalker one was controversial, but still interesting) and a substantial improvement compared to other AI.

And even if it's not "really" outsmarting humans, it's still interesting to see. Seems like it favors constant aggression, probably because it's a way to dictate the pace of the game and keep the possible reactions within a certain range. I'd say that's still useful results for people interested in strategy (in general, or in starcraft). It seems like a solid base, if you have the execution capabilities of AlphaStar.

[–]super_aardvark 47 points48 points  (14 children)

It wasn't so much about the speed as it was about the precision, and in the one case about the attention-splitting (microing them on three different fronts at the same time). I'm sure Mana could blink 10 groups of stalkers just as quickly, but would never be able to pick those groups out of a large clump with such precision. Also, "actions" like selecting some of the units take longer than others -- a human has to drag the mouse, which takes longer than just clicking. I don't know if the AI interface is simulating that cost in any way.

[–]starcraftdeepmind 54 points55 points  (9 children)

It's about both the accuracy of clicks multiplied by the number of clicks (or actions if one prefers. I know the A.I. doesn't use a mouse and keyboard).

If the human player (and not AlphaStar) could at a crucial time slow the game down 5 fold (and have lots of experience operating at this speed) his number of clicks would go up and his accuracy of clicks. He would be able to click on individual stalkers etc in a way he can't at higher speeds of play. I argue that this is a good metaphor for the unfair advantage AlphaStar has.

There are two obvious ways of reducing this advantage:

  1. Reduce the accuracy of 'clicks' by AlphaStar by making the accuracy of the clicks probabilistic. The probabilities could be fixed or changed based on context. (I don't like this option). As an aside, there was some obfuscation on this point too. It is claimed that the agents are 'spammy' and do redundantly do the same action twice, etc. That's a form of inefficiency but it's not the same as wanting to click on a target and hitting it or not—AlphaStar has none of this latter inefficiency.
  2. Reduce the rate of clicks AlphaStar can make. This reduction could be constant or change with context. This is the route the AlphaStar researchers went, and I agree its the right one. Again, I'll emphasise that this variable multiplies with the above variable to get the insane micro we saw. Insisting it's one and not other is missing the point. Why didn't they reduce the rate of clicks more? Based on the clever obfuscating of this issue in the blog post and the youtube streaming presentation, I believe they did in their tests but the performance of the agents was so poor, they were forced to increase it.

[–]monsieurpooh 38 points39 points  (1 child)

Thank you, I too have always been a HUGE advocate of probabilistic clicking or mouse movement accuracy as a handicap to make it same as humans. It becomes infinitely even more important if we ever want DeepMind to compete in FPS competitions such as COUNTER-STRIKE. We want to see it outsmart, out-predict, and surprise humans, not out-aim them.

[–]starcraftdeepmind 12 points13 points  (0 children)

Thanks for the thanks. Yes, as essential if not more so for FPS.

The clue is in the name artificial intelligence—not artificial aiming. 😁

[–]6f937f00-3166-11e4-8 11 points12 points  (1 child)

on point 1) I think a simple model would be to make quicker clicks less accurate. So if it clicks only 100ms after the last click, it gets placed randomly over a wide area. If it clicks say 10 seconds after the last click, it has perfect placement. This somewhat models a human "taking time to think about it" vs "panicked flailing around"

[–]Neoncow 8 points9 points  (0 children)

For 1) for the purpose of finding "more human" strategies, have you considered working with some of your UX teams from parent company to do some modelling of major human input output characteristics?

Like mouse movement that models Fitts law (or other UX "laws"). Or visualization that models eye ball movement or peripheral vision limitations. Or modelling finger fatigue and mouse clicks. Or wrist movement speed. Or adding in minor RSI pain.

I know it's not directly AI related, but if the goal is to produce human usable knowledge, you'll probably have to model human bodies sometime in the future for AI models that interact with the real world.

[–]PM_ME_STEAM 9 points10 points  (4 children)

https://youtu.be/cUTMhmVh1qs?t=7901 It looks like the AI definitely goes way over 600 APM in the 5 second period here. Are you capping the APM or EPM?

[–]OriolVinyals[S] 17 points18 points  (3 children)

We are capping APM. Blizzard in game APM applies some multipliers to some actions, that's why you are seeing a higher number. https://github.com/deepmind/pysc2/blob/master/docs/environment.md#apm-calculation

[–]David_SilverDeepMind 73 points74 points  (2 children)

Re: 2

We keep old versions of each agent as competitors in the AlphaStar League. The current agents typically play against these competitors in proportion to the opponents' win-rate. This is very successful at preventing catastrophic forgetting, since the agent must continue to be able to beat all previous versions of itself. We did try a number of other multi-agent learning strategies and found this approach to work particularly robustly. In addition, it was important to increase the diversity of the AlphaStar League, although this is really a separate point to catastrophic forgetting. It’s hard to put exact numbers on scaling, but our experience was that enriching the space of strategies in the League helped to make the final agents more robust.

[–]David_SilverDeepMind 61 points62 points  (2 children)

Re: 6 (sub-question on self-play)

We did have some preliminary positive results for self-play, in fact an early version of our agent defeated the built-in bots, using basic strategies, entirely by self-play. But supervised human data is very helpful to bootstrap the exploration process, and helps to give much broader coverage of advanced strategies. In particular, we included a policy distillation cost to ensure that the agent continues to try human-like behaviours with some probability throughout training, and this makes it much easier to discover unlikely strategies than when starting from self-play.

[–]David_SilverDeepMind 51 points52 points  (5 children)

Re: 4

The neural network itself takes around 50ms to compute an action, but this is only one part of the processing that takes place between a game event occurring and AlphaStar reacting to that event. First, AlphaStar only observes the game every 250ms on average, this is because the neural network actually picks a number of game ticks to wait, in addition to its action (sometimes known as temporally abstract actions). The observation must then be communicated from the Starcraft binary to AlphaStar, and AlphaStar’s action communicated back to the Starcraft binary, which adds another 50ms of latency, in addition to the time for the neural network to select its action. So in total that results in an average reaction time of 350ms.

[–]pataoAoC 11 points12 points  (3 children)

First, AlphaStar only observes the game every 250ms on average, this is because the neural network actually picks a number of game ticks to wait

How and why does it pick the number of game ticks to get the average of 250ms? I'm only digging into this because the "mean average APM" on the chart struck me as deceptive; the agent used <30 APM on a regular basis while macro'ing to bring down the burst combat micro APM of 1000+, and the mean APM was highlighted on the chart.

[–]nombinoms 24 points25 points  (1 child)

There was a chart somewhere that also showed a pretty messed up reaction time graph. It had a few long reaction times (around a second) and probably almost a 3rd of them under 100ms. I have a feeling that if we watched the games from an artificial alphastar’s point of view it would basically look like it is holding back for awhile followed by super human mouse and camera movement whenever there was a critical skirmish.

Anyone that plays video games of this genre could tell you that apm and reaction time averages are meaningless. You only would need maybe a few second of super human mechanics to win and strategy wouldn’t matter at all. In my opinion all this shows is that we can make AIs that learn to play Starcraft provided it only goes super human at limited times. That’s a far cry from conquering starcraft 2. It’s literally the same tactic hackers use to not get banned.

The most annoying part is they have a ton of supervised data and could easily look at the actual probability distributions of meaningful clicks in a game and build additional constraints directly into the model that could account for so many variables and simulate real mouse movement. But instead they use some misleading “hand crafted” constraint. Its ironic how machine learning practitioners advocate to make all models end to end except when it’s used to model handicaps humans have versus their own preconceived biases of what’s a suitable handicap for their models.

[–]OriolVinyals[S] 44 points45 points  (3 children)

Re. 8: Glad to see the excitement! We're really grateful for the community's support and we want to include them in our work, which is why we are releasing the 11 game replays for the community to review and enjoy. We’ll keep you posted as our plans on this evolve!

[–]David_SilverDeepMind 40 points41 points  (0 children)

Re: 6

The most effective approach so far did not use tree search, environment models, or explicit HRL. But of course these are huge open areas of research and it was not possible to systematically try every possible research direction - and these may well prove fruitful areas for future research. Also it should be mentioned that there are elements of our research (for example temporally abstract actions that choose how many ticks to delay, or the adaptive selection of incentives for agents) that might be considered “hierarchical”.

[–]David_SilverDeepMind 41 points42 points  (0 children)

Re: 7

There are actually many different approaches to learning by self-play. We found that naive implementations of self-play often tended to get stuck in specific strategies or forget how to defeat previous strategies. The AlphaStar League is also based on agents playing against themselves, but its multi-agent learning dynamic encourages strong play against a diverse set of opponent strategies, and in practice seemed to lead to more robust behaviour against unusual patterns of play.

[–]Prae_ 29 points30 points  (8 children)

I'm very interested in the generalization over the three races. The league model for learning seems to work very well for miror match-ups, but it seems to me that it would take a significantly greater time if it had to train 3 races in 9 total match-ups. There are large overlaps between the different match-ups, so it would be intersting to see how well it can make use of these overlaps.

[–]Paladia 9 points10 points  (5 children)

but it seems to me that it would take a significantly greater time if it had to train 3 races in 9 total match-ups.

Doesn't matter much when you have a hyperbolic time chamber where the agents gets 1 753 162 hours of training in one week. It's all how much computer resources they want to dedicate to training at that point.

[–]David_SilverDeepMind 37 points38 points  (12 children)

Re: 3

In order to train AlphaStar, we built a highly scalable distributed training setup using [Google's v3 TPUs](https://cloud.google.com/tpu/) that supports a population of agents learning from many thousands of parallel instances of StarCraft II. The AlphaStar league was run for 14 days, using 16 TPUs for each agent. The final AlphaStar agent consists of the most effective mixture of strategies that have been discovered, and runs on a single desktop GPU.

[–]dojoteef 61 points62 points  (11 children)

1) Are there any plans to train an agent using only pixel inputs and physical mouse/keyboard actions now that you have demonstrated AlphaStar's current ability against professional players? 2) Have you gained any insights from this experience that you believe translates to other reinforcement learning problems where humans interact with AI-controlled agents?

[–]David_SilverDeepMind 42 points43 points  (0 children)

Re: 2

Like Starcraft, most real-world applications of human-AI interaction have an element of imperfect information. That also typically means that there is no absolute optimal way to behave and agents must be robust to a wide variety of unpredictable things that people might do. Perhaps the biggest take away from Starcraft is that we have to be very careful to ensure that our learning algorithms get adequate coverage over the space of all these possible situations.

In addition, I think we’ve also learnt a lot about how to scale up RL to really large problems with huge action spaces and long time horizons.

[–]OriolVinyals[S] 37 points38 points  (0 children)

Re. 2: When we see things like high APMs, or misclicks, it may be from imitation indeed. In fact, we often see very spammy behavior of certain actions for the agents (spamming move commands, microing probes to mine unnecessarily, or flickering the camera during early game).

[–]OriolVinyals[S] 32 points33 points  (3 children)

Re. 1: No current plans as of yet, but you’ll be the first to know if there are any further announcements : )

[–]AesotericNevermind 13 points14 points  (1 child)

I either want the computer to use a mouse, or the game to read my thoughts.

Your move.

[–]heyandy889 6 points7 points  (2 children)

this would be awesome - that is the way they trained an agent to beat Atari games I believe

[–]4567890 60 points61 points  (22 children)

For the pro players, say you are coaching AlphaStar. What would you say are the best and worst aspects of its game? Do you think its victories were more from decision making or mechanics?

[–]SC-MaNa 103 points104 points  (6 children)

I would say that clearly the best aspect of its game is the unit control. In all of the games when we had a similar unit count, AlphaStar came victorious. The worst aspect from the few games that we were able to play was its stubbornness to tech up. It was so convinced to win with basic units that it barely made anything else and eventually in the exhibition match that did not work out. There weren’t many crucial decision making moments so I would say its mechanics were the reason for victory.

[–]hunavle 10 points11 points  (0 children)

In your live game did you think you would lose if you stop harrassing with the prism/immortal? you were i believe 1-1 in upgrades vs its 0-0.

[–]ILikeBigButtss 13 points14 points  (4 children)

AlphaStar displayed a level of competence in decision making and strategy that hasn't been seen from an AI. However, it had a huge advantage in mechanics due to its interface. It didn't have the limits of human imprecision and reaction time. The decision not to tech up could have been influenced in part by it's mechanical ability. It's micro abilities certainly had an impact on it's unit composition decisions.

[–]NewFolgers 16 points17 points  (3 children)

You're right about the precision, but the DeepMind team keeps saying that the agent is only able to sample the game state once every 250ms.. and overall takes 350ms to react. In watching the games, I sometimes even felt that it looked like an awesome player who was lagging a bit.. since sometimes, it failed to move units away just-in-time when there was ample opportunity for a save.

I agree with your last point too. It knew it could beat MaNa's immortal army with its bunch of stalkers (whereas the numbers looked pretty hopeless to a human), and it's because it was able to split into three groups around the map and micro them all simultaneously.. something that humans couldn't do. If it couldn't do those things, it wouldn't have gotten into a situation where it only had a bunch of stalkers to counter immortals.

Anyway, it's got too much of an advantage in quickly+precisely orchestrating its own actions -- but from what we've been told, reaction time does not seem to be the a primary cause of any advantage it has.

[–]ILikeBigButtss 10 points11 points  (1 child)

I hadn't seen the 250ms sampling interval. I had thought that it was receiving updated data on every frame(1/24 of a second). DeepMind's blog shows that the reaction time was as low as 67ms, and averaged 350ms If observations are coming in at .25second intervals, that 67ms could be anywhere between 67ms and 317ms after the actual event. Sampling at quarter second intervals is a pretty odd design choice. It limits reaction time to events that happen early in the interval, but not events at the end of the interval. AlphaStar can still respond faster than humanly possible to some events, but it's effectively random which events those are. A lag on when AlphaStar receives information, but more regular sampling interval would seem to make more sense if the goal was to limit reaction time to human levels. This seems to be just as much a decision to limit the volume of information that AlphaStar needs to process as it is an attempt to limit reaction time.

Hopefully we get a more detailed technical description of AlphaStar and it's interface with the game. The stream and DeepMind's blog post have a bit, but they aren't always completely clear nor are they comprehensive. AlphaStar was impressive, but until it has more human like interface and interaction with the game, it's hard to draw too much meaning from its performance against humans.

I'd also like to see a unrestrained version of AlphaStar(No APM limits, no lag or delay on information) demolish everyone. I want 10k APM stalkers at 3 different fronts across the map, tearing everyone to shreds.

[–]AxeLond 19 points20 points  (1 child)

The balls it had at times was crazy. It would run straight into MaNa's army and snipe 1 high priority unit and then back off without a second of doubt. That would be so hard for a human to do since making the decision to commit or back off can be very tough in the moment.

At least in the exhibition match the first 8 or so minutes looked great and it had perfect strategy and build order, execution wasn't the best but still on a pro level. However it looked like it just ran out of ideas past 9 minutes and mostly ran around doing random stuff like it wasn't expecting the match to go on for this long and had no plans whatsoever, just completely clueless on what to do now.

It should have started building forges for more upgrades, robo and templar archives for end game units to push it's advantage. Instead it kinda sat back and did nothing until MaNa attacked it with his deathball and +2 weapon upgrades vs no weapon upgrade and just basic units. AlphaStar was up 3 bases vs 2 bases and was so far ahead, if a pro were to take over from AlphaStar at around 8:00 he could have easily won the game vs MaNa just by doing some kind of late-game strategy.

[–]althaz 32 points33 points  (12 children)

I can answer part of this. Alpha's micro was inhumanly good in the matches we saw against Mana.

In game 1 vs Mana, Mana simply made a mistake, he probably would have won that match if he had played correctly. I say probably because of how insane Alpha's stalker micro was, maybe it would have hung on and won.

After that though, the micro was insane. The casters kept talking about Alpha not being afraid to go up ramps and into chokes. That's because it could predict and see exactly how far away enemy units were and was ridiculously good at not getting caught out. Couple that with how good its stalker micro was both with and without blink and it made engagements that would be extremely one-sided in a human vs human match go the opposite way.

Alpha's mechanics were perfect, but that wouldn't have mattered vs a pro player like Mana if its decision making wasn't also superb.

One thing worth talking about with its mechanics is the sheer precision - there are no misclicks, so despite the limited speed, the precision was more than enough for Alpha to destroy in battles where it had equal or even slightly worse armies.

Now, on the bigger strategic decisions I don't know - was building more probes like Alpha did the right way to go, or did it win despite that, for example? I'm not at TLO or especially Mana's level, but I actually always over build probes. It's worked out fairly well for me.

[–]starcraftdeepmind 22 points23 points  (5 children)

To mention the precision (effective APM) without mentioned the extremely high burst APM during battle (often in the range of 600-900, sometimes over 1000 APM) is to not have all the variables in the equation.

[–]celeritasCelery 155 points156 points  (54 children)

How does it handle invisible units? Human players can see the shimmer if they are looking really close. But if AI could see that, invisibility would be almost useless. However if it can't see them at all, it seems it would give a big advantage to mass cloaked unit strategies, since an observer would have to present to notice anything.

[–]OriolVinyals[S] 73 points74 points  (28 children)

Funnily enough, at first we ignored seeing the “shimmer” of invisible units. Agents still were able to play as you can still build a detector, in which case units would reveal as usual. However, we later added a “shimmer” feature, which activates if that position has a cloaked unit.

[–]Catch-22 36 points37 points  (22 children)

Waaait, how is that behavior/flag any different from actually detecting the unit?

(and thank you for being here!)

[–]SuperTable 34 points35 points  (9 children)

It's just like in game, one can spot an ennemy unit because the terrain underneath is blurred. So you can forcefield it out or build detection before it actually attacks you. However, you still can't target it nor attack it.

[–]celeritasCelery 46 points47 points  (8 children)

while true, no human player is capable of seeing all "invisible" units. You can only see ones on screen, and only if you are paying really close attention. For the AI, invisible units are not really invisible, they are just "untargetable". Seems a little one sided.

[–]VoodooSteve 27 points28 points  (1 child)

My feeling is this is fine since a "perfect human" would notice all shimmers and this is what AI is going for (provided it's using the camera mode and not detecting all shimmers all over the map at once).

[–]why_rob_y 22 points23 points  (0 children)

It was not using camera mode for Games 1-10, so I'd say the shimmer visibility was an unfair advantage there. However, Game 11 had it use a camera, and you're right, I think it's more fair if it needs to see the shimmer on screen.

[–]Mangalaiii 13 points14 points  (0 children)

How exactly does the "shimmer" appear to the program?

[–]rip_BattleForge 12 points13 points  (0 children)

Please elaborate!

[–]Nevermore60 24 points25 points  (8 children)

Seeing the shimmer of a cloaked unit is certainly an advantage, but you have to remember that that still doesn't allow you to target the unit without detection.

That said, I think you're right that AlphaStar's "perception" of the subtle shimmer, as well as all other subtle visual information on the screen (e.g., the exact health and position of 45 different enemy units all on the screen at once) is far too precise.

To level the playing field and truly pit the strategic abilities of AlphaStar against human players while controlling for all other advantages, AlphaStar would have to rely on optical perception -- i.e., looking at a screen of the game and visually processing the information on the screen -- rather than instantaneously digitally perceiving all information available in a window.

[–]ThinkExist 10 points11 points  (6 children)

I think they only need to restrict AS to the camera interface. It would still need to be looking at the right place to see it.

[–]Mefaso 31 points32 points  (6 children)

I think they pointed out, that some agents went for all dark templar strategies, which would be pointless if they can be seen, so I'd assume they can't see them.

Also in one of the tlo games they built tons of observers

[–]hyperforce 42 points43 points  (1 child)

Cloaked units are also untargetable. So even if you can see them, you cannot damage them with targeted attacks; you would need splash.

[–]Krexington_III 11 points12 points  (0 children)

Yeah but catching a DT walking over the map or taking part in base defense potentially gives information about the state of the game.

[–]2Punx2Furious 7 points8 points  (1 child)

invisibility would be almost useless

You still can't target invisible units if you don't have a detector, unless you use AOE. Anyway, you can see in the first game that it built a ton of observers, probably for that reason.

[–]DreamhackSucks123 94 points95 points  (20 children)

Many people are attributing AlphaStar's single loss to the fact that the algorithm had restricted vision in the final match. I personally dont find this to be a convincing explanation because the warp prism was moving in and out of the fog of war, and the AI was moving its entire army back and forth in response. This definitely seemed like a gap in understanding rather than a mechanical limitation. What are your opinions about the reason why AlphaStar lost in this way?

[–]David_SilverDeepMind 73 points74 points  (2 children)

It’s hard to say why we lose (or indeed win) any individual game, as AlphaStar’s decisions are complex and result from a dynamic multi-agent training process. MaNa played an amazing game, and seemed to find and exploit a weakness in AlphaStar - but it’s hard to say for sure whether this weakness was due to camera, less training time, different opponents, etc. compared to the other agents.

[–]SnowAndTrees 32 points33 points  (7 children)

It's micro (unit control in fights) also seemed noticeably worse in that game than in the previous one, where it beat Mana's immortal heavy army with stalkers, which generally wouldn't be possible in a normal game, as immortals hard counter stalkers.

[–]althaz 25 points26 points  (3 children)

I think this was because the AI had to do more work to manage its attention, but maybe it's just that this agent wasn't as godly at Stalker micro.

It's also worth mentioning that Alpha didn't have a massive group of blink stalkers in this match - no amount of micro can save non-blink stalkers vs 4-6 immortals, because the Stalkers get basically one-shot.

[–]Inori 44 points45 points  (3 children)

  1. Your agent contains quite a number of advanced approaches, including some very unconventional such as the transformer body. What was the process of building it like, e.g. was every part of the agent added incrementally, improving the overall performance at each step? Were there parts that initially degraded the performance, and if yes then how were you able to convince others (yourself?) to stick with it?

  2. Speaking of the transformer body, I'm really surprised that essentially throwing away the full spatial information worked so well. Have you given any thought as to why it worked so well, relative to something like the promising DRC / Conv LSTM?

  3. What is the reward function like? Specifically, I'm assuming it would be impossible to train with pure win/loss, but have you applied any special reward shaping?

Very impressive work either way! GGWP!

[–]OriolVinyals[S] 15 points16 points  (2 children)

  1. Others in the team and myself developed the architecture. Much like one tunes performance on ImageNet, we tried several things. Supervised learning was useful here -- improvements on the architecture were mostly developed this way.
  2. I am not surprised at all. Transformer is, IMHO, a step up from CNNs/RNNs. It is showing SOTA performance everywhere.
  3. Most agents get rewarded for win/loss, without discount (i.e., they don't care to play long games). Some, however, use rewards such as the agent that "liked" to build disruptors.

GG.

[–]NikEy 7 points8 points  (1 child)

Hey Oriol,

(1) Can you clarify your answer on the reward shaping? Are you saying that for most agents you're ONLY looking at the win/loss and not "learning along the way"? So if an agent wins, you weight all the actions in the game positive, and if it loses, you weight them all negative?

(2) How was the disruptor reward-shaping introduced? Does a random percentage of agents get higher rewards for certain unit types?

[–]OriolVinyals[S] 14 points15 points  (0 children)

  1. Yes. Supervised learning makes agents play more or less reasonably. RL can then figure out what it means to win / be good at the game.
  2. If you win, you get a reward of 1. If you win, and build 1 disruptor at least, you get a reward of 2.

[–]Bleyddyn 33 points34 points  (8 children)

Are there any areas you would recommend for ML/RL hobbyists to focus on? Areas where it might be possible to make useful contributions without having more than desktop level compute resources?

[–]OriolVinyals[S] 49 points50 points  (1 child)

There are always things that, because you may not have large amounts of compute, you’ll be able to do which can advance ML. My favorite example is back when we were working on machine translation. We developed something called seq2seq, which had a big LSTM achieving state of the art performance, and trained on 8 GPUs. At the same time, U of Montreal developed “attention”, a fundamental advance in ML, and which allowed the models to be quite much smaller (as they weren’t running on such big hardware).

[–]upboat_allgoals 9 points10 points  (0 children)

To build on this, Google released their transformer architecture, examples on one GPU, which is pretty great: https://github.com/tensorflow/models/tree/master/official/transformer

[–]_MandelBrot 7 points8 points  (0 children)

Most theoretical contributions don't require a lot of compute

[–]lagerbaer 7 points8 points  (4 children)

Check out fast.ai and their online courses. Also paperspace e.g. for relatively cheap cloud compute resources.

[–]TovarishGaming 38 points39 points  (10 children)

First of all, thank you for your hard work and for being a part of today's awesome event!

@Deepmind team: We saw AlphaStar do some Blink Stalker micro today that everyone seemed to agree was simply above-human possibility. Do you expect to see this with other races? I imagine Zerg spreading creep tumors exactly every 4 seconds will lead to insane creep spread or things like that. What are you most excited to see?

@TLO: you said initially that you were still confident you would win while playing as Zerg. After seeing Mana's match today, and knowing that Deepmind will continue to learn exponentially, do you still feel confident in your rematch?

[–]LiquidTLO1 51 points52 points  (0 children)

When I said that I was definitely referring to the level of the agent I played back then. I still believe I would be able to beat a Zerg agent in a zvz that is playing on a similar level as the one MaNa faced.

However it’s very hard to tell how much stronger AlphaStar will become in the future. I don’t think it as simple as to say it’ll become exponentially better. All that aside, I’m extremely eager for a Zerg rematch, that’s for sure.

[–]OriolVinyals[S] 53 points54 points  (2 children)

I would be quite excited to see how self play would pan out if agents played all three races. Will the asymmetries help agents, as they’ll encounter more situations than in a mirror match?

[–]shadiakiki1986 10 points11 points  (5 children)

Deepmind will continue to learn exponentially,

Why do you say "exponentially"?

[–]EnderSword 37 points38 points  (9 children)

I was wondering if you've seen the AI Agents show any signs of 'deceiving' its opponent, through hiding buildings, cancelling something after it was scouted, giving a false impression of an attack in one area or mimicking a build order only to change it etc...?

[–]OriolVinyals[S] 55 points56 points  (0 children)

We have indeed seen agents building and canceling buildings when scouted. It would be hard to know why AlphaStar did it, but it does happen sometimes.

[–]koglerjs 15 points16 points  (7 children)

Watch game 5 against Mana. Cyber core and gateway on low ground indicate, to my eye, a fast expansion, which masks the aggressive play.

It's hard to say what 'intent' is when it comes to deception, but I believe Mana was surprised by the proxy.

[–]EnderSword 10 points11 points  (1 child)

Yeah, he definitely was, also the pylon in the main is clearly intend to intentionally distract attention

[–]kroken81 34 points35 points  (12 children)

How large is the "memory" of alphastar, how much data does it have to draw from while playing?

[–]OriolVinyals[S] 54 points55 points  (3 children)

Each agent uses a deep LSTM, with 3 layers and 384 units each. This memory is updated every time AlphaStar acts in the game, and an average game takes about 1000 actions.

[–]BornDNN 19 points20 points  (1 child)

How many parameters does one single agent possess, regarding the approximate amount of 175 petaflop training-days?

[–]OriolVinyals[S] 46 points47 points  (0 children)

Our network has about 70M parameters.

[–]ReasonablyBadass 6 points7 points  (0 children)

It's amazing a common LSTM is enough for this game. Would something with long term memory like a DNC perform better or would the extra memory be superfluous?

[–]Alpha_sc2 25 points26 points  (1 child)

I they mentioned they used LSTMs which means that the "memories" are encoded implicitly in a fixed-size hidden state.

[–]i_know_about_things 46 points47 points  (0 children)

Nice username and a 5 year old account too.

[–]Hinrek 31 points32 points  (2 children)

Hey guys! I was in awe while watching the stream, amazing stuff! Are you considering to release any AlphaStar vs. AlphaStar games?

[–]CrisprCookie 10 points11 points  (1 child)

I for one would be speficially interested in the super-human version AlphaStar vs. AlphStar. Seems like these could be some interesting matches.

If this will not happen, I would be interested in why. Are the AS vs AS simply not recordable in a proper way to release the replay ? Or does this make it possible for potential future human opponents to prepare for the match, wouldn't this be a more realistic aproach, since humans can prepare for human opponents ?

[–]denestra 84 points85 points  (41 children)

@Deepmind team: Will we be able to play against AlphaStar at some point in the future?

[–]David_SilverDeepMind 52 points53 points  (0 children)

This is not something we’re able to do at the moment. But we're really grateful for the community's support and have tried to include them in our work, which is why we had the livestream event and released the 11 game replays to review and enjoy :) We’ll keep you posted as our plans on this evolve!

[–]jinnyjuice 30 points31 points  (20 children)

Unlikely any time soon -- we still can't play against AlphaZero on go (baduk/weiqi), shougi, and chess. It will probably be in more distant future.

[–]ThinkExist 45 points46 points  (12 children)

The team said at Blizzcon they wanted to put it on ladder. I hope that happens.

[–]jinnyjuice 43 points44 points  (10 children)

It will most likely happen. It happened in go/baduk/weiqi and shougi. For go, top pros quickly noticed an unwinnable online presence called "Master" and hence, that iteration was called AlphaGo Master, which was stronger than AlphaGo. Master version won against the top players 100-0.

Crazier thing is, the next iterative version, AlphaZero, was even better than AlphaGo Master.

[–]ThinkExist 21 points22 points  (5 children)

Assuming Deepmind fixes the APM issues, I would love to see any replays by AS. Even if AS becomes an untouchable god.

[–]ChaosGandalf 8 points9 points  (0 children)

I hope it has its own Twitch stream.

[–]hexyrobot 7 points8 points  (18 children)

They would need to have a gaming rig with a nice GPU for every virtual agent, so my guess is no. That said I wouldn't be surprised if they did some more show matches against pro players.

[–]keepthepace 35 points36 points  (13 children)

It could be a single instance acting like an individual player on ladder. You would meet AlphaStar in the same way you can match up against TLO.

Bots are forbidden in the ladder right now, but I can see Blizzard adding an option for vetted bots.

[–]hexyrobot 12 points13 points  (10 children)

This would be awesome and I would be totally behind it, as long as it was labeled and you knew it was a bot.

[–]keepthepace 13 points14 points  (8 children)

I think we should get an option to accept this kind of matchups.

[–][deleted] 28 points29 points  (5 children)

“Are you alright with losing this game, guaranteed?”

[–]harmonic- 69 points70 points  (15 children)

Agents like AlphaGo and AlphaZero were trained on games with perfect information. How does a game of imperfect information like Starcraft affect the design of the agent? Does AlphaStar have a "memory" of its prior observations similar to humans?

p.s. Huge fan of DeepMind! thanks for doing this.

[–]David_SilverDeepMind 64 points65 points  (6 children)

Interestingly, search-based approaches like AlphaGo and AlphaZero may actually be harder to adapt to imperfect information. For example, search-based algorithms for poker (such as DeepStack or Libratus) explicitly reason about the opponent’s cards via belief states.

AlphaStar, on the other hand, is a model-free reinforcement learning algorithm that reasons about the opponent implicitly, i.e. by learning a behaviour that’s most effective against its opponent, without ever trying to build a model of what the opponent is actually seeing - which is, arguably, a more tractable approach to imperfect information.

In addition, imperfect information games do not have an absolute optimal way to play the game - it really depends upon what the opponent does. This is what gives rise to the “rock-paper-scissors” dynamics that are so interesting in Starcraft. This was the motivation behind the approach we used in the AlphaStar League, and why it was so important to cover all the corners of the strategy space - something that wouldn’t be required in games like Go where there is a minimax optimal strategy that can defeat all opponents, regardless of how they play.

[–]keepthepace 18 points19 points  (6 children)

Does AlphaStar have a "memory" of its prior observations similar to humans?

Not from the team but I am pretty sure the answer is yes, in the DOTA architecture they use a simple LSTM to keep track of the game state over time.

[–]Zanion 14 points15 points  (2 children)

At one point during the cast, when showing a very high level architecture representation, they stated they were using LSTM's as well

[–]mlearner13 64 points65 points  (21 children)

Will you cap the next iterations to more human like capabilities?

[–]upboat_allgoals 34 points35 points  (0 children)

The game that MaNa won, was a significant step towards humanlike capabilities.

On the other hand, as a demonstration of asymmetric information games, we've seen a conclusive demonstration of operational effectiveness, but perhaps more in the vein of a perfect "missile command" play rather than a perfect strategic usage.

How will the team address state space coverage and what are the advanced techniques there?

[–]J0rdian 24 points25 points  (8 children)

They fixed the camera issue so the only thing that is unrealistic is the perfect apm. The fact it can have perfect 1500+ apm is insane. That would probably be the same as 3000+ apm for a normal person.

In my opinion they need to hard cap it's apm compared to it's average. It shouldn't be going above 600-700 apm. Even that might be way too much. Just because of how inefficient humans are with apm and how efficient ai can be.

[–]HerrVigg 7 points8 points  (3 children)

First of all i don't understand their APM graph where TLO has an average of 678 APM and a max of 2000. These numbers are ridiculous, no human can reach that unless you spam useless actions. Where does these numbers come from?

https://storage.googleapis.com/deepmind-live-cms/images/SCII-BlogPost-Fig09.width-1500.png

[–]PM_ME_STEAM 15 points16 points  (1 child)

TLO has said in his stream that managing his control groups is bugged and inflates APM

[–]OriolVinyals[S] 9 points10 points  (0 children)

See other answers in the AMA regarding APMs.

[–]4567890 70 points71 points  (7 children)

Several times you equate human APM with AlphaStar's APM. Are you sure this is fair? Isn't human APM inflated with warm-up click rates, double-entering commands, imperfect clicks, and other meaningless inputs? Meanwhile aren't all of AlphaStar's inputs meaningful and super accurate? Are the two really comparable?

The presentation and blog post references "average APM," but isn't burst APM something worth containing too? I would argue Human burst APM is from meaningless input, while I suspect AlphaStar's burst APM is from micro during the heavy battle periods. You want a level playing field and a focus on decision making, but are you sure AlphaStar wasn't using its burst APM and full map access to reach superhuman levels of unit control for short periods when it mattered most?

[–]Undisabled 21 points22 points  (1 child)

The only problematic use of burst APM that was noticeable to me came in Game 4 of the MaNa replays. That Stalker fight that spanned 3 screens and caused AlphaStar to reach 1500 APM certainly was superhuman and likely will not be a problem now that they have implemented the camera usage.

In regards to the "meaningful and super accurate" inputs from AlphaStar, that is certainly an advantage that has to be taking into account, however others have noted that the EAPM of AlphaStar wasn't too outrageous. I'm not sure their source, but if the EAPM is <200, I think it may be a non-issue. There were several spots in many of the games where AlphaStar can be seen misplaying the micro, similar to how a human might.

I'm not saying there isn't work to be done, but I do think the live game was more balanced than people give it credit for.

[–]gnramires 8 points9 points  (0 children)

The issue is that the quoted EAPM figure is an average across the whole game. Since the bot doesn't need warm-up like humans and saves APM at the start of the game when there's not much to do, the figure seems misrepresentative.

[–]ILikeBigButtss 9 points10 points  (1 child)

AlphaStar's interface also granted it 1000's of APM worth of free information that would normally require actions to acquire. Most of this information was probably unneeded, but AlphaStar got it all for free. In general, there's quite a few routine actions that humans perform that are completely useless for AlphaStar. It's hard to compare the numbers as a result. Human's also lose precision and accuracy when moving quickly, but AlphaStar always executes the exact command that it intended.

[–]SwordShieldMouse 81 points82 points  (21 children)

What is the next milestone after Starcraft II?

[–]OriolVinyals[S] 73 points74 points  (3 children)

There are quite a few big and exciting challenges in AI research. The one that I’ve been mostly interested is along the lines of “meta learning”, which is related to learning quicker from fewer datapoints. This, of course, very naturally translates to StarCraft2 -- it would be great to both reduce the experience required to play the game, as well as being able to learn and adapt to new opponents rather than “freezing” AlphaStar’s weights.

[–]Prae_ 11 points12 points  (1 child)

Having a AI constantly on the ladder would be awesome. Seeing how it adapts to new patches, new maps and shifts in the metagame.

[–]ILikeBigButtss 22 points23 points  (3 children)

It's got a ways to go on SC2. AlphaStar needs harder limits on it's peak APM and reaction time as well as changes to the interface to put it on equal footing with humans. TLO doesn't play Protoss at an exceptionally high level. The APM numbers for Mana vs AlphaStar are much more indicative of the advantage AlphaStar had in terms of APM in high micro situations.

[–]ZephyrBluu 26 points27 points  (0 children)

I mean, they haven't really hit the SC2 milestone yet. It would be like saying Open AI hit the Dota 2 milestone when their bot was winning 1v1's.

[–]upboat_allgoals 47 points48 points  (5 children)

Will you actually pass the true SC2 milestone, the real version with a vastly larger state space of three races that it seems the agent already has trouble against?

[–]Dreadnought7410 34 points35 points  (2 children)

not to mention maps it hasn't seen before, if it will be able to adapt on the fly rather then brute force the issue.

[–][deleted] 15 points16 points  (0 children)

Also more generally, what do you think are some of the important upcoming areas of research in the field?

[–]p2deeee 18 points19 points  (3 children)

Could we see the visualization for the game Mana won? Would be interesting to see its win probability evaluation

did you ever consider a "gg" functionality when win probability <1%?

[–]OriolVinyals[S] 10 points11 points  (0 children)

This is indeed a possible way to “GG” but we have not implemented it yet. (we haven't looked into that visualization yet)

[–]weiqiplayer 57 points58 points  (5 children)

How long until AlphaStarZero (training from scratch without imitation learning) comes out?

[–]David_SilverDeepMind 40 points41 points  (3 children)

This is an open research question and it would be great to see progress in this direction. But always hard to say how long any particular research will take!

[–]Firefritos 40 points41 points  (12 children)

Was there any particular reason Protoss was chosen to be the race for the AI to learn?

[–]NikEy 41 points42 points  (8 children)

according to DeepMind in our discord: "because it was the least buggy" - referring to various issues with other races in the Blizz API in the early days

(ironically the merging of Archons wasn't possible until a few months ago, and that was a huge problem for protoss bots)

[–]_HaasGaming 58 points59 points  (4 children)

"because it was the least buggy"

As a former Zerg main, this feels like an attack.

[–]OriolVinyals[S] 20 points21 points  (0 children)

Also Protoss made sense as it is the most technologically advanced race. And we love technology at DeepMind!

[–]heyandy889 52 points53 points  (1 child)

in the stream Oriol said choosing Protoss limited variables

so basically the same reason why my friends taught me Protoss to learn the game xD

[–]Prae_ 34 points35 points  (0 children)

It has been scientifically confirmed guys !

[–]SeriousGains 18 points19 points  (4 children)

@Deepmind team: We didn't see much of Alphastar's ability to use AoE caster spells like the oracle's statis ward, high templar's storm or sentry's force field. Is this a greater challenge to teach the AI or have these strategies been deemed inefficient through Alpha league play tests? Also, will we ever get to see an unlimited APM version of Alphastar?

@TLO: Do you think caster units have the potential to give AI's a greater advantage over human players? For example would Alphastar playing as zerg be even more difficult to beat with ravagers, infestors and vipers?

[–]OriolVinyals[S] 29 points30 points  (0 children)

A few agents uses stasis ward and force fields. High templar storm is far more rare in high level PvP. We are not interested in unlimited APM StarCraft.

[–]LiquidTLO1 23 points24 points  (0 children)

Possibly, the control it exhibited so far could make caster units stronger.

However I also believe the decision making when to use spells effectively might be more difficult for AlphaStar to master than we might think. The action space for SC2 is so vast, it’s hard to tell what is easy for the agents and what is difficult without seeing it experimentally, I believe.

[–]Alpha_sc2 23 points24 points  (1 child)

In one of the games AlphaStar got 18 disruptors, is that not enough aoe?

[–]SoulDrivenOlives 65 points66 points  (27 children)

I got the impression that different agents trained in the Alphastar league develop strategies that are quite inflexible. It seemed that the agents were unable to switch to a different composition in the middle of a game. I also noticed that the agents did not play in a reactionary style but rather wanted to be the aggressor and dictate the pace of the game. The progress of Deepmind's Starcraft team left me quite speechless but also somewhat disappointed in the seeming brittleness and hard-codedness of the agents.

To Orio Vinyals and David Silver: Are my impressions wrong? Are the alphastar agents able to change unit compositions mid-game as a reaction to new information they scout during the game? If not, what are your plans to allow that kind of long term strategical thinking to develop? You mentioned during the stream that the different agents learn independent of another and therefore don't share knowledge. This seems quite inelegant. How satisfied are you with the current training methods of Alphastar? I can't quite put my finger on it but something about the league seems clumsy.

To Mana and TLO: Do you think that Alphastar's unwillingness to switch compositions was sub-optimal play? How do you think the agent would deal with a late game zerg tech switch after a big fight?

Bonus question to Orio Vinyals and David Silver: Any plans on implementing a handicap on Alphastar's mouse accuracy? A perfect army control both in intent and execution will win most games even if the agent's play is not smart. Perhaps Deepmind should differentiate between Alphastar's intent and actual execution in some way? Maybe it could try to do the correct thing but be unable to execute with a 100% precision? It would act much more human that way.

[–]David_SilverDeepMind 65 points66 points  (16 children)

First, the agents in the AlphaStar League are all quite different from each other. Many of them are highly reactive to the opponent and switch their unit composition significantly depending on what they observe. Second, I’m surprised by the comment about brittleness and hard-codedness, as my feeling is that the training algorithm is remarkably robust (at least enough to successfully counter 10 different strategies from pro players) with remarkably little hard-coding (I’m actually not even sure what you’re referring to here). Regarding the elegance or otherwise of the AlphaStar League, of course this is subjective - but perhaps it would help you to think of the league as a single agent that happens to be made up of a mixture distribution over different strategies, that is playing against itself using a particular form of self-play. But of course, there are always better algorithms and we’ll continue to search for improvements.

[–]PM_ME_UR_LIDAR 12 points13 points  (0 children)

Could we perhaps train a "meta-agent" that, given a game state, predicts which agent would do the best in the current scenario? We can run several agents in parallel and let the meta-agent choose which agent's actions to use. This would result in an ensemble algorithm that should allow much more flexible composition shifts and may be easier than trying to train a single agent that is good at reacting to the opponent.

[–]AndDontCallMePammy 22 points23 points  (0 children)

Arguably the strongest agents are resorting to massing the most micro-able units (stalkers and phoenix) and brute-forcing their way to victory, peaking at 1500+ APM across multiple 'screens' in game-deciding army engagements. Humans can't execute 34 useful actions (EAPM) in one second, but the AI can if it decides to (while still avoiding an APM cap such as 50 actions over 5 seconds). At the very least this APM burst 'feature' fundamentally separates the human vs human and the human vs AI metagames into two distinct strategy spaces (e.g. stalker vs stalker is perfectly viable on an even playing field but not as a human vs AI, and it has little to do with the AI being more intelligent, just faster)

Of course it is the fault of the pro players for not cheesing enough (understandably because being forced to abandon standard play is considered shameful among pros. but it's necessary in the face of 1000+ peak EAPM).

[–]LiquidTLO1 46 points47 points  (3 children)

From the games we have experienced it definitely seemed like a weakness. After MaNa and i saw all 10 of the replays we noticed unit composition still seemed to be a vulnerability.

It’s very hard to tell how it would deal with a Zerg tech switch. I assume if it was training against Zerg it would learn to adapt to it, as it’s such a crucial part of Zerg matchups. Maybe better behaviour would emerge. But we can only speculate.

[–]ZephyrBluu 5 points6 points  (0 children)

Assuming it even gets to the late game. I'd like to see how it responds to Zerg aggression off 2 or 3 bases because those sorts of attacks are insanely hard for Protoss to hold and require very good scouting.

[–]ogs_kfp_t 12 points13 points  (1 child)

During the Go match- commentators said: “it’s pretty close” while AlphaGo thought 70% win.

I had a deja vu.

Was that indeed repeated today?

[–]Mogarfnsmash 13 points14 points  (0 children)

For Blizzcon 2019, is it possible that we will have Pro players vs Alphastar show matches on the main stage? Or will the technology not be there yet?

[–]sluggathugga 13 points14 points  (2 children)

For MaNa,

You adopted the mass probe strategy for the final game. Did you practice it against humans first, or was it a decision just for this game? Do you think it worked in your favour (I can't help but imagine yes, given the level of Oracle harass)? And will you be using it in PvP?

[–]SC-MaNa 24 points25 points  (1 child)

I have tried a few games of the constant probe production but not against very good protoss players yet. I can not tell yet if this is the right approach, because the after-blizzcon patch has changed the way you can scout in PvsP. However in the exhibition game itself it has worked out well. I will surely test it out and hopefully that will be the first thing that I have learnt from AlphaStar.

[–]LolThisGuyAgain 31 points32 points  (2 children)

Loved the games tonight!

Was just wondering how the AI uses vision. For example, if it sees a glimmer moving across, is it able to recognise that it is a cloaked unit, or if it sees a widow mine pot hole, can it recognise that there's a widow mine there without detection (given enough training)? I guess i wanna know what input it receives

[–]upboat_allgoals 19 points20 points  (1 child)

This is described in the blog post a bit, but it seems it's low res such that it hasn't detected the glimmer. MaNa had pretty great observer use in the game he won.

[–]JesusK 27 points28 points  (4 children)

AlphaStar seemed to end up going for the same group of units that it could abuse with perfect Micro, while the average was in the 300-400, during some micro intensive moments it would spike heavily and control in inhuman ways. Also when talking about APM you have to remember most APMs a player does are spam to check for information rather than micro.

While it was still doing decisions on how to proceed based on partial information, it was clear that it relied heavily on this micro units and this seemed to be the norm, so it was a lot less of adapting to what the opponent was doing, or countering unit compositions and more of checking if they could win with the stalker army they had.

Thus, I was wondering if you considered heavily limiting the APM, in an attempt promote the AI into going for more tactical maneuvers and builds instead.

Even more, if you could train an AI to play at lets say only 100 APM, and then drop it in the league with the the AlphaStar we saw, it would need to come up with different approaches to win games given that it cannot just compete in a stalker vs stalker game thus promoting more tactical adaptation from AlphaStar.

Is this even possible or in consideration trying to push the AI into paths that do not allow it to abuse micro?

[–]OriolVinyals[S] 23 points24 points  (3 children)

Training an AI to play with low APM is quite interesting. In the early days, we had agents trained with very low APMs, but they did not micro at all.

[–]JesusK 9 points10 points  (1 child)

More than play at low APM I would like to see a point were the AI micro is as good as a Top player, and thus it cannot depend on APM, specially when put against AIs that have better micro, if it wants to win, it will have compensate with another approach.

Would the AI even be able to attempt this compensation or will it just resign to losing?

[–]PEEFsmash 6 points7 points  (0 children)

I think it will be necessary at some point to beat pro players with APM/control that is objectively weaker than human pros to be totally certain (and convince sc2 and AI communities) that you've beaten the enemy on -intelligence-. The biggest criticism you've gotten is that non-intelligence related abilities of AlphaStar are carrying it. I believe you are able, with time and good work, to beat top players with diamond-level micro, which would only mean one thing...AlphaStar is smarter. Good luck finding that middle ground!

[–]asgardian28 42 points43 points  (2 children)

Thanks for combining my 2 favorite hobbies Sc2 and machine learning, awesome and congrats!

With AlphaZero you demonstrated on chess and go that the computer playing itself only, without being biased by human games, was yielding a superior agent.

Still wilt AlphaStar, you started with imitation learning (presumably to get some baseline agent that doesnt pulls all its workers from mining for instance)

Do you intend to develop an AlphaStarZero, removing the imitation learning phase, or is it a necessary phase in learning Sc2?

[–]OriolVinyals[S] 14 points15 points  (0 children)

Thank you! See other answers here, but in short so far it's been the only way to not get attracted to early game strategies (such as worker rush, all ins, etc.).

[–]SuccessfulPackage[🍰] 12 points13 points  (4 children)

  1. What would be your estimate of the time for the AI to learn the other match ups and other map ?

  2. Would it be interesting to put the AI in a situation she never faced before (new map with weird layout, etc) to see if the previous experiences would be of any use to her ?

  3. Do you think one day we will be able to create a model that enable AI to conceptualize the game based on his experience? (would be cool to see if AI develop bias too)

  4. If we were to flood AI with stupid strategy (drone rushing all the time) would it mess up with weights and make the AI dumber?

Good luck for the rest ! May the peak of humanity evolution Serral guard our race !

TY for your work :3

[–]iMPoopi 12 points13 points  (0 children)

Hi, congrats for the amazing work so far!

I have a few questions regarding the latest game (vs MaNa, exhibition game on the camera interface):

It seems that the AI was ahead after the initial harass, and after dealing with MaNa two zealots counter harass (a lot more income, superior army value albeit maybe not as strong composition).

Do you think that if you were to replay the game again and again from that point in time (in human time since MaNa would have to play again and again as well), the agent in its current iteration would be able to win at least one game? The majority of the games? Or MaNa could find "holes" at least one time, or even again and again?

Do you think your current approach would be able to decide that making a phoenix to defend against the warp prism harass would be better than keep making oracles?

Does the agent only try to maximize the global probability of winning, and only makes decision based on that, or can he isolate certain situations (for example the warp prism harass) as critical situations that needs to be sorted in an entirely different way than the game overall.

For example are you able to know if the AI kept queuing oracles because overall it'll still help win the game, or if it made oracle because it was the "plan" for this game?

How high level can the explainability of your agent go?

Thank you for your time, and congratulations again for the amazing work in our beloved game and in machine learning.

[–]QuestionMonkey 11 points12 points  (1 child)

  1. Would you ever consider open-sourcing your code?
  2. In the visualization you showed, the times when the white arrow appeared (and the agent was "looking at" the game) didn't follow a consistent rhythm. As I understand LSTM networks, each set of layers in the network corresponds to a given "time step". What defines these time steps in AlphaStar? Are there certain things that define a new time step?

[–]jinnyjuice 16 points17 points  (0 children)

Would you ever consider open-sourcing your code?

Knowing all the iterations for AlphaGo to AlphaZero, very unlikely any time soon.

[–]emanuelpalomar 13 points14 points  (1 child)

What happened in the live game? When MaNa harassed AlphaStar's base, it just walked back and forth aimlessly with it's Stalkers, it looks like it's very brittle when pushed outside its training distribution even a little bit and doesn't really understand the game?

Since adverserial attacks (pun intended) like this are a common problem with neural networks, how do you intend to address this weakness?

[–]koglerjs 9 points10 points  (8 children)

Sorry for the wall of text. It starts with a sincere, very simple question:

Why didn't you cast game 5 vs MaNa?

It was an extraordinary demonstration of AlphaStar's capabilities. It:

  • tried to gas steal
  • multiple times!
  • essentially ran a distraction, successfully, with the single pylon
  • feigned a two-base build back home (or just looked like he wanted to get rushed)
  • all while displaying a very weird very aggressive totally new proxy build, because I defy you to find a pro game where someone built such an aggressive stargate and it is arguably the perfect place to put the stargate if you think it through long enough.

You can't ask AlphaStar if that's what it intended to do, and it could just be random noise, but every other game AlphaStar built its base high ground. Low ground cyber core and gateway says second base or "rush me" but it also helps the proxy because you're skipping warpgates because you reeeaally want that robo up ASAP and then you need gas for immos, so every bit of walking you can save is a bonus. MaNa doesn't know that this build does not allow him the time for those low ground buildings to be rushed, so he just sees something weird, twice. Pylon, low ground. What could it be? He gets a full scout and should suspect a proxy, but you have to kill the pylon. That's just a standard response. I am nobody and know nothing about 1s but I think that 99% of pros would have killed the pylon with both stalkers, and the bar for tricks is: do they work once?

I think Tasteless and Rotterdam would have flipped their shit if they'd seen this game on air. The stargate??? AlphaStar put down a stargate?? There? If they haven't seen it yet, tell them I want them to cast their first time watching it. Please. This stargate... it's creative. A person who executed this build in a high-profile match would be rightly termed imaginative.

Try not to be sad, but the game is going to change. Computer provided builds for human players will change this game. I thought I was prepared for DeepMind to be good at StarCraft, but I didn't realize what it would mean. I don't think AI is going to take over the world exactly, but it might be a bit like a cheat code for knowing things when we no longer discover things ourselves.

Heck, if you're any sort of streamer, go watch it and stream it now, and don't read anymore.

Today was historic for many reasons, but I wouldn't be surprised if Game 5 got overlooked at first. "Micro was like we expected it to go! Build orders are working! Wait you didn't restrict the map access?" Really cool stuff, but Game 5 is a first.

Watch that game and tell me if I'm wrong, I'm an idiot barely plat because I need something in my life I'm actively against being a tryhard at. Because the Robo Stargate 1-2 Punch (AlphaStar can't name it, so I can try) has some amount of nuance to it. The thing about these kinds of builds is they only work reliably once. They're like 0-day exploits, sorta. They get known in competitive play and people learn how to play against it.

Your machine came up with a new build. And it surprised MaNa with it... and it tricked him. It just did what was optimal: in games where it built that pylon, other AIs fought that pylon first, and won the time for the pylon to be up and the zealot and stalker to arrive. It doesn't know that AIs are known for doing dumb things like building pylons where they shouldn't, it couldn't possibly try to take advantage of that fact.

A possible counter to this build is to never use two units to kill a pylon built right in your face, otherwise the 3rd proxy pylon gets up in your natural and shield batteries are keeping a small handful of units alive while they're fighting you on your doorstep. MaNa even forces a cancel on some batteries in his natural. But if you scout it sooner, you're still going to have to contend with a zealot and a stalker covered by batteries, they'll just be further away. Those batteries don't get much use in Game 5, which means they're arguably waste, which means AlphaStar was prepared to fight there. This build as countered by MaNa has unused cushion.

I don't know if I have a question for you, but I have a question for MaNa: What did you think when that phoenix came out on top of your prism?

To do a Tastosis thing, "cover the name up and tell me who's playing," and I would have said "Perhaps not even sOs is bold enough to build a stargate in the face of his opponent like that." The only play I've seen with this degree of killer instinct was when Maru decided you were dead, as he would if you got really lucky and beat him in the first match of a best of 3, and then you faced his proxy twice, and the best of 3 was over. This is like an sOs build that Maru decided to use. It's too weird for sOs--too weird for sOs!--and so he'd say "Maru, I can't use this, people would laugh at me" and Maru would play one game as Protoss.

I've rewatched the replay a number of times. Everyone go watch it. Tell me I'm wrong. I don't know that much, I like 4s and I like area effect damage and getting good at 1s would just take too long. The proxy could have been scouted. Trick builds don't work if they're scouted, that's why they're tricks, but we haven't seen this build tested yet.

But I tell you that stargate was as deliberate as AlphaStar can plan to be. The phoenix prevented warp prism drop play from keeping the assault from ending. Void ray is a nice touch for any stalkers which come out.

People save builds like this to pull out in the high pride tournaments where more than just money is on the line. GSL quals are supposedly secret to keep a lid on builds like these. They're like 0-day exploits. Maybe this isn't as astoundingly good a build as I think it is, maybe now that it's an option, people can handle it, because that's the thing about these builds.

Once you see the trick, it doesn't work anymore. But now the work of figuring out how to beat this build begins. For at least a few days, the Masters league should be a bloodbath of people attempting the trick. "Why are you building a pylon in my face?" It's almost rude. But it has a purpose, and that purpose is literally to occupy your units while the build encroaches closer to your throat.

It might not adjust the meta. Probably won't. Meta's still settling a bit anyway from the patch.

But here's my point, and here's why I wrote all of this.

AlphaStar can write and demonstrate a fantastically complicated trick build. This build might not enter into commonly faced threats, but some build from one of your machines will soon do so, and it will adjust the meta.

And after that DeepMind will simply define the meta.

And it's just not going to be the same.

P.S. It's more fitting to call it the AlphaStargate. Congratulations on your success. This was cool to see.

[–]Cock-tail 60 points61 points  (26 children)

  1. So there was an obvious difference between the live version of AlphaStar and the recordings. The new version didn't seem to care when its base was being attacked. How did the limited vision influence that?

  2. The APM of AlphaStar seems to go as high as 1500. Do you think that is fair, considering that those actions are very precise when compared to those performed by a human player?

  3. How well would AlphaStar perform if you changed the map?

  4. An idea: what if you increase the average APM but hard cap the maximum achievable APM at, say, 600?

  5. How come AlphaStar requires less compute power than AlphaZero at runtime?

[–]njc2o 54 points55 points  (18 children)

Vis a vis #2, that's my huge problem with pitting it against humans. SC2 is inherently a physical game. Your mouse can only be at one place at a time. Physically pressing keys and clicking mouse buttons is a huge layer between the brain and the actual units. Your eyes can only focus on one point on the screen, and your minimap awareness either requires eye movement or peripheral vision.

That the AlphaStar could see the whole map (minus fog of war) is a huuuge advantage. 1500 APM is crazy, while keeping up perfect blink micro on three fronts and not having to manage control groups or moving a mouse or camera. I'd love to see an actual physical bot be the interface between the software and the game. Have it interpret screen data as we see it. Force it to click on a unit to see its upgrades, and not just "know" it. Force it to drag its mouse from boxing a group of units to casting a spell. THAT would be a true competition with human opponents.

The obvious value of this is developing a unique understanding of the game completely independent from the meta or traditional understanding of the game (more or less). Utterly fascinating, and it'd be so cool to see AI ideas impacting the pro scene.

Really exciting times, and I'm amazed by the progress made. Just disappointed by the imbalance in the competitive aspects.

[–]ddssassdd 15 points16 points  (1 child)

Vis a vis #2, that's my huge problem with pitting it against humans. SC2 is inherently a physical game. Your mouse can only be at one place at a time. Physically pressing keys and clicking mouse buttons is a huge layer between the brain and the actual units. Your eyes can only focus on one point on the screen, and your minimap awareness either requires eye movement or peripheral vision.

Without pitting it against humans how do we arrive at something that we believe is "fair". We cannot see how much of an advantage something is until it is tested. Maybe some things we think as advantages won't be and some things we don't even think of turn out to be advantages.

[–]ZephyrBluu 13 points14 points  (0 children)

Also, pitting it against humans without levelling the playing field means that humans will simply get out done by perfect mechanics, rather than Alpha* leveraging superior decision making which is what I thought the whole point of this exercise was.

[–]TheOsuConspiracy 13 points14 points  (12 children)

I'd love to see an actual physical bot be the interface between the software and the game. Have it interpret screen data as we see it. Force it to click on a unit to see its upgrades, and not just "know" it. Force it to drag its mouse from boxing a group of units to casting a spell. THAT would be a true competition with human opponents.

This would make the problem basically untenable.

[–]Pangurbon 19 points20 points  (10 children)

You could simulate it. A physical bot is unnecessary and prohibitive, but forcing it to drag, use hotkeys realistically seems doable.

[–]hexyrobot 9 points10 points  (0 children)

I disagree with your first point. Mana was able to win in large part to the fact that the AI would over react and move it's whole army back to it's base every time he moved in with his warp prism and immortals. That over reaction meant it didn't move across the map when it had a larger army and gave him time to build the perfect counter composition.

[–]Nevermore60 9 points10 points  (0 children)

As you said, the all-seeing AlphaStar that swept MaNa 5-0 was just....too good. And ultimately I think that probably had a lot to do with the fact that it wasn't limited by a camera view. The way that it was able to micro in the all-stalker game was just god like and terrifying.

As to the new version, it seems a bit more fair, but I have some questions about how the "camera" limitation works. My guess is that in the new implementation, the agent is limited to perceiving certain kinds of specific visual information (e.g., enemy unit movement, friendly units' specific health) to when that information is within the designated camera view. /u/OriolVinyals, /u/David_Silver, is that correct?

As a follow-up question, does the new, camera-limited AlphaStar automatically perceive every bit of information within the camera view instantaneously (or within one processing time unit, e.g. .375 seconds)? That is, if AlphaStar moves the camera to see an army of 24 friendly stalkers, does it instantaneously perceive and process the precise health stats of each one of the stalkers? If this is the case, I still think this is an unnatural advantage over human players -- AlphaStar still seems to be tapped into the raw data information feed of the game, rather than perceiving the information visually. Is that correct? If so, the "imperfect information" that AlphaStar is perceiving is not nearly as imperfect as that that a human player perceives.

I guess I am suggesting that a truly fair StarCraft AI would have to perceive information about the game optically, by looking at a visual display of the ongoing game, rather than being tapped into the raw data of the game and perceiving that information digitally. If you can divorce the AI processor from the processor that's running the game, such that information only passes from the game to the AI processor optically, that'd be the ultimate StarCraft AI, I think.

/u/OriolVinyals, /u/David_Silver, if either of you read this, would love your thoughts. Excellent work on this, I thought the video today was amazing.

[–]Felewin 26 points27 points  (4 children)

Is it possible that, by learning via mimicry of human replays, the AI is gimped – biased by learning bad habits? Or will it ultimately overcome them, given enough experience, meaning that the human replays are more useful than starting from scratch as a way of jumpstarting the learning process?

[–]Kaixhin 17 points18 points  (2 children)

Given what happened with previous Alpha* iterations, seems like imitating humans to start with is easier but suboptimal, possibly but not necessarily in the long run too. With StarCraft they don't even have the benefit of MCTS, so it's much more difficult to get reasonable strategies purely from self-play from scratch. That said, it's presumably what they'd like to achieve in the near future.

[–]Dinok410 8 points9 points  (2 children)

While most of the focus seems to be on the visual aspects of the game, I'm curious if Alpha uses any form of sound recognition to help assess the situation. At least for me as a player I've encountered several situations in which sounds actually let you react faster than sight, be it the sound of units firing or dying, a medivac dropping units on the corner of your screen, or even the basic announcer stuff. So, does it utilise sound in any way or, if not, are there plans to implement it in the future?

[–]TheSkunk_2 8 points9 points  (3 children)

Incredibly impressive and entertaining showcase. Congratulations!

Question 1: Distinct Agents

Is there a plan to move away from distinct, separate agents that the team curates (or randomly choses) to play in a specific order for a series? In my opinion it detracts from the accomplishment. I think the final live match against Mana is a good example of a flaw of this approach: other agents frequently used Phoenix, but because this particular agent is separate and distinct, it only built Stalkers and Oracles and never built a Phoenix to handle the Warp Prism.

Part of being a professional SC2 pro-gamer is mind-gaming your opponent and deciding which builds to play in which maps in a series. Some of the most ballsy SC2 pros have had to make the incredibly difficult decision do to an incredibly risky cheese in the deciding match of a series. The AI consciously deciding what builds to use on what maps of the series would be truly impressive.

Similarly, it seems to have less ability to switch up builds this way. In a real match, a player might initially have one plan, but decide to be cheesy if they scout their opponent being greedy, or decide to stop cheesing if it's scouted. However, by hand-picking Agents that were incentivized to develop specific builds, developers are essentially hand-picking the AI play-style before the match. The ramifications of this can be seen in many of the show-matches where the AI stuck with it's play-style even when it was not a good idea to do so.

Question 2: Short-Term Memory

In one of the matches, the AI can be seen using Phoenix to lift up a Stalker from Mana's advancing army, and then dropping it when it realized the rest of the army is coming. It did this repeatedly, wasting tons of valuable phoenix energy. This made me wonder if the AI has any kind of short-term memory -- does it literally forget the army exists as soon is it goes out of vision? Do you have any other comment on this particular mishap?

Question 3: Future Improvement

What are the DeepMind team's goals for fixing the AI's current weaknesses? You could simply use the current model, but train it longer - months instead of weeks - and beat stronger players, but this (to me) would be a disappointing approach. Are future goals to learn new maps, new races, new match-ups, and simply brute-force train the Agent(s) to until they can beat the raining world champion, or are there plans to to adjust the agent on a systematic level to shore up it's weaknesses in scouting, map vision, adaptability, and make it less reliant on individual separate agents and superhuman micro/multitasking? Is a version that starts from ground zero instead of using imitation learning planned?

Question 4: I Want to Play It

You'll here this a lot -- the SC2 community wants to play DeepMind themselves. I believe you expressed a desire to make this happen, so I wanted to frame this more specifically: what hurdles do you see preventing this from happening? For us laymen, what technical challenges are involved in say, publishing a separate ladder or game client where players can play against a changing rotation of DeepMind agents? If the main obstacle is the added development costs and time needed to make this happen, has DeepMind and Blizzard considered something like a WarChest to fund the inclusion of DeepMind client/ladder?

Question 5: Mana and TLO's Thoughts

Congratulations on the show! I would love for either of you to write a more detailed blog about your experience and post it to teamliquid.

My question is what you think of AlphaStars play in retrospect. Do you see abusable facets of it's play now (in hindsight) that you didn't originally see during the match? Do you agree with some fans that it's micro/multitasking (particularly the 1,500 APM 3-pronged Stalker surround micro) is unfair and needs to be limited more? How much of the AI's success would you attribute to the sheer unexpectedness of it's play-style and the general unfamiliarity of the play environment (TLO not being aware he was facing distinct agents each match, the fact that you both had to play on an old patch) and how much of it is the inherent strategy/play-style of the agent?

[–]SC-MaNa 19 points20 points  (0 children)

It’s hard to say if there is an abusable strategy, because AlphaStar uses different agents every game. However, the approach to the game seems to be a little similar in all of the matches. I definitely did not realise in the first 5 matches that AlphaStar never fully commits to an attack. It always has a ready back-up economy to continue the game. While playing human players, most of the time the attack or defense is dedicated and that is the plan. So when I saw a lot of gateways earlyon and little to no tech in sight I was very afraid of losing in the next minute. That lead to me being overdefensive and not managing my economy properly. I think in the few games that I have played AlphaStar its biggest advantage was my lack of information about it. Because I did not know what to expect and how to predict its moves I was not playing what I feel comfortable with.

[–]LiquidTLO1 16 points17 points  (1 child)

We can only speak about the agents we saw play so far, but from what we experienced there is definitely a lot of things you can do to the agents to throw them off. They seemed weak vs forcefields in particular, didn’t fully respect choke points and ramps and also surprisingly had a harder time with multi-tasking than I expected. It would often pull back a large amount of it’s units to deal with a small amount of harrass.

I partially agree that apm spikes might still be problematic. However in the defense of AlphaStar there is a hard cap to how many actions it can take, it can decide how to assign them though. So while it exhibits incredibly fast micro, it might make itself vulnerable by using up all its actions on a specific task like that. In the end I’m sure the team on deepmind will address the way they go about APM if it really turns out to be an issue. Right now it’s probably too early to tell if it’s a problem considering how few matches we saw so far. It’ll require longer term testing from professional SC2 players to find out.

Playing against a completely unknown opponent that we knew nothing about, not even the approximate skill level, was a factor in our matches. I was training pvp for my benchmark matches, however most of the matches I played I faced relatively standard build orders. The way AlphaStar played I never encountered before and that’s where my inexperience in pvp showed.

[–]legoboomette 7 points8 points  (4 children)

Why don't you continue training the agents after these exhibitions? Like with AlphaZero as well, it would be interesting to know just how good they would get after more than just a few hours or a week of training.

[–]OriolVinyals[S] 19 points20 points  (0 children)

We prefer to move the research forward, rather than just running the league forever : )

[–]NoseKnowsAll 12 points13 points  (2 children)

One concern with these networks is that extra time does not dramatically improve their performance. If you look at their AlphaZero paper, you'll see that the performance improved from complete novice to average master in chess extremely quickly. Getting to GM skill took a bit longer. Getting to become a challenger to the StockFish engine took a lot longer. It's unlikely it improves much more with more time.

[–]Imnimo 6 points7 points  (0 children)

Many of DeepMind's recent high-profile successes have demonstrated the power of self-play to drive continuous improvement in agent strength. In competitive games, the intuitive value of self-play is clear - it provides an opponent of an appropriate difficulty which never gets too far ahead or falls too far behind. I'm curious about your thoughts on applying the self-play dynamic to cooperative games such as communication learning and mutli-agent coordination tasks. In these settings, is there an additional risk of self-play leading to convergence to trivial or mediocre strategies, due to the lack of a drive to exploit an opponent and avoid being exploitable? Or could a self-play system like AlphaZero be slotted into a cooperative setting pretty much as-is?

[–]LetoAtreides82 6 points7 points  (0 children)

Thank you for demonstrating your project to us, I very much enjoyed it. I was a former StarCraft 2 player and often I had dreamed of what it would be like to have an AI master the game. I have a few questions:

  1. Can we expect another demonstration soon, perhaps another 5 games against Mana with a much improved version of the camera interface agent?

  2. Are there plans to challenge one of the current top three players in the world in the near future?

  3. Would it be possible to make a version of AlphaStar that can play on the StarCraft 2 ladder against actual players? It'd be interesting to see how high of an MMR it can achieve on the ladder.

[–]DeadWombats 8 points9 points  (1 child)

One moment that particularly impressed me was when AlphaStar decided it was losing a fight and recalled its army to prevent more losses.

How does AlphaStar gauge confidence during a battle? What circumstances does it consider before deciding to pursue or retreat?

[–]OldConfusedMan 6 points7 points  (1 child)

Fantastic accomplishment today; have to admit I was cheering for AlphaStar! I work in machine translation; can you explain a bit about the natural language techniques you borrowed for your AlphaStar, and what innovations you believe could flow back to machine translation

[–]fhuszar 23 points24 points  (0 children)

Isn't the unit-level micro-management aspect inherently unfair in favour of computers in StarCraft?

In Go, any sequence of moves AlphaGo makes, Lee Sedol can easily imitate, and vice versa. This is because there is no critical sensorimotor control element there.

In StarCraft, when you play with a mouse and keyboard, there is a motor component. Any sequence of moves that a human player makes, AlphaStar can "effortlessly" imitate, because from its perspective it's just a sequence of symbols. But a human player might struggle to imitate an action sequence of AlphaStar, because a particular sequence of symbols might require unreasonable or very difficult motor sequence.

The metaphor I have in mind here is playing a piano: keystrokes-per-minute is not the only metric that describes the difficulty of playing a particular piece. For a human, hitting the same key 1000 times is a lot easier than playing a random sequence of 1000 notes. From a computer's perspective, hitting the same key 1000 times and playing a random sequence of 1000 notes is equally difficult from an execution standpoint (whether you can learn the sequence or not is besides my point now)

[–]Nostrademous 6 points7 points  (0 children)

  1. Any plans to release a full write-up of the algorithms and their inter-connections and inter-relations?
  2. Seems 20-year research is new again all over the place (i.e., your work on Relational Deep RL for example), what do you think will be the next old-becomes-new approach?
  3. What are you thoughts on hierarchical RL approaches?

  4. In you writeup (https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/) you seem to indicate Dota2 as an "easier" game than SC2, is that your stance or just bad wording? and how come?

Thanks

[–]AxeLond 7 points8 points  (1 child)

I looked over one of the replays, game 4 vs MaNa and I could spend the whole game looking over just how AlphaStar handles it's workers. 2/3 and 16/16 workers on gas/minerals is 100% efficient and 3/3 and up to 24/16 has diminishing returns. It seems to have very hard priorities like on 2 bases it maintains exactly 48 workers with 17 on minerals in the main base and 19 in the natural. After not building probes for almost 2 minutes when the third base finishes it build 1 extra probe from 48 to 49 and doesn't build any more for the rest of the game.

If you see an agent doing something specific like this is it possible to dissect it's brain to find out if there's a specific rule it follows to decide if it should build a 49th worker or not, or is it just impossible to understand what types of rules the neural network follows to decide what it should do in specific situations?

[–]OriolVinyals[S] 8 points9 points  (0 children)

This is very interesting analysis, thanks! As you say, it is very hard to know why AlphaStar is doing what it's doing -- understanding neural networks is an exciting and incredibly active topic of neural network research.

[–]donkeykong1774 29 points30 points  (5 children)

Firstly, amazing job! Congrats to everyone on the team. This is an incredible feat, and it's a joy to watch the decision making and especially the reactions of those playing and commenting.

  1. Could you go into some more detail on the networks used (especially the LSTMs), and what the visualization with the 3 regions with the pink colormaps meant. How does the network compare to the DQN networks used for playing atari, and the MCTS network used in Alphago zero?
  2. How did you evaluate which 5 versions of AlphaStar were the least likely to be exploited? Were they simply the 5 strongest players?
  3. I seem to recall someone mentioned briefly that there were reaction times of 50ms from AlphaStar? That seems faster than human capabilities.
  4. Is there a version of AlphaStar trained purely using self-play, like Alphago Zero?
  5. What did the likelihood of winning plot look like for the last live game? Did the game realize it had lost at the same time as the commentators? How did this compare for the other games?

[–]Arkitas 10 points11 points  (3 children)

To your 3rd question, their blog contains a comment about reaction time:

" Additionally, AlphaStar reacts with a delay between observation and action of 350ms on average. "

[–]Felewin 5 points6 points  (1 child)

Do you expect the behavior to eventually condense toward a particular strategy, or to further diverge into counter-strategies? What trend do you see for amount of strategic diversity as experience increases?

[–]AspiringInsomniac 6 points7 points  (0 children)

- How hard is it to generalize the algorithms used to be able to train an NN to perform on a generic map? ie. have an NN that can play on a map that it hasn't trained on (but otherwise all other mechanics the same)

- You mentioned that there are a variety of agents and you handpicked them to play against the human player for some interesting play. Is there any plan to create a "superagent" which would be an AI to pick strategies based on trying to win series of games? One could imagine an AlphaStar League where matches are best-of-N's.

[–]Voltz_sc2 6 points7 points  (1 child)

Do you have plans to continue training AlphaStar? Because in it's current state, while it is impressive, has no chance on taking on the world champion Joona Sotala.