Hide-n-seek environment. Source: OpenAI

AI is having all the fun: developing human-relevant skills playing hide-n-seek

What took humans millions of years of evolution, took AI multi-agents only days

Ksenia Se
4 min readSep 18, 2019

--

It’s good to be AI. People create special worlds for you — colorful playgrounds where you are playing millions and millions of game rounds, learning new things, recreating some simple human skills that we only developed through cruel natural selection.

In it’s new paper, OpenAI reveals some surprising results of multi-agent dynamics that involve the development of intelligent behavior.

The blue and red cartoon figures were playing hide-n-seek and were able to evolve a series of six distinct strategies and counterstrategies. The surprising thing is that the creators of the game didn’t know that some of those strategies were even supported by their environment.

True intelligence?

Not yet — but the promise is big. OpenAI researchers showed that multi-agent competition supported by standard reinforcement learning can lead to complex behavior with the potential to increase sophistication.

Multi-agent competition is one method to learn skills in an unsupervised environment. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. This algorithm was inspired by how real animals on the real earth learn. Basically, it is a reward system that maximizes the effects of actions.

Hide-n-seek environment and the rules

The red seekers were tasked to keep the blue hiders in sight; and hiders, for their part, had to avoid being in the line of sight of the seekers. Hiders could hide behind immobile walls, movable barricades, boxes, and ramps that were randomly scattered in different randomly generated rooms. The figures could lock the objects to prevent the other side from using them. The seekers were immobilized during the preparation phase to give the hiders the opportunity to run, hide, and/or change the environment.

Source: OpenAI.com

The team of hiders was given +1 as a reward if all hiders are hidden, and -1 as a penalty if any hider is seen by a seeker. The opposite for the seekers. No incentive for dealing with objects or learning how to navigate. Zero reward during the preparation phase. There were no individual rewards either.

481 millions-game rounds later…

There were a few phases of development.

Chasing (episodes 0–22 million)

During the first few millions of rounds, the players moved randomly. The seekers were learning how to chase and the hiders how to avoid them. No object manipulation at this point.

Shelter construction (episodes 22–88 million)

Source: OpenAI.com

After 20+mln chasing and hiding, the game became a bit more sophisticated. The blue hiders started to not only move the boxes and walls to shelter themselves, but also to lock them so their counterparts can not get them. They also learned how to coordinate and block the entrances in the fastest way.

Ramp use (episodes 88–115 million)

Closer to 100-million rounds, the seekers came up with a counter-strategy. They used ramps to climb over the walls and locked boxes.

Ramp defense (episodes 115–388 million)

115-million round. The hiders learned to lock the ramps in place during the preparation phase so the seekers couldn’t use them.

Box surfing (episodes 388–458 million)

Quite unexpectedly, the seekers learned the next counter-strategy: they jumped on top of a box and surfed it to a shelter with hidden blue figures, from the boxes red seekers got over the obstacles.

Surf defense (episodes 458–481 million)

Source OpenAI.com

Finally the hiders learned to lock the boxes during the preparation phase to avoid surfing.

Conclusion

The researchers learned that intrinsic motivation methods are not as effective as multi-agent competition tp enhance learning. In the growing and more sophisticated environments, multi-agent competition and not intrinsic motivation is what promised to generate human-relevant skills in an unsupervised manner.

What also fascinates me is the complexity of building environments. The tricky part is that when we humans build an environment or a physics engine, we can not foresee how it will be used, and what opportunities can be hidden there for an incentivized agent.

How agents understand us, and what could be the aftermath of the finite human mind, is the topic for my next article.

--

--

Ksenia Se

I build Turing Post, equipping you with in-depth knowledge and analysis to make smarter decisions about AI & ML -> https://www.turingpost.com/subscribe