Reinforcement learning is challenging in environments with large state-action spaces, as exploration can be highly inefficient. Even if the dynamics are simple, the optimal policy can be combinatorially hard to discover.
In this work, we propose a hierarchical approach to structured exploration to improve the sample efficiency of on-policy exploration in large state-action spaces.
The key idea is to model a stochastic policy as a hierarchical latent variable model, which can learn low-dimensional structure in the state-action space, and to define exploration by sampling from the low-dimensional latent space.
This approach enables lower sample complexity, while preserving the expressiveness of the policy class.
To make learning tractable, we derive a joint learning and exploration strategy by combining hierarchical variational inference with actor-critic learning.  
The benefits of our learning approach are that it is principled, simple to implement,  scalable to settings with many actions, and composable with existing deep learning approaches.
We evaluate our approach on learning a deep centralized multi-agent policy, as multi-agent environments naturally have an exponentially large state-action space.
We demonstrate  that our approach can more efficiently learn optimal policies in challenging multi-agent games with a large number ($sim20$) of agents, compared to conventional baselines.