In this post, we are going to discuss one good idea from 2017 – information bottleneck [2]. Then we will discuss how the idea can be applied in meta-RL exploration [1].
Mutual Information
We will start warming up by revisiting a classic concept in information theory, mutual information [3]. Mutual information measures the amount of information obtained about one random variable by observing the other random variable:
. From [3], we can see how these equations are derived:
Stochastic Variational Inference and VAE
Stochastic variational inference (SVI) is a useful technique to approximate intractable posterior distribution. One good example is to use SVI for VAE. We have introduced SVI [4] and VAE [5] separately. In this post, we are going to explain both concepts again unifying the two concepts. A stackexchange post [7] also helped shape my writing.
Suppose we have data and hypothesize data is generated by a latent process
, starting from a latent code
.
Then what we want to do is to maximize , sometimes called the “log evidence”. Let
to describe the posterior probability of
given observing
and
to describe the joint probability of
and
. Note that,
is infeasible to compute in general. Therefore, we introduce
to approximate
, with
being learnable parameters.
is tractable, for example, a neural network with outputs representing a Gaussian distribution’s mean and variance but can only approximate the true posterior distribution,
. It turns out that
can be rewritten into [see 6, page 22-23 for derivation]:
.
We call the second term, , the evidence lower bound (ELBO). We have
because
. Therefore, we can maximize ELBO w.r.t.
in order to maximize
.
ELBO can be further derived into [see derivation in 5]:,
where is the prior for the latent code (e.g., standard normal distributions). In VAE, we also use a deterministic neural network to approximate
. Overall,
can be learned by minibatch samples and when ELBO is maximized,
infinitely approximates
.
Deep Variational Information Bottleneck
If you view VAE as a clever idea to encode information of (data) in an unsupervised learning setting, Deep Variational Information Bottleneck [2] is an extended idea to encode latent information from
(data) to
(label) in a supervised learning setting. The objective is to encode
into latent code
with as little mutual information as possible, while making
preserve as much as possible mutual information with
:
After some derivation shown in [2], we can show that we can also instead maximize a lower bound (notations slightly different than [2] because I want it to be consistent in this post):
where, again, is the variational distribution to the true posterior distribution
and
is the decoder network.
Meta-RL Exploration By a Deep Variational Information Bottleneck Method
With all necessary ingredients introduced, we now introduce how Meta-RL exploration can benefit from Information Bottleneck [1]. The basic Meta-RL setup is that we have diverse environments. Each environment is represented with a tensor (could be a one-hot encoding) , which is known at training time but unknown at testing time. The authors of [1] propose to learn two policies:
for exploring environments with the goal to collect as much information about the environment as possible, and
for exploiting an environment with a known encoded tensor
. In training time,
, an encoder to encode environment tensor
(available in training time) or
, a variational encoder which converts the trajectory generated by
to an encoded tensor. The variational encoder
will be learned to match
in training time. Once
,
,
, and
are learned, at testing time, we can run
to collect trajectories
, use
to determine the environment’s encoded tensor
, and run
on top to maximize rewards.
The paper uses the mutual information / deep variational information bottleneck ideas in two places. First, when we learn and
, we use the following loss function to encourage
encoding minimal information from
:
The constrained optimization loss function can be converted to a unconstrained loss function by the lagrangian method, with set as a hyperparameter [8]:
Using the same derivation from [2] (Eqn. 13 & 14), we know the lower bound of is
, which has an analytic form when the prior
is chosen properly (e.g., Gaussian). Thus the unconstrained loss function can be maximized on a lower bound.
Second, we encourage to maximize the mutual information between the trajectories explored by and
:
(The inequality uses the fact that the KL divergence between the true posterior distribution and the variational distribution, , is greater than or equal to 0. )
As we see in the paper, is learned to match
, while, with some trick to rearrange
, we can optimize
in an MDP with reward set as information gain of each step.
References
[1] Decoupling Exploration and Exploitation for Meta-Reinforcement Learning
without Sacrifices: https://arxiv.org/pdf/2008.02790
[2] DEEP VARIATIONAL INFORMATION BOTTLENECK: https://arxiv.org/pdf/1612.00410
[3] https://en.wikipedia.org/wiki/Mutual_information
[4] https://czxttkl.com/2019/05/04/stochastic-variational-inference/
[5] https://czxttkl.com/2020/04/06/optimization-with-discrete-random-variables/
[6] https://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Fall.2015/slides/lec12.vae.pdf