Which Mutual Information Representation Learning Objectives are Sufficient for Control? – The Berkeley Artificial Intelligence Research Blog



Processing raw sensory inputs is vital for using deep RL algorithms to real-world issues.
For example, self-governing automobiles should make choices about how to drive securely provided info streaming from video cameras, radar, and microphones about the conditions of the roadway, traffic signals, and other automobiles and pedestrians.
However, direct “end-to-end” RL that maps sensing unit information to actions (Figure 1, left) can be extremely challenging due to the fact that the inputs are high-dimensional, loud, and include redundant info.
Instead, the difficulty is typically broken down into 2 issues (Figure 1, right): (1) extract a representation of the sensory inputs that keeps just the pertinent info, and (2) carry out RL with these representations of the inputs as the system state.



Figure 1. Representation knowing can draw out compact representations of states for RL.

A wide array of algorithms have actually been proposed to find out lossy state representations in a not being watched style (see this current tutorial for an introduction).
Recently, contrastive knowing techniques have actually shown reliable on RL standards such as Atari and DMControl (Oord et al. 2018, Stooke et al. 2020, Schwarzer et al. 2021), along with for real-world robotic knowing (Zhan et al.).
While we could ask which goals are much better in which situations, there is a much more standard concern at hand: are the representations discovered by means of these techniques ensured to be adequate for control?
In other words, do they are enough to find out the ideal policy, or might they dispose of some crucial info, making it difficult to fix the control issue?
For example, in the self-driving vehicle situation, if the representation disposes of the state of traffic lights, the car would be not able to drive securely.
Surprisingly, we discover that some commonly utilized goals are not adequate, and in reality do dispose of info that might be required for downstream jobs.

Defining the Sufficiency of a State Representation

As presented above, a state representation is a function of the raw sensory inputs that disposes of unimportant and redundant info.
Formally, we specify a state representation $phi_Z$ as a stochastic mapping from the initial state area $mathcal{S}$ (the raw inputs from all the vehicle’s sensing units) to a representation area $mathcal{Z}$: $p(Z | S=s)$.
In our analysis, we presume that the initial state $mathcal{S}$ is Markovian, so each state representation is a function of just the existing state.
We illustrate the representation knowing issue as a visual design in Figure 2.



Figure 2. The representation knowing issue in RL as a visual design.

We will state that a representation suffices if it is ensured that an RL algorithm utilizing that representation can find out the ideal policy.
We use an arise from Li et al. 2006, which shows that if a state representation can representing the ideal $Q$-function, then $Q$-discovering kept up that representation as input is ensured to assemble to the very same service as in the initial MDP (if you’re interested, see Theorem 4 because paper).
So to check if a representation suffices, we can inspect if it has the ability to represent the ideal $Q$-function.
Since we presume we don’t have access to a job benefit throughout representation knowing, to call a representation adequate we need that it can represent the ideal $Q$-functions for all possible benefit functions in the provided MDP.

Analyzing Representations discovered by means of MI Maximization

Now that we’ve developed how we will examine representations, let’s rely on the techniques of discovering them.
As pointed out above, we intend to study the popular class of contrastive knowing techniques.
These techniques can mainly be comprehended as taking full advantage of a shared info (MI) goal including states and actions.
To streamline the analysis, we examine representation knowing in seclusion from the other elements of RL by presuming the presence of an offline dataset on which to carry out representation knowing.
This paradigm of offline representation knowing followed by online RL is ending up being progressively popular, especially in applications such as robotics where gathering information is burdensome (Zhan et al. 2020, Kipf et al. 2020).
Our concern is for that reason whether the goal suffices by itself, not as an auxiliary goal for RL.
We presume the dataset has complete assistance on the state area, which can be ensured by an epsilon-greedy expedition policy, for instance.
An goal might have more than one taking full advantage of representation, so we call a representation knowing goal adequate if all the representations that optimize that goal suffice.
We will examine 3 representative goals from the literature in regards to sufficiency.

Representations Learned by Maximizing “Forward Information”

We start with a goal that promises to keep a good deal of state info in the representation.
It is carefully associated to discovering a forward characteristics design in hidden representation area, and to techniques proposed in previous works (Nachum et al. 2018, Shu et al. 2020, Schwarzer et al. 2021): $J_{fwd} = I(Z_{t+1}; Z_t, A_t)$.
Intuitively, this unbiased looks for a representation in which the existing state and action are maximally helpful of the representation of the next state.
Therefore, whatever foreseeable in the initial state $mathcal{S}$ need to be maintained in $mathcal{Z}$, given that this would optimize the MI.
Formalizing this instinct, we have the ability to show that all representations discovered by means of this goal are ensured to be adequate (see the evidence of Proposition 1 in the paper).

While assuring that $J_{fwd}$ suffices, it’s worth keeping in mind that any state info that is temporally associated will be maintained in representations discovered by means of this goal, no matter how unimportant to the job.
For example, in the driving situation, items in the representative’s visual field that are not on the roadway or walkway would all be represented, despite the fact that they are unimportant to driving.
Is there another goal that can find out adequate however lossier representations?

Representations Learned by Maximizing “Inverse Information”

Next, we consider what we describe an “inverse information” goal: $J_{inv} = I(Z_{t+k}; A_t | Z_t)$.
One method to optimize this goal is by discovering an inverted characteristics design – forecasting the action provided the existing and next state – and lots of previous works have actually utilized a variation of this goal (Agrawal et al. 2016, Gregor et al. 2016, Zhang et al. 2018 among others).
Intuitively, this goal is appealing due to the fact that it protects all the state info that the representative can affect with its actions.
It for that reason might look like a great prospect for an enough goal that disposes of more info than $J_{fwd}$.
However, we can in fact build a practical situation in which a representation that optimizes this goal is not adequate.

For example, think about the MDP revealed on the left side of Figure 4 in which a self-governing car is approaching a traffic control.
The representative has 2 actions readily available, stop or go.
The benefit for following traffic guidelines depends upon the color of the traffic light, and is represented by a red X (low benefit) and green check mark (high benefit).
On the ideal side of the figure, we reveal a state representation in which the color of the traffic light is not represented in the 2 states on the left; they are aliased and represented as a single state.
This representation is not adequate, given that from the aliased state it is unclear whether the representative needs to “stop” or “go” to get the benefit.
However, $J_{inv}$ is taken full advantage of due to the fact that the action taken is still precisely foreseeable provided each set of states.
In other words, the representative has no control over the traffic light, so representing it does not increase MI.
Since $J_{inv}$ is taken full advantage of by this inadequate representation, we can conclude that the goal is not adequate.



Figure 4. Counterexample showing the deficiency of $J_{inv}$.

Since the benefit depends upon the traffic light, maybe we can treat the problem by in addition needing the representation to be efficient in forecasting the instant benefit at each state.
However, this is still inadequate to ensure sufficiency – the representation on the ideal side of Figure 4 is still a counterexample given that the aliased states have the very same benefit.
The essence of the issue is that representing the action that links 2 states is inadequate to be able to select the very best action.
Still, while $J_{inv}$ is inadequate in the basic case, it would be exposing to identify the set of MDPs for which $J_{inv}$ can be shown to be adequate.
We see this as a fascinating future instructions.

Representations Learned by Maximizing “State Information”

The last goal we think about looks like $J_{fwd}$ however leaves out the action: $J_{state} = I(Z_t; Z_{t+1})$ (see Oord et al. 2018, Anand et al. 2019, Stooke et al. 2020).
Does leaving out the action from the MI unbiased effect its sufficiency?
It ends up the response is yes.
The instinct is that optimizing this goal can yield inadequate representations that alias states whose shift circulations vary just with regard to the action.
For example, think about a situation of a cars and truck browsing to a city, portrayed listed below in Figure 5.
There are 4 states from which the vehicle can do something about it “turn right” or “turn left.”
The ideal policy takes initially a left turn, then an ideal turn, or vice versa.
Now think about the state representation revealed on the right that aliases $s_2$ and $s_3$ into a single state we’ll call $z$.
If we presume the policy circulation is consistent over left and ideal turns (a sensible situation for a driving dataset gathered with an expedition policy), then this representation makes the most of $J_{state}$.
However, it can’t represent the ideal policy due to the fact that the representative doesn’t understand whether to go ideal or left from $z$.



Figure 5. Counterexample showing the deficiency of $J_{state}$.

Can Sufficiency Matter in Deep RL?

To comprehend whether the sufficiency of state representations can matter in practice, we carry out basic proof-of-concept try outs deep RL representatives and image observations. To different representation knowing from RL, we initially enhance each representation discovering goal on a dataset of offline information, (comparable to the procedure in Stooke et al. 2020). We gather the repaired datasets utilizing a random policy, which suffices to cover the state area in our environments. We then freeze the weights of the state encoder discovered in the very first stage and train RL representatives with the representation as state input (see Figure 6).



Figure 6. Experimental setup for examining discovered representations.

We explore an easy computer game MDP that has a comparable particular to the self-driving vehicle example explained previously. In this video game called catcher, from the PyGame suite, the representative manages a paddle that it can return and forth to capture fruit that falls from the top of the screen (see Figure 7). A favorable benefit is provided when the fruit is captured and an unfavorable benefit when the fruit is not captured. The episode ends after one piece of fruit falls. Analogous to the self-driving example, the representative does not manage the position of the fruit, therefore a representation that makes the most of $J_{inv}$ may dispose of that info. However, representing the fruit is vital to getting benefit, given that the representative needs to move the paddle below the fruit to capture it. We find out representations with $J_{inv}$ and $J_{fwd}$, enhancing $J_{fwd}$ with sound contrastive estimate (NCE), and $J_{inv}$ by training an inverted design by means of optimum possibility. (For brevity, we leave out try outs $J_{state}$ in this post – please see the paper!) To pick the most compressed representation from amongst those that optimize each goal, we use an info traffic jam of the type $minutes I(Z; S)$. We likewise compare to running RL from scratch with the image inputs, which we call “end-to-end.” For the RL algorithm, we utilize the Soft Actor-Critic algorithm.





Figure 7. (left) Depiction of the catcher video game. (middle) Performance of RL representatives trained with various state representations. (right) Accuracy of rebuilding ground fact state components from discovered representations.

We observe in Figure 7 (middle) that certainly the representation trained to optimize $J_{inv}$ leads to RL representatives that assemble slower and to a lower asymptotic anticipated return. To much better comprehend what info the representation includes, we then try to find out a neural network decoder from the discovered representation to the position of the falling fruit. We report the mean mistake accomplished by each representation in Figure 7 (right). The representation discovered by $J_{inv}$ sustains a high mistake, showing that the fruit is not specifically recorded by the representation, while the representation discovered by $J_{fwd}$ incurs low mistake.

Increasing observation intricacy with visual distractors

To make the representation knowing issue more tough, we duplicate this explore visual distractors contributed to the representative’s observations. We arbitrarily produce pictures of 10 circles of various colors and change the background of the video game with these images (see Figure 8, left, for instance observations). As in the previous experiment, we outline the efficiency of an RL representative trained with the frozen representation as input (Figure 8, middle), along with the mistake of deciphering real state components from the representation (Figure 8, right). The distinction in efficiency in between adequate ($J_{fwd}$) and inadequate ($J_{inv}$) goals is much more noticable in this setting than in the plain background setting. With more info present in the observation in the type of the distractors, inadequate goals that do not enhance for representing all the necessary state info might be “distracted” by representing the background items rather, leading to low efficiency. In this more tough case, end-to-end RL from images stops working to make any development on the job, showing the trouble of end-to-end RL.





Figure 8. (left) Example representative observations with distractors. (middle) Performance of RL representatives trained with various state representations. (right) Accuracy of rebuilding ground fact state components from state representations.

Conclusion

These results emphasize a crucial open issue: how can we develop representation knowing goals that yield representations that are both as lossy as possible and still adequate for the jobs at hand?
Without even more presumptions on the MDP structure or understanding of the benefit function, is it possible to develop a goal that yields adequate representations that are lossier than those discovered by $J_{fwd}$?
Can we identify the set of MDPs for which inadequate goals $J_{inv}$ and $J_{state}$ would suffice?
Further, extending the proposed structure to partly observed issues would be more reflective of practical applications. In this setting, examining generative designs such as VAEs in regards to sufficiency is a fascinating issue. Prior work has actually revealed that taking full advantage of the ELBO alone cannot manage the material of the discovered representation (e.g., Alemi et al. 2018). We guesswork that the zero-distortion maximizer of the ELBO would suffice, while other services require not be. Overall, we hope that our proposed structure can drive research study in creating much better algorithms for without supervision representation discovering for RL.


This post is based upon the paper Which Mutual Information Representation Learning Objectives are Sufficient for Control?, to be provided at Neurips 2021. Thank you to Sergey Levine and Abhishek Gupta for their important feedback on this post.

Synesy.org