Share This Post

Featured News

Emergence of exploratory look-around behaviors through active observation completion

Emergence of exploratory look-around behaviors through active observation completion


Visual recognition has witnessed dramatic successes in recent years. Fueled by benchmarks composed of carefully curated web photos and videos, the focus has been on inferring semantic labels from human-captured images—whether classifying scenes, detecting objects, or recognizing activities (13). However, visual perception requires making not only inferences from observations but also decisions about what to observe. Methods that use human-captured images implicitly assume properties in their inputs, such as canonical poses of objects, no motion blur, or ideal lighting conditions. As a result, they gloss over important hurdles for robotic agents acting in the real world.

For an agent, individual views of an environment offer only a small fraction of all relevant information. For instance, an agent with a view of a television screen in front of it may not know whether it is in a living room or a bedroom. An agent observing a mug from the side may have to move to see it from above to know what is inside.

An agent ought to be able to enter a new environment or pick up a new object and intelligently (non-exhaustively) “look around.” The ability to actively explore would be valuable in both task-driven scenarios (e.g., a drone searches for signs of a particular activity) and scenarios where the task itself unfolds simultaneously with the agent’s exploratory actions (e.g., a search-and-rescue robot enters a burning building and dynamically decides its mission). For example, consider a service robot that is moving around in an open environment without specific goals, waiting for future tasks like delivering a package from one person to another or picking up coffee from the kitchen. It needs to efficiently and constantly gather information so that it is well prepared to perform future tasks with minimal delays. Similarly, consider a search-and-rescue scenario, where a robot is deployed in a hostile environment, such as a burning building or earthquake collapse, where time is of the essence. The robot has to adapt to such new unseen environments and rapidly gather information that other robots and humans can use to effectively respond to situations that dynamically unfold over time (humans caught under debris, locations of fires, and presence of hazardous materials). Having a robot that knows how to explore intelligently can be critical in such scenarios, reducing risks for people while providing an effective response.

Any such scenario brings forth the question of how to collect visual information to benefit perception. A naïve strategy would be to gain full information by making every possible observation—that is, looking around in all directions or systematically examining all sides of an object. However, observing all aspects is often inconvenient if not intractable. Fortunately, in practice, not all views are equally informative. The natural visual world contains regularities, suggesting that not every view needs to be sampled for accurate perception. For instance, humans rarely need to fully observe an object to understand its three-dimensional (3D) shape (46), and one can often understand the primary contents of a room without literally scanning it (7). In short, given a set of past observations, some new views are more informative than others (Fig. 1).

Fig. 1 Looking around efficiently is a complex task requiring the ability to reason about regularities in the visual world using cues like context and geometry.

Top: An agent that has observed limited portions of its environment can reasonably predict some unobserved portions (e.g., water near the ship) but is much more uncertain about other portions. Where should it look next? Bottom: An agent inspecting a 3D object. Having seen a top view and a side view, how must it rotate the mug now to get maximum new information? Critically, we aim to learn policies that are not specific to a given object or scene, nor to a specific perception task. Instead, the look-around policies ought to benefit the agent exploring new, unseen environments and performing tasks unspecified when learning the look-around behavior.

This fact leads us to investigate the question of how to effectively look around: How can a learning system make intelligent decisions about how to acquire new exploratory visual observations? We propose a solution based on “active observation completion”: An agent must actively observe a small fraction of its environment so that it can predict the pixelwise appearances of unseen portions of the environment.

Our problem setting relates to but is distinct from previous work in active perception, intrinsic motivation, and view synthesis. Although there is interesting recent headway in active object recognition (811) and intelligent search mechanisms for detection (1214), such systems are supervised and task specific—limited to accelerating a predefined recognition task. In reinforcement learning (RL), intrinsic motivation methods define generic rewards, such as novelty or coverage (1517), that encourage exploration for navigation agents, but they do not self-supervise policy learning in an observed visual environment, nor do they examine transfer beyond navigation tasks. View synthesis approaches use limited views of the environment along with geometric properties to generate unseen views (1822). Whereas these methods assume individual human-captured images, our problem requires actively selecting the input views themselves. Our primary goal is not to synthesize unseen views but rather to use novel view inference as a means to elicit intelligent exploration policies that transfer well to other tasks.

In the following, we first formally define the learning task, overview our approach, and present results. Then, after the results, we discuss limitations of the current approach and key future directions, followed by Materials and Methods—an overview of the specific deep networks and policy learning approaches we developed. This article expands upon our two previous conference papers (23, 24).

Active observation completion

Our goal is to learn a policy for controlling an agent’s camera motions such that it can explore novel environments and objects efficiently. To this end, we formulate an unsupervised learning objective based on active observation completion. The main idea is to favor sequences of camera motions that make the unseen parts of the agent’s surroundings easier to predict. The output is a look-around policy equipped to gather new images in new environments. As we will demonstrate in results, it prepares the agent to perform intelligent exploration for a wide range of perception tasks, such as recognition, light source localization, and pose estimation.

Problem formulation

The problem setting is formally stated as follows. The agent starts by looking at a novel environment (or object) X from some unknown viewpoint (25). It has a budget T of time to explore the environment. The learning objective is to minimize the error in the agent’s pixelwise reconstruction of the full—mostly unobserved—environment using only the sequence of views selected within that budget. To do this, the agent must maintain an internal representation of how the environment would look conditioned on the views it has seen so far.

We represent the entire environment as a “viewgrid” containing views from a discrete set of viewpoints. To do this, we evenly sample N elevations from −90° to 90° and M azimuths from 0° to 360° and form all MN possible (elevation, azimuth) pairings. The viewgrid is then denoted by V(X) = {x(X, θ(i)) ∣ 1 ≤ iMN}, where x(X, θ(i)) is the 2D view of X from viewpoint θ(i), which is the ith pairing. More generally, θ(i) could capture both camera angles and position; however, to best exploit existing datasets, we limited our experiments to camera rotations alone with no translation movements.

The agent expends its time budget T in discrete increments by selecting T − 1 camera motions in sequence. Each camera motion comprises an actively chosen “glimpse.” At each time step, the agent gets an image observation xt from the current viewpoint. It then makes an exploratory motion (at) based on its policy π. When the agent executes action at∈A, the viewpoint changes according to θt + 1 = θt + at. For each camera motion at executed by the agent, a reward rt is provided by the environment. Using the view xt, the agent updates its internal representation of the environment, denoted V^(X). Because camera motions are restricted to have proximity to the current camera angle and candidate viewpoints partially overlap, the discrete action space promotes efficiency without neglecting the physical realities of the problem [following (8, 9, 23, 26)]. During training, the full viewgrids of the environments are available to the agent as supervision. During testing, the system must predict the complete viewgrid, having seen only a few views within it.

We explored our idea in two settings (Fig. 1). In the first, the agent scans a scene through its limited field-of-view camera; the goal is to select efficient camera motions so that after a few glimpses, it can model unobserved portions of the scene well. In the second, the agent manipulates a 3D object to inspect it; the goal is to select efficient manipulations so that after only a small number of actions, it has a full model of the object’s 3D shape. In both cases, the system must learn to leverage visual regularities (shape primitives, context, etc.) that suggest the likely contents of unseen views, focusing on portions that are hard to “hallucinate” (i.e., predict pixelwise).

Posing the active view acquisition problem in terms of observation completion has two key advantages: generality and low-cost (label-free) training data. The objective is general in the sense that pixelwise reconstruction places no assumptions about the future task for which the glimpses will be used. The training data are low cost, because no manual annotations are required; the agent learns its look-around policy by exploring any visual scene or object. This assumes that capturing images is much more cost-effective than manually annotating images.

Approach overview

The active observation completion task poses three major challenges. First, to predict unobserved views well, the agent must learn to understand 3D relationships from very few views. Classic geometric solutions struggle under these conditions. Instead, our reconstruction must draw on semantic and contextual cues. Second, intelligent action selection is essential to this task. Given a set of past observations, the system must act based on which new views are likely to be most informative, i.e., determine which views would most improve its model of the full viewgrid. We stress that the system will be faced with objects and scenes it never encountered during training, yet still must intelligently choose where it would be valuable to look next.

As a core solution to these challenges, we present an RL approach for active observation completion (Fig. 2) (23). Our RL approach uses a recurrent neural network to aggregate information over a sequence of views; a stochastic neural network uses that aggregated state and current observation to select a sequence of useful camera motions. The agent is rewarded on the basis of its predictions of unobserved views. It therefore learns a policy to intelligently select actions (camera motions) to maximize the quality of its predictions. During training, the complete viewgrid is known, thereby allowing the agent to “self-supervise” its policy learning, meaning it learns without any human-provided labels. See Materials and Methods below for the details of our approach.

Fig. 2 Approach overview.

The agent (actor) encodes individual views from the environment and aggregates them into a belief state vector. This belief is used by the decoder to get the reconstructed viewgrid. The agent’s incomplete belief about the environment leads to uncertainty over some viewpoints (red question marks). To reduce this uncertainty, the agent intelligently samples more views based on its current belief within a fixed time budget T. The agent is penalized on the basis of the reconstruction error at the end of T steps (completion loss). In addition, we provide guidance through sidekicks (sidekick loss), which exploit the full viewgrid—only at training time—to alleviate uncertainty in training due to partial observability. The learned exploratory policy is then transferred to other tasks (top row shows four tasks we consider).

Our model judges the quality of viewgrid reconstruction in the pixel space so as to maintain generality: All pixels for the full scene (or 3D object) would encompass all potentially useful visual information for any task. Hence, our approach avoids committing to any intermediate semantic representation, in favor of learning policies that seek generic information useful to many tasks. That said, our formulation is easily adaptable to more specialized settings. For example, if the target tasks only require semantic segmentation labels, then the predictions could be in the space of object labels instead.

RL approaches often suffer from costly exploration stages and partial state observability. In particular, an active visual agent has to take a long series of actions purely based on the limited information available from its first-person view (23, 2729). The most effective viewpoint trajectories are buried among many mediocre ones, impeding the agent’s exploration in complex state-action spaces.

To address this challenge, as the second main technical contribution of this work, we introduce “sidekick policy learning.” In the active observation completion task, there is a natural asymmetry in observability: Once deployed, an active exploration agent can only move the camera to look around nearby, yet during training, it can access omnidirectional viewpoints. Existing methods facing this asymmetry simply restrict the agent to the same partial observability during training (8, 10, 23, 26, 27, 30). In contrast, our sidekick approach introduces reward shaping and demonstrations that leverage full observability during training to precompute the information content of each candidate glimpse. The sidekicks then guide the agent to visit information hot spots in the environment or sample information-rich trajectories while accounting for the fact that observability is only partial during testing (24). By doing so, sidekicks accelerate the training of the actual agent and improve the overall performance. We use the name “sidekick” to signify how a sidekick to a hero (e.g., in a comic or movie) provides alternate points of view, knowledge, and skills that the hero does not have. In contrast to an “expert” (31, 32), a sidekick complements the hero (agent), yet cannot solve the main task at hand by itself. See Materials and Methods below for more details.

We show that the active observation completion policies learned by our approach serve as exploratory policies that are transferable to entirely new tasks and environments. Given a new task, rather than train a policy with task-specific rewards to direct the camera, we drop in the pretrained look-around policy. We demonstrate that policies learned via active observation completion transfer well to several semantic and geometric estimation tasks, and they even perform competitively with supervised task-specific policies (please see the look-around policy transfer section in Results).


We next present experiments to evaluate the behaviors learned by the proposed look-around agents.


For benchmarking and reproducibility, we evaluated our approach on two widely used datasets.

SUN360 dataset for scenes

For this dataset, our limited field-of-view (60°) agent attempts to complete an omnidirectional scene. SUN360 (33) has spherical panoramas of 26 diverse categories. The dataset consists of 6174 training, 1013 validation, and 1805 testing examples. The viewgrid has 32 pixels–by–32 pixel resolution 2D images sampled from M = 4 camera elevations (−67.5°, −22.5°, 22.5°, 67.5°) and N = 8 azimuths (45°, 90°,…, 360°).

ModelNet dataset for objects

For this dataset, our agent manipulates a 3D object to complete its viewgrid of the object seen from all viewing directions. The viewgrid constitutes an implicit image-based 3D shape model. ModelNet (10) has two subsets of computer-aided design (CAD) models: ModelNet-40 (40 categories) and ModelNet-10 (a 10-category subset of ModelNet-40). Excluding the ModelNet-10 classes, ModelNet-40 consists of 6085 training, 327 validation, and 1310 testing examples. ModelNet-10 consists of 3991 training, 181 validation, and 727 testing examples. The viewgrid has 32 × 32 resolution 2D images sampled from M = 6 camera elevations (−75°, −45°,…, 45°, 75°) and N = 10 azimuths (20°, 56°, 92°,…, 344°) (34). We rendered the objects using substantial lighting variations to increase difficulty in perception. To test the agent’s ability to generalize to previously unseen categories, we always tested on object categories in ModelNet-10, which are unseen during training.

For both datasets, at each time step, the agent moved within a 5 elevations–by–5 azimuths neighborhood from the current position. Requiring nearby motions reflects that the agent cannot teleport, and it ensures that the actions have approximately uniform real-world cost. Balancing task difficulty (harder tasks require more views) and training speed (fewer views are faster) considerations, we set the training episode length T = 4 a priori. By training for a target budget T, the agent has to learn nonmyopic behaviors to best use the expected exploration time. Note that although further increasing T during training increased training costs considerably, doing so naturally led to better reconstructions (please see the Supplementary Materials for longer episode results).


We tested our active completion approach with and without sidekick policy learning (35)—lookaround and lookaround+spl, respectively—compared with a variety of baselines:

1. one-view is our method trained with T = 1. No information aggregation or action selection was performed by this baseline.

2. rnd-actions is identical to our approach, except that the action selection module was replaced by randomly selected actions from the pool of all possible actions.

3. large-actions chooses the largest allowable action repeatedly. This tested whether far-apart views were sufficiently informative.

4. peek-saliency moves to the most salient view within reach at each time step, using a popular saliency metric (36). To avoid getting stuck in a local saliency maximum, it does not revisit seen views. Note that this baseline peeks at neighboring views before action selection to measure saliency, giving it an unfair and impossible advantage over our methods and the other baselines.

These baselines all used the same network architecture as our methods, differing only in the exploration policy that we sought to evaluate. In the interest of evaluating on a wide range of starting positions, we evaluated each method MN times on each test viewgrid, starting from all possible viewpoints.

Active observation completion results

We show the results of scene and object completion on SUN360 and ModelNet (unseen classes) in Fig. 3B. The metrics “average” and “adversarial” measure the expected value of the average and maximum pixelwise mean squared errors (MSEs) over all starting points for a single sample, respectively. Whereas the former measures the average expected performance, the latter measures the worst-case performance when starting from the hardest place in each sample (averaged over examples). We additionally report the relative improvement of each model over one-view to isolate the gains obtained due to action selection over a pretrained T = 1 model. Because all methods shared the same pretraining stage of one-view, this metric provides an apples-to-apples measure of how well the different strategies for moving performed. All methods were evaluated over T = 4 time steps in accordance with the training budget unless stated otherwise.

Fig. 3 Scene and object completion accuracy under different agent behaviors.

(A) Pixelwise MSE errors versus time on both datasets as more glimpses are acquired. (B) Average/adversarial MSE error ×1000 (↓ lower is better) and corresponding improvements (%) over the one-view model (↑ higher is better) on both datasets after all T glimpses are acquired.

As expected, all methods that acquired multiple glimpses outperformed one-view by taking advantage of the extra information that was available from additional views. Both the lookaround and lookaround+spl approaches substantially outperformed the others on all settings. The peek-saliency agent hovered near the most salient views in the neighborhood of the starting view because nearby views tended to have similar saliency scores. The large-actions agent’s accuracy often tended to saturate near the top or bottom of the viewgrid after reaching the environment boundaries. Compared with these behaviors, intelligent sampling of actions using our learned policy led to substantial improvements. Using sidekicks in lookaround+spl improved performance and convergence speed. This is consistent with our results reported in (24) and demonstrates the advantage of using sidekicks. The faster convergence of lookaround+spl is shown in the Supplementary Materials.

Whereas Fig. 3B shows the agents’ ultimate ability to infer the entire scene (object), Fig. 3A shows the reconstruction errors as a function of time. As we can see, the error reduced consistently over time for all methods, but it dropped most sharply for lookaround and lookaround+spl. Faster reduction in the reconstruction error indicates more efficient information aggregation.

Visualizations of the agent’s evolving internal belief state echo this quantitative trend. Figure 4 shows observation completion episodes from the lookaround agent along with the ground truth viewgrid, viewing angles selected by the agent, and reconstruction errors over time. We show the SUN360 viewgrids in equirectangular projection for better visualization. Initially, the agent exhibited considerable uncertainty in its belief, as seen in the poorly decoded reconstructions and large MSE values. However, over time, it actively sampled views that quickly improved the reconstruction quality.

Fig. 4 Episodes of active observation completion for SUN360 (left) and ModelNet (right).

For each example, the first row on the left shows the ground-truth viewgrid; the subsequent rows on the left show the reconstructions at times t = 0,1, T − 1 = 3 along with the pixelwise MSE error (×1000) and the agent’s current glimpse (marked in red). On the right, the sampled viewing angles of the agent at each time step are shown on the viewing sphere (marking the agent’s viewpoint and field of view using a red arrow and outline on the sphere). The reconstruction quality improves over time as it quickly refines the scene structure and object shape.

Figures 5 and 6 visualize the ultimate reconstructions after all T glimpses were acquired (37). For contrast, we also display the results for rnd-actions in Fig. 5. The policies learned by our agent led to more realistic and accurate reconstructions. Although the agent only saw about 15% of all the pixels, its choice of informative glimpses allowed it to anticipate the remainder of the novel scene or object. Movie S1 in the Supplementary Materials shows walkthroughs of the reconstructed environments from the agent’s egocentric point of view.

Fig. 5 Three examples of reconstructions after T = 6 glimpses.

The first column shows the ground-truth viewgrids (equirectangular projections for SUN), the second column shows the corresponding generative adversarial network (GAN)–refined reconstructions of the lookaround and rnd-actions agents, and the third column shows handpicked unseen views (marked on the ground-truth) and the corresponding angles. We chose T = 6 to generate more complete images. Please see the Supplementary Materials for more GAN refinement details. Best viewed on PDF with zoom. Using an intelligent policy, lookaround captures more information from the scene, leading to more realistic reconstructions (examples 1 and 3). Although rnd-actions leads to realistic reconstructions on example 2, its textures and content differ from the ground truth, especially on the ground. Note that the bounding boxes over views are warped to curves on the equirectangular projection for SUN360.

Fig. 6 The ground-truth 360 panorama or viewgrid, agent glimpse inputs, and final GAN-refined reconstructions for multiple environments from SUN360 and ModelNet.

See also movie S1 provided in the Supplementary Materials.

Look-around policy transfer

Having shown that our unsupervised approach successfully trained policies to acquire visual observations useful for completion, we next tested how well the policies transfer to new tasks. Recall, our hypothesis is that the glimpses acquired to maximize completion accuracy will transfer well to solve perception tasks efficiently, because they were chosen to reveal maximal information about the full environment or object.

To demonstrate transfer, we first trained a rnd-actions model for each of the target tasks (“model A”) and a lookaround model for the active observation completion task (“model B”). The policy from model B was then used to select actions for the target task using model A’s task head (see details in the unsupervised policy transfer section in Materials and Methods). In this way, the agent learned to solve the task given arbitrary observations, then inherited our intelligent look-around policy to (potentially) solve the task more quickly—with fewer glimpses. The transfer is considered a success if the look-around agent can solve the task with similar efficiency as a supervised task-specific policy, despite being unsupervised and task agnostic. We tested policy transferability for the following four tasks.

Task 1: Active categorization

The first task is category recognition: The agent must produce the category name of the object or scene it is exploring. We plugged look-around policies into the active categorization system from (8) and followed a similar setup. For ModelNet, we trained model A on ModelNet-10 training objects and the active observation completion model (model B) on ModelNet-40 training objects, which were disjoint classes from those in the target ModelNet-10 dataset. For SUN360, both models were trained on SUN360 training data. We replicated the results from (8) and used the corresponding architecture and training strategies. In particular, the classification head was trained with a cross-entropy loss over the set of classes, and the supervised reward function for policy learning was the negative of the classification loss at the end of the episode. We refer the readers to (8) for the full details. Performance was measured using classification accuracy on the test set.

Task 2: Active surface area estimation

The second task is surface area estimation. The agent starts by looking at some view of the object and must intelligently select subsequent viewing angles to estimate the 3D object’s surface area. The task is relevant for a robot that needs to interact with an unfamiliar object. The 3D models from ModelNet-10 were converted into 50 voxel–by–50 voxel–by–50 voxel occupancy grids. The true surface area was the number of unoccupied voxels that were adjacent to occupied voxels. Estimation was posed as a regression task where the agent predicted a normalized metric value between 0 and 1. Performance was measured using the relative MSE between predicted and ground truth areas on the test set; i.e., if the ground truth and predicted areas are mg and mp, respectively, then the error for one example is ((mgmp)/mg)2. This normalized the error so that it remained comparable across objects of different sizes.

Task 3: Active light source localization

In the third task, the agent is required to localize the sources of light surrounding the 3D object. To design a controlled experimental setting when rendering the ModelNet objects, we placed a single light source randomly at any one of two possible azimuths and four possible elevations relative to the object (see fig. S2). The task was posed as a four-way classification problem where the agent was required to identify the correct elevation (irrespective of the azimuth such that there can be no unfair orientation bias). Performance was measured using localization accuracy on the test set.

Task 4: Active pose estimation

The fourth task is camera pose estimation. Having explored the environment, the agent is required to identify the elevation and relative azimuth of a given reference view. We used a simple solution to this problem. By using the agent’s reconstruction after T time steps, we measured the ℓ2 distance between the given view and each of the reconstructed views. The elevation and azimuth of the reconstructed view leading to the smallest ℓ2 distance was predicted as the pose. The agent used its own decoder as opposed to the decoder from rnd-actions as done in previous tasks. We did not evaluate pose estimation on ModelNet due to the ambiguity arising from symmetric objects. The models were evaluated using the absolute angular error (AE) in (i) elevation and (ii) azimuth predictions, denoted by “AE azim.” and “AE elev.” in Table 1. During evaluation, the starting positions of the agent were selected uniformly over the grid of views. The reference view was sampled randomly from the viewgrid for each episode.

Table 1 Transfer results.

lookaround and lookaround+spl are transferred to the rnd-actions task-heads from each task. The same unsupervised look-around policy successfully accelerates a variety of tasks—even competing well with the fully supervised task-specific policy (supervised). Note that RMSE here denotes the root mean squared error in the surface area prediction.

View this table:

For baselines, we used one-view, rnd-actions, large-actions, peek-saliency (defined in the previous section), and supervised. supervised is a policy that was trained specifically on the training objective for each task, i.e., with task-specific rewards.

We compared the transfer of lookaround and lookaround+spl with these baselines in Table 1. The transfer performance of our policies was better than that of rnd-actions on all tasks. This shows that intelligent sequential camera control has scope for improving these perception tasks’ efficiency. Overall, our look-around policy transferred well across tasks, competing with or even outperforming the supervised task-specific policies. Furthermore, our look-around policies consistently performed the tasks better than the baseline policies for glimpse selection based on saliency or large actions.

For active recognition on ModelNet, most of the methods performed similarly. On that dataset, recognition with a single view was already fairly high, leaving limited headroom for improving with additional views, intelligently selected or otherwise. On pose estimation, our learned policies outperformed the baselines as expected, because the reconstructions generated by our agents were more accurate. On light source localization, our policies showed competitive results and came close to the performance of supervised. They also substantially outperformed the remaining baselines, demonstrating successful transfer. For surface area estimation, we observed that all methods, including the supervised policies, managed only marginal gains over one-view. We believe that this is an indication of the difficulty of this task, as well as the necessity for more 3D-specific architectures such as those that produce voxel grids, point clouds, or surface meshes as output (3840).

These results demonstrate the effectiveness of learning active look-around policies via observation completion on unlabeled datasets—without task-specific rewards. As we see in Table 1, such policies could successfully transfer to a wide range of perception tasks and often performed on par with supervised task-specific policies.


We propose the task of active observation completion to facilitate learning look-around behaviors in a task-independent way. Our proposed approach outperformed several baselines and effectively anticipated the high-level properties of the environment, having observed only a small fraction of the scene or 3D object. We further showed that adding the proposed RL sidekicks led to faster training and convergence to better policies (Fig. 3 and fig. S3). Once look-around behaviors were learned, we showed that they could be effectively transferred to a wide range of semantic and geometric tasks where they at times outperformed supervised policies trained in a traditional task-specific manner (Table 1).

Although we are motivated to devise sidekick policy learning for active visual exploration, it is more generally applicable whenever an RL agent can access greater observability during training than during deployment. For example, agents may operate on first-person observations during test time, yet have access to multiple sensors during training in simulation environments (4143). Similarly, an active object recognition system (8, 10, 11, 26, 30) can only see its previously selected views of the object; yet, if trained with CAD models, it could observe all possible views while learning. Future work can explore sidekicks in such scenarios.

Despite the promising results, our approach does have several shortcomings, and our work points to several interesting directions for future work. Although the agent is moving from one view to another, it does not use the information available during this motion. This is reasonable, because allowable actions are confined to a neighborhood of the current observation and hence relatively close in 3D world space. Still, an interesting setting would be to use the sequence of views obtained while the action is being executed.

Second, our current action space was discretized to promote training efficiency, and we assumed that each action had unit cost and optimized the agent to perform well for a fixed cost budget. The unit cost was approximately correct given the locality of the action space. Nonetheless, it could be interesting to adapt to free-range actions with action-specific costs by allowing the agent to sample any action (continuous or discrete) and penalizing it based on the cost of that action. Such costs could be embodiment specific. For example, humanoid robots may find it easier to move forward when compared with turning and walking, whereas wheeled robots can perform both motions equally well. Such a formulation would also naturally account for the sequence of views seen during action execution. Furthermore, as an alternative to training the agent to make nonmyopic camera motions to best reduce reconstruction error in a fixed budget of glimpses, one could instead formulate the objective in terms of a fixed threshold on reconstruction error and allow the agent to move until that threshold is reached. The former (our formulation) is valuable for scenarios with hard resource constraints; the latter is valuable for scenarios with hard accuracy constraints.

A third limitation of the current approach is that in practice we found that the diversity of actions selected by our learned policies was sometimes limited. The agent often tended to prefer a reduced action space of two or three actions depending on the starting point and the environment, despite using a loss term explicitly encouraging high entropy of selected actions. We believe that this could be related to optimization difficulties commonly associated with policy gradient–based RL, and improvements on this front would also improve the performance of our approach.

Our approach was also affected by a well-known limitation associated with rectangular representations of spherical environments (44) where information at the poles are oversampled compared with the central elevations, resulting in redundant information across different azimuths at the poles. This is further exacerbated in realistic scenes where the poles often represent the sky, floor, and ceiling, which tend to have limited diversity. Because of this issue, we observed that heuristic policies that sample constant actions while avoiding the poles competed strongly with learned approaches and even outperformed supervised policies in some cases. We found that incorporating priors that encourage the agent to move away from the poles resulted in consistent performance gains for our method as well. One future direction to avoid the issue would be to design environments that have varying azimuths across elevations.

Another drawback is that our current testbeds handle only camera rotations, not translations. In future work, we will extend our approach to 3D environments that also permit camera translations (45, 46). In such scenarios, intelligent look-around behavior becomes even more essential, because no matter what visual sensors it has, an agent must move its camera to observe another room. We also plan to consider other tasks for transfer such as target-driven navigation (47) and model-based RL (48, 49), where a preliminary exploratory stage is crucial for performing well on downstream tasks.

Last, it will be interesting to explore how multiple sensing modalities could work together to learn look-around behavior. For example, an agent that hears a sudden noise from one direction might learn to look there to gain new information about dynamic objects in the scene, or an agent that sees an unfamiliar texture might reach out to touch the object surface to better anticipate its shape.


In this final section, we summarize the implementation of our approach. Complete implementation details are provided in the Supplementary Materials.

Recurrent observation completion network

We now discuss the recurrent neural network used for active observation completion. The architecture naturally splits into five modules with distinct functions: SENSE, FUSE, AGGREGATE, DECODE, and ACT. Architecture details for all modules are given in Fig. 7.

Fig. 7 Architecture of our active observation completion system.

Although the input-output pair shown here is for the case of 360° scenes, we used the same architecture for the case of 3D objects. In the output viewgrid, solid black portions denote observed views, question marks denote unobserved views, and transparent black portions denote the system’s uncertain contextual guesses.

Encoding to an internal model of the target

First, we define the core modules with which the agent encodes its internal model of the current environment. At each step t, the agent is presented with a 2D view xt captured from a new viewpoint θt. We stress that absolute viewpoint coordinates θt are not fully known, and objects/scenes are not presented in any canonical orientation. All viewgrids inferred by our approach treat the first view’s azimuth as the origin. We assume only that the absolute elevation can be sensed using gravity and that the agent is aware of the relative motion from the previous view. Let pt denote this proprioceptive metadata (elevation, relative motion).

The SENSE module processes these inputs in separate neural network stacks to produce two vector outputs, which we jointly denote as ot = SENSE (xt, pt) (see Fig. 7, top left). FUSE combines information from both input streams and embeds it into ft = fuse (ot) (Fig. 7, top center). Then, this combined sensory information ft from the current observation is fed into AGGREGATE, which is a long short-term memory module (50). AGGREGATE maintains an encoded internal model st of the object/scene under observation to “remember” all relevant information from past observations. At each time step, it updates this code, combining it with the current observation to produce st = AGGREGATE (f1, ⋯, ft) (Fig. 7, top right).

SENSE, FUSE, and AGGREGATE together encode observations into an internal state st that is used to produce the output viewgrid and select the action, respectively, as we detail next.

Decoding to the inferred viewgrid

DECODE translates the aggregated code into the predicted viewgrid V^t(x1,⋯,xt)=DECODE(st). To do this, it first reshapes st into a sequence of small 2D feature maps (Fig. 7, bottom right) before upsampling to the target dimensions using a series of learned up-convolutions. The final up-convolution produces MN maps, one for each of the MN views in the viewgrid. For color images, we produce 3MN maps, one for each color channel of each view. This is then reshaped into the target viewgrid (Fig. 7, bottom center). Seen views are pasted directly from memory.

Acting to select the next viewpoint to observe

Last, ACT processes the aggregate code st to issue a motor command at = ACT (st) (Fig. 7, middle right). For objects, the motor commands rotate the object (i.e., agent manipulates the object or peers around it); for scenes, the motor commands move the camera (i.e., agent turns in the 3D environment). Upon execution, the observation’s pose updates for the next time step to θt + 1 = θt + at. For t = 1, θ1 is randomly sampled, corresponding to the agent initially encountering the new environment or object from an arbitrary pose.

Internally, ACT first produces a distribution over all possible actions and then samples at from this distribution. We restrict ACT to select small discrete actions at each time step to approximately simulate continuous motion. Once the new viewpoint θt + 1 is set, a new view is captured and the whole process is repeated. This happens until T time steps have passed, involving T − 1 actions. These modules are learned end to end in a policy learning framework as described in the section below on the policy learning formulation.

Sidekick policy learning

We now describe the sidekicks used to learn faster and converge to better policies under partial observability. To effectively learn to perform the task, the agent has to use the limited information available from its egocentric view to (i) aggregate information, (ii) select intelligent actions to improve its training, and (iii) decode the entire viewgrid. This poses considerable hurdles for policy learning under partial observability, that is, making decisions while lacking full state knowledge. In particular, our agent does not know the entire 360° environment before it has to decide where to look next.

To address these issues, we propose sidekicks that exploit full observability available exclusively during training to aid policy learning of the ultimate agent. The key idea is to solve a simpler problem with relevance to the actual look-around task using full observability and then transfer the knowledge to the main agent. We define two types of sidekicks, reward-based and demonstration-based.

Reward-based sidekick

The reward-based sidekick aims to identify a set {x(X,θi)}i=1K of K highly informative views in the environment X by exploiting full observability during training. It considers a simplified completion problem where the goal is to evaluate the information content of individual views themselves, i.e., to identify information hot spots in the environment that strongly suggest other parts of the environment. For example, it might learn that facing the blank ceiling of a kitchen is less informative than looking at the contents of the refrigerator or stove.

To evaluate the informativeness of a candidate view, the sidekick sees how well the entire environment can be reconstructed given only that view. We train a completion model that can reconstruct V^(X) from any single view (i.e., we set T = 1). The score assigned to a candidate view is inversely proportional to the reconstruction error of the entire environment given only that view. The sidekick conveys the results to the agent during policy learning in the form of an augmented reward rts at each time step. Please see the section on sidekick policy learning in the Supplementary Materials for more details.

Demonstration-based sidekick

Our second sidekick generates trajectories of informative views through a sidekick policy πs. In a trajectory, the informativeness of the current view is conditioned on the past views selected, as opposed to sampling individually informative views. To condition the informativeness on past views, we use a cumulative coverage score (see eqs. S9 and S10) that measures the amount of information gathered about different parts of the environment until time t. The goodness of a view is measured by the increase in cumulative coverage obtained upon selecting that view, i.e., how well it complements the previously selected views. Please see the section on sidekick policy learning in the Supplementary Materials for full details.

The demonstration sidekick uses this coverage score to sample informative trajectories. Given a starting view in X, the demonstration sidekick selects a trajectory of T views that jointly maximize the coverage of X. At each time step, the demonstration sidekick evaluates the gain in cumulative coverage obtained by sampling each view in its neighborhood and then greedily samples the best view (see eq. S11).

We use sidekick-generated trajectories as supervision to the agent for a short preparatory period. The goal is to initialize the agent with useful insights learned by the sidekick to accelerate training of better policies. We achieve this through a hybrid training procedure that combines imitation and reinforcement, as described in the demonstration-based sidekick section in the Supplementary Materials.

Policy learning formulation

Having defined the recurrent network model and the sidekick policy preparation, we now describe the policy learning framework used to train our agent as well as the mechanisms used to incorporate sidekick rewards (rts) and demonstrations (obtained from πs). All modules are jointly optimized end to end to improve the final reconstructed viewgrid V^T, which contains predicted views x^T(X,θj) for all viewpoints θj, 1 ≤ jMN. The agent learns a policy π(ast) that returns a distribution over actions for the aggregated internal representation st at time t. Let A={ai} denote the set of camera motions available to the agent. Our agent seeks the policy that minimizes reconstruction error for the environment given a budget of T camera motions (views). Let Ws, Wf, Wr, Wd, Wa represent the weights of the SENSE, FUSE, AGGREGATE, DECODE, and ACT modules. If we denote the set of weights of the network [Ws, Wf, Wr, Wd, Wa] by W and W excluding Wa by W/a and W excluding Wd by W/d, then the overall weight update isΔW=1n∑j=1nλrΔW/arec+λaΔW/dact(1)where n is the number of training samples, j indexes over the training samples, λr and λa are constants, and ΔW/arec and ΔW/dact update all parameters except Wa and Wd, respectively.

The pixelwise MSE reconstruction loss (Ltrec) and corresponding weight update at time t are as followsLrect(X)=∑i=1MNd(x^t(X,θ(i)+Δ0),x(X,θ(i))),ΔW/arec=−∑t=1T∇W/aLrect(X)(2)where x^t(X,θ(i)) denotes the reconstructed view at viewpoint θ(i) and time t, d denotes the pixelwise reconstruction MSE, and Δ0 denotes the offset to account for the unknown starting azimuth (23).

The agent’s reward at time t consists of the intrinsic reward from the sidekick rts=Info(x(X,θt),X) and the negated final reconstruction loss, −LrecT(X)rt=(rts1 ≤ t ≤ T−2−LrecT(X)+rtst=T−1.)(3)

The sidekick reward rts serves to densify the rewards by exploiting full observability, thereby reducing uncertainty during policy learning. Please see the Supplementary Materials for the exact form of rts. The update from the policy consists of an actor-critic update, with a baseline b to reduce variance, and supervision from the demonstration sidekick:ΔW/dact=∑t=1T−1∇W/dlog π(at∣st)(∑t′=tT−1rt′−b(st))+ΔW/ddemo.(4)

We adapt the baseline b as the value function from an actor-critic (51) method to update the ACT module. The demonstration sidekick’s supervision is defined below in Eq. 5. The ACT term additionally includes a loss to update the learned value network and entropy regularization to promote diversity in action selection (please see additional loss functions in the Supplementary Materials).

Whereas the reward sidekick augments rewards, the demonstration sidekick instead influences policy learning by directly supervising the early rounds of action selection. This is achieved through a cross-entropy loss between the sidekick’s policy πs and the agent’s policy π:ΔW/ddemo=∑t=1T−1∑a∈A∇/d(πs(a∣st)log π(a∣st)).−0.10cm(5)

Please see the sidekick policy learning section in the Supplementary Materials for the exact form of πs.

We pretrain the SENSE, FUSE, and DECODE modules with T = 1. The full network is then trained end to end (with SENSE and FUSE frozen). For training with sidekicks, the agent is augmented either with additional rewards from the reward sidekick (Eq. 3) or an additional supervised loss from the demonstration sidekick (Eq. 5).

Unsupervised policy transfer to unseen tasks

We now describe the mechanism used to transfer policies learned in an unsupervised fashion via active observation completion to new perception tasks requiring sequential observations. This section details the process overviewed above in the look-around policy transfer section. The main idea is to inject our generic look-around policy into new unseen tasks in unseen environments. In particular, we consider transferring our policy—trained with neither manual supervision nor task-specific reward—into various semantic and geometric recognition tasks for which the agent was not specifically trained. Recall, we considered four different tasks: recognition, surface area estimation, light source localization, and camera pose estimation.

At training time, we train an end-to-end task-specific model (model A) with a random policy (rnd-actions) and an active observation completion model (model B). Note that our completion model is trained without supervision to look around environments that have zero overlap with model A’s test set. Furthermore, even the categories seen during training may differ from those during testing. For example, the agent might see various furniture categories during training (bookcase, bed, etc.) but never a chair, yet it must generalize well to look around a chair.

At test time, both the task-specific model A and the active observation model B receive and process the same inputs at each time step. The task-specific model does not have a learned policy of its own, because it is trained with a policy that samples random actions. At each time step, model B selects actions to complete its internal model of the new environment based on its look-around policy. This action is then communicated to model A in place of the random actions with which it was trained. Therefore, model A gathers its information based on the actions provided by model B. Model A then makes a prediction for the target task. If the policy learned in model B is truly generic, then it will intelligently explore to solve the new (unseen) tasks despite never receiving task-specific reward for any one of them during training.

Science Robotics

Share This Post

Leave a Reply