[blog] Inspired by progress in large-scale language modeling [Decision Transformer], we apply a similar approach towards building a single generalist agent beyond the realm of text outputs.
The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
Figure 1: A generalist agent. Gato can sense and act with different embodiments across a wide range of environments using a single neural network with the same set of weights. Gato was trained on 604 distinct tasks with varying modalities, observations and action specifications.
In this report we describe the model and the data, and document the current capabilities of Gato [at 0.08b, 0.36b, & 1.2b parameters].
…Given scaling law trends, the performance across all tasks including dialogue will increase with scale in parameters, data and compute. Better hardware and network architectures will allow training bigger models while maintaining real-time robot control capability. By scaling up and iterating on this same basic approach, we can build a useful general-purpose agent.
…We focus our training at the operating point of model scale that allows real-time control of real-world robots, currently around 1.2B parameters in the case of Gato. As hardware and model architectures improve, this operating point will naturally increase the feasible model size, pushing generalist models higher up the scaling law curve. For simplicity Gato was trained offline in a purely supervised manner; however, in principle, there is no reason it could not also be trained with either offline or online reinforcement learning (RL)…Training of the model is performed on a 16×16 TPU v3 slice for 1M steps with batch size 512 and token sequence length L = 1,024, which takes about 4 days.
Figure 2: Training phase of Gato. Data from different tasks and modalities is serialized into a flat sequence of tokens, batched, and processed by a transformer neural network akin to a large language model. Masking is used such that the loss function is applied only to target outputs, i.e. text and various actions.
…Scaling Laws Analysis: In Figure 8, we analyze the aggregate in-distribution performance of the pretrained model as a function of the number of parameters in order to get insight into how performance could improve with increased model capacity. We evaluated 3 different model sizes (measured in parameter count): a 79M model, a 364M model, and a 1.18B model (Gato). We refer to Section C for details on the 3 model architectures. Here, for all 3 model sizes we plot the normalized return as training progresses. To get this single value, for each task we calculate the performance of the model as a percentage of expert score (the same as done in Section 4.1). Then for each domain listed in Table 1 we average the percentage scores across all tasks for that domain. Finally, we mean-aggregate the percentage scores across all domains. We can see that for an equivalent token count, there is a substantial performance improvement with increased scale.
Figure 8: Model size scaling laws results. In-distribution performance as a function of tokens processed for 3 model scales. Performance is first mean-aggregated within each separate control domain, and then mean-aggregated across all domains. We can see a consistent improvement as model capacity is increased for a fixed number of tokens.
Figure 10: Robotics fine-tuning results.Left: Comparison of real robot Skill Generalization success rate averaged across test triplets for Gato, expert, and CRR trained on 35k expert episodes (upper bound). Right: Comparison of simulated robot Skill Generalization success rate averaged across test triplets for a series of ablations on the number of parameters, including scores for expert and a BC baseline trained on 5k episodes.
Fine-tuning and Model Size: To better understand the benefit of large models for few-shot adaptation in robotics domains, we conducted an ablation on model parameter size. This section focuses on in-simulation evaluation. Figure 10 compares the full 1.18B parameter Gato with the smaller 364M and 79M parameter variants for varying amounts of fine-tuning data. Although the 364M model overfits on one episode, causing performance to drop, there is a clear trend towards better adaptation with fewer episodes as the number of parameters is scaled up. The 79M model performs clearly worse than its bigger counterparts. The results suggest that the model’s greater capacity allows the model to use representations learned from the diverse training data at test time.
…As we model the data autoregressively, each token is potentially also a target label given the previous tokens. Text tokens, discrete and continuous values, and actions can be directly set as targets after tokenization. Image tokens and agent observations are not currently predicted in Gato, although that may be an interesting direction for future work. Targets for these non-predicted tokens are set to an unused value and their contribution to the loss is masked out…Because distinct tasks within a domain can share identical embodiments, observation formats and action specifications, the model sometimes needs further context to disambiguate tasks. Rather than providing eg. one-hot task identifiers, we instead take inspiration from (Brown et al, 2020; Sanh et al 2022; Wei et al 2021) and use prompt conditioning. During training, for 25% of the sequences in each batch, a prompt sequence is prepended, coming from an episode generated by the same source agent on the same task. Half of the prompt sequences are from the end of the episode, acting as a form of goal conditioning for many domains; and the other half are uniformly sampled from the episode. During evaluation, the agent can be prompted using a successful demonstration of the desired task, which we do by default in all control results that we present here.
Table 1a: Datasets. Control datasets used to train Gato. Right: Vision & language datasets. Sample weight means the proportion of each dataset, on average, in the training sequence batches.
Figure 5: Gato’s performance on simulated control tasks. Number of tasks where the performance of the pretrained model is above a percentage of expert score, grouped by domain. Here values on the x-axis represent a specific percentage of expert score, where 0 corresponds to random agent performance. The y-axis is the number of tasks where the pretrained model’s mean performance is equal to or above that percentage. That is, the width of each colour band indicates the number of tasks where Gato’s mean performance is above a percentage of the maximum score obtained by a task-specific expert.
…In ALE Atari (Bellemare et al 2013) Gato achieves the average human (or better) scores for 23 Atari games, achieving over twice human score for 11 games. While the single-task online RL agents which generated the data still outperform Gato, this may be overcome by adding capacity or using offline RL training rather than purely supervised (see Section 5.5 where we present a specialist single domain ALE Atari agent achieving better than human scores for 44 games).
…As mentioned earlier, transfer in Atari is challenging. Rusu et al 2016 researched transfer between randomly selected Atari games. They found that Atari is a difficult domain for transfer because of pronounced differences in the visuals, controls and strategy among the different games. Further difficulties that arise when applying behaviour cloning to video games like Atari are discussed by Kanervisto et al 2020.
Agile maneuvers such as sprinting and high-speed turning in the wild are challenging for legged robots.
We present an end-to-end learned controller that achieves record agility for the MIT Mini Cheetah, sustaining speeds up to 3.9 m⁄s. This system runs and turns fast on natural terrains like grass, ice, and gravel and responds robustly to disturbances.
Our controller is a neural network trained in simulation via reinforcement learning and transferred to the real world. The two key components are (1) an adaptive curriculum on velocity commands and (2) an online system identification strategy for sim-to-real transfer leveraged from prior work.
Figure 1b: Success on the OBJECTNAV MP3D-VAL split vs. no. of human demonstrations for training
[cf. DD-PPO] We present a large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments—(1) ObjectGoal Navigation (eg. ‘find & go to a chair’) and (2) Pick&Place (eg. ‘find mug, pick mug, find counter, place mug on counter’). First, we develop a virtual teleoperation data-collection infrastructure—connecting Habitat simulator running in a web browser to Amazon Mechanical Turk, allowing remote users to teleoperate virtual robots, safely and at scale. We collect 80k demonstrations for ObjectNav and 12k demonstrations for Pick&Place, which is an order of magnitude larger than existing human demonstration datasets in simulation or on real robots.
Second, we attempt to answer the question—how does large-scale imitation learning (IL) (which hasn’t been hitherto possible) compare to reinforcement learning (RL) (which is the status quo)? On ObjectNav, we find that IL (with no bells or whistles) using 70k human demonstrations outperforms RL using 240k agent-gathered trajectories. The IL-trained agent demonstrates efficient object-search behavior—it peeks into rooms, checks corners for small objects, turns in place to get a panoramic view—none of these are exhibited as prominently by the RL agent, and to induce these behaviors via RL would require tedious reward engineering. Finally, accuracy vs. training data size plots show promising scaling behavior, suggesting that simply collecting more demonstrations is likely to advance the state of art further. On Pick&Place, the comparison is starker—IL agents achieve ~18% success on episodes with new object-receptacle locations when trained with 9.5k human demonstrations, while RL agents fail to get beyond 0%.
Overall, our work provides compelling evidence for investing in large-scale imitation learning.
Figure 7: Dataset size vs performance plot for Pick&Place.
…On both tasks, we find that demonstrations from humans are essential; imitating shortest paths from an oracle produces neither accuracy nor the strategic search behavior. In hindsight, this is perfectly understandable—shortest paths (eg. Figure 1(a3)) do not contain any exploration but the task requires the agent to explore. Essentially, a shortest path is inimitable, but imitation learning is invaluable.
Figure 5: Comparing RL and IL on (a) VAL success vs. no. of training steps, and (b) VAL success vs. no. of unique training steps. This distinguishes between an IL agent that learns from a static dataset vs. an RL agent that gathers unique trajectories on-the-fly.
Preference-based reinforcement learning (RL) has shown potential for teaching agents to perform the target tasks without a costly, pre-defined reward function by learning the reward with a supervisor’s preference between the two agent behaviors. However, preference-based learning often requires a large amount of human feedback, making it difficult to apply this approach to various applications. This data-efficiency problem, on the other hand, has been typically addressed by using unlabeled samples or data augmentation techniques in the context of supervised learning.
Motivated by the recent success of these approaches, we present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. To further improve the label-efficiency of reward learning, we introduce a new data augmentation that temporally crops consecutive subsequences from the original behaviors.
Our experiments demonstrate that our approach substantially improves the feedback-efficiency of the state-of-the-art preference-based method on a variety of locomotion and robotic manipulation tasks.
A long-horizon dexterous robot manipulation task of deformable objects, such as banana peeling, is problematic because of difficulties in object modeling and a lack of knowledge about stable and dexterous manipulation skills. This paper presents a goal-conditioned dual-action deep imitation learning (DIL) which can learn dexterous manipulation skills using human demonstration data. Previous DIL methods map the current sensory input and reactive action, which easily fails because of compounding errors in imitation learning caused by recurrent computation of actions. The proposed method predicts reactive action when the precise manipulation of the target object is required (local action) and generates the entire trajectory when the precise manipulation is not required. This dual-action formulation effectively prevents compounding error with the trajectory-based global action while respond to unexpected changes in the target object with the reactive local action. Furthermore, in this formulation, both global/local actions are conditioned by a goal state which is defined as the last step of each subtask, for robust policy prediction. The proposed method was tested in the real dual-arm robot and successfully accomplished the banana peeling task.
Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal.
To this end, we propose a novel approach that extracts a self-supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy learning and motion planning. We combine model-based planning with model-free deep reinforcement learning (RL) to learn policies that favor the same object regions favored by people, while requiring minimal robot interactions with the environment.
We evaluate our algorithm, Visual Affordance-guided Policy Optimization (VAPO), with both diverse simulation manipulation tasks and real world robot tidy-up experiments to demonstrate the effectiveness of our affordance-guided policies.
We find that our policies train 4× faster than the baselines and generalize better to novel objects because our visual affordance model can anticipate their affordance regions.
In this paper, we propose a locomotion training framework where a control policy and a state estimator are trained concurrently. The framework consists of a policy network which outputs the desired joint positions and a state estimation network which outputs estimates of the robot’s states such as the base linear velocity, foot height, and contact probability. We exploit a fast simulation environment to train the networks and the trained networks are transferred to the real robot.
The trained policy and state estimator are capable of traversing diverse terrains such as a hill, slippery plate, and bumpy road. We also demonstrate that the learned policy can run at up to 3.75 m⁄s on normal flat ground and 3.54 m⁄s on a slippery plate with the coefficient of friction of 0.22.
Quality-Diversity (QD) algorithms are a well-known approach to generate large collections of diverse and high-quality policies. However, QD algorithms are also known to be data-inefficient, requiring large amounts of computational resources and are slow when used in practice for robotics tasks. Policy evaluations are already commonly performed in parallel to speed up QD algorithms but have limited capabilities on a single machine as most physics simulators run on CPUs. With recent advances in simulators that run on accelerators, thousands of evaluations can performed in parallel on single GPU/TPU. In this paper, we present QDax, an implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible. We first demonstrate the improvements on the number of evaluations per second that parallelism using accelerated simulators can offer. More importantly, we show that QD algorithms are ideal candidates and can scale with massive parallelism to be run at interactive timescales. The increase in parallelism does not significantly affect the performance of QD algorithms, while reducing experiment runtimes by two factors of magnitudes, turning days of computation into minutes. These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning.
We present in-hand manipulation skills on a dexterous, compliant, anthropomorphic hand. Even though these skills were derived in a simplistic manner, they exhibit surprising robustness to variations in shape, size, weight, and placement of the manipulated object. They are also very insensitive to variation of execution speeds, ranging from highly dynamic to quasi-static. The robustness of the skills leads to compositional properties that enable extended and robust manipulation programs.
To explain the surprising robustness of the in-hand manipulation skills, we performed a detailed, empirical analysis of the skills’ performance. From this analysis, we identify three principles for skill design: (1) Exploiting the hardware’s innate ability to drive hard-to-model contact dynamics. (2) Taking actions to constrain these interactions, funneling the system into a narrow set of possibilities. (3) Composing such action sequences into complex manipulation programs. We believe that these principles constitute an important foundation for robust robotic in-hand manipulation, and possibly for manipulation in general.
Both the design and control of a robot play equally important roles in its task performance. However, while optimal control is well studied in the machine learning and robotics community, less attention is placed on finding the optimal robot design. This is mainly because co-optimizing design and control in robotics is characterized as a challenging problem, and more importantly, a comprehensive evaluation benchmark for co-optimization does not exist. In this paper, we propose Evolution Gym, the first large-scale benchmark for co-optimizing the design and control of soft robots. In our benchmark, each robot is composed of different types of voxels (eg. soft, rigid, actuators), resulting in a modular and expressive robot design space. Our benchmark environments span a wide range of tasks, including locomotion on various types of terrains and manipulation. Furthermore, we develop several robot co-evolution algorithms by combining state-of-the-art design optimization methods and deep reinforcement learning techniques. Evaluating the algorithms on our benchmark platform, we observe robots exhibiting increasingly complex behaviors as evolution progresses, with the best evolved designs solving many of our proposed tasks. Additionally, even though robot designs are evolved autonomously from scratch without prior knowledge, they often grow to resemble existing natural creatures while outperforming hand-designed robots. Nevertheless, all tested algorithms fail to find robots that succeed in our hardest environments. This suggests that more advanced algorithms are required to explore the high-dimensional design space and evolve increasingly intelligent robots—an area of research in which we hope Evolution Gym will accelerate progress. Our website with code, environments, documentation, and tutorials is available at http: / / evogym.csail.mit.edu .
[homepage; video; cf. Dactyl, Hwangbo et al 2019/Rudin et al 2021] Legged robots that can operate autonomously in remote and hazardous environments will greatly increase opportunities for exploration into underexplored areas.
Exteroceptive perception is crucial for fast and energy-efficient locomotion: Perceiving the terrain before making contact with it enables planning and adaptation of the gait ahead of time to maintain speed and stability. However, using exteroceptive perception robustly for locomotion has remained a grand challenge in robotics. Snow, vegetation, and water visually appear as obstacles on which the robot cannot step or are missing altogether due to high reflectance. In addition, depth perception can degrade due to difficult lighting, dust, fog, reflective or transparent surfaces, sensor occlusion, and more. For this reason, the most robust and general solutions to legged locomotion to date rely solely on proprioception. This severely limits locomotion speed because the robot has to physically feel out the terrain before adapting its gait accordingly.
Here, we present a robust and general solution to integrating exteroceptive and proprioceptive perception for legged locomotion [in ANYmal robots]. We leverage an attention-based recurrent encoder that integrates proprioceptive and exteroceptive input [using privileged learning, the simulator as oracle, then training the RNN to infer the POMDP & meta-learn at runtime to adapt to changing environments]. The encoder is trained end to end and learns to seamlessly combine the different perception modalities without resorting to heuristics. The result is a legged locomotion controller with high robustness and speed.
The controller was tested in a variety of challenging natural and urban environments over multiple seasons and completed an hour-long hike in the Alps in the time recommended for human hikers.
…DARPA Subterranean Challenge: [video 1, 2] Our controller was used as the default controller in the DARPA Subterranean Challenge missions of team CERBERUS which has won the first prize in the finals (Results). In this challenge, our controller drove ANYmals to operate autonomously over extended periods of time in underground environments with rough terrain, obstructions, and degraded sensing in the presence of dust, fog, water, and smoke. Our controller played a crucial role as it enabled 4 ANYmals to explore over 1,700m in all 3 types of courses—tunnel, urban, and cave—without a single fall.
Automation technologies, and robots in particular, are thought to be massively displacing workers and transforming the future of work.
We study firm investment in automation using cross-country data on robotization as well as administrative data from Germany with information on firm-level automation decisions. Our findings suggest that the impact of robots on firms has been limited:
investment in robots is small and highly concentrated in a few industries, accounting for less than 0.30% of aggregate expenditures on equipment.
recent increases in robotization do not resemble the explosive growth observed for IT technologies in the past, and are driven mostly by catching-up of developing countries.
robot adoption by firms endogenously responds to labor scarcity, alleviating potential displacement of existing workers.
firms that invest in robots increase employment, while total employment effect in exposed industries and regions is negative, but modest in magnitude.
We contrast robots with other digital technologies that are more widespread. Their importance in firms’ investment is substantially higher, and their link with labor markets, while sharing some similarities with robots, appears markedly different.
Robotic skills can be learned via imitation learning (IL) using user-provided demonstrations, or via reinforcement learning (RL) using large amountsof autonomously collected experience.Both methods have complementarystrengths and weaknesses: RL can reach a high level of performance, but requiresexploration, which can be very time consuming and unsafe; IL does not requireexploration, but only learns skills that are as good as the provided demonstrations.Can a single method combine the strengths of both approaches? A number ofprior methods have aimed to address this question, proposing a variety of tech-niques that integrate elements of IL and RL. However, scaling up such methodsto complex robotic skills that integrate diverse offline data and generalize mean-ingfully to real-world scenarios still presents a major challenge. In this paper, ouraim is to test the scalability of prior IL + RL algorithms and devise a system basedon detailed empirical experimentation that combines existing components in themost effective and scalable way. To that end, we present a series of experimentsaimed at understanding the implications of each design decision, so as to develop acombined approach that can utilize demonstrations and heterogeneous prior datato attain the best performance on a range of real-world and realistic simulatedrobotic problems. Our complete method, which we call AW-Opt, combines ele-ments of advantage-weighted regression [1, 2] and QT-Opt [3], providing a unifiedapproach for integrating demonstrations and offline data for robotic manipulation.Please see https://awopt.github.io for more details.
In this paper, we study the problem of enabling a vision-based robotic manipulation system to generalize to novel tasks, a long-standing challenge in robot learning.
We approach the challenge from an imitation learning perspective, aiming to study how scaling and broadening the data collected can facilitate such generalization. To that end, we develop an interactive and flexible imitation learning system that can learn from both demonstrations and interventions and can be conditioned on different forms of information that convey the task, including pre-trained embeddings of natural language or videos of humans performing the task.
When scaling data collection on a real robot to more than 100 distinct tasks, we find that this system can perform 24 unseen manipulation tasks with an average success rate of 44%, without any robot demonstrations for those tasks.
How can artificial agents learn to solve many diverse tasks in complex visual environments in the absence of any supervision? We decompose this question into two problems: discovering new goals and learning to reliably achieve them. We introduce Latent Explorer Achiever (LEXA), an unified solution to these that learns a world model from image inputs and uses it to train an explorer and an achiever policy from imagined rollouts. Unlike prior methods that explore by reaching previously visited states, the explorer plans to discover unseen surprising states through foresight, which are then used as diverse targets for the achiever to practice. After the unsupervised phase, LEXA solves tasks specified as goal images zero-shot without any additional learning. LEXA substantially outperforms previous approaches to unsupervised goal-reaching, both on prior benchmarks and on a new challenging benchmark with a total of 40 test tasks spanning across four standard robotic manipulation and locomotion domains. LEXA further achieves goals that require interacting with multiple objects in sequence. Finally, to demonstrate the scalability and generality of LEXA, we train a single general agent across four distinct environments. Code and videos at https: / / orybkin.github.io / lexa /
We study the problem of robotic stacking with objects of complex geometry. We propose a challenging and diverse set of such objects that was carefully designed to require strategies beyond a simple “pick-and-place” solution. Our method is a reinforcement learning (RL) approach combined with vision-based interactive policy distillation and simulation-to-reality transfer. Our learned policies can efficiently handle multiple object combinations in the real world and exhibit a large variety of stacking skills. In a large experimental study, we investigate what choices matter for learning such general vision-based agents in simulation, and what affects optimal transfer to the real robot. We then leverage data collected by such policies and improve upon them with offline RL. A video and a blog post of our work are provided as supplementary material.
Legged robots are physically capable of traversing a wide range of challenging environments, but designing controllers that are sufficiently robust to handle this diversity has been a long-standing challenge in robotics. Reinforcement learning presents an appealing approach for automating the controller design process and has been able to produce remarkably robust controllers when trained in a suitable range of environments. However, it is difficult to predict all likely conditions the robot will encounter during deployment and enumerate them at training-time. What if instead of training controllers that are robust enough to handle any eventuality, we enable the robot to continually learn in any setting it finds itself in? This kind of real-world reinforcement learning poses a number of challenges, including efficiency, safety, and autonomy.
To address these challenges, we propose a practical robot reinforcement learning system for fine-tuning locomotion policies in the real world. We demonstrate that a modest amount of real-world training can substantially improve performance during deployment, and this enables a real A1 quadrupedal robot to autonomously fine-tune multiple locomotion skills in a range of environments, including an outdoor lawn and a variety of indoor terrains.
Robot learning holds the promise of learning policies that generalize broadly. However, such generalization requires sufficiently diverse datasets of the task of interest, which can be prohibitively expensive to collect. In other fields, such as computer vision, it is common to utilize shared, reusable datasets, such as ImageNet, to overcome this challenge, but this has proven difficult in robotics. In this paper, we ask: what would it take to enable practical data reuse in robotics for end-to-end skill learning? We hypothesize that the key is to use datasets with multiple tasks and multiple domains, such that a new user that wants to train their robot to perform a new task in a new domain can include this dataset in their training process and benefit from cross-task and cross-domain generalization. To evaluate this hypothesis, we collect a large multi-domain and multi-task dataset, with 7,200 demonstrations constituting 71 tasks across 10 environments, and empirically study how this data can improve the learning of new tasks in new environments. We find that jointly training with the proposed dataset and 50 demonstrations of a never-before-seen task in a new domain on average leads to a 2× improvement in success rate compared to using target domain data alone. We also find that data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains. These results suggest that reusing diverse multi-task and multi-domain datasets, including our open-source dataset, may pave the way for broader robot generalization, eliminating the need to re-collect data for each new robot learning project.
How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation.
To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.
In this work, we present and study a training set-up that achieves fast policy generation for real-world robotic tasks by using massive parallelism on a single workstation GPU. We analyze and discuss the impact of different training algorithm components in the massively parallel regime on the final policy performance and training times. In addition, we present a novel game-inspired curriculum that is well suited for training with thousands of simulated robots in parallel. We evaluate the approach by training the quadrupedal robot ANYmal to walk on challenging terrain. The parallel approach allows training policies for flat terrain in under four minutes, and in twenty minutes for uneven terrain. This represents a speedup of multiple orders of magnitude compared to previous work. Finally, we transfer the policies to the real robot to validate the approach. We open-source our training code to help accelerate further research in the field of learned legged locomotion.
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction. In order to accomplish this, humans need easy and effective ways of specifying tasks to the robot. Goal images are one popular form of task specification, as they are already grounded in the robot’s observation space. However, goal images also have a number of drawbacks: they are inconvenient for humans to provide, they can over-specify the desired behavior leading to a sparse reward signal, or under-specify task information in the case of non-goal reaching tasks. Natural language provides a convenient and flexible alternative for task specification, but comes with the challenge of grounding language in the robot’s observation space. To scalably learn this grounding we propose to leverage offline robot datasets (including highly sub-optimal, autonomously collected data) with crowd-sourced natural language labels. With this data, we learn a simple classifier which predicts if a change in state completes a language instruction. This provides a language-conditioned reward function that can then be used for offline multi-task RL. In our experiments, we find that on language-conditioned manipulation tasks our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%, and is able to perform visuomotor tasks from natural language, such as “open the right drawer” and “move the stapler”, on a Franka Emika Panda robot.
Isaac Gym offers a high performance learning platform to train policies for wide variety of robotics tasks directly on GPU. Both physics simulation and the neural network policy training reside on GPU and communicate by directly passing data from physics buffers to PyTorch tensors without ever going through any CPU bottlenecks. This leads to blazing fast training times for complex robotics tasks on a single GPU with 1–2 orders of magnitude improvements compared to conventional RL training that uses a CPU based simulator and GPU for neural networks. We host the results and videos at https: / / sites.google.com / view / isaacgym-nvidia and isaac gym can be download at https://developer.nvidia.com/isaac-gym.
AI is undergoing a paradigm shift with the rise of models (eg. BERT, DALL·E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.
This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (eg. language, vision, robotics, reasoning, human interaction) and technical principles (eg. model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (eg. law, healthcare, education) and societal impact (eg. inequity, misuse, economic and environmental impact, legal and ethical considerations).
Though foundation models are based on conventional deep learning and transfer learning, their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties.
To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
The history of AI is one of increasing emergence and homogenization. With the introduction of machine learning, we moved from a large proliferation of specialized algorithms that specified how to compute answers to a small number of general algorithms that learned how to compute answers (ie. the algorithm for computing answers emerged from the learning algorithm). With the introduction of deep learning, we moved from a large proliferation of hand-engineered features for learning algorithms to a small number of architectures that could be pointed at a new domain and discover good features for that domain. Recently, the trend has continued: we have moved from a large proliferation of trained models for different tasks to a few large “foundation models” which learn general algorithms useful for solving specific tasks. BERT and GPT-3 are central examples of foundation models in language; many NLP tasks that previously required different models are now solved using finetuned or prompted versions of BERT and/or GPT-3.
Note that, while language is the main example of a domain with foundation models today, we should expect foundation models to be developed in an increasing number of domains over time. The authors call these “foundation” models to emphasize that (1) they form a fundamental building block for applications and (2) they are not themselves ready for deployment; they are simply a foundation on which applications can be built. Foundation models have been enabled only recently because they depend on having large scale in order to make use of large unlabeled datasets using self-supervised learning to enable effective transfer to new tasks. It is particularly challenging to understand and predict the capabilities exhibited by foundation models because their multitask nature emerges from the large-scale training rather than being designed in from the start, making the capabilities hard to anticipate. This is particularly unsettling because foundation models also lead to substantially increased homogenization, where everyone is using the same few models, and so any new emergent capability (or risk) is quickly distributed to everyone.
The authors argue that academia is uniquely suited to study and understand the risks of foundation models. Foundation models are going to interact with society, both in terms of the data used to create them and the effects on people who use applications built upon them. Thus, analysis of them will need to be interdisciplinary; this is best achieved in academia due to the concentration of people working in the various relevant areas. In addition, market-driven incentives need not align well with societal benefit, whereas the research mission of universities is the production and dissemination of knowledge and creation of global public goods, allowing academia to study directions that would have large societal benefit that might not be prioritized by industry.
All of this is just a summary of parts of the introduction to the report. The full report is over 150 pages and goes into detail on capabilities, applications, technologies (including technical risks), and societal implications. I’m not going to summarize it here, because it is long and a lot of it isn’t that relevant to alignment; I’ll instead note down particular points that I found interesting.
(pg. 26) Some studies have suggested that foundation models in language don’t learn linguistic constructions robustly; even if they use it well once, they may not do so again, especially under distribution shift. In contrast, humans can easily “slot in” new knowledge into existing linguistic constructions.
(pg. 34) This isn’t surprising but is worth repeating: many of the capabilities highlighted in the robotics section are very similar to the ones that we focus on in alignment (task specification, robustness, safety, sample efficiency).
(pg. 42) For tasks involving reasoning (eg. mathematical proofs, program synthesis, drug discovery, computer-aided design), neural nets can be used to guide a search through a large space of possibilities. Foundation models could be helpful because (1) since they are very good at generating sequences, you can encode arbitrary actions (eg. in theorem proving, they can use arbitrary instructions in the proof assistant language rather than being restricted to an existing database of theorems), (2) the heuristics for effective search learned in one domain could transfer well to other domains where data is scarce, and (3) they could accept multimodal input: for example, in theorem proving for geometry, a multimodal foundation model could also incorporate information from geometric diagrams.
(Section 3) A substantial portion of the report is spent discussing potential applications of foundation models. This is the most in-depth version of this I have seen; anyone aiming to forecast the impacts of AI on the real world in the next 5–10 years should likely read this section. It’s notable to me how nearly all of the applications have an emphasis on robustness and reliability, particularly in truth-telling and logical reasoning.
(Section 4.3) We’ve seen a few (AN #152) ways (AN #155) in which foundation models can be adapted. This section provides a good overview of the various methods that have been proposed in the literature. Note that adaptation is useful not just for specializing to a particular task like summarization, but also for enforcing constraints, handling distributional shifts, and more.
(pg. 92) Foundation models are commonly evaluated by their performance on downstream tasks. One limitation of this evaluation paradigm is that it makes it hard to distinguish between the benefits provided by better training, data, adaptation techniques, architectures, etc. (The authors propose a bunch of other evaluation methodologies we could use.)
(Section 4.9) There is a review of AI safety and AI alignment as it relates to foundation models, if you’re interested. (I suspect there won’t be much new for readers of this newsletter.)
(Section 4.10) The section on theory emphasizes studying the pretraining-adaptation interface, which seems quite good to me. I especially liked the emphasis on the fact that pretraining and adaptation work on different distributions, and so it will be important to make good modeling assumptions about how these distributions are related.]
Seemingly simple natural language requests to a robot are generally underspecified, for example “Can you bring me the wireless mouse?” Flat images of candidate mice may not provide the discriminative information needed for “wireless.” The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while substantial effort and progress has been made on understanding explicitly visual attributes like color and category, comparatively little progress has been made on understanding language about shapes and contours. In this work, we introduce a novel reasoning task that targets both visual and non-visual language about 3D objects. Our new benchmark, ShapeNet Annotated with Referring Expressions (SNARE) requires a model to choose which of two objects is being referenced by a natural language description. We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these image-based models are weaker at understanding the 3D nature of objects—properties which play a key role in manipulation. We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform, but note that a large gap remains between these models and human performance.
Quadrotors are among the most agile flying robots. However, planning time-optimal trajectories at the actuation limit through multiple waypoints remains an open problem. This is crucial for applications such as inspection, delivery, search and rescue, and drone racing.
Early works used polynomial trajectory formulations, which do not exploit the full actuator potential because of their inherent smoothness. Recent works resorted to numerical optimization but require waypoints to be allocated as costs or constraints at specific discrete times. However, this time allocation is a priori unknown and renders previous works incapable of producing truly time-optimal trajectories.
To generate truly time-optimal trajectories, we propose a solution to the time allocation problem while exploiting the full quadrotor’s actuator potential. We achieve this by introducing a formulation of progress along the trajectory, which enables the simultaneous optimization of the time allocation and the trajectory itself.
We compare our method against related approaches and validate it in real-world flights in one of the world’s largest motion-capture systems, where we outperform human expert drone pilots in a drone-racing task.
Successful real-world deployment of legged robots would require them to adapt in real-time to unseen scenarios like changing terrains, changing payloads, wear and tear. This paper presents Rapid Motor Adaptation (RMA) algorithm to solve this problem of real-time online adaptation in quadruped robots. RMA consists of two components: a base policy and an adaptation module. The combination of these components enables the robot to adapt to novel situations in fractions of a second.
RMA is trained completely in simulation without using any domain knowledge like reference trajectories or predefined foot trajectory generators and is deployed on the A1 robot without any fine-tuning. We train RMA on a varied terrain generator using bioenergetics-inspired rewards and deploy it on a variety of difficult terrains including rocky, slippery, deformable surfaces in environments with grass, long vegetation, concrete, pebbles, stairs, sand, etc. RMA shows state-of-the-art performance across diverse real-world as well as simulation experiments. Video results at https://ashish-kmr.github.io/rma-legged-robots/
Does having visual priors (eg. the ability to detect objects) facilitate learning to perform vision-based manipulation (eg. picking up objects)?
We study this problem under the framework of transfer learning, where the model is first trained on a passive vision task, and adapted to perform an active manipulation task. We find that pre-training on vision tasks significantly improves generalization and sample efficiency for learning to manipulate objects. However, realizing these gains requires careful selection of which parts of the model to transfer.
Our key insight is that outputs of standard vision models highly correlate with affordance maps commonly used in manipulation. Therefore, we explore directly transferring model parameters from vision networks to affordance prediction networks, and show that this can result in successful zero-shot adaptation, where a robot can pick up certain objects with zero robotic experience. With just a small amount of robotic experience, we can further fine-tune the affordance model to achieve better results.
With just 10 minutes of suction experience or 1 hour of grasping experience, our method achieves ~80% success rate at picking up novel objects.
We present a coarse-to-fine discretisation method that enables the use of discrete reinforcement learning approaches in place of unstable and data-inefficient actor-critic methods in continuous robotics domains. This approach builds on the recently released ARM algorithm, which replaces the continuous next-best pose agent with a discrete one, with coarse-to-fine Q-attention. Given a voxelised scene, coarse-to-fine Q-attention learns what part of the scene to ‘zoom’ into. When this ‘zooming’ behaviour is applied iteratively, it results in a near-lossless discretisation of the translation space, and allows the use of a discrete action, deep Q-learning method. We show that our new coarse-to-fine algorithm achieves state-of-the-art performance on several difficult sparsely rewarded RLBench vision-based robotics tasks, and can train real-world policies, tabula rasa, in a matter of minutes, with as little as 3 demonstrations.
Artificial intelligence (AI) is accelerating the development of unconventional computing paradigms inspired by the abilities and energy efficiency of the brain. The human brain excels especially in computationally intensive cognitive tasks, such as pattern recognition and classification.
A long-term goal is decentralized neuromorphic computing, relying on a network of distributed cores to mimic the massive parallelism of the brain, thus rigorously following a nature-inspired approach for information processing. Through the gradual transformation of interconnected computing blocks into continuous computing tissue, the development of advanced forms of matter exhibiting basic features of intelligence can be envisioned, able to learn and process information in a delocalized manner. Such intelligent matter would interact with the environment by receiving and responding to external stimuli, while internally adapting its structure to enable the distribution and storage (as memory) of information.
We review progress towards implementations of intelligent matter using molecular systems, soft materials or solid-state materials, with respect to applications in soft robotics, the development of adaptive artificial skins and distributed neuromorphic computing.
We present a self-improving, Neural Tree Expansion (NTE) method for multi-robot online planning in non-cooperative environments, where each robot attempts to maximize its cumulative reward while interacting with other self-interested robots. Our algorithm adapts the centralized, perfect information, discrete-action space method from AlphaZero to a decentralized, partial information, continuous action space setting for multi-robot applications. Our method has three interacting components: (1) a centralized, perfect-information “expert” Monte Carlo Tree Search (MCTS) with large computation resources that provides expert demonstrations, (2) a decentralized, partial-information “learner” MCTS with small computation resources that runs in real-time and provides self-play examples, and (3) policy & value neural networks that are trained with the expert demonstrations and bias both the expert and the learner tree growth. Our numerical experiments demonstrate Neural Tree Expansion’s computational advantage by finding better solutions than a MCTS with 20× more resources. The resulting policies are dynamically sophisticated, demonstrate coordination between robots, and play the Reach-Target-Avoid differential game significantly better than the state-of-the-art control-theoretic baseline for multi-robot, double-integrator systems. Our hardware experiments on an aerial swarm demonstrate the computational advantage of Neural Tree Expansion, enabling online planning at 20Hz with effective policies in complex scenarios.
We train a single, goal-conditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. To do so, we rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method is able to discover highly diverse and complex goals without any human priors. We further show that Bob can be trained with only sparse rewards, because the interaction between Alice and Bob results in a natural curriculum and Bob can learn from Alice’s trajectory when relabeled as a goal-conditioned demonstration. Finally, we show that our method scales, resulting in a single policy that can transfer to many unseen hold-out tasks such as setting a table, stacking blocks, and solving simple puzzles. Videos of a learned policy is available at https://robotics-self-play.github.io.
Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field.
Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly ‘remembering’ promising states and returning to such states before intentionally exploring.
Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games, with orders-of-magnitude improvements on the grand challenges of Montezuma’s Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore’s exploration efficiency and enable it to handle stochasticity throughout training.
The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration—an insight that may prove critical to the creation of truly intelligent learning agents.
Accurately predicting the dynamics of robotic systems is crucial for model-based control and reinforcement learning. The most common way to estimate dynamics is by fitting an one-step ahead prediction model and using it to recursively propagate the predicted state distribution over long horizons. Unfortunately, this approach is known to compound even small prediction errors, making long-term predictions inaccurate. In this paper, we propose a new parametrization to supervised learning on state-action data to stably predict at longer horizons—that we call a trajectory-based model. This trajectory-based model takes an initial state, a future time index, and control parameters as inputs, and directly predicts the state at the future time index. Experimental results in simulated and real-world robotic tasks show that trajectory-based models yield statistically-significantly more accurate long term predictions, improved sample efficiency, and the ability to predict task reward. With these improved prediction properties, we conclude with a demonstration of methods for using the trajectory-based model for control.
Data-efficient learning of manipulation policies from visual observations is an outstanding challenge for real-robot learning. While deep reinforcement learning (RL) algorithms have shown success learning policies from visual observations, they still require an impractical number of real-world data samples to learn effective policies. However, recent advances in unsupervised representation learning and data augmentation significantly improved the sample efficiency of training RL policies on common simulated benchmarks. Building on these advances, we present a Framework for Efficient Robotic Manipulation (FERM) that utilizes data augmentation and unsupervised learning to achieve extremely sample-efficient training of robotic manipulation policies with sparse rewards. We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels, such as reaching, picking, moving, pulling a large object, flipping a switch, and opening a drawer in just 15–50 minutes of real-world training time. We include videos, code, and additional information on the project website—https://sites.google.com/view/efficient-robotic-manipulation.
A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. This setting nevertheless integrates a number of the central challenges of artificial intelligence (AI) research: complex visual perception and goal-directed physical control, grounded language comprehension and production, and multi-agent social interaction. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. However, this is presently impractical. Therefore, we approximate the role of the human with another learned agent, and use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour. Rigorously evaluating our agents poses a great challenge, so we develop a variety of behavioural tests, including evaluation by humans who watch videos of agents or interact directly with them. These evaluations convincingly demonstrate that interactive training and auxiliary losses improve agent behaviour beyond what is achieved by supervised learning of actions alone. Further, we demonstrate that agent capabilities generalize beyond literal experiences in the dataset. Finally, we train evaluation models whose ratings of agents agree well with human judgement, thus permitting the evaluation of new agent models without additional effort. Taken together, our results in this virtual environment provide evidence that large-scale human behavioural imitation is a promising tool to create intelligent, interactive agents, and the challenge of reliably evaluating such agents is possible to surmount. See videos for an overview of the manuscript, training time-lapse, and human-agent interactions.
…Although the agents do not yet attain human-level performance, we will soon describe scaling experiments which suggest that this gap could be closed substantially simply by collecting more data…The scripted probe tasks are imperfect measures of model performance, but as we have shown above, they tend to be well correlated with model performance under human evaluation. With each doubling of the dataset size, performance grew by the same increment. The rate of performance, in particular for instruction-following tasks, was larger for the BG·A model compared to B·A. Generally, these results give us confidence that we could continue to improve the performance of the agents straightforwardly by increasing the dataset size.
Figure 15: Scaling & Transfer. A. Scaling properties for 2 of our agents. The agent’s performance on the scripted probe tasks increased as we trained on more data. In instruction-following tasks in particular, the rate of this increase was higher for BC+GAIL compared to BC (scatter points indicate seeds). B. Transfer learning across different language game prompts. Training on multiple language games simultaneously led to higher performance than training on each single prompt independently. C. Multitask training improved data efficiency. We held out episodes with instructions that contain the words “put”, “position” or “place” and studied how much of this data was required to learn to position objects in the room. When simultaneously trained on all language game prompts, using 1⁄8 of the Position data led to 60% of the performance with all data, compared to 7% if we used the positional data alone. D. Object-colour generalisation. We removed all instances of orange ducks from the data and environment, but we left all other orange objects and all non-orange ducks. The performance at scripted tasks testing for this particular object-colour combination was similar to baseline.
…After training, we asked the models to “Lift an orange duck” or “What colour is the duck?”…Figure 15D shows that the agent trained without orange ducks performed almost as well on these restricted Lift and Color probe tasks as an agent trained with all of the data. These results demonstrate explicitly what our results elsewhere suggest: that agents trained to imitate human action and language demonstrate powerful combinatorial generalisation capabilities. While they have never encountered the entity, they know what an “orange duck” is and how to interact with one when asked to do so for the first time. This particular example was chosen at random; we have every reason to believe that similar effects would be observed for other compound concepts.
As robots become more present in open human environments, it will become crucial for robotic systems to understand and predict human motion. Such capabilities depend heavily on the quality and availability of motion capture data. However, existing datasets of full-body motion rarely include (1) long sequences of manipulation tasks, (2) the 3D model of the workspace geometry, and (3) eye-gaze, which are all important when a robot needs to predict the movements of humans in close proximity.
Hence, in this paper, we present a novel dataset of full-body motion for everyday manipulation tasks, which includes the above. The motion data was captured using a traditional motion capture system based on reflective markers. We additionally captured eye-gaze using a wearable pupil-tracking device. As we show in experiments, the dataset can be used for the design and evaluation of full-body motion prediction algorithms. Furthermore, our experiments show eye-gaze as a powerful predictor of human intent.
The dataset includes 180 min of motion capture data with 1627 pick and place actions being performed. It is available at https://humans-to-robots-motion.github.io/mogaze and is planned to be extended to collaborative tasks with two humans in the near future.
In this paper, we present an experiment, designed to investigate and evaluate the scalability and the robustness aspects of mobile manipulation. The experiment involves performing variations of mobile pick and place actions and opening/closing environment containers in a human household. The robot is expected to act completely autonomously for extended periods of time. We discuss the scientific challenges raised by the experiment as well as present our robotic system that can address these challenges and successfully perform all the tasks of the experiment. We present empirical results and the lessons learned as well as discuss where we hit limitations.
Meta-reinforcement learning algorithms can enable autonomous agents, such as robots, to quickly acquire new behaviors by leveraging prior experience in a set of related training tasks. However, the onerous data requirements of meta-training compounded with the challenge of learning from sensory inputs such as images have made meta-RL challenging to apply to real robotic systems. Latent state models, which learn compact state representations from a sequence of observations, can accelerate representation learning from visual inputs. In this paper, we leverage the perspective of meta-learning as task inference to show that latent state models can also perform meta-learning given an appropriately defined observation space. Building on this insight, we develop meta-RL with latent dynamics (MELD), an algorithm for meta-RL from images that performs inference in a latent state model to quickly acquire new skills given observations and rewards. MELD outperforms prior meta-RL methods on several simulated image-based robotic control problems, and enables a real WidowX robotic arm to insert an Ethernet cable into new locations given a sparse task completion signal after only 8 hours of real world meta-training. To our knowledge, MELD is the first meta-RL algorithm trained in a real-world robotic control setting from images.
This chapter explores the creators and potential consumers of sex robots.
With Realbotix as our case study, we take a closer look at the language and sentiments of those developing the technology and those who are testing, consuming, or showing an interest in it. We do this by means of website and chat forum analysis, and via interviews with those involved.
From this, we can see the motivation for developing a sexual companion robot places the emphasis firmly on the companionship aspect, and that those involved in creating and consuming the products share an ideology of intimacy and affection, with sexual gratification only playing a minor role.
Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an open challenge for many years.
We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model. The world model uses discrete representations and is trained separately from the policy. DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model. With the same computational budget and wall-clock time, Dreamer V2 reaches 200M frames and surpasses the final performance of the top single-GPU agents IQN and Rainbow.
DreamerV2 is also applicable to tasks with continuous actions, where it learns an accurate world model of a complex humanoid robot and solves stand-up and walking from only pixel inputs.
The game of curling can be considered a good test bed for studying the interaction between artificial intelligence systems and the real world. In curling, the environmental characteristics change at every moment, and every throw has an impact on the outcome of the match. Furthermore, there is no time for relearning during a curling match due to the timing rules of the game.
Here, we report a curling robot that can achieve human-level performance in the game of curling using an adaptive deep reinforcement learning framework. Our proposed adaptation framework extends standard deep reinforcement learning using temporal features, which learn to compensate for the uncertainties and nonstationarities that are an unavoidable part of curling. Our curling robot, Curly, was able to win three of four official matches against expert human teams [top-ranked women’s curling teams and Korea national wheelchair curling team (reserve team)]. These results indicate that the gap between physics-based simulators and the real world can be narrowed.
Lifelong learning and adaptability are two defining aspects of biological agents. Modern reinforcement learning (RL) approaches have shown significant progress in solving complex tasks, however once training is concluded, the found solutions are typically static and incapable of adapting to new information or perturbations. While it is still not completely understood how biological brains learn and adapt so efficiently from experience, it is believed that synaptic plasticity plays a prominent role in this process.
Inspired by this biological mechanism, we propose a search method that, instead of optimizing the weight parameters of neural networks directly, only searches for synapse-specific Hebbian learning rules that allow the network to continuously self-organize its weights during the lifetime of the agent.
We demonstrate our approach on several reinforcement learning tasks with different sensory modalities and more than 450K trainable plasticity parameters. We find that starting from completely random weights, the discovered Hebbian rules enable an agent to navigate a dynamical 2D-pixel environment; likewise they allow a simulated 3D quadrupedal robot to learn how to walk while adapting to morphological damage not seen during training and in the absence of any explicit reward or error signal in less than 100× steps.
The dm_control software package is a collection of Python libraries and task suites for reinforcement learning agents in an articulated-body simulation.
A MuJoCo wrapper provides convenient bindings to functions and data structures. The PyMJCF and Composer libraries enable procedural model manipulation and task authoring. The Control Suite is a fixed set of tasks with standardised structure, intended to serve as performance benchmarks. The Locomotion framework provides high-level abstractions and examples of locomotion tasks. A set of configurable manipulation tasks with a robot arm and snap-together bricks is also included.
dm_control is publicly available at this URL. A video summary of all tasks is available at this URL
Increasing the scale of reinforcement learning experiments has allowed researchers to achieve unprecedented results in both training sophisticated agents for video games, and in sim-to-real transfer for robotics. Typically such experiments rely on large distributed systems and require expensive hardware setups, limiting wider access to this exciting area of research. In this work we aim to solve this problem by optimizing the efficiency and resource utilization of reinforcement learning algorithms instead of relying on distributed computation. We present the “Sample Factory”, a high-throughput training system optimized for a single-machine setting. Our architecture combines a highly efficient, asynchronous, GPU-based sampler with off-policy correction techniques, allowing us to achieve throughput higher than 105 environment frames/second on non-trivial control problems in 3D without sacrificing sample efficiency. We extend Sample Factory to support self-play and population-based training and apply these techniques to train highly capable agents for a multiplayer first-person shooter game. The source code is available at https://github.com/alex-petrenko/sample-factory
Designing reward functions is a challenging problem in AI and robotics. Humans usually have a difficult time directly specifying all the desirable behaviors that a robot needs to optimize. One common approach is to learn reward functions from collected expert demonstrations. However, learning reward functions from demonstrations introduces many challenges: some methods require highly structured models, eg. reward functions that are linear in some predefined set of features, while others adopt less structured reward functions that on the other hand require tremendous amount of data. In addition, humans tend to have a difficult time providing demonstrations on robots with high degrees of freedom, or even quantifying reward values for given demonstrations.
To address these challenges, we present a preference-based learning approach—where as an alternative, the human feedback is only in the form of comparisons between trajectories. Furthermore, we do not assume highly constrained structures on the reward function. Instead, we model the reward function using a Gaussian Process (GP) and propose a mathematical formulation to actively find a GP using only human preferences. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework. Our results in simulations and an user study suggest that our approach can efficiently learn expressive reward functions for robotics tasks.
The promise of reinforcement learning is to solve complex sequential decision problems by specifying a high-level reward function only. However, RL algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires thoroughly exploring the environment, but despite substantial investments by the community, creating algorithms that can do so remains one of the central challenges of the field. We hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (“detachment”) and from failing to first return to a state before exploring from it (“derailment”). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly remembering promising states and first returning to such states before exploring. Go-Explore solves all heretofore unsolved Atari games (those for which algorithms could not previously outperform humans when evaluated following current community standards) and surpasses the state of the art on all hard-exploration games, with orders of magnitude improvements on the grand challenges Montezuma’s Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a challenging and extremely sparse-reward robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore’s exploration efficiency and enable it to handle stochasticity throughout training. The striking contrast between the substantial performance gains from Go-Explore and the simplicity of its mechanisms suggests that remembering promising states, returning to them, and exploring from them is a powerful and general approach to exploration, an insight that may prove critical to the creation of truly intelligent learning agents.
One of the great promises of robot learning systems is that they will be able to learn from their mistakes and continuously adapt to ever-changing environments. Despite this potential, most of the robot learning systems today are deployed as a fixed policy and they are not being adapted after their deployment. Can we efficiently adapt previously learned behaviors to new environments, objects and percepts in the real world?
In this paper, we present a method and empirical evidence towards a robot learning framework that facilitates continuous adaption. In particular, we demonstrate how to adapt vision-based robotic manipulation policies to new variations by fine-tuning via off-policy reinforcement learning, including changes in background, object shape and appearance, lighting conditions, and robot morphology. Further, this adaptation uses less than 0.2% of the data necessary to learn the task from scratch. We find that our approach of adapting pre-trained policies leads to substantial performance gains over the course of fine-tuning, and that pre-training via RL is essential: training from scratch or adapting from supervised ImageNet features are both unsuccessful with such small amounts of data. We also find that these positive results hold in a limited continual learning setting, in which we repeatedly fine-tune a single lineage of policies using data from a succession of new tasks.
Our empirical conclusions are consistently supported by experiments on simulated manipulation tasks, and by 52 unique fine-tuning experiments on a real robotic grasping system pre-trained on 580,000 grasps.
Reliable and stable locomotion has been one of the most fundamental challenges for legged robots. Deep reinforcement learning (deep RL) has emerged as a promising method for developing such control policies autonomously. In this paper, we develop a system for learning legged locomotion policies with deep RL in the real world with minimal human effort. The key difficulties for on-robot learning systems are automatic data collection and safety. We overcome these two challenges by developing a multi-task learning procedure and a safety-constrained RL framework. We tested our system on the task of learning to walk on three different terrains: flat ground, a soft mattress, and a doormat with crevices. Our system can automatically and efficiently learn locomotion skills on a Minitaur robot with little human intervention. The supplemental video can be found at: https: / / www.youtube.com / watch?v = cwyiq6dCgOc.
…So the agency turned to retroreflectors, tiny glass beads that reflect laser light (in this case, a laser beam) back at its source…In addition to being incredibly maneuverable, dragonflies are exceptionally good gliders compared to other insects, which helps them conserve energy on long flights. The scientist brought in some specimens, and when Adkins pressed him on the issue, “the old fellow plucked the insect from its perch and tossed it into the air”, Adkins wrote. “It made about two circuits and landed nicely on the desk.”
The demonstration convinced Adkins, but the team still needed to figure out how to replicate a dragonfly’s wings, which flap 1,800 times per minute. To pull this off, scientists used a tiny fluidic oscillator, a device with no moving parts that’s completely driven by gas produced by lithium nitrate crystals. When initial tests showed that the prototype couldn’t carry the required 0.2 gm payload, designers added additional thrust by venting exhaust backward, much like jet propulsion. After a quick dragonfly-inspired paint job, the drone was ready for (covert) action, weighing just under a gram. Its glittering ‘eyes’ were the glass retroreflector beads destined to snoop on unsuspecting targets.
…While the CIA now had its robo-bug, it still needed a way to control it. Radio control was out of the question because any extra weight would doom the small insectothopter. So CIA scientists turned to the same lasers used for the retroreflectors. This was a portable laser unit, known as ROME, that produced an invisible infrared beam. The idea was that the laser would heat a bimetallic strip that would then open or close the dragonfly’s exhaust. While effectively throttling the ‘engine’, another laser—acting like a kind of rudder—would then steer the drone to its desired destination. With its gas-pumping engine and laser-based navigation system, the insectothopter could fly for only 60 seconds. But this was more than enough to get the dragonfly—and its payload—to a target some 200 meters away.
…The biggest problem with the insectothopter’s design was that an operator had to keep a laser manually trained on the drone during flight. Easily done in a static wind tunnel, less so in blustery and unpredictable conditions…In theory, the insectothopter could still be flown in less than 7MPH winds, but “the ultimate demonstration of controlled powered flight has not yet been achieved”, Adkins ultimately reported. “Though the flight tests were impressive, control in any kind of crosswind was too difficult.”
There are two prevailing technical theories about what it will take to reach AGI. In one, all the necessary techniques already exist; it’s just a matter of figuring out how to scale and assemble them. In the other, there needs to be an entirely new paradigm; deep learning, the current dominant technique in AI, won’t be enough. Most researchers fall somewhere between these extremes, but OpenAI has consistently sat almost exclusively on the scale-and-assemble end of the spectrum. Most of its breakthroughs have been the product of sinking dramatically greater computational resources into technical innovations developed in other labs.
Brockman and Sutskever deny that this is their sole strategy, but the lab’s tightly guarded research suggests otherwise. A team called “Foresight” runs experiments to test how far they can push AI capabilities forward by training existing algorithms with increasingly large amounts of data and computing power. For the leadership, the results of these experiments have confirmed its instincts that the lab’s all-in, compute-driven strategy is the best approach. For roughly six months, these results were hidden from the public because OpenAI sees this knowledge as its primary competitive advantage. Employees and interns were explicitly instructed not to reveal them, and those who left signed nondisclosure agreements. It was only in January that the team, without the usual fanfare, quietly posted a paper on one of the primary open-source databases for AI research. People who experienced the intense secrecy around the effort didn’t know what to make of this change. Notably, another paper with similar results from different researchers had been posted a month earlier.
…One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources. A small team has been assigned to the initial effort, with an expectation that other teams, along with their work, will eventually fold in. On the day it was announced at an all-company meeting, interns weren’t allowed to attend. People familiar with the plan offer an explanation: the leadership thinks this is the most promising way to reach AGI. [See DALL·E, CLIP.]
…The man driving OpenAI’s strategy is Dario Amodei, the ex-Googler who now serves as research director. When I meet him, he strikes me as a more anxious version of Brockman. He has a similar sincerity and sensitivity, but an air of unsettled nervous energy. He looks distant when he talks, his brows furrowed, a hand absentmindedly tugging his curls. Amodei divides the lab’s strategy into two parts. The first part, which dictates how it plans to reach advanced AI capabilities, he likens to an investor’s “portfolio of bets.” Different teams at OpenAI are playing out different bets. The language team, for example, has its money on a theory postulating that AI can develop a substantial understanding of the world through mere language learning. The robotics team, in contrast, is advancing an opposing theory that intelligence requires a physical embodiment to develop. As in an investor’s portfolio, not every bet has an equal weight. But for the purposes of scientific rigor, all should be tested before being discarded. Amodei points to GPT-2, with its remarkably realistic auto-generated texts, as an instance of why it’s important to keep an open mind. “Pure language is a direction that the field and even some of us were somewhat skeptical of”, he says. “But now it’s like, ‘Wow, this is really promising.’” Over time, as different bets rise above others, they will attract more intense efforts. Then they will cross-pollinate and combine. The goal is to have fewer and fewer teams that ultimately collapse into a single technical direction for AGI. This is the exact process that OpenAI’s latest top-secret project has supposedly already begun.
Covariant.ai has developed a platform that consists of off-the-shelf robot arms equipped with cameras, a special gripper, and plenty of computer power for figuring out how to grasp objects tossed into warehouse bins. The company, emerging from stealth Wednesday, announced the first commercial installations of its AI-equipped robots: picking boxes and bags of products for a German electronics retailer called Obeta.
…The company was founded in 2017 by Pieter Abbeel, a prominent AI professor at UC Berkeley, and several of his students. Abbeel pioneered the application of machine learning to robotics, and he made a name for himself in academic circles in 2010 by developing a robot capable of folding laundry (albeit very slowly). Covariant uses a range of AI techniques to teach robots how to grasp unfamiliar objects. These include reinforcement learning, in which an algorithm trains itself through trial and error, a little like the way animals learn through positive and negative feedback…Besides reinforcement learning, Abbeel says his company’s robots make use of imitation learning, a way of learning by observing demonstrations of perception and grasping by another algorithm, and meta-learning, a way of refining the learning process itself. Abbeel says the system can adapt and improve when a new batch of items arrive. “It’s training on the fly”, he says. “I don’t think anybody else is doing that in the real world.”
…But reinforcement learning is finicky and needs lots of computer power. “I used to be skeptical about reinforcement learning, but I’m not anymore”, says Hinton, a professor at the University of Toronto who also works part time at Google. Hinton says the amount of computer power needed to make reinforcement learning work has often seemed prohibitive, so it is striking to see commercial success. He says it is particularly impressive that Covariant’s system has been running in a commercial setting for a prolonged period.
…Peter Puchwein, vice president of innovation at Knapp, says he is particularly impressed by the way Covariant.ai’s robots can grasp even products in transparent bags, which can be difficult for cameras to perceive. “Even as a human being, if you have a box with 20 products in poly bags, it’s really hard to take just one out”, he says…Late last year, the international robot maker ABB ran a contest. It invited 20 companies to design software for its robot arms that could sort through bins of random items, from cubes to plastic bags filled with other objects. Ten of the companies were based in Europe, and the other half were in the United States. Most came nowhere close to passing the test. A few could handle most tasks but failed on the trickier cases. Covariant was the only company that could handle every task as swiftly and efficiently as a human. “We were trying to find weaknesses”, said Marc Segura, managing director of service robotics at ABB. “It is easy to reach a certain level on these tests, but it is super difficult not to show any weaknesses.”
Obtaining venous access for blood sampling or intravenous (IV) fluid delivery is an essential first step in patient care. However, success rates rely heavily on clinician experience and patient physiology. Difficulties in obtaining venous access result in missed sticks and injury to patients, and typically require alternative access pathways and additional personnel that lengthen procedure times, thereby creating unnecessary costs to healthcare facilities.
Here, we present the first-in-human assessment of an automated robotic venipuncture device designed to safely perform blood draws on peripheral forearm veins. The device combines ultrasound imaging and miniaturized robotics to identify suitable vessels for cannulation and robotically guide an attached needle toward the lumen center. The device demonstrated results comparable to or exceeding that of clinical standards, with a success rate of 87% on all participants (n = 31), a 97% success rate on non-difficult venous access participants (n = 25), and an average procedure time of 93 ± 30 s (n = 31).
In the future, this device can be extended to other areas of vascular access such as IV catheterization, central venous access, dialysis, and arterial line placement.
The AI community has a long-term goal of building intelligent machines that interact effectively with the physical world, and a key challenge is teaching these systems to navigate through complex, unfamiliar real-world environments to reach a specified destination—without a preprovided map. We are announcing today that Facebook AI has created a new large-scale distributed reinforcement learning (RL) algorithm called DD-PPO, which has effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data. Agents trained with DD-PPO (which stands for decentralized distributed proximal policy optimization) achieve nearly 100% success in a variety of virtual environments, such as houses and office buildings. We have also successfully tested our model with tasks in real-world physical settings using a LoCoBot and Facebook AI’s PyRobot platform. An unfortunate fact about maps is that they become outdated the moment they are created. Most real-world environments evolve—buildings and structures change, objects are moved around, and people and pets are in constant flux. By learning to navigate without a map, DD-PPO-trained agents will accelerate the creation of new AI applications for the physical world.
Previous systems reached a 92% success rate on these tasks, but even failing 1 out of 100 times is not acceptable in the physical world, where a robot agent might damage itself or its surroundings by making an error. DD-PPO-trained agents reach their goal 99.9% of the time. Perhaps even more impressive, they do so with near-maximal efficiency, choosing a path that comes within 3% (on average) of matching the shortest possible route from the starting point to the goal. It is worth stressing how uncompromising this task is. There is no scope for mistakes of any kind—no wrong turn at a crossroads, no backtracking from a dead end, no exploration or deviation of any kind from the most direct path. We believe that the agent learns to exploit the statistical regularities in the floor plans of real indoor environments (apartments, houses, and offices) that are also present in our data sets. This improved performance is powered by a new, more effective system for distributed training (DD-PPO), along with the state-of-the-art speed and fidelity of Facebook AI’s open source AI Habitat platform.
…We propose a simple, synchronous, distributed RL method that scales well. We call this method decentralized distributed proximal policy optimization, as it is decentralized (has no parameter server) and distributed (runs across many machines), and we use it to scale proximal policy optimization, a previously developed technique (Schulman et al 2017). In DD-PPO, each worker alternates between collecting experience in a resource-intensive, GPU-accelerated simulated environment and then optimizing the model. This distribution is synchronous—there is an explicit communication stage in which workers synchronize their updates to the model.
The variability in experience collection runtime presents a challenge to using this method in RL. In supervised learning, all gradient computations take about the same time. In RL, some resource-intensive environments can take substantially longer to simulate. This introduces substantial synchronization overhead, as every worker must wait for the slowest to finish collecting experience. To address this, we introduced a preemption threshold where the rollout collection stage of these stragglers is forced to end early once some percentage, p percent, (we find 60% to work well) of the other workers are finished collecting their rollout, thereby dramatically improving scaling. Our system weighs all workers’ contributions to the loss equally and limits the minimum number of steps before preemption to one-fourth the maximum to ensure that all environments contribute to learning.
We transform reinforcement learning (RL) into a form of supervised learning (SL) by turning traditional RL on its head, calling this Upside Down RL (UDRL). Standard RL predicts rewards, while UDRL instead uses rewards as task-defining inputs, together with representations of time horizons and other computable functions of historic and desired future data. UDRL learns to interpret these input observations as commands, mapping them to actions (or action probabilities) through SL on past (possibly accidental) experience.
UDRL generalizes to achieve high rewards or other goals, through input commands such as: get lots of reward within at most so much time! A separate paper [63] on first experiments with UDRL shows that even a pilot version of UDRL can outperform traditional baseline algorithms on certain challenging RL problems.
We also conceptually simplify an approach [60] for teaching a robot to imitate humans. First videotape humans imitating the robot’s current behaviors, then let the robot learn through SL to map the videos (as input commands) to these behaviors, then let it generalize and imitate videos of humans executing previously unknown behavior. This Imitate-Imitator concept may actually explain why biological evolution has resulted in parents who imitate the babbling of their babies.
Procedural Content Generation (PCG) refers to the practice, in videogames and other games, of generating content such as levels, quests, or characters algorithmically. Motivated by the need to make games replayable, as well as to reduce authoring burden, limit storage space requirements, and enable particular aesthetics, a large number of PCG methods have been devised by game developers. Additionally, researchers have explored adapting methods from machine learning, optimization, and constraint solving to PCG problems. Games have been widely used in AI research since the inception of the field, and in recent years have been used to develop and benchmark new machine learning algorithms. Through this practice, it has become more apparent that these algorithms are susceptible to overfitting. Often, an algorithm will not learn a general policy, but instead a policy that will only work for a particular version of a particular task with particular initial parameters. In response, researchers have begun exploring randomization of problem parameters to counteract such overfitting and to allow trained policies to more easily transfer from one environment to another, such as from a simulated robot to a robot in the real world. Here we review the large amount of existing work on PCG, which we believe has an important role to play in increasing the generality of machine learning methods. The main goal here is to present RL/AI with new tools from the PCG toolbox, and its secondary goal is to explain to game developers and researchers a way in which their work is relevant to AI research.
We propose a novel system SwarmCloak for landing of a fleet of four flying robots on the human arms using light-sensitive landing pads with vibrotactile feedback. We developed two types of wearable tactile displays with vibromotors which are activated by the light emitted from the LED array at the bottom of quadcopters. In an user study, participants were asked to adjust the position of the arms to land up to two drones, having only visual feedback, only tactile feedback or visual-tactile feedback. The experiment revealed that when the number of drones increases, tactile feedback plays a more important role in accurate landing and operator’s convenience. An important finding is that the best landing performance is achieved with the combination of tactile and visual feedback. The proposed technology could have a strong impact on the human-swarm interaction, providing a new level of intuitiveness and engagement into the swarm deployment just right from the skin surface.
Large, richly annotated datasets have accelerated progress in fields such as computer vision and natural language processing, but replicating these successes in robotics has been challenging. While prior data collection methodologies such as self-supervision have resulted in large datasets, the data can have poor signal-to-noise ratio. By contrast, previous efforts to collect task demonstrations with humans provide better quality data, but they cannot reach the same data magnitude. Furthermore, neither approach places guarantees on the diversity of the data collected, in terms of solution strategies.
In this work, we leverage and extend the RoboTurk platform to scale up data collection for robotic manipulation using remote teleoperation. The primary motivation for our platform is two-fold: (1) to address the shortcomings of prior work and increase the total quantity of manipulation data collected through human supervision by an order of magnitude without sacrificing the quality of the data and (2) to collect data on challenging manipulation tasks across several operators and observe a diverse set of emergent behaviors and solutions.
We collected over 111 hours of robot manipulation data across 54 users and 3 challenging manipulation tasks in 1 week, resulting in the largest robot dataset collected via remote teleoperation. We evaluate the quality of our platform, the diversity of demonstrations in our dataset, and the utility of our dataset via quantitative and qualitative analysis. For additional results, supplementary videos, and to download our dataset, visit http://roboturk.stanford.edu/realrobotdataset.
We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling—achieving a speedup of 107× on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience)—over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.
This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task –near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks—the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as an universal resource (all models and code are publicly available).
Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. However, much of the current research on meta-reinforcement learning focuses on task distributions that are very narrow. For example, a commonly used meta-reinforcement learning benchmark uses different running velocities for a simulated robot as different tasks. When policies are meta-trained on such narrow task distributions, they cannot possibly generalize to more quickly acquire entirely new tasks.
Therefore, if the aim of these methods is to enable faster acquisition of entirely new behaviors, we must evaluate them on task distributions that are sufficiently broad to enable generalization to new behaviors. In this paper, we propose an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks. Our aim is to make it possible to develop algorithms that generalize to accelerate the acquisition of entirely new, held-out tasks. We evaluate 7 state-of-the-art meta-reinforcement learning and multi-task learning algorithms on these tasks. Surprisingly, while each task and its variations (eg. with different object positions) can be learned with reasonable success, these algorithms struggle to learn with multiple tasks at the same time, even with as few as ten distinct training tasks. Our analysis and open-source environments pave the way for future research in multi-task learning and meta-learning that can enable meaningful generalization, thereby unlocking the full potential of these methods.
We demonstrate that models trained only in simulation can be used to solve a manipulation problem of unprecedented complexity on a real robot.
This is made possible by two key components: a novel algorithm, which we call automatic domain randomization (ADR) and a robot platform built for machine learning. ADR automatically generates a distribution over randomized environments of ever-increasing difficulty.
Control policies and vision state estimators trained with ADR exhibit vastly improved sim2real transfer. For control policies, memory-augmented models trained on an ADR-generated distribution of environments show clear signs of emergent meta-learning at test time. The combination of ADR with our custom robot platform allows us to solve a Rubik’s cube with a humanoid robot hand, which involves both control and state estimation problems.
We’ve trained a pair of neural networks to solve the Rubik’s Cube with a human-like robot hand. The neural networks are trained entirely in simulation, using the same reinforcement learning code as OpenAI Five paired with a new technique called Automatic Domain Randomization (ADR). The system can handle situations it never saw during training, such as being prodded by a stuffed giraffe. This shows that reinforcement learning isn’t just a tool for virtual tasks, but can solve physical-world problems requiring unprecedented dexterity.
…Since May 2017, we’ve been trying to train a human-like robotic hand to solve the Rubik’s Cube. We set this goal because we believe that successfully training such a robotic hand to do complex manipulation tasks lays the foundation for general-purpose robots. We solved the Rubik’s Cube in simulation in July 2017. But as of July 2018, we could only manipulate a block on the robot. Now, we’ve reached our initial goal. Solving a Rubik’s Cube one-handed is a challenging task even for humans, and it takes children several years to gain the dexterity required to master it. Our robot still hasn’t perfected its technique though, as it solves the Rubik’s Cube 60% of the time (and only 20% of the time for a maximally difficult scramble).
We present a framework for data-driven robotics that makes use of a large dataset of recorded robot experience and scales to several tasks using learned reward functions. We show how to apply this framework to accomplish three different object manipulation tasks on a real robot platform. Given demonstrations of a task together with task-agnostic recorded experience, we use a special form of human annotation as supervision to learn a reward function, which enables us to deal with real-world tasks where the reward signal cannot be acquired directly. Learned rewards are used in combination with a large dataset of experience from different tasks to learn a robot policy offline using batch RL. We show that using our approach it is possible to train agents to perform a variety of challenging manipulation tasks including stacking rigid objects and handling cloth.
ROBEL is an open-source platform of cost-effective robots designed for reinforcement learning in the real world. ROBEL introduces two robots, each aimed to accelerate reinforcement learning research in different task domains: D’Claw is a three-fingered hand robot that facilitates learning dexterous manipulation tasks, and D’Kitty is a four-legged robot that facilitates learning agile legged locomotion tasks. These low-cost, modular robots are easy to maintain and are robust enough to sustain on-hardware reinforcement learning from scratch with over 14000 training hours registered on them to date.
To leverage this platform, we propose an extensible set of continuous control benchmark tasks for each robot. These tasks feature dense and sparse task objectives, and additionally introduce score metrics as hardware-safety. We provide benchmark scores on an initial set of tasks using a variety of learning-based methods. Furthermore, we show that these results can be replicated across copies of the robots located in different institutions. Code, documentation, design files, detailed assembly instructions, final policies, baseline details, task videos, and all supplementary materials required to reproduce the results are available at www.roboticsbenchmarks.org.
We present fully autonomous source seeking onboard a highly constrained nano quadcopter, by contributing application-specific system and observation feature design to enable inference of a deep-RL policy onboard a nano quadcopter.
Our deep-RL algorithm finds a high-performance solution to a challenging problem, even in presence of high noise levels and generalizes across real and simulation environments with different obstacle configurations.
We verify our approach with simulation and in-field testing on a Bitcraze CrazyFlie using only the cheap and ubiquitous Cortex-M4 microcontroller unit.
The results show that by end-to-end application-specific system design, our contribution consumes almost three times less additional power, as compared to competing learning-based navigation approach onboard a nano quadcopter. Thanks to our observation space, which we carefully design within the resource constraints, our solution achieves a 94% success rate in cluttered and randomized test environments, as compared to the previously achieved 80%. We also compare our strategy to a simple finite state machine (FSM), geared towards efficient exploration, and demonstrate that our policy is more robust and resilient at obstacle avoidance as well as up to 70% more efficient in source seeking.
To this end, we contribute a cheap and lightweight end-to-end tiny robot learning (tinyRL) solution, running onboard a nano quadcopter, that proves to be robust and efficient in a challenging task using limited sensory input.
As researchers teach robots to perform more and more complex tasks, the need for realistic simulation environments is growing. Existing techniques for closing the reality gap by approximating real-world physics often require extensive real world data and/or thousands of simulation samples.
This paper presents TuneNet, a new machine learning-based method to directly tune the parameters of one model to match another using an iterative residual tuning technique. TuneNet estimates the parameter difference between two models using a single observation from the target and minimal simulation, allowing rapid, accurate and sample-efficient parameter estimation. The system can be trained via supervised learning over an auto-generated simulated dataset.
We show that TuneNet can perform system identification, even when the true parameter values lie well outside the distribution seen during training, and demonstrate that simulators tuned with TuneNet outperform existing techniques for predicting rigid body motion. Finally, we show that our method can estimate real-world parameter values, allowing a robot to perform sim-to-real task transfer on a dynamic manipulation task unseen during training.
As a new general-purpose technology, robots have the potential to radically transform employment and organizations.
In contrast to prior studies that predict dramatic employment declines, we find that investments in robotics are associated with increases in total firm employment, but decreases in the total number of managers. Similarly, we find that robots are associated with an increase in the span of control for supervisors remaining within the organization. We also provide evidence that robot adoption is not motivated by the desire to reduce labor costs, but is instead related to improving product and service quality.
Our findings are consistent with the notion that robots reduce variance in production processes, diminishing the need for managers to monitor worker activities to ensure production quality. As additional evidence, we also find robot investments predict improved performance measurement and increased adoption of incentive pay based on individual employee performance. With respect to changes in skill composition within the organization, robots predict decreases in employment for middle-skilled workers, but increases in employment for low-skill and high-skilled workers. We also find robots not only predict changes in employment, but also corresponding adaptations in organizational structure. Robot investments are associated with both centralization and decentralization of decision-making authority depending upon the task, but decision rights in either case are reassigned away from the managerial level of the hierarchy.
Overall, our results suggest that robots have distinct and profound effects on employment and organizations that require fundamental changes in firm practices and organizational design.
Perhaps the most ambitious scientific quest in human history is the creation of general artificial intelligence, which roughly means AI that is as smart or smarter than humans. The dominant approach in the machine learning community is to attempt to discover each of the pieces required for intelligence, with the implicit assumption that some future group will complete the Herculean task of figuring out how to combine all of those pieces into a complex thinking machine. I call this the “manual AI approach”. This paper describes another exciting path that ultimately may be more successful at producing general AI. It is based on the clear trend in machine learning that hand-designed solutions eventually are replaced by more effective, learned solutions. The idea is to create an AI-generating algorithm (AI-GA), which automatically learns how to produce general AI. Three Pillars are essential for the approach: (1) meta-learning architectures, (2) meta-learning the learning algorithms themselves, and (3) generating effective learning environments. I argue that either approach could produce general AI first, and both are scientifically worthwhile irrespective of which is the fastest path. Because both are promising, yet the ML community is currently committed to the manual approach, I argue that our community should increase its research investment in the AI-GA approach. To encourage such research, I describe promising work in each of the Three Pillars. I also discuss AI-GA-specific safety and ethical considerations. Because it it may be the fastest path to general AI and because it is inherently scientifically interesting to understand the conditions in which a simple algorithm can produce general AI (as happened on Earth where Darwinian evolution produced human intelligence), I argue that the pursuit of AI-GAs should be considered a new grand challenge of computer science research.
Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent’s observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial?
We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent.
We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (i) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast—when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU. (ii) Habitat-API: a modular high-level library for end-to-end development of embodied AI algorithms—defining tasks (eg. navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents.
These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or ‘merely’ impractical. Specifically, in the context of point-goal navigation: (1) we revisit the comparison between learning and SLAM approaches from two recent works and find evidence for the opposite conclusion—that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and (2) we conduct the first cross-dataset generalization experiments train, test x Matterport3D, Gibson for multiple sensors blind, RGB, RGBD, D and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.
“Long-Range Indoor Navigation with PRM-RL”, Anthony Francis, Aleksandra Faust, Hao-Tien Lewis Chiang, Jasmine Hsu, J. Chase Kew, Marek Fiser, Tsang-Wei Edward Lee et al (2019-02-25):
Long-range indoor navigation requires guiding robots with noisy sensors and controls through cluttered environments along paths that span a variety of buildings. We achieve this with PRM-RL, a hierarchical robot navigation method in which reinforcement learning agents that map noisy sensors to robot controls learn to solve short-range obstacle avoidance tasks, and then sampling-based planners map where these agents can reliably navigate in simulation; these roadmaps and agents are then deployed on robots, guiding them along the shortest path where the agents are likely to succeed. Here we use Probabilistic Roadmaps (PRMs) as the sampling-based planner, and AutoRL as the reinforcement learning method in the indoor navigation context. We evaluate the method in simulation for kinematic differential drive and kinodynamic car-like robots in several environments, and on differential-drive robots at three physical sites. Our results show PRM-RL with AutoRL is more successful than several baselines, is robust to noise, and can guide robots over hundreds of meters in the face of noise and obstacles in both simulation and on robots, including over 5.8 kilometers of physical robot navigation. Video: https: / / www.youtube.com / watch?v = xN-OWX5gKvQ
Legged robots pose one of the greatest challenges in robotics. Dynamic and agile maneuvers of animals cannot be imitated by existing methods that are crafted by humans. A compelling alternative is reinforcement learning, which requires minimal craftsmanship and promotes the natural evolution of a control policy. However, so far, reinforcement learning research for legged robots is mainly limited to simulation, and only few and comparably simple examples have been deployed on real systems. The primary reason is that training with real robots, particularly with dynamically balancing systems, is complicated and expensive.
In the present work, we introduce a method for training a neural network policy in simulation and transferring it to a state-of-the-art legged system, thereby leveraging fast, automated, and cost-effective data generation schemes. The approach is applied to the ANYmal robot, a sophisticated medium-dog-sized quadrupedal system. Using policies trained in simulation, the quadrupedal machine achieves locomotion skills that go beyond what had been achieved with prior methods: ANYmal is capable of precisely and energy-efficiently following high-level body velocity commands, running faster than before, and recovering from falling even in complex configurations.
Enter the RobotriX, an extremely photorealistic indoor dataset designed to enable the application of deep learning techniques to a wide variety of robotic vision problems. The RobotriX consists of hyperrealistic indoor scenes which are explored by robot agents which also interact with objects in a visually realistic manner in that simulated world. Photorealistic scenes and robots are rendered by Unreal Engine into a virtual reality headset which captures gaze so that a human operator can move the robot and use controllers for the robotic hands; scene information is dumped on a per-frame basis so that it can be reproduced offline to generate raw data and ground truth labels. By taking this approach, we were able to generate a dataset of 38 semantic classes totaling 8M stills recorded at +60 frames per second with full HD resolution. For each frame, RGB-D and 3D information is provided with full annotations in both spaces. Thanks to the high quality and quantity of both raw information and annotations, the RobotriX will serve as a new milestone for investigating 2D and 3D robotic vision tasks with large-scale data-driven techniques.
Real world data, especially in the domain of robotics, is notoriously costly to collect. One way to circumvent this can be to leverage the power of simulation to produce large amounts of labeled data. However, training models on simulated images does not readily transfer to real-world ones. Using domain adaptation methods to cross this “reality gap” requires a large amount of unlabeled real-world data, whilst domain randomization alone can waste modeling power. In this paper, we present Randomized-to-Canonical Adaptation Networks (RCANs), a novel approach to crossing the visual reality gap that uses no real-world data. Our method learns to translate randomized rendered images into their equivalent non-randomized, canonical versions. This in turn allows for real images to also be translated into canonical sim images. We demonstrate the effectiveness of this sim-to-real approach by training a vision-based closed-loop grasping reinforcement learning agent in simulation, and then transferring it to the real world to attain 70% zero-shot grasp success on unseen objects, a result that almost doubles the success of learning the same task directly on domain randomization alone. Additionally, by joint finetuning in the real-world with only 5,000 real-world grasps, our method achieves 91%, attaining comparable performance to a state-of-the-art system trained with 580,000 real-world grasps, resulting in a reduction of real-world data by more than 99%.
Deep reinforcement learning (RL) algorithms can learn complex robotic skills from raw sensory inputs, but have yet to achieve the kind of broad generalization and applicability demonstrated by deep learning methods in supervised domains. We present a deep RL method that is practical for real-world robotics tasks, such as robotic manipulation, and generalizes effectively to never-before-seen tasks and objects. In these settings, ground truth reward signals are typically unavailable, and we therefore propose a self-supervised model-based approach, where a predictive model learns to directly predict the future from raw sensory readings, such as camera images. At test time, we explore three distinct goal specification methods: designated pixels, where a user specifies desired object manipulation tasks by selecting particular pixels in an image and corresponding goal positions, goal images, where the desired goal state is specified with an image, and image classifiers, which define spaces of goal states. Our deep predictive models are trained using data collected autonomously and continuously by a robot interacting with hundreds of objects, without human supervision. We demonstrate that visual MPC can generalize to never-before-seen objects—both rigid and deformable—and solve a range of user-defined object manipulation tasks using the same model.
Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide range of complex decision-making tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.
Clear directions for a robotic platform: The chemistry literature contains more than a century’s worth of instructions for making molecules, all written by and for humans.
Steiner et al 2018 developed an autonomous compiler and robotic laboratory platform to synthesize organic compounds on the basis of standardized methods descriptions (see the Perspective by Milo). The platform comprises conventional equipment such as round-bottom flasks, separatory funnels, and a rotary evaporator to maximize its compatibility with extant literature. The authors showcase the system with short syntheses of 3 common pharmaceuticals that proceeded comparably to manual synthesis.
The synthesis of complex organic compounds is largely a manual process that is often incompletely documented.
To address these shortcomings, we developed an abstraction that maps commonly reported methodological instructions into discrete steps amenable to automation.
These unit operations were implemented in a modular robotic platform by using a chemical programming language that formalizes and controls the assembly of the molecules.
Yields and purities of products and intermediates were comparable to or better than those achieved manually. The syntheses are captured as digital code that can be published, versioned, and transferred flexibly between platforms with no modification, thereby greatly enhancing reproducibility and reliable access to complex molecules.
Introduction: Outside of a few well-defined areas such as polypeptide and oligonucleotide chemistry, the automation of chemical synthesis has been limited to large-scale bespoke industrial processes, with laboratory-scale and discovery-scale synthesis remaining predominantly a manual process. These areas are generally defined by the ability to synthesize complex molecules by the successive iteration of similar sets of reactions, allowing the synthesis of products by the automation of a relatively small palette of standardized reactions. Recent advances in areas such as flow chemistry, oligosaccharide synthesis, and iterative cross-coupling are expanding the number of compounds synthesized by automated methods. However, there is no universal and interoperable standard that allows the automation of chemical synthesis more generally.
Rationale: We developed a standard approach that mirrors how the bench chemist works and how the bulk of the open literature is reported, with the round-bottomed flask as the primary reactor. We assembled a relatively small array of equipment to accomplish a wide variety of different syntheses, and our abstraction of chemical synthesis encompasses the 4 key stages of synthetic protocols: reaction, workup, isolation, and purification. Further, taking note of the incomplete way chemical procedures are reported, we hypothesized that a standardized format for reporting a chemical synthesis procedure, coupled with an abstraction and formalism linking the synthesis to physical operations of an automated robotic platform, would yield a universal approach to a chemical programming language. We call this architecture and abstraction the Chemputer.
Results: For the Chemputer system to accomplish the automated synthesis of target molecules, we developed a program, the Chempiler, to produce specific, low-level instructions for modular hardware of our laboratory-scale synthesis robot. The Chempiler takes information about the physical connectivity and composition of the automated platform, in the form of a graph using the open-source GraphML format, and combines it with a hardware-independent scripting language [chemical assembly (ChASM) language], which provides instructions for the machine operations of the automated platform. The Chempiler software allows the ChASM code for a protocol to be run without editing on any unique hardware platform that has the correct modules for the synthesis. Formalization of a written synthetic scheme by using a chemical descriptive language (XDL) eliminates the ambiguous interpretation of the synthesis procedures. This XDL scheme is then translated into the ChASM file for a particular protocol. An automated robotic platform was developed, consisting of a fluidic backbone connected to a series of modules capable of performing the operations necessary to complete a synthetic sequence. The backbone allows the efficient transfer of the required chemicals into and out of any module of the platform, as well as the flushing and washing of the entire system during multistep procedures in which the modules are reused multiple times. The modules developed for the system consist of a reaction flask, a jacketed filtration setup capable of being heated or cooled, an automated liquid-liquid separation module, and a solvent evaporation module. With these 4 modules, it was possible to automate the synthesis of the pharmaceutical compounds diphenhydramine hydrochloride, rufinamide, and sildenafil without human interaction, in yields comparable to those achieved in traditional manual syntheses.
Conclusion: The Chemputer allows for an abstraction of chemical synthesis, when coupled with a high-level chemical programming language, to be compiled by our Chempiler into a low-level code that can run on a modular standard robotic platform for organic synthesis. The software and modular hardware standards permit synthetic protocols to be captured as digital code. This code can be published, versioned, and transferred flexibly between physical platforms with no modification. We validated this concept by the automated synthesis of 3 pharmaceutical compounds. This represents a step toward the automation of bench-scale chemistry more generally and establishes a standard aiming at increasing reproducibility, safety, and collaboration.
We learn end-to-end point-to-point and path-following navigation behaviors that avoid moving obstacles. These policies receive noisy lidar observations and output robot linear and angular velocities. The policies are trained in small, static environments with AutoRL, an evolutionary automation layer around Reinforcement Learning (RL) that searches for a deep RL reward and neural network architecture with large-scale hyper-parameter optimization. AutoRL first finds a reward that maximizes task completion, and then finds a neural network architecture that maximizes the cumulative of the found reward. Empirical evaluations, both in simulation and on-robot, show that AutoRL policies do not suffer from the catastrophic forgetfulness that plagues many other deep reinforcement learning algorithms, generalize to new environments and moving obstacles, are robust to sensor, actuator, and localization noise, and can serve as robust building blocks for larger navigation tasks. Our path-following and point-to-point policies are respectively 23% and 26% more successful than comparison methods across new environments. Video at: https: / / www.youtube.com / watch?v = 0UwkjpUEcbI
Through many recent successes in simulation, model-free reinforcement learning has emerged as a promising approach to solving continuous control robotic tasks. The research community is now able to reproduce, analyze and build quickly on these results due to open source implementations of learning algorithms and simulated benchmark tasks. To carry forward these successes to real-world applications, it is crucial to withhold utilizing the unique advantages of simulations that do not transfer to the real world and experiment directly with physical robots. However, reinforcement learning research with physical robots faces substantial resistance due to the lack of benchmark tasks and supporting source code. In this work, we introduce several reinforcement learning tasks with multiple commercially available robots that present varying levels of learning difficulty, setup, and repeatability. On these tasks, we test the learning performance of off-the-shelf implementations of four reinforcement learning algorithms and analyze sensitivity to their hyper-parameters to determine their readiness for applications in various real-world tasks. Our results show that with a careful setup of the task interface and computations, some of these implementations can be readily applicable to physical robots. We find that state-of-the-art learning algorithms are highly sensitive to their hyper-parameters and their relative ordering does not transfer across tasks, indicating the necessity of re-tuning them for each task for best performance. On the other hand, the best hyper-parameter configuration from one task may often result in effective learning on held-out tasks even with different robots, providing a reasonable default. We make the benchmark tasks publicly available to enhance reproducibility in real-world reinforcement learning.
In this paper, we present a decentralized sensor-level collision avoidance policy for multi-robot systems, which shows promising results in practical applications. In particular, our policy directly maps raw sensor measurements to an agent’s steering commands in terms of the movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we present a multi-scenario multi-stage training framework to learn an optimal policy. The policy is trained over a large number of robots in rich, complex environments simultaneously using a policy gradient based reinforcement learning algorithm. The learning algorithm is also integrated into a hybrid control framework to further improve the policy’s robustness and effectiveness.
We validate the learned sensor-level collision avoidance policy in a variety of simulated and real-world scenarios with thorough performance evaluations for large-scale multi-robot systems. The generalization of the learned policy is verified in a set of unseen scenarios including the navigation of a group of heterogeneous robots and a large-scale scenario with 100 robots. Although the policy is trained using simulation data only, we have successfully deployed it on physical robots with shapes and dynamics characteristics that are different from the simulated agents, in order to demonstrate the controller’s robustness against the sim-to-real modeling error. Finally, we show that the collision-avoidance policy learned from multi-robot navigation tasks provides an excellent solution to the safe and effective autonomous navigation for a single robot working in a dense real human crowd. Our learned policy enables a robot to make effective progress in a crowd without getting stuck. Videos are available at https://sites.google.com/view/hybridmrca
We use reinforcement learning (RL) to learn dexterous in-hand manipulation policies which can perform vision-based object reorientation on a physical Shadow Dexterous Hand. The training is performed in a simulated environment in which we randomize many of the physical properties of the system like friction coefficients and an object’s appearance. Our policies transfer to the physical robot despite being trained entirely in simulation. Our method does not rely on any human demonstrations, but many behaviors found in human manipulation emerge naturally, including finger gaiting, multi-finger coordination, and the controlled use of gravity. Our results were obtained using the same distributed RL system that was used to train OpenAI Five. We also include a video of our results: YouTube
Data-driven approaches to solving robotic tasks have gained a lot of traction in recent years. However, most existing policies are trained on large-scale datasets collected in curated lab settings. If we aim to deploy these models in unstructured visual environments like people’s homes, they will be unable to cope with the mismatch in data distribution.
In such light, we present the first systematic effort in collecting a large dataset for robotic grasping in homes. First, to scale and parallelize data collection, we built a low cost mobile manipulator assembled for under 3K USD. Second, data collected using low cost robots suffer from noisy labels due to imperfect execution and calibration errors. To handle this, we develop a framework which factors out the noise as a latent variable. Our model is trained on 28K grasps collected in several houses under an array of different environmental conditions.
We evaluate our models by physically executing grasps on a collection of novel objects in multiple unseen homes. The models trained with our home dataset showed a marked improvement of 43.7% over a baseline model trained with data collected in lab. Our architecture which explicitly models the latent noise in the dataset also performed 10% better than one that did not factor out the noise.
We hope this effort inspires the robotics community to look outside the lab and embrace learning based approaches to handle inaccurate cheap robots.
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised “practice” phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques.
Review of Roland & Shiman 2002 history of a decade of ARPA/DARPA involvement in AI and supercomputing, and the ARPA philosophy of technological acceleration; it yielded mixed results, perhaps due to ultimately insurmountable bottlenecks—the time was not yet ripe for many goals.
Review of DARPA history book, Strategic Computing: DARPA and the Quest for Machine Intelligence, 1983–1993, Roland & Shiman 2002, which reviews a large-scale DARPA effort to jumpstart real-world uses of AI in the 1980s by a multi-pronged research effort into more efficient computer chip R&D, supercomputing, robotics/self-driving cars, & expert system software. Roland & Shiman 2002 particularly focus on the various ‘philosophies’ of technological forecasting & development, which guided DARPA’s strategy in different periods, ultimately endorsing a weak technological determinism where the bottlenecks are too large for a small (in comparison to the global economy & global R&D) organization best a DARPA can hope for is a largely agnostic & reactive strategy in which granters ‘surf’ technological changes, rapidly exploiting new technology while investing their limited funds into targeted research patching up any gaps or lags that accidentally open up and block broader applications. (For broader discussion of progress, see “Lessons from the Media Lab” & Bakewell.)
In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success.
To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations.
We present an adversarial active exploration for inverse dynamics model learning, a simple yet effective learning scheme that incentivizes exploration in an environment without any human intervention. Our framework consists of a deep reinforcement learning (DRL) agent and an inverse dynamics model contesting with each other. The former collects training samples for the latter, with an objective to maximize the error of the latter. The latter is trained with samples collected by the former, and generates rewards for the former when it fails to predict the actual action taken by the former. In such a competitive setting, the DRL agent learns to generate samples that the inverse dynamics model fails to predict correctly, while the inverse dynamics model learns to adapt to the challenging samples. We further propose a reward structure that ensures the DRL agent to collect only moderately hard samples but not overly hard ones that prevent the inverse model from predicting effectively. We evaluate the effectiveness of our method on several robotic arm and hand manipulation tasks against multiple baseline models. Experimental results show that our method is comparable to those directly trained with expert demonstrations, and superior to the other baselines even without any human priors.
Learning to control robots directly based on images is a primary challenge in robotics. However, many existing reinforcement learning approaches require iteratively obtaining millions of robot samples to learn a policy, which can take significant time. In this paper, we focus on learning a realistic world model capturing the dynamics of scene changes conditioned on robot actions. Our dreaming model can emulate samples equivalent to a sequence of images from the actual environment, technically by learning an action-conditioned future representation/scene regressor. This allows the agent to learn action policies (ie. visuomotor policies) by interacting with the dreaming model rather than the real-world. We experimentally confirm that our dreaming model enables robot learning of policies that transfer to the real-world.
Advances in deep generative networks have led to impressive results in recent years. Nevertheless, such models can often waste their capacity on the minutiae of datasets, presumably due to weak inductive biases in their decoders. This is where graphics engines may come in handy since they abstract away low-level details and represent images as high-level programs. Current methods that combine deep learning and renderers are limited by hand-crafted likelihood or distance functions, a need for large amounts of supervision, or difficulties in scaling their inference algorithms to richer datasets. To mitigate these issues, we present SPIRAL, an adversarially trained agent that generates a program which is executed by a graphics engine to interpret and sample images. The goal of this agent is to fool a discriminator network that distinguishes between real and rendered data, trained with a distributed reinforcement learning setup without any supervision. A surprising finding is that using the discriminator’s output as a reward signal is the key to allow the agent to make meaningful progress at matching the desired output rendering. To the best of our knowledge, this is the first demonstration of an end-to-end, unsupervised and adversarial inverse graphics agent on challenging real world (MNIST, OMNIGLOT, CELEBA) and synthetic 3D datasets. A video of the agent can be found at YouTube.
I apply recent work on “learning to think” (2015) and on PowerPlay (2011) to the incremental training of an increasingly general problem solver, continually learning to solve new tasks without forgetting previous skills. The problem solver is a single recurrent neural network (or similar general purpose computer) called ONE. ONE is unusual in the sense that it is trained in various ways, eg. by black box optimization / reinforcement learning / artificial evolution as well as supervised / unsupervised learning. For example, ONE may learn through neuroevolution to control a robot through environment-changing actions, and learn through unsupervised gradient descent to predict future inputs and vector-valued reward signals as suggested in 1990. User-given tasks can be defined through extra goal-defining input patterns, also proposed in 1990. Suppose ONE has already learned many skills. Now a copy of ONE can be re-trained to learn a new skill, eg. through neuroevolution without a teacher. Here it may profit from re-using previously learned subroutines, but it may also forget previous skills. Then ONE is retrained in PowerPlay style (2011) on stored input/output traces of (a) ONE’s copy executing the new skill and (b) previous instances of ONE whose skills are still considered worth memorizing. Simultaneously, ONE is retrained on old traces (even those of unsuccessful trials) to become a better predictor, without additional expensive interaction with the environment. More and more control and prediction skills are thus collapsed into ONE, like in the chunker-automatizer system of the neural history compressor (1991). This forces ONE to relate partially analogous skills (with shared algorithmic information) to each other, creating common subroutines in form of shared subnetworks of ONE, to greatly speed up subsequent learning of additional, novel but algorithmically related skills.
Humans and animals are capable of learning a new behavior by observing others perform the skill just once. We consider the problem of allowing a robot to do the same—learning from a raw video pixels of a human, even when there is substantial domain shift in the perspective, environment, and embodiment between the robot and the observed human. Prior approaches to this problem have hand-specified how human and robot actions correspond and often relied on explicit human pose detection systems. In this work, we present an approach for one-shot learning from a video of a human by using human and robot demonstration data from a variety of previous tasks to build up prior knowledge through meta-learning. Then, combining this prior knowledge and only a single video demonstration from a human, the robot can perform the task that the human demonstrated.
We show experiments on both a PR2 arm and a Sawyer arm, demonstrating that after meta-learning, the robot can learn to place, push, and pick-and-place new objects using just one video of a human performing the manipulation.
This paper introduces a novel neural network-based reinforcement learning approach for robot gaze control. Our approach enables a robot to learn and to adapt its gaze control strategy for human-robot interaction neither with the use of external sensors nor with human supervision. The robot learns to focus its attention onto groups of people from its own audio-visual experiences, independently of the number of people, of their positions and of their physical appearances. In particular, we use a recurrent neural network architecture in combination with Q-learning to find an optimal action-selection policy; we pre-train the network using a simulated environment that mimics realistic scenarios that involve speaking/silent participants, thus avoiding the need of tedious sessions of a robot interacting with people. Our experimental evaluation suggests that the proposed method is robust against parameter estimation, i.e. the parameter values yielded by the method do not have a decisive impact on the performance. The best results are obtained when both audio and visual information is jointly used. Experiments with the Nao robot indicate that our framework is a step forward towards the autonomous learning of socially acceptable gaze behavior.
The importance of robotic assistive devices grows in our work and everyday life. Cooperative scenarios involving both robots and humans require safe human-robot interaction. One important aspect here is the management of robot errors, including fast and accurate online robot-error detection and correction. Analysis of brain signals from a human interacting with a robot may help identifying robot errors, but accuracies of such analyses have still substantial space for improvement. In this paper we evaluate whether a novel framework based on deep convolutional neural networks (deep ConvNets) could improve the accuracy of decoding robot errors from the EEG of a human observer, both during an object grasping and a pouring task. We show that deep ConvNets reached significantly higher accuracies than both regularized Linear Discriminant Analysis (rLDA) and filter bank common spatial patterns (FB-CSP) combined with rLDA, both widely used EEG classifiers. Deep ConvNets reached mean accuracies of 75% ± 9%, rLDA 65% ± 10% and FB-CSP + rLDA 63% ± 6% for decoding of erroneous vs. correct trials. Visualization of the time-domain EEG features learned by the ConvNets to decode errors revealed spatiotemporal patterns that reflected differences between the two experimental paradigms. Across subjects, Convnet decoding accuracies were statistically-significantly correlated with those obtained with rLDA, but not CSP, indicating that in the present context ConvNets behaved more ‘rLDA-like’ (but consistently better), while in a previous decoding study with another task but the same Convnet architecture, it was found to behave more ‘CSP-like’. Our findings thus provide further support for the assumption that deep ConvNets are a versatile addition to the existing toolbox of EEG decoding techniques, and we discuss steps how Convnet EEG decoding performance could be further optimized.
Simulations are attractive environments for training agents as they provide an abundant source of data and alleviate certain safety concerns during the training process. But the behaviours developed by agents in simulation are often specific to the characteristics of the simulator. Due to modeling error, strategies that are successful in simulation may not transfer to their real world counterparts. In this paper, we demonstrate a simple method to bridge this “reality gap”. By randomizing the dynamics of the simulator during training, we are able to develop policies that are capable of adapting to very different dynamics, including ones that differ significantly from the dynamics on which the policies were trained. This adaptivity enables the policies to generalize to the dynamics of the real world without any training on the physical system. Our approach is demonstrated on an object pushing task using a robotic arm. Despite being trained exclusively in simulation, our policies are able to maintain a similar level of performance when deployed on a real robot, reliably moving an object to a desired location from random initial configurations. We explore the impact of various design decisions and show that the resulting policies are robust to significant calibration error.
We present PRM-RL, a hierarchical method for long-range navigation task completion that combines sampling based path planning with reinforcement learning (RL). The RL agents learn short-range, point-to-point navigation policies that capture robot dynamics and task constraints without knowledge of the large-scale topology. Next, the sampling-based planners provide roadmaps which connect robot configurations that can be successfully navigated by the RL agent. The same RL agents are used to control the robot under the direction of the planning, enabling long-range navigation. We use the Probabilistic Roadmaps (PRMs) for the sampling-based planner. The RL agents are constructed using feature-based and deep neural net policies in continuous state and action spaces. We evaluate PRM-RL, both in simulation and on-robot, on two navigation tasks with non-trivial robot dynamics: end-to-end differential drive indoor navigation in office environments, and aerial cargo delivery in urban environments with load displacement constraints. Our results show improvement in task completion over both RL agents on their own and traditional sampling-based planners. In the indoor navigation task, PRM-RL successfully completes up to 215 m long trajectories under noisy sensor conditions, and the aerial cargo delivery completes flights over 1000 m without violating the task constraints in an environment 63 million times larger than used in training.
Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms can be extremely time-consuming and expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically. Unfortunately, models trained purely on simulated data often fail to generalize to the real world. We study how randomized simulated environments and domain adaptation methods can be extended to train a grasping system to grasp novel objects from raw monocular RGB images. We extensively evaluate our approaches with a total of more than 25,000 physical test grasps, studying a range of simulation conditions and domain adaptation methods, including a novel extension of pixel-level domain adaptation that we term the GraspGAN. We show that, by using synthetic data and domain adaptation, we are able to reduce the number of real-world samples needed to achieve a given level of performance by up to 50×, using only randomly generated simulated objects. We also show that by using only unlabeled real-world data and our GraspGAN methodology, we obtain real-world grasping performance without any real-world labels that is similar to that achieved with 939,777 labeled real-world samples.
The most data-efficient algorithms for reinforcement learning in robotics are model-based policy search algorithms, which alternate between learning a dynamical model of the robot and optimizing a policy to maximize the expected return given the model and its uncertainties. Among the few proposed approaches, the recently introduced Black-DROPS algorithm exploits a black-box optimization algorithm to achieve both high data-efficiency and good computation times when several cores are used; nevertheless, like all model-based policy search approaches, Black-DROPS does not scale to high dimensional state/action spaces. In this paper, we introduce a new model learning procedure in Black-DROPS that leverages parameterized black-box priors to (1) scale up to high-dimensional systems, and (2) be robust to large inaccuracies of the prior information. We demonstrate the effectiveness of our approach with the “pendubot” swing-up task in simulation and with a physical hexapod robot (48D state space, 18D action space) that has to walk forward as fast as possible. The results show that our new algorithm is more data-efficient than previous model-based policy search algorithms (with and without priors) and that it can allow a physical 6-legged robot to learn new gaits in only 16 to 30 seconds of interaction time.
In order for a robot to be a generalist that can perform a wide range of jobs, it must be able to acquire a wide variety of skills quickly and efficiently in complex unstructured environments. High-capacity models such as deep neural networks can enable a robot to represent complex skills, but learning each skill from scratch then becomes infeasible. In this work, we present a meta-imitation learning method that enables a robot to learn how to learn more efficiently, allowing it to acquire new skills from just a single demonstration. Unlike prior methods for one-shot imitation, our method can scale to raw pixel inputs and requires data from significantly fewer prior tasks for effective learning of new skills. Our experiments on both simulated and real robot platforms demonstrate the ability to learn new tasks, end-to-end, from a single visual demonstration.
Brain-controlled robots are a promising new type of assistive device for severely impaired persons. Little is however known about how to optimize the interaction of humans and brain-controlled robots. Information about the human’s perceived correctness of robot performance might provide an useful teaching signal for adaptive control algorithms and thus help enhancing robot control. Here, we studied whether watching robots perform erroneous vs. correct action elicits differential brain responses that can be decoded from single trials of electroencephalographic (EEG) recordings, and whether brain activity during human-robot interaction is modulated by the robot’s visual similarity to a human.
To address these topics, we designed two experiments. In experiment I, participants watched a robot arm pour liquid into a cup. The robot performed the action either erroneously or correctly, i.e. it either spilled some liquid or not. In experiment II, participants observed two different types of robots, humanoid and non-humanoid, grabbing a ball. The robots either managed to grab the ball or not. We recorded high-resolution EEG during the observation tasks in both experiments to train a Filter Bank Common Spatial Pattern (FBCSP) pipeline on the multivariate EEG signal and decode for the correctness of the observed action, and for the type of the observed robot.
Our findings show that it was possible to decode both correctness and robot type for the majority of participants significantly, although often just slightly, above chance level. Our findings suggest that non-invasive recordings of brain responses elicited when observing robots indeed contain decodable information about the correctness of the robot’s action and the type of observed robot.
Advances in deep learning over the last decade have led to a flurry of research in the application of deep artificial neural networks to robotic systems, with at least thirty papers published on the subject between 2014 and the present. This review discusses the applications, benefits, and limitations of deep learning vis-à-vis physical robotic systems, using contemporary research as exemplars. It is intended to communicate recent advances to the wider robotics community and inspire additional interest in and application of deep learning in robotics.
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
This paper introduces the Intentional Unintentional (IU) agent. This agent endows the deep deterministic policy gradients (DDPG) agent for continuous control with the ability to solve several tasks simultaneously. Learning to solve many tasks simultaneously has been a long-standing, core goal of artificial intelligence, inspired by infant development and motivated by the desire to build flexible robot manipulators capable of many diverse behaviours. We show that the IU agent not only learns to solve many tasks simultaneously but it also learns faster than agents that target a single task at-a-time. In some cases, where the single task DDPG method completely fails, the IU agent successfully solves the task. To demonstrate this, we build a playroom environment using the MuJoCo physics engine, and introduce a grounded formal language to automatically generate tasks.
We propose a technique for multi-task learning from demonstration that trains the controller of a low-cost robotic arm to accomplish several complex picking and placing tasks, as well as non-prehensile manipulation. The controller is a recurrent neural network using raw images as input and generating robot arm trajectories, with the parameters shared across the tasks. The controller also combines VAE-GAN-based reconstruction with autoregressive multimodal action prediction. Our results demonstrate that it is possible to learn complex manipulation tasks, such as picking up a towel, wiping an object, and depositing the towel to its previous position, entirely from raw images with direct behavior cloning. We show that weight sharing and reconstruction-based regularization substantially improve generalization and robustness, and training on multiple tasks simultaneously increases the success rate on all tasks.
Material recognition enables robots to incorporate knowledge of material properties into their interactions with everyday objects. For example, material recognition opens up opportunities for clearer communication with a robot, such as “bring me the metal coffee mug”, and recognizing plastic versus metal is crucial when using a microwave or oven. However, collecting labeled training data with a robot is often more difficult than unlabeled data. We present a semi-supervised learning approach for material recognition that uses generative adversarial networks (GANs) with haptic features such as force, temperature, and vibration. Our approach achieves state-of-the-art results and enables a robot to estimate the material class of household objects with ~90% accuracy when 92% of the training data are unlabeled. We explore how well this approach can recognize the material of new objects and we discuss challenges facing generalization. To motivate learning from unlabeled training data, we also compare results against several common supervised learning classifiers. In addition, we have released the dataset used for this work which consists of time-series haptic measurements from a robot that conducted thousands of interactions with 72 household objects.
Deep learning and reinforcement learning methods have recently been used to solve a variety of problems in continuous control domains. An obvious application of these techniques is dexterous manipulation tasks in robotics which are difficult to solve using traditional control theory or hand-engineered approaches. One example of such a task is to grasp an object and precisely stack it on another. Solving this difficult and practically relevant problem in the real world is an important long-term goal for the field of robotics.
Here we take a step towards this goal by examining the problem in simulation and providing models and techniques aimed at solving it. We introduce two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG), a model-free Q-learning based method, which make it statistically-significantly more data-efficient and scalable. Our results show that by making extensive use of off-policy data and replay, it is possible to find control policies that robustly grasp objects and stack them.
Further, our results hint that it may soon be feasible to train successful stacking policies by collecting interactions on real robots.
“One-Shot Imitation Learning”, Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel et al (2017-03-21):
Imitation learning has been commonly applied to solve different tasks in isolation. This usually requires either careful feature engineering, or a significant number of samples. This is far from what we desire: ideally, robots should be able to learn from very few demonstrations of any given task, and instantly generalize to new situations of the same task, without requiring task-specific engineering. In this paper, we propose a meta-learning framework for achieving such capability, which we call one-shot imitation learning.
Specifically, we consider the setting where there is a very large set of tasks, and each task has many instantiations. For example, a task could be to stack all blocks on a table into a single tower, another task could be to place all blocks on a table into two-block towers, etc. In each case, different instances of the task would consist of different sets of blocks with different initial states. At training time, our algorithm is presented with pairs of demonstrations for a subset of all tasks. A neural net is trained that takes as input one demonstration and the current state (which initially is the initial state of the other demonstration of the pair), and outputs an action with the goal that the resulting sequence of states and actions matches as closely as possible with the second demonstration. At test time, a demonstration of a single instance of a new task is presented, and the neural net is expected to perform well on new instances of this new task. The use of soft attention allows the model to generalize to conditions and tasks unseen in the training data. We anticipate that by training this model on a much greater variety of tasks and settings, we will obtain a general system that can turn any demonstrations into robust policies that can accomplish an overwhelming variety of tasks.
The most data-efficient algorithms for reinforcement learning (RL) in robotics are based on uncertain dynamical models: after each episode, they first learn a dynamical model of the robot, then they use an optimization algorithm to find a policy that maximizes the expected return given the model and its uncertainties. It is often believed that this optimization can be tractable only if analytical, gradient-based algorithms are used; however, these algorithms require using specific families of reward functions and policies, which greatly limits the flexibility of the overall approach. In this paper, we introduce a novel model-based RL algorithm, called Black-DROPS (Black-box Data-efficient RObot Policy Search) that: (1) does not impose any constraint on the reward function or the policy (they are treated as black-boxes), (2) is as data-efficient as the state-of-the-art algorithm for data-efficient RL in robotics, and (3) is as fast (or faster) than analytical approaches when several cores are available. The key idea is to replace the gradient-based optimization algorithm with a parallel, black-box algorithm that takes into account the model uncertainties. We demonstrate the performance of our new algorithm on two standard control benchmark problems (in simulation) and a low-cost robotic manipulator (with a real robot).
The overarching goal of this work is to efficiently enable end-users to correctly anticipate a robot’s behavior in novel situations. Since a robot’s behavior is often a direct result of its underlying objective function, our insight is that end-users need to have an accurate mental model of this objective function in order to understand and predict what the robot will do. While people naturally develop such a mental model over time through observing the robot act, this familiarization process may be lengthy. Our approach reduces this time by having the robot model how people infer objectives from observed behavior, and then it selects those behaviors that are maximally informative. The problem of computing a posterior over objectives from observed behavior is known as Inverse Reinforcement Learning (IRL), and has been applied to robots learning human objectives. We consider the problem where the roles of human and robot are swapped. Our main contribution is to recognize that unlike robots, humans will not be exact in their IRL inference. We thus introduce two factors to define candidate approximate-inference models for human learning in this setting, and analyze them in an user study in the autonomous driving domain. We show that certain approximate-inference models lead to the robot generating example behaviors that better enable users to anticipate what it will do in novel situations. Our results also suggest, however, that additional research is needed in modeling how humans extrapolate from examples of robot behavior.
We give an overview of recent exciting achievements of deep reinforcement learning (RL). We discuss six core elements, six important mechanisms, and twelve applications. We start with background of machine learning, deep learning and reinforcement learning. Next we discuss core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration. After that, we discuss important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multi-agent RL, hierarchical RL, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, natural language processing, including dialogue systems, machine translation, and text generation, computer vision, neural architecture design, business management, finance, healthcare, Industry 4.0, smart grid, intelligent transportation systems, and computer systems. We mention topics not reviewed yet, and list a collection of RL resources. After presenting a brief summary, we close with discussions.
Please see Deep Reinforcement Learning, arXiv:1810.06339, for a significant update.
Applying end-to-end learning to solve complex, interactive, pixel-driven control tasks on a robot is an unsolved problem. Deep Reinforcement Learning algorithms are too slow to achieve performance on a real robot, but their potential has been demonstrated in simulated environments. We propose using progressive networks to bridge the reality gap and transfer learned policies from simulation to the real world. The progressive net approach is a general framework that enables reuse of everything from low-level visual features to high-level policies for transfer to new tasks, enabling a compositional, yet simple, approach to building complex skills.
We present an early demonstration of this approach with a number of experiments in the domain of robot manipulation that focus on bridging the reality gap. Unlike other proposed approaches, our real-world experiments demonstrate successful task learning from raw visual input on a fully actuated robot manipulator. Moreover, rather than relying on model-based trajectory optimisation, the task learning is accomplished using only deep reinforcement learning and sparse rewards.
There has been a recent paradigm shift in robotics to data-driven learning for planning and control. Due to large number of experiences required for training, most of these approaches use a self-supervised paradigm: using sensors to measure success/failure. However, in most cases, these sensors provide weak supervision at best. In this work, we propose an adversarial learning framework that pits an adversary against the robot learning the task. In an effort to defeat the adversary, the original robot learns to perform the task with more robustness leading to overall improved performance. We show that this adversarial framework forces the robot to learn a better grasping model in order to overcome the adversary. By grasping 82% of presented novel objects compared to 68% without an adversary, we demonstrate the utility of creating adversaries. We also demonstrate via experiments that having robots in adversarial setting might be a better learning strategy as compared to having collaborative multiple robots.
A key challenge in scaling up robot learning to many skills and environments is removing the need for human supervision, so that robots can collect their own data and improve their own performance without being limited by the cost of requesting human feedback. Model-based reinforcement learning holds the promise of enabling an agent to learn to predict the effects of its actions, which could provide flexible predictive models for a wide range of tasks and environments, without detailed human supervision.
We develop a method for combining deep action-conditioned video prediction models with model-predictive control that uses entirely unlabeled training data. Our approach does not require a calibrated camera, an instrumented training set-up, nor precise sensing and actuation. Our results show that our method enables a real robot to perform nonprehensile manipulation—pushing objects—and can handle novel objects not seen during training.
Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on off-policy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.
Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to the task of target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows to better generalize. To address the second issue, we propose AI2-THOR framework, which provides an environment with high-quality 3D scenes and physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently.
We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and across scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment.
The supplementary video can be accessed at the following link: YouTube.
We describe a learning-based approach to hand-eye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images and independently of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps.
To train our network, we collected over 800,000 grasp attempts over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. Our experimental evaluation demonstrates that our method achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing.
This paper describes a collection of optimization algorithms for achieving dynamic planning, control, and state estimation for a bipedal robot designed to operate reliably in complex environments.
To make challenging locomotion tasks tractable, we describe several novel applications of convex, mixed-integer, and sparse nonlinear optimization to problems ranging from footstep placement to whole-body planning and control. We also present a state estimator formulation that, when combined with our walking controller, permits highly precise execution of extended walking plans over non-flat terrain.
We describe our complete system integration and experiments carried out on Atlas, a full-size hydraulic humanoid robot built by Boston Dynamics, Inc.
Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot’s motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a partially observed guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.
Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.
Technological developments can be foreseen but the knowledge is largely useless because startups are inherently risky and require optimal timing. A more practical approach is to embrace uncertainty, taking a reinforcement learning perspective.
How do you time your startup? Technological forecasts are often surprisingly prescient in terms of predicting that something was possible & desirable and what they predict eventually happens; but they are far less successful at predicting the timing, and almost always fail, with the success (and riches) going to another.
Why is their knowledge so useless? Why are success and failure so intertwined in the tech industry? The right moment cannot be known exactly in advance, so attempts to forecast will typically be off by years or worse. For many claims, there is no way to invest in an idea except by going all in and launching a company, resulting in extreme variance in outcomes, even when the idea is good and the forecasts correct about the (eventual) outcome.
Progress can happen and can be foreseen long before, but the details and exact timing due to bottlenecks are too difficult to get right. Launching too early means failure, but being conservative & launching later is just as bad because regardless of forecasting, a good idea will draw overly-optimistic researchers or entrepreneurs to it like moths to a flame: all get immolated but the one with the dumb luck to kiss the flame at the perfect instant, who then wins everything, at which point everyone can see that the optimal time is past. All major success stories overshadow their long list of predecessors who did the same thing, but got unlucky. The lesson of history is that for every lesson, there is an equal and opposite lesson. So, ideas can be divided into the overly-optimistic & likely doomed, or the fait accompli. On an individual level, ideas are worthless because so many others have them too—‘multiple invention’ is the rule, and not the exception. Progress, then, depends on the ‘unreasonable man’.
This overall problem falls under the reinforcement learning paradigm, and successful approaches are analogous to Thompson sampling/posterior sampling: even an informed strategy can’t reliably beat random exploration which gradually shifts towards successful areas while continuing to take occasional long shots. Since people tend to systematically over-exploit, how is this implemented? Apparently by individuals acting suboptimally on the personal level, but optimally on societal level by serving as random exploration.
A major benefit of R&D, then, is in laying fallow until the ‘ripe time’ when they can be immediately exploited in previously-unpredictable ways; applied R&D or VC strategies should focus on maintaining diversity of investments, while continuing to flexibly revisit previous failures which forecasts indicate may have reached ‘ripe time’. This balances overall exploitation & exploration to progress as fast as possible, showing the usefulness of technological forecasting on a global level despite its uselessness to individuals.
In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way.
By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement.
We report unprecedented learning efficiency on challenging and high-dimensional control tasks.
[Remarkably, PILCO can learn your standard “Cartpole” task within just a few trials by carefully building a Bayesian Gaussian process model and picking the maximally-informative experiments to run. Cartpole is quite difficult for a human, incidentally, there’s an installation of one in the SF Exploratorium, and I just had to try it out once I recognized it. (My sample-efficiency was not better than PILCO.)]
[previous: 1988, 1998; later: 2008]Here’s how my anticipations about the time of arrival of human level robotic intelligence evolved:
In the early 1970s, doing simple computer stereoscopic vision, it became rapidly obvious that the computer power in our mainframe PDP-10 was hugely insufficient to do even that basic function in real time, implying that doing the job of the whole nervous system was even further out of reach. Besides enormously more speed, we needed enormously more memory.
This was contrary to the orthodoxy in AI at the time. My advisor John McCarthy wrote in many essays that existing computers were sufficient for human-intelligent AI, but we needed some theoretical breakthroughs (humorously 2 Newtons and 3 Einsteins) to achieve it.
I tried to quantify the power needed, first by estimating the number of switching operations in the brain and comparing it to switching in computer circuits, and got a rough number of about one trillion (10^12) operations per second (ops)
…By the end of the 1970s, it was pretty clear there wouldn’t be an AI Apollo project, but I thought that if people were willing to put as much effort as was expended in the weapons labs, we could have the requisite power in a supercomputer-class machine in about 20 years. That’s the glimpse you got in that TV show [Love Machine 2000?]…Even though our AI and Robotics computers were still stuck at 1 MIPS (but much cheaper—in the 1970s we used million-dollar [>$4.8$1.01970m] computers, by the mid 1980s equivalent power could be had in workstations for tens of thousands of dollars [>$29,953.1$10,000.01985]), I was able to plot the historical decrease in computing cost to predict that we would have 10 trillion ops in a $10,000 computer by about 2020 or 2030 [in Mind Children]…By the early 2000s also there were several supercomputers in existence that could do more than 10 trillion ops, though not available for robotics work. [Now] In 2004, VA Tech connected 1,100 dual-processor Macintosh G5 machines for the record low cost of about $9,279,735.0$6,000,000.02004, and benchmarked it at over 10 trillion ops. SEEGRID is building visual self-navigating vehicles using onboard computing of a billion ops or so, about the brainpower of a guppy by my numbers.
We investigated why self-produced tactile stimulation is perceived as less intense than the same stimulus produced externally. A tactile stimulus on the palm of the right hand was either externally produced, by a robot or self-produced by the subject. In the conditions in which the tactile stimulus was self-produced, subjects moved the arm of a robot with their left hand to produce the tactile stimulus on their right hand via a second robot. Subjects were asked to rate intensity of the tactile sensation and consistently rated self-produced tactile stimuli as less tickly, intense, and pleasant than externally produced tactile stimuli. Using this robotic setup we were able to manipulate the correspondence between the action of the subjects’ left hand and the tactile stimulus on their right hand. First, we parametrically varied the delay between the movement of the left hand and the resultant movement of the tactile stimulus on the right hand. Second, we implemented varying degrees of trajectory perturbation and varied the direction of the tactile stimulus movement as a function of the direction of left-hand movement. The tickliness rating increased significantly with increasing delay and trajectory perturbation. This suggests that self-produced movements attenuate the resultant tactile sensation and that a necessary requirement of this attenuation is that the tactile stimulus and its causal motor command correspond in time and space. We propose that the extent to which self-produced tactile sensation is attenuated (ie. its tickliness) is proportional to the error between the sensory feedback predicted by an internal forward model of the motor system and the actual sensory feedback produced by the movement.
This paper describes how the performance of AI machines tends to improve at the same pace that AI researchers get access to faster hardware. The processing power and memory capacity necessary to match general intellectual performance of the human brain are estimated. Based on extrapolation of past trends and on examination of technologies under development, it is predicted that the required hardware will be available in cheap machines in the 2020s…At the present rate, computers suitable for human-like robots will appear in the 2020s. Can the pace be sustained for another three decades?
…By 1990, entire careers had passed in the frozen winter of 1-MIPS computers, mainly from necessity, but partly from habit and a lingering opinion that the early machines really should have been powerful enough. In 1990, 1 MIPS cost $2,467.0$1,000.01990 in a low-end personal computer. There was no need to go any lower. Finally spring thaw has come. Since 1990, the power available to individual AI and robotics programs has doubled yearly, to 30 MIPS by 1994 and 500 MIPS by 1998. Seeds long ago alleged barren are suddenly sprouting. Machines read text, recognize speech, even translate languages. Robots drive cross-country, crawl across Mars, and trundle down office corridors. In 1996 a theorem-proving program called EQP running five weeks on a 50 MIPS computer at Argonne National Laboratory found a proof of a boolean algebra conjecture by Herbert Robbins that had eluded mathematicians for sixty years. And it is still only spring. Wait until summer.
…The mental steps underlying good human chess playing and theorem proving are complex and hidden, putting a mechanical interpretation out of reach. Those who can follow the play naturally describe it instead in mentalistic language, using terms like strategy, understanding and creativity. When a machine manages to be simultaneously meaningful and surprising in the same rich way, it too compels a mentalistic interpretation. Of course, somewhere behind the scenes, there are programmers who, in principle, have a mechanical interpretation. But even for them, that interpretation loses its grip as the working program fills its memory with details too voluminous for them to grasp.
As the rising flood reaches more populated heights, machines will begin to do well in areas a greater number can appreciate. The visceral sense of a thinking presence in machinery will become increasingly widespread. When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self-evident.
Faster than Exponential Growth in Computing Power: The number of MIPS in $1,956.1$1,000.01998 of computer from 1900 to the present. Steady improvements in mechanical and electromechanical calculators before World War II had increased the speed of calculation a thousandfold over manual methods from 1900 to 1940. The pace quickened with the appearance of electronic computers during the war, and 1940 to 1980 saw a million-fold increase. The pace has been even quicker since then, a pace which would make human-like robots possible before the middle of the next century. The vertical scale is logarithmic, the major divisions represent thousandfold increases in computer performance. Exponential growth would show as a straight line, the upward curve indicates faster than exponential growth, or, equivalently, an accelerating rate of innovation. The reduced spread of the data in the 1990s is probably the result of intensified competition: underperforming machines are more rapidly squeezed out. The numerical data for this power curve are presented in the appendix.
The big freeze: From 1960 to 1990 the cost of computers used in AI research declined, as their numbers dilution absorbed computer-efficiency gains during the period, and the power available to individual AI programs remained almost unchanged at 1 MIPS, barely insect power. AI computer cost bottomed in 1990, and since then power has doubled yearly, to several hundred MIPS by 1998. The major visible exception is computer chess (shown by a progression of knights), whose prestige lured the resources of major computer companies and the talents of programmers and machine designers. Exceptions also exist in less public competitions, like petroleum exploration and intelligence gathering, whose high return on investment gave them regular access to the largest computers.
[“Scale is an API for training data, providing access to human-powered data for a multitude of use cases located in San Francisco, California, United States; founded 2016-06-01. Scale accelerates the development of AI applications by helping computer vision teams generate high-quality ground truth data. Our advanced LiDAR, video, and image annotation APIs allow self-driving, drone, and robotics teams at companies like Waymo, OpenAI, Lyft, Zoox, Pinterest, and Airbnb focus on building differentiated models vs. labeling data.”]
[“Scale has around 100 employees, according to Wang, but its limited full-time staff is a small fraction of the human-power behind the services Scale offers. The startup has nearly 30,000 contractors aiding in the labeling process.”The humans are pretty critical to what we’re doing because they’re there to make sure that all the data we provide is really high quality”, Wang says.
Companies provide Scale with data via their API and the startup puts its resources to work labeling the text, audio, pictures and video so that its customers’ machine learning models can be trained. The startup’s customers include Waymo, OpenAI, Airbnb and Lyft.
For a customer working with autonomous driving data, Scale’s services may mean taking collected video frames and manually segmenting out individual cars, humans or other obstacles. For another customer, it can mean making common sense language connections to ensure natural language processing models can understand language in context. The “human insight” can help minimize labeling bias and give customers data that is more precise and more accurate, though, as with just about all AI startups, the hope is that these insights will gradually usher in a future where reliance on these humans-in-the-loop will be lessened. In the meantime, Scale sits atop an army of contractors that might hold the key to bulking up Silicon Valley’s machine learning intelligence.”]