Agents that can acquire diverse skills to solve the same task have a benefit over other agents if e.g. unexpected environmental changes occur. However, Reinforcement Learning (RL) policies mainly rely on Gaussian parameterization, preventing them from learning multi-modal, diverse skills. In this work, we propose a novel RL approach for training policies that exhibit diverse behavior. To this end, we propose a highly non-linear Mixture of Experts (MoE) as the policy representation, where each expert formalizes a skill as a contextual motion primitive. The context defines the task, which can be for instance the goal reaching position of the agent, or changing physical parameters like friction. Given a context, our trained policy first selects an expert out of the repertoire of skills and subsequently adapts the parameters of the contextual motion primitive. To incentivize our policy to learn diverse skills, we leverage a maximum entropy objective combined with a per-expert context distribution that we optimize alongside each expert. The per-expert context distribution allows each expert to focus on a context sub-space and boost learning speed. However, these distributions need to be able to represent multi-modality and hard discontinuities in the environment's context probability space. We solve these requirements by leveraging energy-based models to represent the per-expert context distributions and show how we can efficiently train them using the standard policy gradient objective.
Accepted Papers
We present XLand-Minigrid, a suite of tools and grid-world environments for meta-reinforcement learning research inspired by the diversity and depth of XLand and the simplicity and minimalism of MiniGrid. XLand-Minigrid is written in JAX, designed to be highly scalable, and can potentially run on GPU or TPU accelerators, democratizing large-scale experimentation with limited resources. To demonstrate the generality of our library, we have implemented some well-known single-task environments as well as new meta-learning environments capable of generating $10^8$ distinct tasks. We have empirically shown that the proposed environments can scale up to $2^{13}$ parallel instances on the GPU, reaching tens of millions of steps per second.
The ability to rapidly acquire knowledge from humans is a fundamental skill for AI assistants. Traditional frameworks like imitation and reinforcement learning employ fixed, low-level communication protocols, making them inefficientfor teaching complex tasks. In contrast, humans are capable of communicatingnuanced ideas with progressive efficiency by establishing shared vocabularieswith others and expanding those vocabularies with increasingly abstract words. Mimicking this phenomenon in human communication, we introduce a novel learning framework named Communication-Efficient Interactive Learning (CEIL).By equipping a learning agent with a rich, dynamic language and an intrinsic motivation to communicate with minimal effort, CEIL leads to emergence of a human-like pattern where the learner and the teacher communicate more efficientlyover time by exchanging increasingly more abstract intentions. CEIL demonstrates impressive learning efficiency on a 2D MineCraft domain featuring long-horizondecision-making tasks. Especially, it performs robustly with teachers modeled after human pragmatic communication behavior.
In this work, we present a general framework for continual learning of sequentially arrived tasks with the use of pre-training, which has emerged as a promising direction for artificial intelligence systems to accommodate real-world dynamics. From a theoretical perspective, we decompose its objective into three hierarchical components, including within-task prediction, task-identity inference, and task-adaptive prediction. Then we propose an innovative approach to explicitly optimize these components with parameter-efficient fine-tuning (PEFT) techniques and representation statistics. We empirically demonstrate the superiority and generality of our approach in downstream continual learning, and further explore the applicability of PEFT techniques in upstream continual learning. We also discuss the biological basis of the proposed framework with recent advances in neuroscience.
Infants explore their complex physical and social environment in an organized way. To gain insight into what intrinsic motivations may help structure this exploration, we create a virtual infant agent and place it in a developmentally-inspired 3D environment with no external rewards. The environment has a virtual caregiver agent with the capability to interact contingently with the infant agent in ways that resemble play. We test intrinsic reward functions that are similar to motivations that have been proposed to drive exploration in humans: surprise, uncertainty, novelty, and learning progress. The reward functions that are proxies for novelty and uncertainty are the most successful in generating diverse experiences and activating the environment contingencies. We also find that learning a world model in the presence of an attentive caregiver helps the infant agent learn how to predict scenarios with challenging social and physical dynamics. Our findings provide insight into how curiosity-like intrinsic rewards and contingent social interaction lead to social behavior and the creation of a robust predictive world model.
We argue that reward-maximization is insufficient as an objective for open-ended agency due to the complexity of the control problems. Instead, we argue that the intrinsic motivation metric of hierarchical empowerment might be particularly powerful for generating goals for life-long agents.
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in an open-ended world that continuously explores, acquires diverse skills, and makes novel discoveries without human intervention in Minecraft. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent’s capability rapidly and alleviates catastrophic forgetting. Empirically, Voyager demonstrates strong in-context lifelong learning capabilities. It outperforms prior SOTA by obtaining 3.1x more unique items, unlocking tech tree milestones up to 15.3x faster, and traveling 2.3x longer distances. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.
Social learning, a cornerstone of cultural evolution, allows individuals to acquire knowledge by observing and imitating others. Central to its efficacy is episodic memory, which records specific behavioral sequences to facilitate learning. This study examines their interrelation in the context of collaborative foraging. Specifically, we examine how variations in the frequency and fidelity of social learning impact collaborative foraging, and how the length of behavioral sequences preserved in agents’ episodic memory modulates these factors. To this end, we deploy Sequential Episodic Control agents capable of sharing among them behavioral sequences stored in their episodic memory. Our findings indicate that high-frequency, high-fidelity social learning promotes more distributed and efficient resource collection, a benefit that remains consistent regardless of the length of the shared episodic memories. In contrast, low-fidelity social learning shows no advantages over non-social learning in terms of resource acquisition. In addition, storing and disseminating of longer episodic memories contribute to enhanced performance up to a certain threshold, beyond which increased memory capacity does not yield further benefits. Our findings emphasize the crucial role of high-fidelity social learning in collaborative foraging, and illuminate the intricate relationship between episodic memory capacity and the quality and frequency of social learning. This work aims to highlight the potential of neuro-computational models like episodic control algorithms in understanding social learning and offers a new perspective for investigating the cognitive mechanisms underlying open-ended cultural evolution.
Imprinting is a common survival strategy in which an animal learns a lasting preference for its parents and siblings early in life. To date, however, the origins and computational foundations of imprinting have not been formally established. What learning mechanisms generate imprinting behavior in newborn animals? Here, we used deep reinforcement learning and intrinsic motivation (curiosity), two learning mechanisms deeply rooted in psychology and neuroscience, to build autonomous artificial agents that imprint. When we raised our artificial agents together in the same environment, akin to the early social experiences of newborn animals, the agents spontaneously developed imprinting behavior. Our results provide a pixels-to-actions computational model of animal imprinting. We show that domain-general learning mechanisms—deep reinforcement learning and intrinsic motivation—are sufficient for embodied agents to rapidly learn core social behaviors from unsupervised natural experience.
We study intrinsically motivated exploration by artificially intelligent (AI) agents in animal-inspired settings. We construct virtual environments that are 3D, vision-based, physics-simulated, and based on two established animal assays: labyrinth exploration, and novel object interaction. We assess Plan2Explore (P2E), a leading model-based, intrinsically motivated deep reinforcement learning agent, in these environments. We characterize and compare the behavior of the AI agents to animal behavior, using measures devised for animal neuroethology. P2E exhibits some similarities to animal behavior, but is dramatically less efficient than mice at labyrinth exploration. We further characterize the neural dynamics associated with world modeling in the novel-object assay. We identify latent neural population activity axes linearly associated with representing object color and proximity. These results identify areas of improvement for existing AI agents, and make strides toward understanding the learned neural dynamics that guide their behavior.
Goal representation affects the performance of Hierarchical Reinforcement Learning (HRL) algorithms by decomposing complex problems into easier subtasks. Recent studies show that representations that preserve temporally abstract environment dynamics are successful in solving difficult problems with theoretical guarantees for optimality. These methods however cannot scale to tasks where environment dynamics increase in complexity. On the other hand, other efforts have tried to use spatial abstraction to mitigate the previous issues. Their limitations include scalability to high dimensional environments and dependency on prior knowledge.In this work, we propose a novel three-layer HRL algorithm that introduces, at different levels of the hierarchy, both a spatial and a temporal goal abstraction. We provide a theoretical study of the regret bounds of the learned policies. We evaluate the approach on complex continuous control tasks, demonstrating the effectiveness of spatial and temporal abstractions learned by this approach.
What drives exploration? Understanding intrinsic motivation is a long-standing question in both cognitive science and artificial intelligence (AI); numerous exploration objectives have been proposed and tested in human experiments and used to train reinforcement learning (RL) agents. However, experiments in the former are often in simplistic environments that do not capture the complexity of real world exploration. On the other hand, experiments in the latter use more complex environments, yet the trained RL agents fail to come close to human exploration efficiency. To study this gap, we propose a framework for directly comparing human and agent exploration in an open-ended environment, Crafter. We study how well commonly-proposed information theoretic objectives for intrinsic motivation relate to actual human and agent behaviours, finding that human exploration consistently shows a significant positive correlation with Entropy, Information Gain, and Empowerment. Surprisingly, we find that intrinsically-motivated RL agent exploration does not show the same significant correlation consistently, despite being designed to optimize objectives that approximate Entropy or Information Gain. In a preliminary analysis of verbalizations, we find that children's verbalizations of goals positively correlates strongly with Empowerment, suggesting that goal-setting may be an important aspect of efficient exploration.
Intrinsic reward function is widely used to improve the exploration in reinforcement learning. We first examine the conditions and causes of catastrophic forgetting of the intrinsic reward function, and propose a new method, FarCuriosity, inspired by how humans and non-human animals learn. The method depends on fragmentation and recall: an agent fragments an environment based on surprisal signals, and uses different local curiosity modules (prediction-based intrinsic reward functions) for each division so that modules are not trained on the entire environment. At fragmentation event, the agent stores the current module in long-term memory (LTM) and either initializes a new module or recalls a previously stored module based on its match with the current state. With fragmentation and recall, FarCuriosity achieves less forgetting and better overall performance in games with varied and heterogeneous environments in the Atari benchmark suite of tasks. Thus, this work highlights the problem of catastrophic forgetting in prediction-based curiosity methods and proposes a first solution.
Understanding the world in terms of objects and the possible interactions with them is an important cognition ability, especially in robotic manipulation. However, learning a structured world model that allows controlling the agent accurately remains a challenge. To address this, we propose FOCUS, a model-based agent that learns an object-centric world model. The learned representation makes it possible to provide the agent with an object-centric exploration mechanism, which encourages the agent to interact with objects and discover useful interactions. We apply FOCUS in several robotic manipulation settings where we show how our method fosters interactions such as reaching, moving, and rotating the objects in the environment. We further show how this ability to autonomously interact with objects can be used to quickly solve a given task using reinforcement learning with sparse rewards.
The ability of large language models (LLMs) to engage in credible dialogues with humans, taking into account the training data and the context of the conversation, raised discussions about their ability to exhibit intrinsic motivations, agency, or even some degree of consciousness. We argue that the internal architecture of LLMs and their finite and volatile state cannot support any of these properties. By combining insights from complementary learning systems and global neuronal workspace theories, we propose to integrate LLMs and other deep learning systems into a new architecture that is able to exhibit properties akin to agency, self-motivation and even, more speculatively, some features of consciousness.
We propose regularity as a novel reward signal for intrinsically-motivated reinforcement learning. Taking inspiration from child development, we postulate that striving for structure and order helps guide exploration towards a subspace of tasks that are not favored by naive uncertainty-based intrinsic rewards. Our generalized formulation of Regularity as Intrinsic Reward (RaIR) allows us to operationalize it within model-based reinforcement learning. In a synthetic environment, we showcase the plethora of structured patterns that can emerge from pursuing this regularity objective. We also demonstrate the strength of our method in a multi-object robotic manipulation environment. We incorporate RaIR into free play and use it to complement the model’s epistemic uncertainty as an intrinsic reward. Doing so, we witness the autonomous construction of towers and other regular structures during free play, which leads to a substantial improvement in zero-shot downstream task performance on assembly tasks.
In reinforcement learning, agents often need to make decisions between selecting actions that are familiar and have previously yielded positive results (exploitation), and seeking new information that could allow them to uncover more effective actions (exploration). Understanding how humans learn their sophisticated exploratory strategies over the course of their development remains an open question for both computer and cognitive science. Existing studies typically use classic bandit or gridworld tasks that confound the rewarding with the informative characteristics of an outcome. In this study, we adopt an observe-vs.-bet task that separates “pure exploration” from “pure exploitation” by giving participants the option to either observe an instance of an outcome and receive no reward, or to bet on one action that is eventually rewarding, but offers no immediate feedback. We collected data from 33 five-to-seven-year-old children who completed the task at one of three different bias levels. We compared how children performed with both approximate solutions to the partially-observable Markov decision process and meta-reinforcement learning models that was meta trained on the same decision making task across different probability levels. We found that the children observe significantly more than the two classes of algorithms and qualitatively more than adults in similar tasks. We then quantified how children’s policies differ between the different efficacy levels by fitting probabilistic programming models and by calculating the likelihood of the children’s actions under the task-driven model. The fitted parameters of the behavioral model as well as the direction of the deviation from neural network policies demonstrate that the primary way children adapt their behavior is by changing the amount of time that they bet on the most-recently-observed arm while maintaining a consistent frequency of observations across bias levels, suggesting both that children model the causal structure of the environment and a “hedging behavior” that would be impossible to detect in standard bandit tasks. The results shed light on how children reason about reward and information, providing an important developmental benchmark that can help shape our understanding of human behavior that we hope to investigate further using recently-developed neural network reinforcement learning models on reasoning about information and reward.
While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the skill discriminator, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples on a Fetch tabletop manipulation task suite.
We study how reinforcement learning algorithms and children develop their causal curriculum to achieve a challenging goal that is not solvable at first. Adopting the Procgen environments that comprise various tasks as challenging goals, we found that 5- to 7-year-old children actively used their current level progress to determine their next step in the curriculum and made improvements to solving the goal during this process. This suggests that children treat their level progress as an intrinsic reward, and are motivated to master easier levels in order to do better at the more difficult one, even without explicit reward. To evaluate RL agents, we exposed them to the same demanding Procgen environments as children and employed several curriculum learning methodologies. Our results demonstrate that RL agents that emulate children by incorporating level progress as an intrinsic reward signal exhibit greater stability and are more likely to converge during training, compared to RL agents solely reliant on extrinsic reward signals for game-solving. Curriculum learning may also offer a significant reduction in the number of frames needed to solve a target environment. Taken together, our human-inspired findings suggest a potential path forward for addressing catastrophic forgetting or domain shift during curriculum learning in RL agents.
Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement learning (RL) itself can be cast as a self-supervised problem: learning to reach any goal without human-specified rewards or labels. Despite the seeming appeal, little (if any) prior work has demonstrated how self-supervised RL methods can be practically deployed on robotic systems. By first studying a challenging simulated version of this task, we discover design decisions about architectures and hyperparameters that increase the success rate by $2 \times$. These findings lay the groundwork for our main result: we demonstrate that a self-supervised RL algorithm based on contrastive learning can solve real-world, image-based robotic manipulation tasks, with tasks being specified by a single goal image provided after training.
As artificial intelligence advances, Large Language Models (LLMs) have evolved beyond being just tools, becoming more like human-like agents that can converse, reflect, plan, and set goals. However, these models still struggle with open-ended question answering and often fail to understand unfamiliar scenarios quickly. To address this, we ask: how do humans manage strange situations so effectively? We believe it’s largely due to our natural instinct for curiosity and a built-in desire to predict the future and seek explanations when those predictions don’t align with reality. Unlike humans, LLMs typically accept information passively without an inherent desire to question or doubt, which could be why they struggle to understand new situations.Focusing on this, our study explores the possibility of equipping LLM-agents with human-like curiosity. Can these models move from being passive processors to active seekers of understanding, reflecting human behaviors? And can this adaptation benefit them as it does humans? To explore this, we introduce an innovative experimental framework where generative agents navigate through strange and unfamiliar situations, and their understanding is then assessed through interview questions about those situations. Initial results show notable improvements when models are equipped with traits of surprise and inquiry compared to those without. This research is a step towards creating more human-like agents and highlights the potential benefits of integrating human-like traits in models.
Autotelic learning is the training setup where agents learn by setting their own goals and trying to achieve them. However, creatively generating freeform goals is challenging for autotelic agents. We present Codeplay, an algorithm casting autotelic learning as a game between a Setter agent and a Solver agent, where the Setter generates programming puzzles of appropriate difficulty and novelty for the solver and the Solver learns to achieve them. Early experiments with the Setter demonstrates one can effectively control the tradeoff between difficulty of a puzzle and its novelty by tuning the reward of the Setter, a code language model finetuned with deep reinforcement learning.
Despite many successful applications of data-driven control in robotics, extracting meaningful diverse behaviors remains a challenge. Typically, task performance needs to be compromised in order to achieve diversity. In many scenarios, task requirements are specified as a multitude of reward terms, each requiring a different trade-off. In this work, we take a constrained optimization viewpoint on the quality-diversity trade-off and show that we can obtain diverse policies while imposing constraints on their value functions which are defined through distinct rewards. In line with previous work, further control of the diversity level can be achieved through an attract-repel reward term motivated by the Van der Waals force. We demonstrate the effectiveness of our method on a local navigation task where a quadruped robot needs to reach the target within a finite horizon. Finally, our trained policies transfer well to the real 12-DoF quadruped robot, Solo12, and exhibit diverse agile behaviors with successful obstacle traversal.
Both surprise-minimizing and surprise-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment's level of natural entropy. However, neither method can perform well across all entropy regimes. In an effort to find a single surprise-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective depending on the entropy conditions it faces, by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit which captures the ability of the agent to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes.
From birth, human infants engage in intrinsically motivated, open-ended learning, mainly by deciding what to attend to and for how long. Yet, existing formal models of the drivers of looking are very limited in scope. To address this, we present a new version of the Rational Action, Noisy Choice for Habituation (RANCH) model. This version of RANCH is a stimulus-computable, rational learning model that decides how long to look at sequences of stimuli based on expected information gain (EIG). The model captures key patterns of looking time documented in the literature, habituation and dishabituation. We evaluate RANCH quantitatively using large datasets from adult and infant looking time experiments. We argue that looking time in our experiments is well described by RANCH, and that RANCH is a general, interpretable and modifiable framework for the rational analyses of intrinsically motivated learning by looking.
Humans show a remarkable capacity to generate novel goals, for learning and play alike, and modeling this human capacity would be a valuable step toward more generally-capable artificial agents. We describe a computational model for generating novel human-like goals represented in a domain-specific language (DSL). We learn a ‘human-likeness’ fitness function over expressions in this DSL from a small (<100 game) human dataset collected in an online experiment. We then use a Quality-Diversity (QD) approach to generate a variety of human-like games with different characteristics and high fitness. We demonstrate that our method can generate synthetic games that are syntactically coherent under the DSL, semantically sensible with respect to environmental objects and their affordances, but distinct from human games in the training set. We discuss key components of our model and its current shortcomings, in the hope that this work helps inspire progress toward self-directed agents with human-like goals.
Future sequence represents the outcome after executing the action into the environment. When driven by the information-theoretic concept of mutual information, it seeks maximally informative consequences. Explicit outcomes may vary across state, return, or trajectory serving different purposes such as credit assignment or imitation learning. However, the inherent nature of incorporating intrinsic motivation with reward maximization is often neglected. In this work, we propose a variational approach to jointly learn the necessary quantity for estimating the mutual information and the dynamics model, providing a general framework for incorporating different forms of outcomes of interest. Integrated into a policy iteration scheme, our approach guarantees convergence to the optimal policy. While we mainly focus on theoretical analysis, our approach opens the possibilities of leveraging intrinsic control with model learning to enhance sample efficiency and incorporate uncertainty of the environment into decision-making.
While large language models (LLMs) now excel at code generation, a key aspect of software development is the art of refactoring: consolidating code into libraries of reusable and readable programs. In this paper, we introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code to build libraries tailored to particular problem domains. LILO combines LLM-guided program synthesis with recent algorithmic advances in automated refactoring from Stitch: a symbolic compression system that efficiently identifies optimal lambda abstractions across large code corpora. To make these abstractions interpretable, we introduce an auto-documentation (AutoDoc) procedure that infers natural language names and docstrings based on contextual examples of usage. In addition to improving human readability, we find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions. We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing neural and symbolic methods—including the state-of-the-art library learning algorithm DreamCoder—LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge.