Papers

Accepted Papers

Reinforcement Learning of Diverse Skills using Mixture of Deep Experts

Authors: Onur Celik · Aleksandar Taranovic · Gerhard Neumann

Agents that can acquire diverse skills to solve the same task have a benefit over other agents if e.g. unexpected environmental changes occur. However, Reinforcement Learning (RL) policies mainly rely on Gaussian parameterization, preventing them from learning multi-modal, diverse skills. In this work, we propose a novel RL approach for training policies that exhibit diverse behavior. To this end, we propose a highly non-linear Mixture of Experts (MoE) as the policy representation, where each expert formalizes a skill as a contextual motion primitive. The context defines the task, which can be for instance the goal reaching position of the agent, or changing physical parameters like friction. Given a context, our trained policy first selects an expert out of the repertoire of skills and subsequently adapts the parameters of the contextual motion primitive. To incentivize our policy to learn diverse skills, we leverage a maximum entropy objective combined with a per-expert context distribution that we optimize alongside each expert. The per-expert context distribution allows each expert to focus on a context sub-space and boost learning speed. However, these distributions need to be able to represent multi-modality and hard discontinuities in the environment's context probability space. We solve these requirements by leveraging energy-based models to represent the per-expert context distributions and show how we can efficiently train them using the standard policy gradient objective.

XLand-MiniGrid: Scalable Meta-Reinforcement Learning Environments in JAX

Authors: Alexander Nikulin · Vladislav Kurenkov · Ilya Zisman · Viacheslav Sinii · Artem Agarkov · Sergey Kolesnikov

Paper Link

We present XLand-Minigrid, a suite of tools and grid-world environments for meta-reinforcement learning research inspired by the diversity and depth of XLand and the simplicity and minimalism of MiniGrid. XLand-Minigrid is written in JAX, designed to be highly scalable, and can potentially run on GPU or TPU accelerators, democratizing large-scale experimentation with limited resources. To demonstrate the generality of our library, we have implemented some well-known single-task environments as well as new meta-learning environments capable of generating $10^8$ distinct tasks. We have empirically shown that the proposed environments can scale up to $2^{13}$ parallel instances on the GPU, reaching tens of millions of steps per second.

Progressively Efficient Communication

Authors: Khanh Nguyen · Ruijie Zheng · Hal Daumé III · Furong Huang · Karthik Narasimhan

Paper Link

The ability to rapidly acquire knowledge from humans is a fundamental skill for AI assistants. Traditional frameworks like imitation and reinforcement learning employ fixed, low-level communication protocols, making them inefficientfor teaching complex tasks. In contrast, humans are capable of communicatingnuanced ideas with progressive efficiency by establishing shared vocabularieswith others and expanding those vocabularies with increasingly abstract words. Mimicking this phenomenon in human communication, we introduce a novel learning framework named Communication-Efficient Interactive Learning (CEIL).By equipping a learning agent with a rich, dynamic language and an intrinsic motivation to communicate with minimal effort, CEIL leads to emergence of a human-like pattern where the learner and the teacher communicate more efficientlyover time by exchanging increasingly more abstract intentions. CEIL demonstrates impressive learning efficiency on a 2D MineCraft domain featuring long-horizondecision-making tasks. Especially, it performs robustly with teachers modeled after human pragmatic communication behavior.

Towards a General Framework for Continual Learning with Pre-training

Authors:Liyuan Wang · Jingyi Xie · Xingxing Zhang · Hang Su · Jun Zhu

Paper Link

In this work, we present a general framework for continual learning of sequentially arrived tasks with the use of pre-training, which has emerged as a promising direction for artificial intelligence systems to accommodate real-world dynamics. From a theoretical perspective, we decompose its objective into three hierarchical components, including within-task prediction, task-identity inference, and task-adaptive prediction. Then we propose an innovative approach to explicitly optimize these components with parameter-efficient fine-tuning (PEFT) techniques and representation statistics. We empirically demonstrate the superiority and generality of our approach in downstream continual learning, and further explore the applicability of PEFT techniques in upstream continual learning. We also discuss the biological basis of the proposed framework with recent advances in neuroscience.

Intrinsically Motivated Social Play in Virtual Infants

Authors: Chris Doyle · Sarah Shader · Michelle Lau · Megumi Sano · Dan Yamins · Nick Haber

Paper Link

Infants explore their complex physical and social environment in an organized way. To gain insight into what intrinsic motivations may help structure this exploration, we create a virtual infant agent and place it in a developmentally-inspired 3D environment with no external rewards. The environment has a virtual caregiver agent with the capability to interact contingently with the infant agent in ways that resemble play. We test intrinsic reward functions that are similar to motivations that have been proposed to drive exploration in humans: surprise, uncertainty, novelty, and learning progress. The reward functions that are proxies for novelty and uncertainty are the most successful in generating diverse experiences and activating the environment contingencies. We also find that learning a world model in the presence of an attentive caregiver helps the infant agent learn how to predict scenarios with challenging social and physical dynamics. Our findings provide insight into how curiosity-like intrinsic rewards and contingent social interaction lead to social behavior and the creation of a robust predictive world model.

Why Open-Ended Agency Should be Formalized on Hierarchical Empowerment-Gain Maximization

Authors: Thomas Ringstrom

Paper Link

We argue that reward-maximization is insufficient as an objective for open-ended agency due to the complexity of the control problems. Instead, we argue that the intrinsic motivation metric of hierarchical empowerment might be particularly powerful for generating goals for life-long agents.

Voyager: An Open-Ended Embodied Agent with Large Language Models

Authors: Guanzhi Wang · Yuqi Xie · Yunfan Jiang · Ajay Mandlekar · Chaowei Xiao · Yuke Zhu · Linxi Fan · Animashree Anandkumar

Paper Link

We introduce Voyager, the first LLM-powered embodied lifelong learning agent in an open-ended world that continuously explores, acquires diverse skills, and makes novel discoveries without human intervention in Minecraft. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent’s capability rapidly and alleviates catastrophic forgetting. Empirically, Voyager demonstrates strong in-context lifelong learning capabilities. It outperforms prior SOTA by obtaining 3.1x more unique items, unlocking tech tree milestones up to 15.3x faster, and traveling 2.3x longer distances. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.

High-fidelity social learning via shared episodic memories improves collaborative foraging

Authors: Ismael T. Freire · Paul Verschure

Paper Link

Social learning, a cornerstone of cultural evolution, allows individuals to acquire knowledge by observing and imitating others. Central to its efficacy is episodic memory, which records specific behavioral sequences to facilitate learning. This study examines their interrelation in the context of collaborative foraging. Specifically, we examine how variations in the frequency and fidelity of social learning impact collaborative foraging, and how the length of behavioral sequences preserved in agents’ episodic memory modulates these factors. To this end, we deploy Sequential Episodic Control agents capable of sharing among them behavioral sequences stored in their episodic memory. Our findings indicate that high-frequency, high-fidelity social learning promotes more distributed and efficient resource collection, a benefit that remains consistent regardless of the length of the shared episodic memories. In contrast, low-fidelity social learning shows no advantages over non-social learning in terms of resource acquisition. In addition, storing and disseminating of longer episodic memories contribute to enhanced performance up to a certain threshold, beyond which increased memory capacity does not yield further benefits. Our findings emphasize the crucial role of high-fidelity social learning in collaborative foraging, and illuminate the intricate relationship between episodic memory capacity and the quality and frequency of social learning. This work aims to highlight the potential of neuro-computational models like episodic control algorithms in understanding social learning and offers a new perspective for investigating the cognitive mechanisms underlying open-ended cultural evolution.

Imprinting in autonomous artificial agents using deep reinforcement learning

Authors: Donsuk Lee · Samantha Wood · Justin Wood

Paper Link

Imprinting is a common survival strategy in which an animal learns a lasting preference for its parents and siblings early in life. To date, however, the origins and computational foundations of imprinting have not been formally established. What learning mechanisms generate imprinting behavior in newborn animals? Here, we used deep reinforcement learning and intrinsic motivation (curiosity), two learning mechanisms deeply rooted in psychology and neuroscience, to build autonomous artificial agents that imprint. When we raised our artificial agents together in the same environment, akin to the early social experiences of newborn animals, the agents spontaneously developed imprinting behavior. Our results provide a pixels-to-actions computational model of animal imprinting. We show that domain-general learning mechanisms—deep reinforcement learning and intrinsic motivation—are sufficient for embodied agents to rapidly learn core social behaviors from unsupervised natural experience.

Neurobehavior of exploring AI agents

Authors: Isaac Kauvar · Chris Doyle · Nick Haber

Paper Link

We study intrinsically motivated exploration by artificially intelligent (AI) agents in animal-inspired settings. We construct virtual environments that are 3D, vision-based, physics-simulated, and based on two established animal assays: labyrinth exploration, and novel object interaction. We assess Plan2Explore (P2E), a leading model-based, intrinsically motivated deep reinforcement learning agent, in these environments. We characterize and compare the behavior of the AI agents to animal behavior, using measures devised for animal neuroethology. P2E exhibits some similarities to animal behavior, but is dramatically less efficient than mice at labyrinth exploration. We further characterize the neural dynamics associated with world modeling in the novel-object assay. We identify latent neural population activity axes linearly associated with representing object color and proximity. These results identify areas of improvement for existing AI agents, and make strides toward understanding the learned neural dynamics that guide their behavior.

Reconciling Spatial and Temporal Abstractions for Goal Representation

Authors: Mehdi Zadem · Sergio Mover · Sao Mai Nguyen

Paper Link

Goal representation affects the performance of Hierarchical Reinforcement Learning (HRL) algorithms by decomposing complex problems into easier subtasks. Recent studies show that representations that preserve temporally abstract environment dynamics are successful in solving difficult problems with theoretical guarantees for optimality. These methods however cannot scale to tasks where environment dynamics increase in complexity. On the other hand, other efforts have tried to use spatial abstraction to mitigate the previous issues. Their limitations include scalability to high dimensional environments and dependency on prior knowledge.In this work, we propose a novel three-layer HRL algorithm that introduces, at different levels of the hierarchy, both a spatial and a temporal goal abstraction. We provide a theoretical study of the regret bounds of the learned policies. We evaluate the approach on complex continuous control tasks, demonstrating the effectiveness of spatial and temporal abstractions learned by this approach.

What can AI Learn from Human Exploration? Intrinsically-Motivated Humans and Agents in Open-World Exploration

Authors: Yuqing Du · Eliza Kosoy · Alyssa L Dayan · Maria Rufova · Pieter Abbeel · Alison Gopnik

Paper Link

What drives exploration? Understanding intrinsic motivation is a long-standing question in both cognitive science and artificial intelligence (AI); numerous exploration objectives have been proposed and tested in human experiments and used to train reinforcement learning (RL) agents. However, experiments in the former are often in simplistic environments that do not capture the complexity of real world exploration. On the other hand, experiments in the latter use more complex environments, yet the trained RL agents fail to come close to human exploration efficiency. To study this gap, we propose a framework for directly comparing human and agent exploration in an open-ended environment, Crafter. We study how well commonly-proposed information theoretic objectives for intrinsic motivation relate to actual human and agent behaviours, finding that human exploration consistently shows a significant positive correlation with Entropy, Information Gain, and Empowerment. Surprisingly, we find that intrinsically-motivated RL agent exploration does not show the same significant correlation consistently, despite being designed to optimize objectives that approximate Entropy or Information Gain. In a preliminary analysis of verbalizations, we find that children's verbalizations of goals positively correlates strongly with Empowerment, suggesting that goal-setting may be an important aspect of efficient exploration.

Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic Forgetting in Curiosity

Authors: Jaedong Hwang · Zhang-Wei Hong · Eric Chen · Akhilan Boopathy · Pulkit Agrawal · Ila Fiete

Paper Link

Intrinsic reward function is widely used to improve the exploration in reinforcement learning. We first examine the conditions and causes of catastrophic forgetting of the intrinsic reward function, and propose a new method, FarCuriosity, inspired by how humans and non-human animals learn. The method depends on fragmentation and recall: an agent fragments an environment based on surprisal signals, and uses different local curiosity modules (prediction-based intrinsic reward functions) for each division so that modules are not trained on the entire environment. At fragmentation event, the agent stores the current module in long-term memory (LTM) and either initializes a new module or recalls a previously stored module based on its match with the current state. With fragmentation and recall, FarCuriosity achieves less forgetting and better overall performance in games with varied and heterogeneous environments in the Atari benchmark suite of tasks. Thus, this work highlights the problem of catastrophic forgetting in prediction-based curiosity methods and proposes a first solution.

FOCUS: Object-Centric World Models for Robotic Manipulation

Authors: Stefano Ferraro · Pietro Mazzaglia · Tim Verbelen · Bart Dhoedt

Paper Link

Understanding the world in terms of objects and the possible interactions with them is an important cognition ability, especially in robotic manipulation. However, learning a structured world model that allows controlling the agent accurately remains a challenge. To address this, we propose FOCUS, a model-based agent that learns an object-centric world model. The learned representation makes it possible to provide the agent with an object-centric exploration mechanism, which encourages the agent to interact with objects and discover useful interactions. We apply FOCUS in several robotic manipulation settings where we show how our method fosters interactions such as reaching, moving, and rotating the objects in the environment. We further show how this ability to autonomously interact with objects can be used to quickly solve a given task using reinforcement learning with sparse rewards.

DeepThought: an architecture for autonomous self-motivated systems

Authors: Arlindo L Oliveira · Tiago Domingos · Mario Figueiredo · Pedro Lima

Paper Link

The ability of large language models (LLMs) to engage in credible dialogues with humans, taking into account the training data and the context of the conversation, raised discussions about their ability to exhibit intrinsic motivations, agency, or even some degree of consciousness. We argue that the internal architecture of LLMs and their finite and volatile state cannot support any of these properties. By combining insights from complementary learning systems and global neuronal workspace theories, we propose to integrate LLMs and other deep learning systems into a new architecture that is able to exhibit properties akin to agency, self-motivation and even, more speculatively, some features of consciousness.

Regularity as Intrinsic Reward for Free Play

Authors: Cansu Sancaktar · Justus Piater · Georg Martius

Paper Link

We propose regularity as a novel reward signal for intrinsically-motivated reinforcement learning. Taking inspiration from child development, we postulate that striving for structure and order helps guide exploration towards a subspace of tasks that are not favored by naive uncertainty-based intrinsic rewards. Our generalized formulation of Regularity as Intrinsic Reward (RaIR) allows us to operationalize it within model-based reinforcement learning. In a synthetic environment, we showcase the plethora of structured patterns that can emerge from pursuing this regularity objective. We also demonstrate the strength of our method in a multi-object robotic manipulation environment. We incorporate RaIR into free play and use it to complement the model’s epistemic uncertainty as an intrinsic reward. Doing so, we witness the autonomous construction of towers and other regular structures during free play, which leads to a substantial improvement in zero-shot downstream task performance on assembly tasks.

Children prioritize purely exploratory actions in observe-vs.-bet tasks

Authors: Eunice Yiu · Kai Sandbrink · Alison Gopnik

Paper Link

In reinforcement learning, agents often need to make decisions between selecting actions that are familiar and have previously yielded positive results (exploitation), and seeking new information that could allow them to uncover more effective actions (exploration). Understanding how humans learn their sophisticated exploratory strategies over the course of their development remains an open question for both computer and cognitive science. Existing studies typically use classic bandit or gridworld tasks that confound the rewarding with the informative characteristics of an outcome. In this study, we adopt an observe-vs.-bet task that separates “pure exploration” from “pure exploitation” by giving participants the option to either observe an instance of an outcome and receive no reward, or to bet on one action that is eventually rewarding, but offers no immediate feedback. We collected data from 33 five-to-seven-year-old children who completed the task at one of three different bias levels. We compared how children performed with both approximate solutions to the partially-observable Markov decision process and meta-reinforcement learning models that was meta trained on the same decision making task across different probability levels. We found that the children observe significantly more than the two classes of algorithms and qualitatively more than adults in similar tasks. We then quantified how children’s policies differ between the different efficacy levels by fitting probabilistic programming models and by calculating the likelihood of the children’s actions under the task-driven model. The fitted parameters of the behavioral model as well as the direction of the deviation from neural network policies demonstrate that the primary way children adapt their behavior is by changing the amount of time that they bet on the most-recently-observed arm while maintaining a consistent frequency of observations across bias levels, suggesting both that children model the causal structure of the environment and a “hedging behavior” that would be impossible to detect in standard bandit tasks. The results shed light on how children reason about reward and information, providing an important developmental benchmark that can help shape our understanding of human behavior that we hope to investigate further using recently-developed neural network reinforcement learning models on reasoning about information and reward.

Skill-Based Reinforcement Learning with Intrinsic Reward Matching

Authors: Ademi Adeniji · Amber Xie · Pieter Abbeel

Paper Link

While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the skill discriminator, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples on a Fetch tabletop manipulation task suite.

From Child's Play to AI: Insights into Automated Causal Curriculum Learning

Authors: Annya Dahmani · Eunice Yiu · Tabitha Lee · Nan Rosemary Ke · Oliver Kroemer · Alison Gopnik

Paper Link

We study how reinforcement learning algorithms and children develop their causal curriculum to achieve a challenging goal that is not solvable at first. Adopting the Procgen environments that comprise various tasks as challenging goals, we found that 5- to 7-year-old children actively used their current level progress to determine their next step in the curriculum and made improvements to solving the goal during this process. This suggests that children treat their level progress as an intrinsic reward, and are motivated to master easier levels in order to do better at the more difficult one, even without explicit reward. To evaluate RL agents, we exposed them to the same demanding Procgen environments as children and employed several curriculum learning methodologies. Our results demonstrate that RL agents that emulate children by incorporating level progress as an intrinsic reward signal exhibit greater stability and are more likely to converge during training, compared to RL agents solely reliant on extrinsic reward signals for game-solving. Curriculum learning may also offer a significant reduction in the number of frames needed to solve a target environment. Taken together, our human-inspired findings suggest a potential path forward for addressing catastrophic forgetting or domain shift during curriculum learning in RL agents.

Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data

Authors:Chongyi Zheng · Benjamin Eysenbach · Homer Walke · Patrick Yin · Kuan Fang · Russ Salakhutdinov · Sergey Levine

Paper Link

Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement learning (RL) itself can be cast as a self-supervised problem: learning to reach any goal without human-specified rewards or labels. Despite the seeming appeal, little (if any) prior work has demonstrated how self-supervised RL methods can be practically deployed on robotic systems. By first studying a challenging simulated version of this task, we discover design decisions about architectures and hyperparameters that increase the success rate by $2 \times$. These findings lay the groundwork for our main result: we demonstrate that a self-supervised RL algorithm based on contrastive learning can solve real-world, image-based robotic manipulation tasks, with tasks being specified by a single goal image provided after training.

Enhancing Understanding in Generative Agents through Active Inquiring

Authors: Jiaxin Ge · Kaiya Zhao · Manuel Cortes · Jovana Kondic · Shuying Luo · Michelangelo Naim · Andrew Ahn · Guangyu Robert Yang

Paper Link

As artificial intelligence advances, Large Language Models (LLMs) have evolved beyond being just tools, becoming more like human-like agents that can converse, reflect, plan, and set goals. However, these models still struggle with open-ended question answering and often fail to understand unfamiliar scenarios quickly. To address this, we ask: how do humans manage strange situations so effectively? We believe it’s largely due to our natural instinct for curiosity and a built-in desire to predict the future and seek explanations when those predictions don’t align with reality. Unlike humans, LLMs typically accept information passively without an inherent desire to question or doubt, which could be why they struggle to understand new situations.Focusing on this, our study explores the possibility of equipping LLM-agents with human-like curiosity. Can these models move from being passive processors to active seekers of understanding, reflecting human behaviors? And can this adaptation benefit them as it does humans? To explore this, we introduce an innovative experimental framework where generative agents navigate through strange and unfamiliar situations, and their understanding is then assessed through interview questions about those situations. Initial results show notable improvements when models are equipped with traits of surprise and inquiry compared to those without. This research is a step towards creating more human-like agents and highlights the potential benefits of integrating human-like traits in models.

Codeplay: Autotelic Learning through Collaborative Self-Play in Programming Environments

Authors: Laetitia Teodorescu · Cédric Colas · Matthew Bowers · Thomas Carta · Pierre-Yves Oudeyer

Paper Link

Autotelic learning is the training setup where agents learn by setting their own goals and trying to achieve them. However, creatively generating freeform goals is challenging for autotelic agents. We present Codeplay, an algorithm casting autotelic learning as a game between a Setter agent and a Solver agent, where the Setter generates programming puzzles of appropriate difficulty and novelty for the solver and the Solver learns to achieve them. Early experiments with the Setter demonstrates one can effectively control the tradeoff between difficulty of a puzzle and its novelty by tuning the reward of the Setter, a code language model finetuned with deep reinforcement learning.

Learning Diverse Skills for Local Navigation under Multi-constraint Optimality

Authors: Jin Cheng · Marin Vlastelica Pogančić · Pavel Kolev · Chenhao Li · Georg Martius

Paper Link

Despite many successful applications of data-driven control in robotics, extracting meaningful diverse behaviors remains a challenge. Typically, task performance needs to be compromised in order to achieve diversity. In many scenarios, task requirements are specified as a multitude of reward terms, each requiring a different trade-off. In this work, we take a constrained optimization viewpoint on the quality-diversity trade-off and show that we can obtain diverse policies while imposing constraints on their value functions which are defined through distinct rewards. In line with previous work, further control of the diversity level can be achieved through an attract-repel reward term motivated by the Van der Waals force. We demonstrate the effectiveness of our method on a local navigation task where a quadruped robot needs to reach the target within a finite horizon. Finally, our trained policies transfer well to the real 12-DoF quadruped robot, Solo12, and exhibit diverse agile behaviors with successful obstacle traversal.

Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning

Authors: Adriana Hugessen · Roger Creus Castanyer · Glen Berseth

Paper Link

Both surprise-minimizing and surprise-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment's level of natural entropy. However, neither method can perform well across all entropy regimes. In an effort to find a single surprise-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective depending on the entropy conditions it faces, by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit which captures the ability of the agent to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes.

Modeling habituation in infants and adults using rational curiosity over perceptual embeddings

Authors: Gal Raz · Anjie Cao · Rebecca Saxe · Michael C Frank

Paper Link

From birth, human infants engage in intrinsically motivated, open-ended learning, mainly by deciding what to attend to and for how long. Yet, existing formal models of the drivers of looking are very limited in scope. To address this, we present a new version of the Rational Action, Noisy Choice for Habituation (RANCH) model. This version of RANCH is a stimulus-computable, rational learning model that decides how long to look at sequences of stimuli based on expected information gain (EIG). The model captures key patterns of looking time documented in the literature, habituation and dishabituation. We evaluate RANCH quantitatively using large datasets from adult and infant looking time experiments. We argue that looking time in our experiments is well described by RANCH, and that RANCH is a general, interpretable and modifiable framework for the rational analyses of intrinsically motivated learning by looking.

Generating Human-Like Goals by Synthesizing Reward-Producing Programs

Authors: Guy Davidson · Graham Todd · Todd Gureckis · Julian Togelius · Brenden Lake

Paper Link

Humans show a remarkable capacity to generate novel goals, for learning and play alike, and modeling this human capacity would be a valuable step toward more generally-capable artificial agents. We describe a computational model for generating novel human-like goals represented in a domain-specific language (DSL). We learn a ‘human-likeness’ fitness function over expressions in this DSL from a small (<100 game) human dataset collected in an online experiment. We then use a Quality-Diversity (QD) approach to generate a variety of human-like games with different characteristics and high fitness. We demonstrate that our method can generate synthetic games that are syntactically coherent under the DSL, semantically sensible with respect to environmental objects and their affordances, but distinct from human games in the training set. We discuss key components of our model and its current shortcomings, in the hope that this work helps inspire progress toward self-directed agents with human-like goals.

Generative Intrinsic Optimization: Intrisic Control with Model Learning

Authors: Jianfei Ma

Paper Link

Future sequence represents the outcome after executing the action into the environment. When driven by the information-theoretic concept of mutual information, it seeks maximally informative consequences. Explicit outcomes may vary across state, return, or trajectory serving different purposes such as credit assignment or imitation learning. However, the inherent nature of incorporating intrinsic motivation with reward maximization is often neglected. In this work, we propose a variational approach to jointly learn the necessary quantity for estimating the mutual information and the dynamics model, providing a general framework for incorporating different forms of outcomes of interest. Integrated into a policy iteration scheme, our approach guarantees convergence to the optimal policy. While we mainly focus on theoretical analysis, our approach opens the possibilities of leveraging intrinsic control with model learning to enhance sample efficiency and incorporate uncertainty of the environment into decision-making.

Learning Interpretable Libraries by Compressing and Documenting Code

Authors: Gabriel Grand · Catherine Wong · Matthew Bowers · Theo X. Olausson · Muxin Liu · Josh Tenenbaum · Jacob Andreas

Paper Link

While large language models (LLMs) now excel at code generation, a key aspect of software development is the art of refactoring: consolidating code into libraries of reusable and readable programs. In this paper, we introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code to build libraries tailored to particular problem domains. LILO combines LLM-guided program synthesis with recent algorithmic advances in automated refactoring from Stitch: a symbolic compression system that efficiently identifies optimal lambda abstractions across large code corpora. To make these abstractions interpretable, we introduce an auto-documentation (AutoDoc) procedure that infers natural language names and docstrings based on contextual examples of usage. In addition to improving human readability, we find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions. We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing neural and symbolic methods—including the state-of-the-art library learning algorithm DreamCoder—LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge.