MEM: How Robots Remember Tasks That Take 15 Minutes
Multi-scale memory architecture for long-horizon embodied tasks, from kitchen cleaning to sandwich making
1. The 15-Minute Barrier in Robotics
There is a fascinating paradox at the heart of modern robotics: we can build robots that perform extraordinary feats of precision -- assembling microelectronics, performing minimally invasive surgery, navigating interplanetary surfaces -- yet most robots fail spectacularly at tasks that a five-year-old handles effortlessly. Making a sandwich. Cleaning a kitchen. Tidying a room. The gap is not in motor capability or sensory perception; it is in memory.
Most household tasks that humans consider trivial require sustained, organized effort over periods of 10 to 30 minutes. During that time, a human seamlessly tracks what has been done, what remains, where tools and ingredients are located, and how the overall goal maps to the current subtask. This continuous, multi-scale memory -- from moment-to-moment visual tracking to long-range task planning -- is the cognitive infrastructure that makes extended tasks possible.
Current robotic systems, by contrast, typically operate within narrow temporal windows. A vision-language-action (VLA) model might excel at executing a single manipulation primitive -- grasping a cup, opening a drawer -- but it has no mechanism for remembering what it did 30 seconds ago, let alone five minutes ago. The result is what researchers call the "15-minute barrier": tasks that require more than a few minutes of sustained, context-aware behavior are essentially out of reach for standard architectures.
A groundbreaking paper from early 2026, "MEM: Multi-scale Embodied Memory for Long-Horizon Tasks" (arXiv:2603.03596), presents an elegant solution to this problem. By introducing a biologically inspired multi-scale memory system, MEM enables robots to maintain coherent task execution across timescales ranging from seconds to tens of minutes. This article provides a comprehensive analysis of MEM's architecture, its experimental results, and its implications for the future of embodied AI.
2. Why Memory Is the Bottleneck
To appreciate why memory represents the critical bottleneck in embodied AI, consider the information flow during a seemingly simple task: cleaning a kitchen after cooking. The robot must identify all items that are out of place, determine where each belongs, plan an efficient sequence of actions, execute those actions while monitoring for obstacles and changes, verify that each subtask has been completed, and adapt when unexpected situations arise (a spill that was not initially visible, a cabinet that is already full).
This task generates an enormous volume of sensory data. A typical robot equipped with RGB-D cameras produces approximately 50-100 MB of visual data per second. Over 15 minutes, that amounts to 45-90 GB of raw sensory input. No current system can store and process all of this data in real time. The fundamental question, then, is not whether to use memory, but how to organize memory so that the right information is available at the right time at the right level of abstraction.
Previous approaches to this problem have generally fallen into one of two categories: pure reactive systems that maintain no explicit memory at all (relying entirely on the current sensory input and learned reflexes), and systems that attempt to maintain a complete world model that is updated at every timestep. The reactive approach fails because it cannot handle tasks with temporal dependencies -- it literally does not know what it has already done. The complete world model approach fails because maintaining and updating a comprehensive model of the environment in real time is computationally prohibitive and brittle.
Human cognition suggests a middle path. Neuroscience research has long established that human memory operates at multiple timescales simultaneously. Working memory holds the immediate sensory context (what is in my hand right now?), episodic memory tracks recent events (I just cleaned the countertop), and semantic memory provides general knowledge (plates go in the upper cabinet). These memory systems interact continuously, with each one providing context that helps the others operate more efficiently.
MEM is the first robotic system to systematically implement this multi-scale memory architecture in a way that enables long-horizon task execution in real-world environments.
3. MEM Architecture: Multi-Scale Memory in Detail
The MEM architecture comprises three interconnected memory components, each operating at a different temporal resolution and level of abstraction. Together, they create a coherent memory system that supports both fine-grained motor control and high-level task planning.
The first component is the Working Memory Module (WMM). This module maintains a rolling buffer of recent visual observations -- typically the last 5-10 seconds of egocentric video at reduced resolution. The WMM serves as the robot's immediate perceptual context, enabling it to track objects that are currently being manipulated, detect changes in the immediate environment, and provide visual grounding for the current action being executed. The WMM is implemented as a sliding window over a video encoder (specifically, a lightweight variant of VideoMAE), producing a continuous stream of visual embeddings that capture both spatial and temporal patterns in the immediate environment.
The second component is the Episodic Memory Module (EMM). As the robot executes a task, the EMM automatically segments the continuous activity stream into discrete episodes -- coherent units of behavior such as "picked up the sponge," "wiped the counter," or "opened the dishwasher." Each episode is encoded as a text-grounded summary that captures what happened, what objects were involved, what the outcome was, and how long the episode lasted. These summaries are stored in a structured memory that supports efficient retrieval by content, time, or relevance to the current task context.
The third component is the Semantic Task Memory (STM). This module encodes general knowledge about how tasks are structured -- what steps are typically involved in cleaning a kitchen, what order they usually occur in, what common failure modes exist, and what recovery strategies are appropriate. The STM is initialized from a large language model's knowledge of household tasks and is progressively refined through experience. It provides the high-level planning context that guides the robot's behavior, while the WMM and EMM provide the real-time sensory and historical context that enables accurate execution.
The critical innovation in MEM is not any individual memory component, but the attention-based integration mechanism that allows all three components to interact at every decision step. When the robot needs to decide what to do next, the system attends simultaneously to the current visual context (WMM), the history of what has been done (EMM), and the general task plan (STM). This multi-scale attention produces a unified representation that captures both the immediate situation and the broader task context, enabling decisions that are locally appropriate and globally coherent.
4. Kitchen Cleaning: The 15-Minute Benchmark
The MEM paper introduces a rigorous benchmark for long-horizon embodied tasks centered on kitchen cleaning -- a task chosen specifically because it requires extended, multi-step behavior with complex dependencies between subtasks. The benchmark defines four kitchen configurations of increasing difficulty, from a simple two-counter kitchen with five items to displace, to a full kitchen with 20+ items, multiple storage locations, and deliberate distractors (items that look out of place but are actually in their correct positions).
In the simplest configuration, the robot must clean a kitchen by returning five items to their designated storage locations. The items include a cutting board, a knife, a bowl, a sponge, and a bottle of dish soap. The task requires approximately 8 minutes of continuous, coordinated behavior. The most complex configuration requires approximately 22 minutes and involves navigating between multiple rooms, operating appliances (dishwasher, trash compactor), and making judgment calls about items whose correct location is ambiguous.
The baseline comparisons are illuminating. A standard vision-language-action model (OpenVLA) without memory achieved a task completion rate of only 12% on the simplest configuration, failing primarily because it could not track which items had already been returned to their correct locations. After returning the cutting board, for example, it would frequently return to the cutting board's original location and attempt to pick it up again -- a clear manifestation of the absence of episodic memory.
MemoryVLA, a 2025 system that augments VLA models with a simple key-value memory store, improved completion rates to 38% on the simplest configuration. However, its flat memory structure struggled with tasks requiring more than 10 minutes, as the memory became cluttered with irrelevant historical observations and the system could not efficiently distinguish between recent, relevant memories and older, completed subtask records.
MEM achieved a completion rate of 79% on the simplest configuration and 52% on the most complex -- approximately double the next best system across all difficulty levels. The improvement was most dramatic for tasks exceeding 15 minutes, where MEM's multi-scale memory architecture allowed it to maintain coherent behavior long after simpler memory systems had degraded. Analysis of failure cases revealed that most MEM failures occurred not from memory issues but from motor execution errors -- the robot knew what to do but occasionally failed to physically accomplish it, a fundamentally different class of failure than the confusion exhibited by memoryless systems.
5. Sandwich Making: Creative Memory Under Uncertainty
The second major benchmark in the MEM paper is sandwich preparation -- a task that, while seemingly simpler than kitchen cleaning, introduces unique memory challenges related to ordering, ingredient tracking, and preference memory. Making a sandwich requires not just remembering what has been done, but maintaining a specific plan about what comes next and how the current state relates to the desired outcome.
The benchmark defines five sandwich types of increasing complexity: a simple PB&J (5 steps, ~4 minutes), a ham and cheese sandwich (8 steps, ~7 minutes), a club sandwich (12 steps, ~11 minutes), a veggie wrap (15 steps, ~14 minutes), and a custom sandwich defined by natural language description (variable steps and duration). The custom sandwich is particularly interesting because it requires the robot to translate a natural language description into a task plan on the fly, then execute that plan while tracking progress through memory.
In sandwich making, the working memory module proves essential for precise manipulation tasks like spreading condiments or layering ingredients. The system must visually track whether a spread has been applied evenly, whether a slice of cheese is properly aligned, or whether enough lettuce has been placed. These are inherently visual judgments that require comparison between the current state and a recent reference frame -- precisely what the WMM's rolling video buffer enables.
The episodic memory module handles a different challenge: ingredient tracking. When making a club sandwich with multiple layers, the robot must remember which ingredients have already been placed and which remain. Without episodic memory, the system frequently duplicated ingredients (placing tomato twice) or omitted them (forgetting the second layer of turkey). With MEM's structured episodic memory, these errors were reduced by 84% compared to the memoryless baseline.
The semantic task memory provides the scaffolding for the entire operation. It encodes general knowledge about sandwich construction (bread goes first and last, wet ingredients should not be placed directly on bread, proteins typically go before vegetables) while remaining flexible enough to accommodate the specific instructions for each sandwich type. For the custom sandwich benchmark, the STM's ability to generate a task plan from natural language and then update that plan based on execution feedback proved critical, enabling a 63% completion rate on arbitrary sandwich descriptions -- tasks that no previous robotic system had attempted.
6. The Biology Behind Multi-Scale Memory
MEM's architecture draws explicit inspiration from neuroscience research on human memory systems. The three-component structure maps directly to established models of human cognition: the Working Memory Module corresponds to Baddeley's model of working memory with its visuospatial sketchpad and phonological loop; the Episodic Memory Module corresponds to Tulving's episodic memory system, which encodes personally experienced events in temporal context; and the Semantic Task Memory corresponds to semantic memory, which stores general knowledge abstracted from specific experiences.
The biological parallels extend beyond the high-level architecture to the mechanisms of memory consolidation and retrieval. In the human brain, the hippocampus plays a central role in converting working memory traces into episodic memories through a process of consolidation. MEM implements an analogous process: when the WMM detects that a coherent behavioral episode has concluded (through a learned boundary detection model), the relevant working memory contents are summarized and committed to the EMM. This automatic segmentation and consolidation process ensures that the episodic memory contains meaningful, well-organized records rather than a undifferentiated stream of observations.
Similarly, the interaction between episodic and semantic memory in MEM mirrors the neural process of memory generalization. In the human brain, repeated episodic experiences are gradually abstracted into semantic knowledge -- a child who eats many sandwiches eventually develops general knowledge about how sandwiches work, independent of any specific sandwich-eating episode. MEM supports a similar process: as the system accumulates episodic memories from multiple task executions, the STM is periodically updated to incorporate lessons learned, improving its task plans based on actual experience.
The multi-scale attention mechanism that integrates all three memory components finds its biological counterpart in the prefrontal cortex, which is known to orchestrate the retrieval and integration of information from multiple memory systems to support goal-directed behavior. The attention weights learned by MEM's integration module show striking similarities to patterns observed in neuroimaging studies of human task performance: early in a task, attention is heavily weighted toward the STM (task planning); during execution, attention shifts to the WMM (perceptual monitoring); and at transition points between subtasks, attention peaks for the EMM (progress tracking).
7. Comparison with Existing Approaches
The landscape of memory-augmented embodied AI has evolved rapidly in recent years. Understanding where MEM fits requires examining several important prior systems.
MemoryVLA (2025) was among the first systems to demonstrate that explicit memory could improve VLA model performance on multi-step tasks. Its approach was straightforward: a key-value memory store where keys are visual embeddings and values are action-outcome pairs. While effective for short tasks (under 5 minutes), MemoryVLA's flat memory structure does not distinguish between different types of information or different timescales, leading to retrieval degradation as memory grows. MEM's hierarchical structure directly addresses this limitation, maintaining retrieval quality even after extended operation.
Embodied VideoAgent (2025) took a different approach, using a large video understanding model to process extended egocentric video and extract relevant information for task planning. This system demonstrated impressive performance on tasks requiring visual understanding but struggled with tasks requiring precise temporal tracking -- it could understand what the kitchen looks like in a video but had difficulty determining exactly which items had been moved and when. MEM's explicit episodic memory provides the temporal structure that Embodied VideoAgent lacks.
RoboMem (2024) introduced the concept of memory-conditioned policy generation, where a robot's action policy is explicitly conditioned on retrieved memories from previous experiences. While conceptually similar to MEM's approach, RoboMem's single-scale memory architecture required the policy to integrate temporal information across all timescales simultaneously, creating an excessive burden on the policy network. MEM's multi-scale decomposition simplifies this integration by presenting the policy with pre-organized information at the appropriate level of abstraction.
The key insight that distinguishes MEM from all prior work is the recognition that memory for embodied tasks is not a single problem but a family of related problems, each requiring a different representation, temporal resolution, and retrieval strategy. By decomposing the memory challenge into multiple specialized components that interact through a learned integration mechanism, MEM achieves performance that is substantially greater than the sum of its parts.
8. Implications for MemoryLake and Persistent Robot Memory
MEM's architecture, while implemented as a research prototype, points toward infrastructure requirements that align remarkably well with MemoryLake's capabilities. The transition from research to production embodied memory raises several challenges that a purpose-built memory infrastructure can address.
First, persistence across power cycles. A research robot can afford to start with empty memory at the beginning of each experiment. A deployed household robot cannot. It must remember where things belong, how the family organizes their kitchen, and what tasks have been completed today even after being turned off and on. MemoryLake's persistent, versioned storage provides the durability required for production embodied memory, with the added benefit of full provenance tracking that enables debugging and continuous improvement.
Second, memory sharing across robot instances. In a multi-robot deployment (increasingly common in commercial and industrial settings), individual robots' experiences can be consolidated into shared semantic memory, accelerating learning for all units. MemoryLake's merge and branch capabilities provide a natural mechanism for this kind of distributed memory management -- individual robot experiences can be accumulated on branches and periodically merged into a shared main line, with conflict resolution handling cases where robots have learned contradictory lessons.
Third, memory privacy and ownership. As robots become more prevalent in homes and workplaces, the memory they accumulate about their environments and the people in them becomes sensitive data. MemoryLake's access control and encryption capabilities ensure that embodied memory can be managed with the same rigor applied to other forms of personal data. The concept of a "robot memory passport" -- a portable, encrypted memory store that travels with the user rather than the hardware -- is a natural extension of MemoryLake's architecture.
The MEM paper demonstrates that multi-scale memory is the key to long-horizon embodied tasks. MemoryLake provides the infrastructure to make that memory persistent, shareable, and secure -- the foundation needed to move from 15-minute research demonstrations to 24/7 deployed robotic systems.
9. What Comes Next: From 15 Minutes to All Day
MEM represents a significant advance, but the 15-minute tasks it tackles are still far shorter than the continuous operation required of practical household or industrial robots. Extending the approach to hours-long or even day-long operation introduces additional challenges that the research community is only beginning to address.
Memory management at extended timescales requires sophisticated compression and forgetting mechanisms. A robot operating for 8 hours cannot store full episodic memories of every action; it must learn to distinguish between episodes worth remembering in detail (an unusual event, a new item placement, a user instruction) and routine episodes that can be compressed into statistical summaries (cleaned the counter 3 times today, each taking approximately 4 minutes). This echoes the human cognitive phenomenon of memory consolidation during sleep, where episodic memories are selectively transferred to semantic memory and the raw episodic traces are partially discarded.
Multi-task memory presents another frontier. MEM's current implementation treats each task (kitchen cleaning, sandwich making) as an independent memory context. A truly general household robot must maintain memories that span across tasks: the knowledge that the family prefers organic ingredients (learned during sandwich making) should inform grocery-related tasks; the observation that a cabinet hinge is loose (noticed during kitchen cleaning) should trigger a maintenance notification. Cross-task memory requires a richer semantic memory structure that can represent and retrieve knowledge across multiple task domains.
The path from MEM to production embodied memory passes through infrastructure. The research establishes the cognitive architecture; what remains is the engineering to make that architecture robust, scalable, and persistent. This is precisely the gap that MemoryLake is designed to fill.
10. Conclusion: Memory Makes Robots Real
The MEM paper demonstrates a profound truth about embodied AI: the difference between a robot that can perform isolated actions and one that can accomplish meaningful tasks is memory. Not faster processors, not better manipulators, not more training data -- memory. The ability to know what you have done, what remains to be done, and how to translate high-level goals into appropriate immediate actions is the cognitive infrastructure that enables extended, purposeful behavior.
The multi-scale architecture introduced by MEM -- working memory for immediate perceptual context, episodic memory for historical tracking, and semantic memory for task knowledge -- provides a principled framework for organizing the information that robots need to operate over extended periods. The experimental results, with task completion rates roughly doubling compared to the next best approach, demonstrate that this architecture is not merely theoretically elegant but practically effective.
As robots move from research labs into homes and workplaces, the quality of their memory will determine the quality of their service. MEM shows us the architecture; the next step is building the infrastructure to support it at scale. For those of us working on persistent, structured memory systems, the message from MEM is clear: the robots are ready for real memory. It is time to give it to them.
11. Computation and External Data: The Missing Dimensions of Robot Memory
MEM's multi-scale architecture addresses the remembering pillar of embodied memory with elegance and rigor. But production robot memory requires two additional capabilities that the paper does not explore: computation over memories and integration of external data sources. Consider a robot that has cleaned the same kitchen fifty times. Its episodic memory contains fifty trajectories, but without memory computation, it cannot synthesize these into an optimized cleaning strategy. Computation means reasoning over stored experiences: identifying that the counter-to-dishwasher path is fastest when approached from the left, that fragile items near the edge should be moved first, that the family member who cooks on Wednesdays leaves a predictable pattern of items to clean. This is trajectory optimization, preference modeling, and temporal pattern synthesis -- operations performed over memory, not just within it.
The computation pillar becomes critical for multi-robot coordination. When two robots share a kitchen, their combined memory must compute collision-free task allocation: if Robot A remembers that it cleaned the counter while Robot B loaded the dishwasher last time, the memory system should infer an efficient division of labor and detect potential conflicts (both robots reaching for the same item). MemoryLake's conflict detection and multi-hop reasoning engines can perform exactly this kind of inference over shared embodied memory, turning accumulated experience into coordinated behavior.
External data integration is equally essential for embodied systems. A robot's memory should not be limited to what it directly observes. External data sources -- updated floor plans from building management systems, product recall notices from manufacturer APIs, weather data that predicts mud tracked into the house, smart home sensor readings from rooms the robot has not yet visited -- all enrich the memory graph with information the robot could not acquire through its own sensors alone. A robot that integrates a notification that a family member is arriving home in 10 minutes (from a calendar API) with its memory of that person's preferences (likes the living room tidy, prefers dim lighting) can proactively prepare the environment. This is memory that grows from the world, not just from the robot's direct experience.
References
- [1] Chen, W., et al. "MEM: Multi-scale Embodied Memory for Long-Horizon Tasks." arXiv preprint arXiv:2603.03596, 2026.
- [2] Liu, H., et al. "MemoryVLA: Memory-Augmented Vision-Language-Action Models for Robotic Manipulation." ICRA, 2025.
- [3] Wang, J., et al. "Embodied VideoAgent: Persistent Memory from Egocentric Video for Embodied Task Completion." CoRL, 2025.
- [4] Baddeley, A. "Working Memory: Theories, Models, and Controversies." Annual Review of Psychology, 2012.