Exploring AI Agent Frameworks: Deconstructing OpenHands (1) --- Core Concepts
Starting with this article, we will begin a new series to delve into the Agent framework, using OpenHands (formerly OpenDevin) as an example.
The reason I’m writing this series is because I was discussing agent systems with some classmates, but I felt that my explanations were too superficial, and they didn’t seem to fully understand. Therefore, I decided to systematically review it again.
Because this series draws on a large number of articles, there may be some omissions in the references. If so, please point them out.
0x00 Summary
Mastering the underlying logic of Agents is not only fundamental to their proficiency but also crucial for their design, evaluation, and scaling. For product managers, AI engineers, and technology decision-makers, only a deep understanding of the Agent’s technical roadmap allows for precise strategic planning in the implementation of AI applications and seizing future opportunities. We aim to delve into the following key questions from the ground up:
- What core technology modules are needed to build a practical AI Agent?
- How do these core modules work together to form a complete closed loop for task execution?
- What are the key challenges that AI Agent systems face during deployment, and how does OpenHands address these engineering challenges?
We hope that through this in-depth “disassembly” journey, we can go beyond the demonstration of surface functions and directly touch the cornerstone of its architecture.
The reason for analyzing OpenHands is that, as an open-source AI software development agent framework, OpenHands (formerly OpenDevin) has distinctive features in terms of functionality, architecture, and compatibility. Furthermore, considering its ease of entry, learning value, and practical applications, it is highly suitable for learning agent frameworks. Its specific features are as follows:
- Low entry barrier and easy to learn: Supports natural language interaction, allowing developers to assign development tasks without complex coding, reducing initial learning costs. It also provides user-friendly interfaces such as GUI and CLI, along with detailed practical examples and clear documentation, facilitating understanding of the collaborative logic between Agents, tools, and LLMs. Furthermore, it is fully open-source under the MIT license, allowing free access to the core source code for Agent scheduling, task orchestration, and other aspects.
- This course covers the core knowledge system of Agents: learning it will enable you to master the core technologies of Agent frameworks, such as multi-Agent collaboration, task decomposition, tool invocation, and sandbox security control. It is closely integrated with practical scenarios such as web page interaction, API integration, and code processing, helping learners understand the implementation logic of Agents in real-world development processes.
- It combines learning and practical application value: it can be used for building simple agent prototypes to quickly validate ideas, and it can also support the reliable deployment of large-scale agents, adapting to the entire process from learning and testing to production applications. Furthermore, compared to mainstream agent SDKs, its unique sandboxed production server and remote execution features allow learners to access more comprehensive advanced agent capabilities, enhancing their technical competitiveness.
0x01 Background
When discussing the technology of AI Agent systems, we need to first understand a key question: What truly makes a system competitive-the “intelligent functions” that are immediately visible, or the “workflow architecture” that supports the smooth operation of everything behind the scenes? This answer not only determines what to develop first, but also concerns whether the system can create a long-term advantage that others cannot copy.
1.1 What is an Agent?
The definition of an agent has multiple perspectives:
- Some define it as an independent system that can operate autonomously for a long time and complete complex tasks with the help of various tools;
- Some believe that an agent is one that can complete the closed loop of “perception -> planning -> action -> feedback” with minimal human intervention, and can both parse natural language targets and call external tools such as search engines and databases.
- Some people interpret it as a standardized implementation scheme that follows a pre-set workflow.
Anthropic categorizes these different types of systems under the umbrella term “Agent systems,” while clearly distinguishing between the two core concepts of “Workflows” and “Agent” at the architectural level. Specifically:
- Workflow is coding “how to do”: Workflow systems are like stage plays performed according to a fixed script, and the collaboration between Large Language Models (LLMs) and various tools follows the pre-written code path.
- Agents encode “what to do”: Agent systems are more like project managers with autonomous decision-making capabilities, dynamically guiding the process and tool calls through a large language model, and controlling the execution of tasks throughout the process.
As large language models mature in their four key capabilities-complex input understanding, logical reasoning and task planning, reliable tool invocation, and error recovery-Agents are gradually being deployed in production scenarios. Their workflow typically begins with instructions or interactive communication from a human user. Once the task objective is clear, they autonomously plan and execute operations. If additional information or judgment is needed, they proactively request it from the human user. At each execution step, the agent must obtain “ground truth” data from the environment-such as tool call results or code execution output-to assess task progress. When reaching a preset checkpoint or encountering execution obstacles, the agent can pause to obtain human feedback. Tasks generally terminate upon completion, and stopping conditions (such as a maximum number of iterations) are set to ensure process control.
1.2 The Importance of Agent Engineering
The implementation logic of the Agent is relatively simple, essentially involving the continuous operation of a large language model within a “context feedback - tool invocation” loop. The core building block of the Agent system is an enhanced large language model (LLM) that proactively leverages retrieval, tooling, and memory capabilities to optimize its functionality. For most applications, optimizing individual LLM calls through retrieval (RAG) and adding contextual examples in the prompt (Prompt Engineering) is usually sufficient.
While providing a set of tools and prompts allows an agent to explore, analyze, decide, and execute until the task is completed, this isn’t inherently wrong. However, allowing an LLM (Local Management Model) to control everything results in an endless loop and unchecked growth due to incomplete prompts and tools. Therefore, although the AI model is the “brain” of the agent, agent engineering is the key support for achieving production-ready deployments. Rakesh Gohel’s “AI Agent Iceberg Model” points out that building a truly usable agent involves 90% software engineering and only 10% AI technology.
- “10% AI”: The AI model is just the “brain” of the agent: understanding the task, planning the steps, and generating content.
- “90% Engineering”: Engineering is the entire “body and nervous system” that supports the Agent, including user interaction, access control, task orchestration, tool calls, logging, and exception rollback.
Behind this lies a systematic agent architecture that quietly determines the efficiency, scalability, and evolutionary direction of these agents. If we compare Large Language Models (LLMs) to the engine of AI, then the “Agent architecture” is the chassis and driving system that determines how far AI can go.
1.3 Architecture is the key to competitive advantage
AI Agents are not standalone products, but a completely new form of software-not “smarter robots,” but “autonomous, collaborative digital individuals.” The technical challenge lies not in “imagination,” but in “engineering implementation capabilities.” In the future, those who truly lead the development of Agents will undoubtedly be “Agent craftsmen” who understand both AI technology and system architecture. Agent architecture has become a core competitive arena for next-generation AI applications. The ability to understand the collaborative logic of “Memory-Plan-Tool-Reflection” and to build “transparent, controllable, and scalable” task systems determines whether a team possesses the core capabilities to create practical Agent applications.
Take browserAgent as an example. They can generate code and interact with web pages, which seems impressive, but these capabilities are actually supported by three external conditions-and these three conditions are becoming increasingly common in the industry.
- Basic model capabilities are accessible to everyone. As large language models become increasingly powerful in the future, the functions that AI agents can perform will become easier to implement, gradually becoming industry standards. In this way, the gap in “intelligence” between different systems will gradually narrow.
- Tip-based techniques are easy to copy. They cannot be a unique advantage.
- Tool usage is becoming increasingly standardized. The ability of an agent to invoke tools essentially means using someone else’s API. Any team can directly integrate existing tools to fill their functional gaps, and competition in this area is gradually becoming monotonous.
The real technological barrier lies in building a stable, controllable, monitorable, and scalable underlying operating environment. This requires a team with deep technical expertise, the ability to handle complex systems, and continuous iteration to address new challenges. These challenges are inherent to AI’s interaction with the real world and won’t disappear automatically as models become more powerful; rather, they determine the system’s stability and reliability. Whoever can solve these “system-level pain points”-state management, tool fault tolerance, plan controllability, and behavioral transparency-will gain the upper hand in the agent technology revolution.
In the future, the competition among AI agent systems will shift from a functional contest of “who can do what” to a contest of engineering strength of “who can do it more stably, better, and for longer”-in short, engineering strength determines long-term competitiveness.
0x02 AI Agent System
Building a mature AI Agent Framework essentially involves creating an infrastructure that supports an agent’s autonomous perception, environmental awareness, decision-making, action execution, and self-evolution-a true “digital life form.” Its core lies in breaking down the agent’s core capability modules, clarifying the responsibilities and collaboration logic of each component, and overcoming key bottlenecks in system stability, flexibility, and usability. The core of the agent system encompasses four main capabilities:
- Environmental interaction awareness is achieved through tools such as Function Call and MCP protocol.
- Autonomous reasoning and decision-making can be accomplished using methods such as ReAct and Reflexion.
- Knowledge management employs a paradigm that combines short-term memory (based on Prompt/context window) and long-term memory (managed through RAG and vector databases);
- Communication and collaboration among multiple agents are achieved through protocols such as A2A, ACP, and ANP.
2.1 Challenges in Architecture Design
Large models are essentially probabilistic outputs, and this “probabilistic nature” brings with it a three-pronged risk.
- Inconsistency: Multiple sampling results for the same input diverge.
- Unreality: Hallucinations lead to factual errors.
- Timeliness: Static training data has expired.
This results in four major bottlenecks for the Agent:
- Memory fragmentation. Simple agents lack a dedicated task state awareness mechanism, and relying solely on context splicing cannot reliably track long-process tasks. For example, in multi-step tasks, the “intermediate state” simply lies in the context window of the LLM, without structured storage or explicit semantic indexing.
- One mistake and you’re dead. A simple agent lacks the ability to detect and tolerate errors in tool calls.
- Unpacking the black box. LLM directly outputs the natural language plan, which developers cannot plug in or verify; it can only be discovered at the execution end.
- An audit black hole. The agent’s actions-who it calls, what fields it retrieves, and which user prompt it bases its decisions on-leave no trace, making corporate compliance impossible.
Therefore, the competition among agents is shifting from “who can call more APIs” to “who can encapsulate uncertain models into deterministic systems.” Putting probability into state machines, confining illusions within audit cages, and pushing costs into budgets-this is the real barrier to entry for agent systems.
2.2 Core Components
Agent is by no means an enhanced version of a single language model. LLM is only its “cognitive center”. What truly supports its complete functions is a multi-module architecture that works in concert. A mature and deployable Agent system contains at least several core modules. Each module is like a different role in a sophisticated team, each performing its own duties and working closely together.
- The Planning and Decision Engine (Planner/Policy) acts as the “cognitive hub,” much like the “think tank” of a team. Its core responsibilities include analyzing the user’s task intent, breaking down the user’s “complex ultimate goal” into an “executable sequence of subtasks,” decomposing complex tasks into subtasks, clarifying the dependencies between subtasks, and generating outputs such as code and reports. It supports both one-time static planning and gradual expansion via ReAct/CodeAct, meaning it supports dynamic plan adjustments: based on tool call results, environmental changes, or user feedback, it modifies the subtask flow in real time to avoid a disconnect between the plan and actual execution.
- The Memory module acts as a “context continuator,” responsible for storing information such as dialogue context, key nodes and progress of task execution, and historical experience, ensuring the continuity of task execution. It stores dialogue history, intermediate results, and long-term memory. Thus, by introducing summaries or key data from previous steps into the Prompt, the Agent can “remember” the results of previous steps, ensuring state consistency and contextual continuity across steps.
- The Planner module acts as an “action roadmap.” Its implementation mechanisms include rule-based fixed processes (suitable for standardized tasks), dynamic generation based on LLM reasoning capabilities (suitable for flexible needs), and hybrid scheduling that combines the advantages of both (such as building a directed graph task flow based on LangGraph).
- The Tool-use module acts as the Agent’s “hands and feet,” allowing the model to “see” the tools instead of calling them “in the dark.” This breaks the limitation of simply outputting text and connects to external resources by calling third-party APIs, retrieving external knowledge bases, and reading and writing local files to complete actual operations.
- The Reflection module empowers the Agent with “metacognitive capabilities,” acting like a “debriefing specialist” for the team. When a task fails or is hindered, the Agent compares the execution results with expectations, assesses the execution effectiveness, identifies the cause of the problem, and adjusts the strategy.
These modules complement each other, forming an organic whole of “state-driven + intent decomposition + tool invocation + self-learning”, rather than a simple superposition.
In addition, in actual implementations, the following components are usually included:
- Event Bus: Used for decoupling, serializing user messages, tool returns, status changes, and exception reports into events and broadcasting them to subscribers.
- Runtime Sandbox: Provides a “hands-on” space for the model-file system, network, shell, Python interpreter, third-party libraries.
- Observability: Provides a visual task execution interface that displays the task flow, subtask progress, and module interaction logs, supporting real-time monitoring by developers and allowing operators to see the “black box”-every call chain, token consumption, exception stack, and decision reasons are all written to disk.
- Configuration & Storage module: Manages the global configuration of the management framework, such as LLM parameters, task budgets (maximum number of iterations, cost cap), storage paths, tool lists, etc., and supports dynamic configuration updates. It also ensures data persistence, including task status, execution history, stored data, and configuration information, guaranteeing recovery after system restarts.
2.3 Devin & OpenHands (formerly OpenDevin)
Devin, developed by Cognition AI, is the world’s first AI programmer, possessing full-stack development capabilities and capable of autonomously completing tasks such as code writing, debugging, deployment, and AI model training. Devin represents an advanced autonomous agent designed to address the complexities of software engineering. It leverages a combination of tools such as shell, code editor, and web browser to demonstrate the underutilized potential of LLM in software development.
The OpenDevin project was born out of a desire to replicate, enhance, and innovate upon the original Devin model. OpenDevin aims to explore and extend Devin’s capabilities, identify its strengths and areas for improvement, and guide the progress of the open code model. By engaging the open-source community, OpenDevin aims to address the challenges faced by Code LLM in real-world scenarios, produce works that significantly contribute to the community, and pave the way for future advancements.
OpenHands currently has over 65,000 stars on GitHub.
The OpenHands website is: https://docs.openhands.dev/
The GitHub link is: https://github.com/OpenHands/OpenHands
Software Agent SDK: https://github.com/OpenHands/software-agent-sdk
Benchmarks: https://github.com/OpenHands/benchmarks
The latest OpenHands paper is as follows: The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents 2511.03690v1
0x03 OpenHands Architecture Concept Diagram
To accurately unlock the deep logic of the OpenHands source code, we need to construct a clear architectural concept map, connecting the scattered core components and design concepts into an organic whole. Each link in this map carries a key mission, collectively building the foundation for the efficient operation of the AI Agent.
OpenHands components
3.1 System Architecture
The diagram below provides a high-level overview of the OpenHands system architecture. The system is divided into two main parts: the front-end and the back-end. The front-end handles user interactions and displays results. The back-end handles business logic and executes Agent operations. A key advantage of this architecture is its flexibility and scalability: new Agent types, action types, or runtime environments can be easily integrated into existing systems.
OpenHands-Overview
The backend architecture is as follows:
OpenHands Class Diagram
3.2 Code Repository Directory Structure
OpenHands’ codebase is clearly organized, and the following are the main directories:
- Agent Center (agenthub): This is the core area of the Agent, responsible for code generation and execution logic, and is the carrier of the platform’s core capabilities. It contains Agent logic focused on code generation and execution, which is the core capability of the platform. In addition, it also includes the Agent implementation responsible for browser interaction.
- Events System: This is the core directory of the event-driven architecture, defining the system’s “communication language.” It includes encapsulating feedback events such as task execution results
observation/, and defining various action commands that agents can executeaction/.stream.pyimplements the event stream management mechanism, responsible for event distribution, storage, and subscription, and is a key hub for component collaboration. - Runtime Environment (runtime/): Provides the underlying environment support for Agent execution.
impl/This directory contains specific implementations of different runtime environments, such as Docker containers and local environments;plugins/it also supports extending runtime functionality through plugins to improve system flexibility. - Memory Management (memory/): Responsible for managing the Agent’s historical data and memory.
conversation_memory.pyhandles the storage and retrieval of dialogue history;condenser/implements historical record compression logic to address the limited context window issue in LLMs and ensure the continuity of long-cycle tasks. - Language Model Integration (llm/): Enables integration with various large-scale language models.
- Controller: The system’s “command center,” responsible for the scheduling and management of Agents.
This structure reflects OpenHands’ modular design, with clear division of responsibilities among components, facilitating maintenance and expansion. The logical structure between these directories is roughly as follows:
OpenHands - Overall Process
3.3 Core Components
OpenHands employs an event-based architecture that decouples the Agent, runtime environment, and user interface, enabling flexible interaction patterns. Key classes in OpenHands include:
- LLM: Responsible for all interactions with large language models. Thanks to LiteLLM, it can work with any underlying completion model.
- Agent: Responsible for viewing the current state and generating an action to bring the target closer to the final goal.
- AgentController: Initializes the Agent, manages its state, and drives the main loop, propelling the Agent forward step by step.
- State: Represents the current state of the Agent task. This includes the current step, the history of recent events, and the Agent’s long-term plan. The State module is like the Agent’s “memory brain,” not only recording various state data during task execution but also supporting breakpoint recovery. Even if a long-running task is interrupted, it can continue from where it left off without having to start over.
- EventStream: The central hub for events. Any component can publish events or listen to events published by other components.
- Event: Action or observation
- Action: Represents a request, such as editing a file, running a command, or sending a message.
- Observation: Represents information collected from the environment, such as file content or command output.
- Action and Observation act as a “common language” between components-Action carries the instructions to be executed, and Observation returns the execution results. Only when the two work together can information be transmitted smoothly without any bottlenecks. These modules do not work independently, but are connected and work together through the “core hub” of Event Stream-where various events converge and are distributed, pushing tasks step by step according to plan, ultimately enabling the AI Agent to handle complex tasks.
- The ReAct paradigm acts as a “code of conduct” for the agent, establishing the core logic of “think first, then act, and receive feedback,” ensuring that its decision-making is methodical and not chaotic. The event-driven model, on the other hand, builds the “skeleton” of the system, with all interactions driven by the flow of events. The modules are not rigidly bound together, allowing for flexible responses to various situations.
- Event: Action or observation
- Runtime: Responsible for executing actions and sending back the observation results.
- Sandbox: The runtime environment portion that runs commands, such as inside Docker.
- Runtime and Memory are key “organs” of the system: Runtime provides an isolated execution environment to ensure that the code runs safely and stably without disturbing other parts; Memory manages historical data clearly and provides past experience for the Agent to make decisions.
- Server: Manages OpenHands sessions via HTTP, such as driving the front end.
- Session: Stores a single EventStream, a single AgentController, and a single Runtime. It typically represents a single task (but may include several user prompts).
- ConversationManager: Maintains a list of active sessions and ensures that requests are routed to the correct session.
The core component interaction relationships are as follows:
OpenHands - Core Component Interaction Relationships
This process demonstrates the complete architecture of OpenHands as an AI software development agent platform, showing the complete data flow and component relationships from user interaction to core execution.
- User input phase
- Users can make requests via web interface, CLI, or API.
- The server receives the request and creates a session.
- Events are transmitted through an event flow system.
- Agent processing phase
- The controller initializes the corresponding agent based on the configuration.
- The agent makes decisions based on the current state and historical data.
- Perform specific operations through the tool system.
- Execution phase
- The runtime environment executes specific commands or code.
- The plug-in system provides additional feature support.
- Security system monitoring execution process
- Feedback phase
- The execution result is returned as an Observation.
- Update Agent status and memory
- Return the results to the user
0xFF Reference
https://docs.all-hands.dev/openhands/usage/architecture/backend
As AI agents evolve from “toys” to “tools,” what should we focus on? Openhands Architecture Analysis [Part 2: Core Concepts Related to Agents] by Kerry
As AI agents evolve from “toys” to “tools,” what should we focus on? Openhands Architecture Analysis [Part 1: Series Introduction] by Kerry
Coding Agent Openhands Analysis (with code) Arrow
OpenHands Source Code Analysis by Yi Lihui
[Agent Engineering] 01 - Agent Engineering Technology Insights, Challenges, and Solutions ( by Wang Lei)
From code generation to autonomous decision-making: Building a Coding-Driven “Self-Programming” Agent (Alibaba Cloud Developers)
From Prompt to Context: A Systematic Overview of Context Engineering Based on 1400+ Papers ( Alibaba Cloud Developers)
Claude Code In-Depth Analysis: The Core Architecture of a Top-Tier AI Programming Tool (Alibaba Cloud Developers)
From Thinking to Act: In-depth Analysis of the Architecture Design and Future Evolution of Manus-like AI Agents (AnthroTech AI)
AI Programmer’s Guide to OpenDevin Source Code Analysis (goofy)
The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents 2511.03690v1
Machine Learning Systems Tutorial: AI Engineering Principles and Practice
Agentic Design Patterns (Chinese Translation Project)
AI Native Application Architecture White Paper
Must-Read Series for 2025: How AI is Redefining Research? A 10,000-Word Article Explaining “Deep Research”
Agent Development Practice: From Idea to Product - Overcoming Key Technologies in SSE, Context Engineering, and Streaming Parsing
In-depth analysis of the Google gemini-cli source code: A deep dive into the core of AI Agents.
Agent Development Practices: From Idea to Product - System Architecture Design Practices