SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents
Recent advancements in language model (LM) agents have demonstrated significant potential for automating complex real-world tasks across various domains, including software engineering, robotics, and scientific experimentation. These agents typically operate by proposing and executing actions through APIs. As tasks grow in complexity, LM agent frameworks have evolved to incorporate multiple agents, multi-step retrieval, and tailored scaffolding to optimize performance.
A central challenge in this field is effectively exploring and understanding the environment. This has led to the development of engineered scaffolds that utilize tools, memory mechanisms, and custom pipelines. However, many existing methods operate under the assumption of partial observability, requiring agents to collect observations incrementally. While this is applicable in dynamic or unfamiliar environments, it is less relevant in fully observable settings like SWE-bench, where all pertinent information is accessible from the outset.
Strategies in Software Engineering
Research on LM agents in software engineering has primarily focused on two strategies: agent-based frameworks and structured pipelines. Agent-based systems, such as SWE-Agent and OpenHands CodeAct, enable LMs to autonomously interact with codebases through custom interfaces and retrieval tools. Other models like Moatless and AutoCodeRover enhance localization through search techniques, while SpecRover refines scaffolding design. Conversely, structured pipelines—such as Agentless and CodeMonkey—decompose tasks into sequential phases like localization, repair, and validation.
Current research suggests leveraging Long-Context LMs (LCLMs) to directly interpret the entire task environment, potentially replacing complex agentic designs with a single powerful LCLM. Advances in LCLM architecture and infrastructure have shown that these models can outperform retrieval-augmented systems in many contexts, reducing reliance on intricate external scaffolding.
Research Findings
Researchers from Stanford, IBM, and the University of Toronto investigated whether complex scaffolding is necessary for LM agents addressing tasks like SWE-bench. Their findings indicate that using LCLMs, such as Gemini-1.5-Pro, with appropriate prompting and no scaffolding can achieve competitive performance, reaching 38% on SWE-Bench-Verified. Notably, Gemini-2.5-Pro, utilizing the same straightforward setup, achieved a performance rate of 50.8%. This suggests that many complex agentic designs could be simplified significantly.
Additionally, a hybrid two-stage approach employing Gemini-1.5-Pro and Claude-3.7 reached a solve rate of 48.6%, further supporting the case for a simplified architecture.
State-in-Context Agents
Traditional LM agents often rely on interactive exploration due to partial observability. However, many tasks, such as software debugging, allow for full observability. The study proposes state-in-context agents that utilize LCLMs to process full or compressed environment states directly, eliminating the need for complex agentic scaffolding. For large codebases, a ranking-based compression method selects relevant files to fit within context limits.
Two methods are introduced: DIRECTSOLVE, where LCLMs solve tasks using the complete context, and SELECTSOLVE, where LCLMs localize relevant files for short-context LMs (SCLMs) to solve. Both methods employ targeted patch formats and validation to ensure accuracy and minimize hallucination.
Experimental Evaluation
The experiments assessed a simplified agent framework using LLMs on the SWE-bench Verified benchmark, which includes 500 real-world software engineering tasks. The proposed methods, DIRECTSOLVE and SELECTSOLVE, utilized LCLMs like Gemini-1.5-Pro and Gemini-2.5-Pro, with SELECTSOLVE also incorporating an additional SCLM (Claude-3.7-Sonnet) for patch generation. Results indicated that DIRECTSOLVE outperformed complex agentic approaches such as Agentless and CodeAct with minimal engineering effort. SELECTSOLVE further enhanced accuracy by leveraging stronger models for patching.
Ablation studies underscored the significance of chain-of-thought (CoT) prompting, code restatement, and token-efficient context design. Additionally, positioning relevant files at the beginning of the prompt improved performance, highlighting limitations in long-context processing.
Cost Considerations
Currently, the cost of utilizing LCLM-based methods is higher than existing approaches like Agentless and CodeAct, averaging $2.60 per instance compared to $0.25 and $0.87, respectively. However, rapid reductions in inference costs and increasing context lengths are making LCLMs more practical. Techniques such as key-value (KV) caching significantly lower costs after initial runs, reducing it to approximately $0.725. Although slight changes in codebases may limit caching benefits, further advancements could enhance this aspect. The study also indicates that LCLMs can manage long interaction histories, diminishing the need for complex memory and retrieval mechanisms.
In conclusion, unscaffolded LCLM models can perform competitively on SWE-bench tasks, suggesting a shift towards simpler architectures in the future.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.