Google AI Introduces DS STAR: A Multi-Agent Data Science System That Plans, Codes, and Verifies End-to-End Analytics
Google has unveiled DS STAR (Data Science Agent via Iterative Planning and Verification), a multi-agent framework designed to transform open-ended data science questions into executable Python scripts, regardless of the complexity of the data formats involved. DS STAR operates directly on mixed data types, such as CSV, JSON, Markdown, and unstructured text, rather than relying solely on structured databases.
Transforming Text to Python Over Heterogeneous Data
Unlike existing data science agents that typically use Text to SQL, which limits them to structured tables, DS STAR expands the approach to encompass a variety of file types. The system generates Python code that loads and integrates data from diverse files. Initially, it summarizes each file, leveraging this context to plan, implement, and verify a multi-step solution. This framework enables DS STAR to handle complex benchmarks like DABStep, KramaBench, and DA Code, which require comprehensive analyses across different file formats.
Stage 1: Data File Analysis with Aanalyzer
The first stage of DS STAR involves building a structured representation of the data lake. For each file (Dᵢ), the Aanalyzer agent generates a Python script (sᵢ_desc) that extracts key information, such as column names, data types, metadata, and textual summaries. This step is applicable to both structured and unstructured data, producing output that serves as a shared context for subsequent agents.
Stage 2: Iterative Planning, Coding, and Verification
Following data analysis, DS STAR enters an iterative loop that mimics human interaction with a data notebook. The process involves:
- Aplanner creates an executable initial step (p₀) based on the query and file descriptions.
- Acoder translates the current plan (p) into Python code (s).
- DS STAR executes the code to gather an observation (r).
- Averifier assesses the cumulative plan, query, current code, and execution result, providing a binary evaluation: sufficient or insufficient.
- If the plan is deemed insufficient, Arouter determines the next steps for refinement.
This loop continues until the verifier confirms that the plan is sufficient or until a maximum of 20 refinement rounds is reached. The final plan is then converted into solution code by a separate agent, Afinalyzer, which ensures strict adherence to output formats.
Robustness Modules: Adebugger and Retriever
Recognizing that real-world data pipelines often face challenges such as schema drift and missing columns, DS STAR incorporates an Adebugger to rectify broken scripts. When code fails, Adebugger generates a corrected script by utilizing detailed schema descriptions along with the original code and error tracebacks.
Additionally, DS STAR employs a Retriever module to effectively manage large datasets. This module selects the top 100 relevant files based on user queries and file descriptions, enhancing the contextual understanding of the task at hand. The research team utilized Gemini Embedding 001 for this similarity search.
Benchmark Results on DABStep, KramaBench, and DA Code
In extensive experiments, DS STAR, powered by Gemini 2.5 Pro and allowing for 20 refinement rounds, outperformed previous models significantly. Results include:
- DABStep: DS STAR achieved a hard level accuracy of 45.24%, compared to 12.70% from the model alone.
- KramaBench: DS STAR scored 44.69% normalized, surpassing the previous best of 39.79%.
- DA Code: DS STAR reached 37.1% accuracy on hard tasks, compared to 32.0% from other agents.
Key Takeaways
DS STAR redefines data science agents by integrating a multi-agent architecture that addresses the challenges of heterogeneous data sources. Its innovative design facilitates the generation of Python code through a systematic process of analysis, planning, coding, and verification. The system’s robustness is enhanced by Adebugger and Retriever modules, ensuring efficient handling of diverse data scenarios. Moreover, the significant performance improvements on benchmark tasks highlight the potential of DS STAR in real-world enterprise applications.
For further exploration, refer to the original research paper and the related technical details.
Feel free to follow us on Twitter and join our community on Reddit. You can also subscribe to our newsletter for the latest updates.