«`html

PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforcement Learning from AI Feedback (RLAIF) and a Robust Reasoning Scaffold

Target Audience Analysis

The target audience for PokeeResearch-7B consists of AI researchers, business managers, and data analysts who are interested in leveraging advanced AI technologies for deep research tasks. Their pain points include the need for accurate and reliable research outputs, the challenge of synthesizing information from multiple sources, and the desire for tools that enhance productivity and decision-making efficiency. Their goals revolve around improving research capabilities, achieving high accuracy in data interpretation, and integrating AI solutions into business processes. They prefer clear, concise communication that is rich in technical detail while also being accessible to non-experts.

Overview of PokeeResearch-7B

Pokee AI has open-sourced PokeeResearch-7B, a 7B parameter deep research agent capable of executing full research loops. The agent decomposes queries, issues search and read calls, verifies candidate answers, and synthesizes multiple research threads into a final response.

Research and Verification Loop

The agent operates through a structured research and verification loop. In the research phase, it calls external tools for web searches and page readings or proposes interim answers. During verification, it checks the answers against retrieved evidence, either accepting them or restarting the research process. This structure minimizes errors and enhances the reliability of outcomes.

Training Methodology: RLAIF with RLOO

PokeeResearch-7B is fine-tuned from Qwen2.5-7B-Instruct using an annotation-free Reinforcement Learning from AI Feedback (RLAIF) method, employing the REINFORCE Leave-One-Out (RLOO) algorithm. The training prioritizes semantic correctness, citation faithfulness, and adherence to instructions rather than mere token overlap. The model’s specifications include:

Batch size: 64
Research threads per prompt: 8
Learning rate: 3e-6
Training steps: 140
Context length: 32,768 tokens
Precision: bf16
Checkpoint size: near 13 GB

Reasoning Scaffold and Research Threads Synthesis

The reasoning scaffold comprises three key mechanisms: self-correction, self-verification, and research threads synthesis. The agent can detect malformed tool calls and retry, inspect its own answers against evidence, and run multiple independent threads per question, summarizing and synthesizing them into a final answer. This synthesis has been shown to improve accuracy on challenging benchmarks.

Evaluation Protocol

The evaluation of PokeeResearch-7B involves text-only questions from 10 benchmarks, including NQ, TriviaQA, and HotpotQA. A total of 1,228 questions are sampled, with four research threads run per question. Mean accuracy is computed, with a maximum interaction limit of 100 turns. The evaluation utilizes Gemini-2.5-Flash-lite for correctness judgment.

Results at 7B Scale

PokeeResearch-7B achieves the best mean at 4 accuracy among 7B deep research agents across the evaluated datasets. Notable results include:

HLE: 17.6 with RTS
GAIA: 41.3 with RTS
BrowseComp: 8.4 with RTS

Improvements are also observed across various QA benchmarks, showcasing significant gains from research threads synthesis, particularly on HLE, GAIA, and BrowseComp.

Key Takeaways

PokeeResearch-7B is optimized for factual accuracy, citation faithfulness, and instruction adherence through RLAIF and RLOO.
The agent’s reasoning scaffold enhances research reliability via self-verification and synthesis of independent research threads.
The evaluation employs mean at 4 accuracy across 10 datasets, establishing a robust performance benchmark.
PokeeResearch-7B is released under the Apache-2.0 license, with public access to code and model weights.

Further Exploration

For more information, check out the research paper, the model on Hugging Face, and the GitHub repository. You can also follow us on Twitter and join our community on Reddit. For updates, consider subscribing to our newsletter and joining us on Telegram.

«`