Fractional Reasoning in LLMs: A New Way to Control Inference Depth

Introduction: Challenges in Uniform Reasoning During Inference

Large Language Models (LLMs) have demonstrated significant advancements across various domains, with test-time compute being crucial to their performance. This approach enhances reasoning during inference by allocating additional computational resources—such as generating multiple candidate responses to select the most suitable one or refining answers iteratively through self-reflection. However, existing test-time compute strategies apply uniform reasoning across all problems, disregarding the varying reasoning needs of different queries. This can lead to degraded answers or unnecessary computational costs. Consequently, LLMs must dynamically adjust their reasoning depth or level of reflection to optimize performance.

Prior Work: Latent Steering and Representation Control

Previous research has explored various methods to enhance LLM reasoning through inference-time scaling and latent state control. Techniques like Chain-of-Thought (CoT) prompting guide models in decomposing complex problems into intermediate steps, improving reasoning performance. Additionally, outcome reward models (ORMs) and process reward models (PRMs) assess generated responses based on correctness or the quality of internal reasoning. Representation engineering methods employ steering vectors in LLM latent spaces for controlled generation, while solutions like In-Context Vectors (ICV) extract latent vectors from demonstrations to guide internal states during inference. Representation Finetuning (ReFT) learns task-specific low-rank interventions over latent representations.

The Proposed Framework: Fractional Reasoning for Adaptive Inference

Researchers from Stanford University have introduced Fractional Reasoning (FR), a training-free and model-agnostic framework aimed at enhancing test-time compute through adaptive reasoning control. FR modifies reasoning behavior by adjusting the model’s internal representations, extracting the latent shift induced by reasoning-promoting inputs such as CoT or reflection prompts, and applying this shift with a tunable scaling factor. This allows models to vary the depth of reasoning during inference without altering the input text or requiring fine-tuning. FR supports two key forms of test-time scaling: (a) Breadth-based scaling, such as Best-of-N and Majority vote, and (b) Depth-based scaling, like self-reflection.

Benchmarking: Performance Gains on Reasoning Tasks

FR has been evaluated across three benchmarks requiring multi-step reasoning: GSM8K, MATH500, and GPQA. The evaluation utilizes test sets for GSM8K and MATH500 while employing the diamond split for GPQA. Main experiments utilize two competitive open-source instruction-tuned models: Qwen2.5-7B-Instruct and LLaMA-3.1-8B-Instruct. Both models demonstrate strong reasoning capabilities and provide access to the latent state representations necessary for the proposed method. FR consistently outperforms standard test-time compute methods across all benchmarks and models, significantly enhancing performance. By adjusting the influence of prompts, FR broadens the solution space, increasing the efficiency of traditional test-time compute methods.

Behavior and Model-Agnostic Generality of Fractional Reasoning

Further analysis of FR reveals its behavioral dynamics and general applicability across different models. Findings indicate that increasing the scaling parameter results in longer outputs with more detailed multi-step reasoning. This confirms that the framework effectively steers model behavior in a predictable and continuous manner. FR remains effective even when applied to reasoning-specialized models, such as DeepSeek-R1-Distill-Qwen-7B, improving accuracy over standard prompting baselines. Performance scaling analysis shows consistent improvements with an increasing number of generations, with FR achieving higher accuracy across most sampling budgets compared to the majority vote baseline.

Conclusion: Towards More Dynamic and Efficient LLM Inference

In summary, the introduction of Fractional Reasoning (FR) provides a training-free and model-agnostic framework designed to improve test-time compute through adaptive control of reasoning behavior in LLMs. This approach offers a general and interpretable method for more precise and efficient allocation of computational resources during inference, addressing the limitations of uniform reasoning in current test-time compute strategies. Future research may focus on developing adaptive policies for fully dynamic inference, as the framework currently relies on predefined reasoning directions and lacks automatic selection of scaling factors.

Check out the Paper. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Developers, Engineers, and Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage this platform to reach their target audience.