How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge

Understanding the Target Audience

The target audience for this content primarily comprises AI researchers, business managers, and professionals in fields such as healthcare and finance who are interested in the functioning and evaluation of large language models (LLMs). These readers are typically involved in decision-making processes regarding AI implementations and are keen to understand the underlying mechanisms that drive LLM performance.

Pain Points

Lack of clarity on how LLMs process and reason through information.
Challenges in evaluating the effectiveness and trustworthiness of AI models in critical applications.
Need for reliable frameworks to assess reasoning capabilities in various domains.

Goals

To gain insights into the reasoning processes of LLMs to improve their applications in business and healthcare.
To develop metrics that accurately reflect the performance and reliability of AI systems.
To inform strategies for training LLMs that enhance both factual accuracy and logical reasoning.

Interests

Advancements in AI and machine learning technologies.
Application of AI in industry-specific scenarios, particularly in high-stakes fields like medicine and finance.
Research on improving AI transparency and interpretability.

Communication Preferences

The target audience prefers concise, technical content that is well-researched and backed by peer-reviewed studies. They appreciate actionable insights and practical examples that can be directly applied to their fields of work.

Introduction

Recent advancements in reasoning-focused LLMs, such as OpenAI’s o1/3 and DeepSeek-R1, have led to significant improvements on complex tasks. However, the step-by-step reasoning behind these models remains unclear. Most evaluations focus on final-answer accuracy, which obscures the reasoning process and fails to reveal how models combine knowledge and logic.

The Shortcomings of Final-Answer Evaluations in Math and Medicine

While LLMs have made impressive strides in reasoning tasks, particularly in math and medicine, progress has largely centered on improving final answer accuracy at the expense of understanding the reasoning process. Previous methods of assessment have highlighted factual errors in reasoning chains or measured the similarity between reasoning steps and the original question. However, such similarity does not guarantee logical soundness or factual accuracy, as LLMs often rely on internal knowledge or prior deductions.

A New Framework for Separating Knowledge and Logic in LLM Reasoning

Researchers from UC Santa Cruz, Stanford, and Tongji University propose a framework that breaks down LLM reasoning into two key components: factual knowledge and logical steps. This framework utilizes two metrics: the Knowledge Index (KI) for assessing factual accuracy and Information Gain (InfoGain) for evaluating reasoning quality. Their analysis of Qwen models across math and medical tasks reveals that reasoning skills do not easily transfer between domains. Although supervised fine-tuning enhances accuracy, it can compromise reasoning depth. In contrast, reinforcement learning can refine reasoning by filtering out irrelevant information.

Assessing Reasoning with Qwen2.5-7B and DeepSeek-R1 Models

The researchers evaluate LLM reasoning by analyzing Qwen2.5-7B and its distilled version, DeepSeek-R1, trained through supervised fine-tuning (SFT) and reinforcement learning (RL). Using tasks from both math and medical domains, they decompose model responses into logical steps and assess them using two metrics: Information Gain (how much uncertainty is reduced with each reasoning step) and Knowledge Index (how factually accurate each step is, verified against expert sources). This approach reveals how models reason and identifies areas where they may lack accuracy or logical soundness.

Supervised Fine-Tuning vs. Reinforcement Learning in Domain-Specific Tasks

The study compares two variants of Qwen-2.5-7B—Qwen-Base and the distilled Qwen-R1—specifically on medical tasks. Results indicate that Qwen-Base consistently outperforms Qwen-R1 in accuracy, knowledge retention, and reasoning, especially following SFT and RL. The distilled model likely struggles due to prior training biases towards math and code, resulting in a mismatch for medical applications. Notably, SFT enhances medical knowledge retention more effectively than RL, although it may slightly weaken reasoning efficiency. Conversely, RL improves both reasoning and knowledge retention when applied after SFT. Medical benchmarks tend to prioritize factual knowledge over abstract reasoning compared to math-oriented tasks.

Conclusion: Toward More Interpretable and Trustworthy LLMs

This study introduces a framework that separates knowledge from reasoning, aiming to enhance the evaluation of LLMs, particularly in high-stakes domains like medicine and math. The research demonstrates that while supervised fine-tuning improves factual accuracy—an essential element in medical applications—it may compromise reasoning depth. In contrast, reinforcement learning contributes positively to reasoning by filtering out inaccuracies. The framework has potential applications in various fields, including law and finance, where structured thought processes are vital. Overall, this approach clarifies the decision-making mechanisms of LLMs and suggests methods for tailoring their training to specific domains.

Check out the Paper, Code, and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.