Dynamic Fine-Tuning (DFT): Bridging the Generalization Gap in Supervised Fine-Tuning (SFT) for LLMs

«`html

Dynamic Fine-Tuning (DFT): Bridging the Generalization Gap in Supervised Fine-Tuning (SFT) for LLMs

Supervised Fine-Tuning (SFT) is a standard technique for adapting large language models (LLMs) to new tasks by training them on expert demonstration datasets. It is valued for its simplicity and ability to develop expert-like behavior quickly. However, SFT often underperforms in generalization compared to reinforcement learning (RL). While RL allows models to explore diverse strategies, leading to stronger generalization, it demands high computational resources, careful hyperparameter tuning, and access to reward signals, which are not always practical.

Existing attempts to address the challenges of SFT and RL have led to various hybrid methods. A common strategy combines an initial SFT phase with subsequent RL refinement, as seen in methods like InstructGPT. Alternative methods, such as interleaving SFT and RL steps or Direct Preference Optimization (DPO), aim to integrate imitation and reinforcement signals more efficiently. Techniques like Negative-aware Fine-Tuning (NFT) allow models to self-improve by modeling incorrect outputs. However, theoretical work has struggled to establish a precise mathematical equivalence between SFT and offline policy gradients.

A team of researchers from Southeast University, UC Berkeley, Shanghai Jiao Tong University, Nanyang Technological University, and Wuhan University has proposed Dynamic Fine-Tuning (DFT) to address the limited generalization of SFT LLMs. Through mathematical analysis, they identify that standard SFT gradients encode a flawed reward structure, limiting the model’s capacity to generalize effectively. DFT stabilizes gradient updates through dynamic rescaling of the objective function based on the probability of each token, enhancing generalization across multiple benchmarks and base models.

DFT is evaluated in a standard SFT setting, where only expert demonstration data is available, without negative samples, reward models, or verification signals. It is trained using the NuminaMath CoT dataset, which contains 860,000 mathematical problems and solutions from various sources, including Chinese high school mathematics exercises and U.S. and international mathematical olympiads. In an offline RL setting, DFT is tested using the rejection sampling fine-tuning (RFT) framework, generating responses for 10,000 math questions, resulting in 140,000 training examples.

In SFT settings, DFT outperforms standard SFT across all evaluated LLMs, demonstrating superior generalization and robustness on challenging benchmarks where standard SFT yields minimal or negative impact. It exhibits better learning efficiency and faster convergence characteristics, outperforming Importance-Weighted SFT (iw-SFT) in most scenarios. In offline RL settings, DFT outperforms both offline and online RL baselines, scoring an average of 35.43, exceeding the best offline method, RFT, by 11.46 points, and outperforming the strongest online RL algorithm, GRPO, by 3.43 points. DFT also scores 64.71 on Math500, slightly ahead of GRPO, and achieves significant gains on harder tasks like AMC23 (7.19 over GRPO) and Minerva Math (6.23 over GRPO).

In this work, researchers address the generalization gap between SFT and RL by introducing Dynamic Fine-Tuning (DFT), a method that dynamically reweights the SFT loss using token probabilities. This modification stabilizes learning and enhances generalization, as evidenced by performance gains across mathematical reasoning benchmarks. However, evaluations of DFT are limited to math-focused datasets and models up to 7 billion parameters, with no testing on other domains or larger models. Future work aims to extend DFT to broader benchmarks, larger models, and vision-language tasks to validate its cross-modal effectiveness.

For further insights and updates, follow us on Twitter and join our community on ML SubReddit.

External illustration — [Source: External Resource]

«`