«`html

Unbabel Introduces TOWER+: A Unified Framework for High-Fidelity Translation and Instruction-Following in Multilingual LLMs

Understanding the Target Audience

The target audience for TOWER+ includes business leaders, AI researchers, and developers in the fields of machine translation and natural language processing. Their pain points revolve around the need for high-quality translations that maintain contextual integrity and adhere to specific formatting requirements. They aim to enhance user experience in multilingual environments while ensuring operational efficiency. Interests include advancements in AI technology, practical applications of language models, and strategies for improving translation accuracy. Communication preferences lean towards technical documentation, case studies, and data-driven insights.

Current Challenges in Machine Translation

Large language models have significantly advanced machine translation, utilizing extensive training datasets to translate multiple languages while capturing linguistic subtleties. However, fine-tuning these models for translation accuracy often compromises their instruction-following and conversational capabilities. Broad-purpose models frequently fall short of meeting professional fidelity standards. The challenge lies in balancing precise, culturally aware translations with the ability to perform tasks such as code generation and problem-solving, while also maintaining terminological consistency and adhering to formatting guidelines across diverse audiences. Stakeholders require systems that can adapt dynamically to domain-specific needs and user preferences without sacrificing fluency.

Current Approaches to Tailoring Language Models

Various strategies have been employed to enhance translation accuracy in language models. Fine-tuning pre-trained models on parallel corpora improves the adequacy and fluency of translations. Continued pretraining on a mix of monolingual and parallel data enhances multilingual fluency. Some teams have incorporated reinforcement learning from human feedback to align outputs with quality preferences. Proprietary systems like GPT-4o and Claude 3.7 have demonstrated superior translation quality, while open-weight adaptations such as TOWER V2 and GEMMA 2 have shown comparable or superior performance in specific language scenarios.

Introducing TOWER+: A Unified Training Framework

Researchers from Unbabel and academic partners have introduced TOWER+, a suite of models designed to balance translation specialization and general-purpose utility. Variants are available at multiple parameter scales: 2 billion, 9 billion, and 72 billion. The unified training pipeline aims to position TOWER+ models on the Pareto frontier, achieving high translation performance alongside robust general capabilities.

TOWER+ Training Pipeline

The training pipeline consists of several stages:

Continued pretraining on curated data (66% monolingual, 33% parallel, 1% instruction).
Supervised fine-tuning that includes translation tasks and diverse instruction-following scenarios.
Preference optimization using weighted preference optimization and group-relative policy updates.
Reinforcement learning with verifiable rewards to ensure compliance with transformation guidelines.

This comprehensive approach yields a balance between specialized translation accuracy and versatile language proficiency.

Benchmark Results

The TOWER+ 9B model achieved a win rate of 33.47% on multilingual general chat prompts and an XCOMET-XXL score of 84.38 across 24 language pairs. The flagship 72 billion-parameter variant secured a 54.52% win rate on M-ArenaHard, an IFEval instruction-following score of 89.02, and an XCOMET-XXL level of 83.29 on the full WMT24++ benchmark. The combined translation and instruction-following benchmark, IF-MT, scored 5.55 for instruction adherence and 88.95 for translation fidelity, establishing state-of-the-art results among open-weight models.

Key Technical Highlights of TOWER+

TOWER+ models span 2 B, 9 B, and 72 B parameters, exploring the performance frontier between translation specialization and general-purpose utility.
The post-training pipeline integrates four stages: continued pretraining, supervised fine-tuning, weighted preference optimization, and reinforcement learning.
Continued pretraining covers 27 languages and dialects, as well as 47 language pairs, over 32 billion tokens.
The 9 B variant achieved a 33.47% win rate on M-ArenaHard and an 84.38% XCOMET-XXL across 24 pairs.
The 72 B model recorded 54.52% on M-ArenaHard and 89.02% on IFEval.
The 2 B model matched larger baselines with a 6.33% win rate on M-ArenaHard.

Conclusion

TOWER+ demonstrates that translation excellence and conversational versatility can coexist within a single open-weight suite. By unifying large-scale pretraining with specialized alignment stages, the models achieve a Pareto-optimal balance across translation fidelity, instruction-following, and general chat capabilities, offering a scalable blueprint for future domain-specific LLM development.

Check out the Paper and Models. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

«`