«`html

TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs

Understanding the Target Audience

The target audience for TransEvalnia includes researchers, developers, and business professionals involved in machine translation (MT) and language processing technologies. Their primary pain points are:

Difficulty in evaluating translation quality accurately.
Need for transparency in evaluation metrics beyond traditional numerical scores.
Challenges in aligning automated evaluations with human judgments.

Their goals include:

Improving translation quality assessments.
Utilizing advanced metrics to enhance decision-making.
Staying updated with the latest advancements in AI and MT technologies.

Interests may include:

Research in AI and natural language processing.
Applications of LLMs in various industries.
Best practices in translation evaluation and quality assurance.

Communication preferences lean towards technical documentation, peer-reviewed studies, and data-driven insights.

Overview of TransEvalnia

Translation systems powered by large language models (LLMs) have advanced significantly, sometimes outperforming human translators. As LLMs evolve, particularly in complex tasks like document-level or literary translation, evaluating their progress becomes increasingly challenging. Traditional automated metrics, such as BLEU, are still prevalent but fail to provide insights into the reasons behind scores. With translation quality nearing human levels, users demand evaluations that extend beyond numerical metrics, focusing on dimensions like accuracy, terminology, and audience suitability.

Researchers at Sakana.ai have developed TransEvalnia, a translation evaluation and ranking system that employs prompting-based reasoning to assess translation quality. This system offers detailed feedback across selected MQM dimensions, ranks translations, and assigns scores on a 5-point Likert scale, including an overall rating. TransEvalnia has shown competitive performance against leading models like MT-Ranker across various language pairs and tasks, including English-Japanese and Chinese-English.

Methodology and Evaluation

The methodology focuses on evaluating translations based on key quality aspects, including accuracy, terminology, audience suitability, and clarity. For poetic texts, emotional tone replaces standard grammar checks. Translations are assessed span by span, scored on a 1–5 scale, and ranked. To mitigate bias, the study compares three evaluation strategies: single-step, two-step, and a more reliable interleaving method. A “no-reasoning” method is also tested, albeit with limitations in transparency and bias.

Human experts reviewed selected translations to compare their judgments with those of the system, providing insights into its alignment with professional standards. The evaluation of translation ranking systems utilized datasets with human scores, comparing TransEvalnia models (Qwen and Sonnet) against MT-Ranker, COMET-22/23, XCOMET-XXL, and MetricX-XXL. Notably, on WMT-2024 en-es, MT-Ranker excelled due to rich training data, but in most other datasets, TransEvalnia matched or surpassed MT-Ranker. For instance, Qwen’s no-reasoning approach achieved a win on WMT-2023 en-de.

Conclusion

In conclusion, TransEvalnia is a prompting-based system for evaluating and ranking translations using LLMs like Claude 3.5 Sonnet and Qwen. It provides detailed scores across key quality dimensions, inspired by the MQM framework, and selects the superior translation among options. The system often matches or outperforms MT-Ranker on several WMT language pairs, although MetricX-XXL leads on WMT due to fine-tuning. Human raters found Sonnet’s outputs reliable, with scores showing a strong correlation with human judgments. The team also explored solutions to position bias, a persistent challenge in ranking systems, and has made all evaluation data and code publicly available.

Further Resources

Check out the Paper for more in-depth information. Feel free to explore our Tutorials page on AI Agent and Agentic AI for various applications. Follow us on Twitter and join our community of over 100k members on ML SubReddit. Don’t forget to subscribe to our Newsletter.

«`