DrBenchmark: The First-Ever Publicly Available French Biomedical Large Language Understanding Benchmark

A group of researchers in France introduced Dr.Benchmark to address the need for the evaluation of masked language models in French, particularly in the biomedical domain. There have been significant advances in the field of NLP, particularly in pre-trained language models (PLMs), but evaluating these models remains difficult due to variations in evaluation protocols. The scarcity of evaluation benchmarks in the biomedical domain in languages other than English and Chinese has made this even more challenging. These issues created a gap in evaluating the accuracy of the latest French biomedical models.

The existing method for evaluating French language models failed to provide standardized protocols and comprehensive benchmark datasets, leading to inconsistent results and stalling advancement in NLP research. DrBenchmark is the first publicly available French biomedical language understanding benchmark. This benchmark comprises 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification. The primary contribution of DrBenchmark is its aggregation of diverse downstream tasks into a single benchmark, allowing the assessment of pre-trained language models’ intrinsic qualities from various perspectives. The paper also tests eight cutting-edge pre-trained masked language models (MLMs) on both general and biomedical data. The MLMs include French generalist models, cross-lingual generalist models, French biomedical models, and an English biomedical model.

DrBenchmark offers a modular, reproducible, and easily customizable automated protocol for fair comparison among language models. It leverages the HuggingFace Datasets and the Transformers library for data loading, pre-training, and evaluation. The experimental protocol ensures consistency by fine-tuning all models using the same hyperparameters for each downstream task. Results from the experiments reveal that no single model excels across all tasks, highlighting the importance of domain-specific models for achieving peak performance in the biomedical field. Interestingly, even though French biomedical models exhibit superior performance in most tasks, certain out-of-domain models or models trained in different languages maintain competitiveness in specific tasks.

In conclusion, the paper presents DrBenchmark to solve the lack of evaluation resources for French biomedical NLP models. By aggregating diverse downstream tasks into a comprehensive benchmark, DrBenchmark enables fair comparison among pre-trained language models. The evaluation results highlight the importance of employing domain-specific models for optimal performance in biomedical NLP tasks. The study also shows that certain models trained in different languages or outside of the domain can still compete in specific tasks, underscoring the need for more study in this field.

Check out the Paper and Project page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit

The post DrBenchmark: The First-Ever Publicly Available French Biomedical Large Language Understanding Benchmark appeared first on MarkTechPost.