DrBenchmark

DrBenchmark

A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

LIA - Avignon University
LS2N - Nantes University
CHU - Nantes University
STL CNRS - Lille University
Zenidoc

DrBenchmark

DrBenchmark is the first-ever publicly available French biomedical language understanding benchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification.

Datasets on HuggingFace

CAS

CLISTER

QUAERO

DEFT2020

DEFT2021

DiaMED

E3C

ESSAI

FrenchMedMCQA

MANTRAGSC

MORFITT

PxCorpus

Models on HuggingFace

DrBERT 7GB

DrBERT 4GB

CamemBERT-bio

CamemBERT

FlauBERT

CamemBERTa

PubMedBERT

XLM-RoBERTa

Tutorial

Setup and execute the benchmark on your own model
module purge
module load pytorch-gpu/py3/1.12.1

git clone https://github.com/DrBenchmark/DrBenchmark.git
cd DrBenchmark
pip install -r requirements.txt

echo "Dr-BERT/DrBERT-7GB" > ./models.txt

# If the node is offline (default)
python download_datasets_locally.py
python download_models_locally.py

nano run_all_jean_zay.sh # replace the account identifier, models and array size
sbatch run_all_jean_zay.sh

Models evaluation

DrBenchmark Overview

Our proposed benchmark comprises 20 French biomedical language understanding tasks, one of which is specifically created for this benchmark. The descriptions and statistics of these tasks are presented here :

Dataset Task Metric Train Validation Test License
CAS POS tagging SeqEval F1 2,653 379 758 DUA
ESSAI POS tagging SeqEval F1 5,072 725 1,450 DUA
QUAERO NER - EMEA SeqEval F1 429 389 348 GFDL 1.3
NER - MEDLINE SeqEval F1 833 832 833
E3C NER - Clinical SeqEval F1 969 140 293 CC BY-NC
NER - Temporal SeqEval F1 969 140 293
MorFITT Multi-label Classification Weighted F1 1514 1,022 1,088 CC BY-SA 4.0
FrenchMedMCQA Question-Answering Hamming / EMR 2,171 312 622 Apache 2.0
Multi-class Classification Weighted F1 2,171 312 622
Mantra-GSC NER - EMEA SeqEval F1 70 10 20 CC BY 4.0
NER - Medline SeqEval F1 70 10 20
NER - Patents SeqEval F1 35 5 10
CLISTER Semantic Textual Similarity EDRM / Spearman 499 101 400 DUA
DEFT-2020 Semantic Textual Similarity EDRM / Spearman 498 102 410 DUA
Multi-class Classification Weighted F1 460 112 530
DEFT-2021 Multi-label Classification Weighted F1 118 49 108 DUA
NER SeqEval F1 2,153 793 1,766
DiaMed Multi-class Classification Weighted F1 509 76 154 CC BY-SA 4.0
PxCorpus NER SeqEval F1 1,386 198 397 CC BY 4.0
Multi-class Classification Weighted F1 1,386 198 397

Comparison of models performance

All the models are fine-tuned regarding a strict protocol using the same hyperparameters for each downstream task. The reported results are obtained by averaging the scores from four separate runs, thus ensuring robustness and reliability..

Dataset Task Baseline CamemBERT CamemBERTa FlauBERT DrBERT-FS DrBERT-CP CamemBERT-bio PubMedBERT XLM-RoBERTa
CAS POS 23.50 95.53 96.56 95.22 96.93 96.46 95.22 94.82 96.91
ESSAI POS 26.31 97.38 98.08 97.05 98.41 98.01 97.39 97.42 98.34  
QUAERO NER EMEA 8.37 62.68 64.86 74.86 64.11 67.05 66.59 53.19 64.47
NER MEDLINE 4.92 55.25 55.60 48.98 55.82 60.10 58.94 53.26 51.12
E3C NER Clinical 4.47 54.70 55.53 47.61 54.45 56.55 56.96 38.34 52.87
NER Temporal 21.74 83.45 83.22 61.64 81.48 83.43 83.44 80.86 82.6
MorFITT Multi-Label CLS 3.24 64.21 66.28 70.25 68.70 70.99 67.53 68.58 67.28
FrenchMedMCQA MCQA 21.83 / 11.57 28.53 / 2.25 29.77 / 2.57 27.88 / 2.09 31.07 / 3.22 32.41 / 2.89 35.3 / 1.45 32.90 / 1.61 34.74 / 2.09
CLS 8.37 66.21 64.44 61.88 65.38 66.22 65.79 65.41 64.69
MantraGSC NER FR EMEA 0.00 29.14 40.84 66.20 66.23 60.88 30.63 40.14 52.64
NER FR Medline 7.78 23.20 22.55 20.69 42.38 35.52 23.66 27.53 18.73
NER FR Patents 6.20 00.00 44.16 31.47 57.34 39.68 00.00 4.51 8.58
CLISTER STS 0.44 / 0.00 0.55 / 0.33 0.56 / 0.47 0.50 / 0.29 0.62 / 0.57 0.60 / 0.49 0.54 / 0.26 0.70 / 0.78 0.49 / 0.23
DEFT-2020 STS 0.49 / 0.00 0.59 / 0.58 0.59 / 0.43 0.58 / 0.51 0.72 / 0.81 0.73 / 0.86 0.58 / 0.32 0.78 / 0.86   0.60 / 0.26
CLS 14.00 96.31 97.96 42.37 82.38 95.71 94.78 95.33 67.66
DEFT-2021 Multi-Label CLS 24.49 18.04 18.04 39.21 34.15 30.04 17.82 25.53 24.46
NER 0.00 62.76 62.61 33.51 60.44 63.43 64.36 60.27 60.32
DiaMED CLS 15.36 30.40 24.05 34.08 60.45 54.43 39.57 54.96 26.69
PxCorpus NER 10.00 92.89 95.05 47.57 95.88 71.38 93.08 94.66 95.80
CLS 84.78 94.41 93.95 93.45 94.43 94.52 94.49 93.12 93.91

Recent Publications

(2024). DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain.

Arxiv HAL Code HuggingFace

Citation BibTeX

@inproceedings{labrak:hal-04470938,
  TITLE = {{DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain}},
  AUTHOR = {Labrak, Yanis and Bazoge, Adrien and El Khettari, Oumaima and Rouvier, Mickael and Constant Dit Beaufils, Pac{\^o}me and Grabar, Natalia and Daille, B{\'e}atrice and Quiniou, Solen and Morin, Emmanuel and Gourraud, Pierre-antoine and Dufour, Richard},
  BOOKTITLE = {{Fourteenth Language Resources and Evaluation Conference (LREC-COLING 2024)}},
  ADDRESS = {Torino, Italy},
  YEAR = {2024},
}

License

The DrBenchmark toolkit as well as the models have been publicly released online under a CC0 1.0 open-source license.