DrBenchmark is the first-ever publicly available French biomedical language understanding benchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification.
CAS |
CLISTER |
QUAERO |
DEFT2020 |
DEFT2021 |
DiaMED |
E3C |
ESSAI |
FrenchMedMCQA |
MANTRAGSC |
MORFITT |
PxCorpus |
DrBERT 7GB |
DrBERT 4GB |
CamemBERT-bio |
CamemBERT |
FlauBERT |
CamemBERTa |
PubMedBERT |
XLM-RoBERTa |
module purge
module load pytorch-gpu/py3/1.12.1
git clone https://github.com/DrBenchmark/DrBenchmark.git
cd DrBenchmark
pip install -r requirements.txt
echo "Dr-BERT/DrBERT-7GB" > ./models.txt
# If the node is offline (default)
python download_datasets_locally.py
python download_models_locally.py
nano run_all_jean_zay.sh # replace the account identifier, models and array size
sbatch run_all_jean_zay.sh
Our proposed benchmark comprises 20 French biomedical language understanding tasks, one of which is specifically created for this benchmark. The descriptions and statistics of these tasks are presented here :
Dataset | Task | Metric | Train | Validation | Test | License |
---|---|---|---|---|---|---|
CAS | POS tagging | SeqEval F1 | 2,653 | 379 | 758 | DUA |
ESSAI | POS tagging | SeqEval F1 | 5,072 | 725 | 1,450 | DUA |
QUAERO | NER - EMEA | SeqEval F1 | 429 | 389 | 348 | GFDL 1.3 |
NER - MEDLINE | SeqEval F1 | 833 | 832 | 833 | ||
E3C | NER - Clinical | SeqEval F1 | 969 | 140 | 293 | CC BY-NC |
NER - Temporal | SeqEval F1 | 969 | 140 | 293 | ||
MorFITT | Multi-label Classification | Weighted F1 | 1514 | 1,022 | 1,088 | CC BY-SA 4.0 |
FrenchMedMCQA | Question-Answering | Hamming / EMR | 2,171 | 312 | 622 | Apache 2.0 |
Multi-class Classification | Weighted F1 | 2,171 | 312 | 622 | ||
Mantra-GSC | NER - EMEA | SeqEval F1 | 70 | 10 | 20 | CC BY 4.0 |
NER - Medline | SeqEval F1 | 70 | 10 | 20 | ||
NER - Patents | SeqEval F1 | 35 | 5 | 10 | ||
CLISTER | Semantic Textual Similarity | EDRM / Spearman | 499 | 101 | 400 | DUA |
DEFT-2020 | Semantic Textual Similarity | EDRM / Spearman | 498 | 102 | 410 | DUA |
Multi-class Classification | Weighted F1 | 460 | 112 | 530 | ||
DEFT-2021 | Multi-label Classification | Weighted F1 | 118 | 49 | 108 | DUA |
NER | SeqEval F1 | 2,153 | 793 | 1,766 | ||
DiaMed | Multi-class Classification | Weighted F1 | 509 | 76 | 154 | CC BY-SA 4.0 |
PxCorpus | NER | SeqEval F1 | 1,386 | 198 | 397 | CC BY 4.0 |
Multi-class Classification | Weighted F1 | 1,386 | 198 | 397 |
All the models are fine-tuned regarding a strict protocol using the same hyperparameters for each downstream task. The reported results are obtained by averaging the scores from four separate runs, thus ensuring robustness and reliability..
Dataset | Task | Baseline | CamemBERT | CamemBERTa | FlauBERT | DrBERT-FS | DrBERT-CP | CamemBERT-bio | PubMedBERT | XLM-RoBERTa |
---|---|---|---|---|---|---|---|---|---|---|
CAS | POS | 23.50 | 95.53 | 96.56 | 95.22 | 96.93 | 96.46 | 95.22 | 94.82 | 96.91 |
ESSAI | POS | 26.31 | 97.38 | 98.08 | 97.05 | 98.41 | 98.01 | 97.39 | 97.42 | 98.34 |
QUAERO | NER EMEA | 8.37 | 62.68 | 64.86 | 74.86 | 64.11 | 67.05 | 66.59 | 53.19 | 64.47 |
NER MEDLINE | 4.92 | 55.25 | 55.60 | 48.98 | 55.82 | 60.10 | 58.94 | 53.26 | 51.12 | |
E3C | NER Clinical | 4.47 | 54.70 | 55.53 | 47.61 | 54.45 | 56.55 | 56.96 | 38.34 | 52.87 |
NER Temporal | 21.74 | 83.45 | 83.22 | 61.64 | 81.48 | 83.43 | 83.44 | 80.86 | 82.6 | |
MorFITT | Multi-Label CLS | 3.24 | 64.21 | 66.28 | 70.25 | 68.70 | 70.99 | 67.53 | 68.58 | 67.28 |
FrenchMedMCQA | MCQA | 21.83 / 11.57 | 28.53 / 2.25 | 29.77 / 2.57 | 27.88 / 2.09 | 31.07 / 3.22 | 32.41 / 2.89 | 35.3 / 1.45 | 32.90 / 1.61 | 34.74 / 2.09 |
CLS | 8.37 | 66.21 | 64.44 | 61.88 | 65.38 | 66.22 | 65.79 | 65.41 | 64.69 | |
MantraGSC | NER FR EMEA | 0.00 | 29.14 | 40.84 | 66.20 | 66.23 | 60.88 | 30.63 | 40.14 | 52.64 |
NER FR Medline | 7.78 | 23.20 | 22.55 | 20.69 | 42.38 | 35.52 | 23.66 | 27.53 | 18.73 | |
NER FR Patents | 6.20 | 00.00 | 44.16 | 31.47 | 57.34 | 39.68 | 00.00 | 4.51 | 8.58 | |
CLISTER | STS | 0.44 / 0.00 | 0.55 / 0.33 | 0.56 / 0.47 | 0.50 / 0.29 | 0.62 / 0.57 | 0.60 / 0.49 | 0.54 / 0.26 | 0.70 / 0.78 | 0.49 / 0.23 |
DEFT-2020 | STS | 0.49 / 0.00 | 0.59 / 0.58 | 0.59 / 0.43 | 0.58 / 0.51 | 0.72 / 0.81 | 0.73 / 0.86 | 0.58 / 0.32 | 0.78 / 0.86 | 0.60 / 0.26 |
CLS | 14.00 | 96.31 | 97.96 | 42.37 | 82.38 | 95.71 | 94.78 | 95.33 | 67.66 | |
DEFT-2021 | Multi-Label CLS | 24.49 | 18.04 | 18.04 | 39.21 | 34.15 | 30.04 | 17.82 | 25.53 | 24.46 |
NER | 0.00 | 62.76 | 62.61 | 33.51 | 60.44 | 63.43 | 64.36 | 60.27 | 60.32 | |
DiaMED | CLS | 15.36 | 30.40 | 24.05 | 34.08 | 60.45 | 54.43 | 39.57 | 54.96 | 26.69 |
PxCorpus | NER | 10.00 | 92.89 | 95.05 | 47.57 | 95.88 | 71.38 | 93.08 | 94.66 | 95.80 |
CLS | 84.78 | 94.41 | 93.95 | 93.45 | 94.43 | 94.52 | 94.49 | 93.12 | 93.91 |
@inproceedings{labrak:hal-04470938, TITLE = {{DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain}}, AUTHOR = {Labrak, Yanis and Bazoge, Adrien and El Khettari, Oumaima and Rouvier, Mickael and Constant Dit Beaufils, Pac{\^o}me and Grabar, Natalia and Daille, B{\'e}atrice and Quiniou, Solen and Morin, Emmanuel and Gourraud, Pierre-antoine and Dufour, Richard}, BOOKTITLE = {{Fourteenth Language Resources and Evaluation Conference (LREC-COLING 2024)}}, ADDRESS = {Torino, Italy}, YEAR = {2024}, }
The DrBenchmark toolkit as well as the models have been publicly released online under a CC0 1.0 open-source license.