DrBenchmark

A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

LIA - Avignon University
LS2N - Nantes University
CHU - Nantes University
STL CNRS - Lille University
Zenidoc

DrBenchmark

DrBenchmark is the first-ever publicly available French biomedical language understanding benchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification.

Datasets on HuggingFace

CAS	CLISTER	QUAERO	DEFT2020
DEFT2021	DiaMED	E3C	ESSAI
FrenchMedMCQA	MANTRAGSC	MORFITT	PxCorpus

Models on HuggingFace

DrBERT 7GB	DrBERT 4GB	CamemBERT-bio	CamemBERT
FlauBERT	CamemBERTa	PubMedBERT	XLM-RoBERTa

Tutorial

Setup and execute the benchmark on your own model

module purge
module load pytorch-gpu/py3/1.12.1

git clone https://github.com/DrBenchmark/DrBenchmark.git
cd DrBenchmark
pip install -r requirements.txt

echo "Dr-BERT/DrBERT-7GB" > ./models.txt

# If the node is offline (default)
python download_datasets_locally.py
python download_models_locally.py

nano run_all_jean_zay.sh # replace the account identifier, models and array size
sbatch run_all_jean_zay.sh

Models evaluation

DrBenchmark Overview

Our proposed benchmark comprises 20 French biomedical language understanding tasks, one of which is specifically created for this benchmark. The descriptions and statistics of these tasks are presented here :

Dataset	Task	Metric	Train	Validation	Test	License
CAS	POS tagging	SeqEval F1	2,653	379	758	DUA
ESSAI	POS tagging	SeqEval F1	5,072	725	1,450	DUA
QUAERO	NER - EMEA	SeqEval F1	429	389	348	GFDL 1.3
QUAERO	NER - MEDLINE	SeqEval F1	833	832	833	GFDL 1.3
E3C	NER - Clinical	SeqEval F1	969	140	293	CC BY-NC
E3C	NER - Temporal	SeqEval F1	969	140	293	CC BY-NC
MorFITT	Multi-label Classification	Weighted F1	1514	1,022	1,088	CC BY-SA 4.0
FrenchMedMCQA	Question-Answering	Hamming / EMR	2,171	312	622	Apache 2.0
FrenchMedMCQA	Multi-class Classification	Weighted F1	2,171	312	622	Apache 2.0
Mantra-GSC	NER - EMEA	SeqEval F1	70	10	20	CC BY 4.0
	NER - Medline	SeqEval F1	70	10	20
	NER - Patents	SeqEval F1	35	5	10
CLISTER	Semantic Textual Similarity	EDRM / Spearman	499	101	400	DUA
DEFT-2020	Semantic Textual Similarity	EDRM / Spearman	498	102	410	DUA
DEFT-2020	Multi-class Classification	Weighted F1	460	112	530	DUA
DEFT-2021	Multi-label Classification	Weighted F1	118	49	108	DUA
DEFT-2021	NER	SeqEval F1	2,153	793	1,766	DUA
DiaMed	Multi-class Classification	Weighted F1	509	76	154	CC BY-SA 4.0
PxCorpus	NER	SeqEval F1	1,386	198	397	CC BY 4.0
PxCorpus	Multi-class Classification	Weighted F1	1,386	198	397	CC BY 4.0

Comparison of models performance

All the models are fine-tuned regarding a strict protocol using the same hyperparameters for each downstream task. The reported results are obtained by averaging the scores from four separate runs, thus ensuring robustness and reliability..

Dataset	Task	Baseline	CamemBERT	CamemBERTa	FlauBERT	DrBERT-FS	DrBERT-CP	CamemBERT-bio	PubMedBERT	XLM-RoBERTa
CAS	POS	23.50	95.53	96.56	95.22	96.93	96.46	95.22	94.82	96.91
ESSAI	POS	26.31	97.38	98.08	97.05	98.41	98.01	97.39	97.42	98.34
QUAERO	NER EMEA	8.37	62.68	64.86	74.86	64.11	67.05	66.59	53.19	64.47
QUAERO	NER MEDLINE	4.92	55.25	55.60	48.98	55.82	60.10	58.94	53.26	51.12
E3C	NER Clinical	4.47	54.70	55.53	47.61	54.45	56.55	56.96	38.34	52.87
E3C	NER Temporal	21.74	83.45	83.22	61.64	81.48	83.43	83.44	80.86	82.6
MorFITT	Multi-Label CLS	3.24	64.21	66.28	70.25	68.70	70.99	67.53	68.58	67.28
FrenchMedMCQA	MCQA	21.83 / 11.57	28.53 / 2.25	29.77 / 2.57	27.88 / 2.09	31.07 / 3.22	32.41 / 2.89	35.3 / 1.45	32.90 / 1.61	34.74 / 2.09
FrenchMedMCQA	CLS	8.37	66.21	64.44	61.88	65.38	66.22	65.79	65.41	64.69
MantraGSC	NER FR EMEA	0.00	29.14	40.84	66.20	66.23	60.88	30.63	40.14	52.64
	NER FR Medline	7.78	23.20	22.55	20.69	42.38	35.52	23.66	27.53	18.73
	NER FR Patents	6.20	00.00	44.16	31.47	57.34	39.68	00.00	4.51	8.58
CLISTER	STS	0.44 / 0.00	0.55 / 0.33	0.56 / 0.47	0.50 / 0.29	0.62 / 0.57	0.60 / 0.49	0.54 / 0.26	0.70 / 0.78	0.49 / 0.23
DEFT-2020	STS	0.49 / 0.00	0.59 / 0.58	0.59 / 0.43	0.58 / 0.51	0.72 / 0.81	0.73 / 0.86	0.58 / 0.32	0.78 / 0.86	0.60 / 0.26
DEFT-2020	CLS	14.00	96.31	97.96	42.37	82.38	95.71	94.78	95.33	67.66
DEFT-2021	Multi-Label CLS	24.49	18.04	18.04	39.21	34.15	30.04	17.82	25.53	24.46
DEFT-2021	NER	0.00	62.76	62.61	33.51	60.44	63.43	64.36	60.27	60.32
DiaMED	CLS	15.36	30.40	24.05	34.08	60.45	54.43	39.57	54.96	26.69
PxCorpus	NER	10.00	92.89	95.05	47.57	95.88	71.38	93.08	94.66	95.80
PxCorpus	CLS	84.78	94.41	93.95	93.45	94.43	94.52	94.49	93.12	93.91

Recent Publications

Yanis Labrak, Adrien Bazoge, Oumaima El Khettari, Mickael Rouvier, Pacôme Constant Dit Beaufils, Natalia Grabar, Béatrice Daille, Solen Quiniou, Emmanuel Morin, Pierre‐antoine Gourraud, Richard Dufour. (2024). DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain.

Arxiv HAL Code HuggingFace

Citation BibTeX

@inproceedings{labrak:hal-04470938,
  TITLE = {{DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain}},
  AUTHOR = {Labrak, Yanis and Bazoge, Adrien and El Khettari, Oumaima and Rouvier, Mickael and Constant Dit Beaufils, Pac{\^o}me and Grabar, Natalia and Daille, B{\'e}atrice and Quiniou, Solen and Morin, Emmanuel and Gourraud, Pierre-antoine and Dufour, Richard},
  BOOKTITLE = {{Fourteenth Language Resources and Evaluation Conference (LREC-COLING 2024)}},
  ADDRESS = {Torino, Italy},
  YEAR = {2024},
}

License

The DrBenchmark toolkit as well as the models have been publicly released online under a CC0 1.0 open-source license.

DrBenchmark

A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

LIA - Avignon University LS2N - Nantes University CHU - Nantes University STL CNRS - Lille University Zenidoc

DrBenchmark

Datasets on HuggingFace

Models on HuggingFace

Tutorial

Models evaluation

DrBenchmark Overview

Comparison of models performance

Recent Publications

Citation BibTeX

License

LIA - Avignon University
LS2N - Nantes University
CHU - Nantes University
STL CNRS - Lille University
Zenidoc