In recent years, Large Language Models (LLMs) have shown significant flexibility, opening up possibilities in niche areas like healthcare and medicine. Despite the existence of several open-source LLMs designed for healthcare applications, customizing general LLMs for the medical sector remains a challenge. This paper introduces BioMistral, a specialized open-source LLM for the biomedical field, built on the Mistral model and further refined with data from PubMed Central. We perform an extensive evaluation of BioMistral LLM against 10 recognized medical question-answering (QA) tasks in English. Additionally, we examine more compact versions of the model created through techniques such as quantization and merging. Our findings highlight BioMistral’s enhanced effectiveness over other open-source models for medical use and its ability to compete with proprietary options. To tackle the scarcity of non-English data and to evaluate the cross-language applicability of medical LLMs, we translated our benchmark into seven additional languages, conducting the first broad multilingual assessment in this field. We make all datasets, benchmarks, scripts, and models from our study available for public use.
Please note that while BioMistral aims to encapsulate and disseminate medical knowledge from reliable sources, it is not yet optimized to do so within the strict professional standards required for medical practice. We caution against using BioMistral for direct medical applications until it has been rigorously tested and aligned with specific use scenarios, including being validated through randomized controlled trials in clinical settings. There are unexplored risks and biases within BioMistral 7B, and its performance in actual clinical environments is untested. Therefore, we advise treating BioMistral 7B solely as a tool for academic research and recommend against its use in operational settings for generating natural language or for any formal healthcare and medical purposes.
BioMistral Models Overview
BioMistral comprises a collection of models based on Mistral, further refined with biomedical texts from PubMed Central Open Access under various licenses. These models have been developed with the support of the CNRS Jean Zay French HPC.
| Model Name | Base Model | Model Type | Sequence Length | Download Location |
|---|---|---|---|---|
| BioMistral-7B | Mistral-7B-Instruct-v0.1 | Further Pre-trained | 2048 | HuggingFace |
| BioMistral-7B-DARE | Mistral-7B-Instruct-v0.1 | Merge DARE | 2048 | HuggingFace |
| BioMistral-7B-TIES | Mistral-7B-Instruct-v0.1 | Merge TIES | 2048 | HuggingFace |
| BioMistral-7B-SLERP | Mistral-7B-Instruct-v0.1 | Merge SLERP | 2048 | HuggingFace |
Quantized Models
The table provides detailed information about various quantized versions of the BioMistral-7B model, showcasing different methods used to reduce the model’s size and computational requirements, as well as their impact on VRAM usage and execution time relative to the base model. These models are designed for efficient deployment, particularly in environments with limited hardware resources.
- Base Model: This column specifies the original model before quantization.
- Method: Describes the quantization or optimization approach used, such as FP16/BF16, AWQ (Adaptive Weight Quantization), or BnB (Bit n’ Bite).
- q_group_size: For methods involving quantization groups, this column indicates the size of those groups.
- w_bit: The bit-width used for weight quantization, indicating the precision level of the model weights after quantization.
- version: Specifies the particular technique or variant used within a method, such as GEMM (General Matrix Multiply) or GEMV (General Matrix-Vector Multiply) for the AWQ method.
- VRAM GB: The amount of Video RAM (in gigabytes) required to run the model, demonstrating the model’s memory efficiency.
- Time: Shows the relative execution time compared to the base model, indicating the speed-up or slow-down factor.
- Download: Points to the location where these models can be accessed, in this case, HuggingFace, a popular hub for sharing machine learning models.
| Base Model | Method | q_group_size | w_bit | version | VRAM GB | Time | Download |
|---|---|---|---|---|---|---|---|
| BioMistral-7B | FP16/BF16 | 15.02 | x1.00 | HuggingFace | |||
| BioMistral-7B | AWQ | 128 | 4 | GEMM | 4.68 | x1.41 | HuggingFace |
| BioMistral-7B | AWQ | 128 | 4 | GEMV | 4.68 | x10.30 | HuggingFace |
| BioMistral-7B | BnB.4 | 4 | 5.03 | x3.25 | HuggingFace | ||
| BioMistral-7B | BnB.8 | 8 | 8.04 | x4.34 | HuggingFace | ||
| BioMistral-7B-DARE | AWQ | 128 | 4 | GEMM | 4.68 | x1.41 | HuggingFace |
| BioMistral-7B-TIES | AWQ | 128 | 4 | GEMM | 4.68 | x1.41 | HuggingFace |
| BioMistral-7B-SLERP | AWQ | 128 | 4 | GEMM | 4.68 | x1.41 | HuggingFace |
Details
Here’s a simplified interpretation of the entries:
- The FP16/BF16 entry represents a model converted into 16-bit floating-point formats, requiring 15.02 GB of VRAM and running at the same speed as the original model.
- Models using AWQ with a quantization group size of 128 and 4-bit weights significantly reduce VRAM usage to 4.68 GB. Depending on the specific optimization (GEMM or GEMV), the execution time can vary from 1.41 times faster to 10.30 times slower than the base model.
- BnB methods with 4-bit and 8-bit quantizations offer a middle ground in VRAM usage (5.03 GB and 8.04 GB, respectively) and provide speed-ups between 3.25 to 4.34 times.
- The DARE, TIES, and SLERP models, all based on the BioMistral-7B with AWQ method and GEMM version, require 4.68 GB of VRAM and offer a 1.41 times speed-up, indicating uniform efficiency improvements across these model variants.
Using BioMistral
You can use BioMistral with Hugging Face’s Transformers library as follow.
Loading the model and tokenizer :
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BioMistral/BioMistral-7B")
model = AutoModel.from_pretrained("BioMistral/BioMistral-7B")
CAUTION! Please be aware that it is essential to communicate to both direct users and those affected downstream about the potential risks, biases, and limitations present within the model. Although the model is capable of generating natural language text. Our understanding of its full range of abilities and restrictions is still in the initial stages. This understanding is particularly vital in sensitive areas like medicine. Therefore, we strongly discourage using this model for generating natural language in production environments or for professional activities within the healthcare and medical sectors.
Read related articles:

