Meet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert

BioReason addresses a significant challenge in utilizing AI for genomics: the need for interpretable, step-by-step reasoning from complex DNA data. While DNA foundation models have shown proficiency in learning diverse sequence patterns for tasks like variant prediction and gene regulation, they often function as black boxes, providing limited insight into the underlying biological mechanisms. Conversely, large language models (LLMs) display remarkable reasoning capabilities across various domains, yet they are not tailored to work directly with raw genomic sequences. This disconnect between effective DNA representation and in-depth biological reasoning hampers AI’s ability to achieve expert-level comprehension and inhibits its potential for driving scientific discovery through meaningful, hypothesis-driven insights.

DNA foundation models have made notable strides by learning rich representations directly from genomic sequences, demonstrating strong performance across numerous biological tasks. For instance, models like Evo2 showcase substantial long-range capabilities, highlighting their potential; however, their lack of interpretability restricts deeper biological understanding. Meanwhile, LLMs excel in processing biomedical texts but often do not engage with raw genomic data directly. Early attempts to bridge this gap, such as GeneGPT and TxGemma, have emerged, yet current genomic benchmarks primarily assess task performance without adequately evaluating reasoning and hypothesis generation.

Researchers from the Vector Institute, University Health Network, Arc Institute, Cohere, University of California, San Francisco, and Google DeepMind have developed BIOREASON, a pioneering AI system that combines a DNA foundation model with an LLM. This innovative integration allows BIOREASON to analyze raw genomic sequences while applying LLM-based reasoning to generate clear, biologically grounded insights. Trained through supervised fine-tuning and reinforcement learning, BIOREASON achieves over a 15% performance gain compared to traditional models, reaching up to 97% accuracy in KEGG-based disease pathway prediction. This model provides interpretable, step-by-step outputs that enhance biological understanding and facilitate hypothesis generation.

The BIOREASON model is a multimodal framework designed to support comprehensive, interpretable biological reasoning by merging genomic sequences with natural language queries. It utilizes a DNA foundation model to extract rich, contextual embeddings from raw DNA inputs, integrating these with tokenized textual queries to create a unified input for the LLM, specifically Qwen3. The model generates step-by-step explanations of biological processes by projecting DNA embeddings into the LLM’s space through a learnable layer, further enhanced with positional encoding. Additionally, reinforcement learning via Group Relative Policy Optimization refines its reasoning capabilities.

Evaluating BIOREASON on three datasets centered on DNA variant interpretation and biological reasoning revealed that it outperformed both DNA-only and LLM-only models in predicting disease outcomes from genomic variants. The top-performing version, which combined Evo2 and Qwen3-4B, demonstrated high accuracy and F1-scores across all tasks. A notable case study involved a PFN1 mutation associated with ALS, where BIOREASON accurately predicted the disease and provided a ten-step explanation linking the variant’s impact on actin dynamics and motor neuron degeneration. This illustrates its strength in not only making accurate predictions but also in delivering transparent, biologically grounded reasoning pathways.

In conclusion, BIOREASON merges DNA encoders with large language models to enable detailed, interpretable reasoning over genomic data. Unlike conventional models, it not only makes accurate predictions but also elucidates the biological logic behind them with step-by-step outputs. This capability aids scientists in better comprehending disease mechanisms and formulating new research inquiries. While powerful, BIOREASON faces challenges such as high computational costs and limited uncertainty measures. Future advancements aim to address these issues by enhancing scalability, incorporating additional biological data such as RNA and proteins, and expanding its application to broader tasks, including Genome-Wide Association Studies (GWAS). Overall, BIOREASON represents a significant step forward in advancing precision medicine and genomic research.

Check out the Paper, GitHub Page, and Project Page. All credit for this research goes to the researchers involved in this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and subscribe to our Newsletter.