Google AI Releases C2S-Scale 27B Model for Single-Cell Gene Expression Analysis

A team of researchers from Google Research, Google DeepMind, and Yale has introduced the C2S-Scale 27B, a 27-billion-parameter foundation model designed for single-cell analysis, built on Gemma-2. This model translates single-cell RNA-seq (scRNA-seq) profiles into “cell sentences”—ordered lists of gene symbols—allowing a language model to effectively parse and reason over cellular states.

Understanding the Model

The C2S-Scale model transforms high-dimensional expression vectors into text by rank-ordering genes and generating the top-K symbols as a sequence of gene names. This representation aligns single-cell data with standard LLM toolchains, facilitating tasks such as:

Cell-type prediction
Tissue classification
Cluster captioning
Perturbation prediction
Biological Q&A

Training Data, Stack, and Release

The C2S-Scale-Gemma-2-27B model is built on Gemma-2 27B (decoder-only Transformer), trained on Google TPU v5, and released under CC-BY-4.0. The training corpus aggregates over 800 public scRNA-seq datasets, encompassing more than 57 million cells (human and mouse) with associated metadata and textual context. Pretraining unifies transcriptomic tokens and biological text into a single multimodal corpus.

Key Results: An Interferon-Conditional Amplifier

The research team conducted a dual-context virtual screen over more than 4,000 drugs to identify compounds that enhance antigen presentation (MHC-I program) specifically in immune-context-positive settings—i.e., primary patient samples with low interferon tone—while exhibiting negligible effects in immune-context-neutral cell-line data. The model predicted a significant context split for silmitasertib (CK2 inhibitor), demonstrating strong MHC-I upregulation when combined with low-dose interferon, and minimal effect without interferon. The team validated this prediction in human neuroendocrine models, achieving a marked synergistic increase in antigen presentation (approximately 50% in their assays).

The amplifier lowers the response threshold to interferon rather than initiating antigen presentation from scratch. Flow-cytometry readings show upregulation of HLA-A, B, C only under the combined treatment (including IFN-β and IFN-γ) across two neuroendocrine models, with representative MFI gains (e.g., 13.6% at 10 nM and 34.9% at 1000 nM silmitasertib in one model).

Key Takeaways

C2S-Scale 27B encodes scRNA-seq profiles as textual “cell sentences,” enabling LLM-native single-cell analysis workflows.
In a two-context virtual screen of over 4,000 compounds, the model predicted an interferon-conditional amplifier: CK2 inhibition (silmitasertib) enhances MHC-I antigen presentation only in conjunction with low-dose interferon.
Wet-lab tests in human neuroendocrine cell models confirmed the prediction, with approximately 50% increase in antigen presentation for silmitasertib combined with interferon compared to either treatment alone; this remains preclinical/in vitro.
Open weights and usage documentation are available on Hugging Face (vandijklab) with both 27B and 2B Gemma variants for research purposes.

Conclusion

The C2S-Scale 27B model represents a significant advancement for LLMs in biology, translating scRNA-seq into “cell sentences” that allow for programmatic queries over cell states and perturbations. The identification of an interferon-conditional amplifier—silmitasertib (CK2 inhibition)—that increases MHC-I antigen presentation in the presence of low-dose interferon is a promising development for converting immune-“cold” tumors into more responsive targets for immunotherapy. However, it is important to note that all evidence remains preclinical, emphasizing the model’s role in hypothesis generation rather than clinical claims.

For further details, refer to the Technical Paper, Model on Hugging Face, and GitHub Page for tutorials, codes, and notebooks. You can also follow us on Twitter and join our community on ML SubReddit with over 100k members. Additionally, connect with us on Telegram.

Google AI Releases C2S-Scale 27B Model that Translate Complex Single-Cell Gene Expression Data into ‘cell sentences’ that LLMs can Understand