Foundation Models for Bioinformatics – a Primer

(Original article published on Medium)

Foundation models for biology are one of the most significant technological advances in bioinformatics in recent years. However, they are built on concepts that can be relatively unfamiliar to people in the field, as they are derived from other areas of AI. This article summarises what you need to know about Foundation models in Biology and how they can be helpful to you.

Most people are now familiar with natural language models such as chatGPT, trained on huge quantities of text and capable of generating responses that mimic human language. Many people are also acquainted with generating images and videos using other architectures. However, biology also has its own language, which can be modelled using machine learning. In fact, biology has many languages, from DNA to proteins to transcription factors and regulation, interactions between cell types and tissues, and much else.

Foundation models for biology attempt to model these languages, using large quantities of data for training and applying this knowledge to generate new data or make predictions.

A list of Foundation Models for Biology

This list is quite incomplete, as new models get published every week. For a more complete list, check this repository https://github.com/HICAI-ZJU/Scientific-LLM-Survey (I hope they accept my pull requests for adding some recent ones). I also like following the https://kiinai.substack.com/ for recent news.

[Foundation models for biology](https://www.notion.so/17ec9afb0bec80489d86ed3a453a05c9?pvs=21)

What are Foundation Models for Biology?

  • Generally speaking, they are models trained on large quantities of biological data, such as genomic sequences or chemical structures, using architectures like the Transformer, which is widely used for training Large Language Models (LLMs) like ChatGPT. The term has become popular thanks to the paper https://arxiv.org/abs/2402.04286 published in 2024.
  • Instead of training a Transformer on a dataset of texts or human conversations, these models are trained on biological sequences or other types of biological data. The model learns the “language of life” by capturing hierarchical patterns in the data.
    • For example, a model may be trained on gene expression data from blood tissue single-cell sequencing. The model learns which genes are typically expressed together in specific cell types discovering these interactions purely from analyzing the training data, without any prior knowledge of biological pathways or gene sets.

What Are “Pre-training” and “Fine-tuning”?

Training a Foundation model is very expensive and requires large quantities of data. However, once a model has been pre-trained, it can be used and fine-tuned for other tasks. This means that we can take advantage of big atlases (provided the model is made public), and use them on smaller datasets, without having to download all the original data and train on it from scratch.

  • Pre-training: This is the initial phase where the foundation model is trained on a large dataset of biological data. For example, the model might learn to predict the next nucleotide in a sequence or fill in missing elements within a sequence. During this phase, the model learns the patterns underlying the input data – essentially building an understanding of the “language of biology.”.
  • Fine-tuning: Once pre-trained, the model can be adapted to specific tasks by modifying its parameters and training it further on smaller, task-specific datasets. The advantage of this, compared to training a model from data directly, is that the model has already learned the relationship between the elements during the pertaining.
    • A pre-trained model might have learned general genomic patterns. By fine-tuning, you could train it to classify patients as healthy or diseased based on their genomic data.
    • Fine-tuning doesn’t necessarily involve modifying only the last layer. Techniques like updating multiple layers or using adapter modules can help leverage the full depth of the model.

Example 1: using a pre-trained model to predict disease

Let’s imagine you are developing a model to predict whether a patient is affected by a disease. You start from a gene expression dataset in cases/controls from previous experiments.

One traditional bioinformatics approach may be to create a data frame of gene expression across all samples and train a model to distinguish cases and controls. One may use a random forest approach, XGBoost, or logistic regression. The predicted variable will be “disease”, and the gene expression values will be the input.

The problem with this approach is that it doesn’t consider the relationships between genes. Gene A may be in the same pathway as Gene B, and the pair may be frequently expressed together. Gene C may be a transcription factor of Gene D, but only when gene E is expressed, and so on. Biology is so complex, and we are far from understanding it.

If your training dataset is big enough, your model may be able to learn about all the gene relationships by itself. Or you may tweak it with some feature engineering to simplify the training. However, in most cases, your training dataset will be far too small. You may have a cohort of 10, 20 samples at most. Even hundreds of samples may not be enough.

Here is where a foundation model comes into play. We can download a model previously trained on a large dataset of gene expression, like the Gene Expression Atlas, and apply it to our small dataset. We will transform the gene expression values into embeddings, which are numbers that include the knowledge of gene interactions learned during the pre-training. By using these embeddings instead of the original values, our predictions should be more accurate, and they will take into account all the biology that the foundation model has learned.

Example 2: Imputation and Upscaling of Microarrays

I like this example from the CpGPT paper https://www.biorxiv.org/content/10.1101/2024.10.24.619766v1.full, which presents a model trained on Methylation data.

Despite the advent of NGS technologies, Microarrays are still one of the most cost-effective ways to profile Methylation samples. There is a vast repository of data generated using old Illumina Microarray chips, like the HumanMethylation BeadChips, which can profile only 27K or 450K CpGs. There must be thousands of publications in Pubmed, and datasets in GEO, generated using these old technologies. The newest arrays can genotype more than 900K sites. It’s such a shame that this data is based on old technology and there are so many CpG sites missing.

Is there a way to use the data from all these old chips, and “update it” to the newest one, predicting the sites that were not present in the original assay? The answer, proposed in the CpGPT paper, is yes. They trained a transformer model on a large quantity of methylation data, including old and new chips. The model learned how to impute new sites, and predict the genotype of the missing CpGs based on data. To validate the results, they used the imputed datasets to predict age, and obtained much lower errors.

Similarly, a foundation model can be used to impute missing SNPs from Genotype data, and other data types. Foundation models can also be used for generating new datasets synthetically – the preciousGPT3 paper (https://www.biorxiv.org/content/10.1101/2024.07.25.605062v1describes this neatly.

What Are the Key Concepts of Transformers?

Transformers are at the core of foundation models. Here are the features that differentiate them from other architectures:

  1. Embeddings:
    – Input data (e.g., nucleotides, amino acids) is converted into numerical representations called embeddings.
    – Embeddings encode both the identity and context of each element in the data. For example, in a nucleotide sequence, embeddings might capture the base type, its genomic position, and nearby elements.
    – The art of creating embeddings lies in designing features relevant to the task. For example:
    – Evo uses embeddings optimized for evolutionary data.
    – scGPT encodes single-cell RNA-seq data, including metadata like cell type and experimental batch.
    – CpGPT incorporates features like CpG methylation status, genomic position, and neighboring base context.
  2. Attention Mechanism:
    – Transformers use self-attention to understand relationships between different parts of the input. For example, the attention mechanism allows the model to connect distant elements in a sequence, such as regulatory elements and target genes. This ability to capture long-range dependencies is particularly crucial for biological data.
  3. Generative Training:
    – Many Transformers are trained using tasks like masked language modeling (e.g., predicting masked bases) or next-token prediction. These tasks help the model generalize across datasets and tasks by building a rich internal representation of the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">