Jump to content

Draft:Large Language Models on DNA

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 69.172.191.66 (talk) at 21:49, 9 March 2025 (Hyena Hierarchy). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Large Language Models in DNA

Large language models (LLMs) have transformed natural language processing and are now redefining genomic analysis by treating DNA as a “biological language.” In this context, the four nucleotide bases—adenine (A), guanine (G), cytosine (C), and thymine (T)—serve as the fundamental “words” of genetic information. By leveraging self-supervised learning and self-attention mechanisms, these models generate deep embeddings and tokenized representations of DNA sequences, enabling them to detect complex patterns and long-range dependencies that traditional alignment-based methods or motif analyses may overlook.

The importance of LLMs in genomics stems from their ability to capture subtle, context-dependent relationships in genetic data. Unlike rule-based approaches, LLMs dynamically learn from the data itself, which allows for enhanced gene annotation, improved functional genomics analyses, and more robust evolutionary studies. This flexibility and power provide critical insights into gene function, mutation patterns, and regulatory mechanisms that are essential for advancing genetic medicine, disease research, and synthetic biology.

A breakthrough in this field was demonstrated with DNABERT[1] in 2021, which adapted BERT’s bidirectional attention using k-mer tokenization to effectively process genomic sequences. Following this, models such as HyenaDNA[2] and Striped Hyena[3] have further pushed the boundaries, enabling the efficient processing of extremely long genomic sequences—up to 1 million tokens—without sacrificing computational efficiency. These developments underscore the pivotal role LLMs play in offering a holistic and integrative view of genomic complexity, paving the way for transformative advances in our understanding of biology.

Background

Deoxyribonucleic acid (DNA) is the molecule that carries genetic information essential for the development and functioning of organisms. This information is stored as a code composed of four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). The human genome consists of approximately 3 billion bases, with only about 1.5% encoding proteins, while the remaining 98.5% consists of noncoding regions. These noncoding regions include regulatory elements such as enhancers and promoters, which play crucial roles in gene expression and cellular function[4].

Historically, computational genomics relied on statistical methods and Hidden Markov Models (HMMs) for motif detection. While effective for many tasks, these methods struggled with capturing long-range dependencies in DNA sequences. Early machine learning models improved upon these approaches by enabling tasks such as gene classification, but they lacked the complexity needed for capturing intricate genomic patterns.

The emergence of deep learning and Large Language Models (LLMs) has transformed DNA sequence analysis by providing a global and transferable understanding of genomic sequences. LLMs are deep learning-based AI models originally designed for processing and generating human-like text. They function by tokenizing inputs, converting sequences into numerical representations, and are trained on massive datasets using self-supervised learning to recognize patterns in sequential data. This ability allows LLMs to model complex genomic interactions by leveraging upstream and downstream nucleotide contexts.

In 2021, DNABERT[1] became one of the first LLMs specifically designed for DNA sequences, utilizing k-mer tokenization to adapt BERT’s bidirectional attention mechanism for genomic data. Building on this foundation, HyenaDNA[2] introduced memory-efficient architectures capable of processing long genomic sequences up to 1 million tokens in length. Meanwhile, Evo[5], a 7-billion-parameter model trained on over 2.7 million prokaryotic and phage genomes, has demonstrated remarkable capabilities in zero-shot function prediction and generative tasks, uncovering evolutionary patterns and aiding pathogen surveillance[6] .

These advancements mark a paradigm shift in genomics, moving from rule-based and alignment-heavy methods to deep learning-driven sequence analysis. By leveraging self-attention mechanisms and scalable architectures, LLMs have opened new avenues for research in functional genomics, evolutionary biology, and personalized medicine, fundamentally redefining how scientists interpret the vast complexity of genetic information.

Scientific Principles and Mechanisms

Large language models usually utilize different architectures to process and predict the data. For AI in genomics, due to the high volume of context and distance between semantics, Transformers and Hyena hierarchy are the most common frameworks implemented to solve problems.

Transformers

Transformers are a type of deep learning model introduced in 2017 by Vaswani et al. in the paper "Attention Is All You Need[7]." They represent a significant departure from traditional recurrent neural networks (RNNs) by relying entirely on self-attention mechanisms rather than sequential processing. This self-attention allows transformers to evaluate the relationships between all tokens in an input sequence simultaneously, enabling them to capture long-range dependencies more effectively.

The architecture of transformers is based on an encoder-decoder structure, where the encoder processes the input data to generate a set of continuous representations and the decoder uses these representations to generate output sequences. Key components of the transformer model include multi-head self-attention, positional encoding, and feed-forward neural networks. These features have made transformers the foundation for state-of-the-art models in natural language processing, such as BERT, GPT, and many others, and they are increasingly being applied across various domains including computer vision, genomics, and beyond.

The Hyena Model Workflow

Hyena Hierarchy

The Hyena model is a neural network architecture that was developed to address the scalability issues associated with traditional self‐attention mechanisms.[8] It is designed to efficiently handle very long sequences by replacing the quadratic-complexity self‐attention with a subquadratic operator that interleaves implicit long convolutions with data-controlled gating.

Motivation and Context

Traditional Transformer models rely on self-attention to allow each token in a sequence to interact with every other token. Although this mechanism is highly effective for capturing dependencies, its computational cost scales quadratically () with the sequence length L. This quadratic scaling creates significant challenges when processing long sequences, such as entire documents, long time series, or high-resolution images.

The need for more efficient models that can process long-range dependencies has led researchers to explore alternatives that reduce computational and memory requirements. The Hyena model was introduced as a drop-in replacement for self-attention, aiming to maintain the global receptive field and expressive power of attention while scaling subquadratically with sequence length.

Architecture

At the core of the Hyena model is the concept of implicit long convolutions. Traditional convolutions use fixed kernels that are explicitly defined and stored, resulting in a parameter count that scales linearly with the kernel size. In contrast, Hyena generates convolutional filters implicitly using a parameterized function—typically implemented as a small feed-forward network. This allows the model to synthesize long filters on the fly, effectively decoupling the filter length from the number of parameters.

In addition to implicit convolutions, the Hyena operator incorporates data-controlled multiplicative gating. In this mechanism, each token is modulated by gating signals that are derived from learned linear projections of the input. The gating operation is performed element-wise and serves to dynamically adjust the influence of the convolutional output, effectively tailoring the operator to the specific input context.

The overall Hyena operator is defined as a recurrence that alternates between implicit long convolutions and element-wise gating. For an order-N Hyena operator, the recurrence is expressed as follows:

  1. , where is one of the linear projections of the input.
  2. For :
    • , where represents a gating projection and is an implicitly parameterized long convolution filter.
  3. The final output is given by .

, where

  • is the intermediate state at recurrence step and time position .
  • is a linear projection of the input at time position , analogous to the "value" in self-attention.
  • is the gating projection at recurrence step .
  • ​ is the implicit long convolution filter for step .
  • The operator denotes convolution, so is the result of convolving filter ​ with the signal ​ at time .
  • The dot "" indicates element-wise multiplication.

Mathematical Formulation

The implicit convolution filters in Hyena are typically parameterized as functions of time. For each filter , the response at time is given by:

, where is the composition operator, meaning that the positional encoding is first applied to and then processed by the FFN.

Here, the window function serves to modulate the filter (for example, by imposing an exponential decay), and the feed-forward network (FFN) together with positional encodings generate the filter values. This implicit parameterization is a key design choice that allows Hyena to capture long-range dependencies without a proportional increase in parameter count.

Efficiency and Scalability

By replacing the quadratic self-attention mechanism with a sequence of FFT-based convolutions and element-wise multiplications, the Hyena operator achieves an overall time complexity of , where is the number of recurrence steps. This subquadratic scaling is particularly advantageous for long sequences, allowing the model to process inputs that are orders of magnitude longer than those feasible with conventional attention.

The operations in the Hyena model—both the implicit convolutions and the gating functions—are highly parallelizable and amenable to optimization on modern hardware accelerators. Techniques such as fast Fourier transforms (FFT) further enhance the efficiency, making the model well-suited for large-scale applications where both speed and memory efficiency are critical.

Comparison with Transformer Models

While Transformer models use self-attention to achieve a global receptive field, this comes at the cost of quadratic complexity with respect to the sequence length. In contrast, the Hyena model achieves a similar global context through its recurrence of long convolutions and gating, but with much lower computational cost. This makes Hyena a promising alternative in settings where long-range dependencies need to be modeled efficiently.

Aspect Hyena Model Transformer
Computational Complexity (subquadratic) (quadratic)
Memory Footprint Lower; uses FFT-based convolutions and implicit filters Higher; requires storing the full self-attention matrix
Global Context Yes; achieved via interleaved implicit convolutions and gating Yes; achieved through dense pairwise interactions in self-attention
Scalability to Long Sequences Highly efficient; can process sequences of millions of tokens (e.g., genomic data) Limited by quadratic scaling; effective only up to a few thousand tokens
Parameter Scaling Decoupled from sequence length due to implicit parameterization of filters Fixed parameter count independent of sequence length, but scaling becomes costly
Speed on Long Sequences Significantly faster (e.g., 160× faster at 1M tokens in certain cases) Slower due to quadratic cost in computation and memory
Hardware Utilization High; operations like FFT and element-wise gating are highly parallelizable Optimized for dense matrix operations, but efficiency drops with very long sequences

[8][8]

Model Architectures

Model Architecture & Tokenization Key Features & Benefits
DNABERT[1] Transformer-based BERT architecture with overlapping k-mer tokenization Converts DNA into "words" (k-mers) to capture local and long-range context.

Utilizes self-attention for deep, context-aware embeddings. Effective for gene annotation and mutation detection.

DNABERT2[9] Refined transformer-based architecture replacing fixed k-mer tokenization with Byte Pair Encoding (BPE) Dynamically segments DNA into variable-length subunits, reducing computational costs.

Offers a more flexible representation that enhances the ability to capture complex patterns, leading to improved scalability and performance in genomic analyses.

HyenaDNA[2] Hyena[8]-based architecture built on implicit convolutions with single nucleotide (character) tokenization Processes DNA at single nucleotide resolution by replacing traditional self-attention with efficient convolutional operators.

Scales sub-quadratically with sequence length, handling sequences up to one million tokens and training up to 160× faster.

StripedHyena[3] Advanced variant of the Hyena architecture integrating rotary self-attention layers with gated convolution operators Combines the efficiency of implicit convolutions with targeted pattern recall from attention mechanisms.

Maintains single nucleotide tokenization. Supports extremely long sequences (up to one million tokens) with enhanced training speed and scalability.

  • DNABERT[1] adapts the transformer-based BERT architecture for genomic sequence analysis by converting DNA into overlapping k-mer tokens. In this setup, each k-mer functions like a “word” in a sentence, allowing the model to capture both local patterns and longer-range contextual relationships within the sequence. This approach leverages self-attention to build deep, context-aware embeddings of genomic data, facilitating tasks such as gene annotation and mutation detection.
  • DNABERT2[9] refines the original architecture by replacing fixed k-mer tokenization with Byte Pair Encoding (BPE). Unlike k-mer tokenization, BPE dynamically segments DNA sequences into variable-length subunits, resulting in a more flexible and efficient representation. This not only reduces computational costs but also enhances the model's ability to capture complex patterns in the data, improving its scalability and overall performance in genomic analyses.
  • HyenaDNA[2] is a model that leverages a Hyena[8]-based architecture—built on implicit convolutions—to process DNA at single nucleotide resolution. By replacing traditional self-attention with highly efficient convolutional operators, HyenaDNA scales sub-quadratically with sequence length. This efficiency allows it to model extraordinarily long genomic sequences—up to one million tokens—while training up to 160× faster than standard transformer models. Its single-character tokenizer ensures that every nucleotide is represented without loss of resolution, capturing long-range dependencies crucial for understanding genomic regulation.
  • StripedHyena[3] is an advanced variant of the Hyena architecture that enhances sequence modeling by integrating specialized components such as rotary self-attention layers with its core gated convolution operators. This hybrid design combines the benefits of efficient implicit convolutions with targeted pattern recall from attention mechanisms, further improving training speed and scalability. Like HyenaDNA, StripedHyena supports single nucleotide tokenization and can handle sequences as long as one million tokens, making it exceptionally well-suited for large-scale genomic datasets and long-range interaction analysis.

Advantages

Autonomous Pattern Recognition: LLMs are capable of learning intricate patterns within genomic sequences. They excel at detecting subtle regulatory elements such as motifs, enhancers, and transcription factor binding sites. This automated recognition eliminates the need for manual feature engineering, thereby reducing human bias and accelerating the discovery process.

Efficient Feature Extraction: By pre-training on vast amounts of genomic data, LLMs automatically extract essential features from DNA sequences. This efficiency in feature extraction allows them to identify important genomic markers without relying on hand-crafted features. As a result, researchers can focus on downstream analysis rather than the labor-intensive process of designing features.

Transferability and Fine-Tuning: Pre-trained models encapsulate universal genomic features that can be fine-tuned for specific applications—such as mutation detection, gene annotation, or regulatory element prediction—with relatively little additional data. This transfer learning capability enables rapid adaptation to new challenges and facilitates the development of versatile diagnostic and research tools.[10][11]

Applications

Regulatory Element Identification: One of the primary applications of DNA LLMs is in the identification of regulatory elements. Regulatory elements such as promoters, enhancers, and silencers are crucial for controlling gene expression, and their precise location in the genome can greatly influence cellular function[12][13]. Models like DNABERT and DNABERT2 have been fine-tuned to predict these regions, enabling researchers to annotate genomes more accurately. By learning the patterns associated with active regulatory sites, these models offer improved detection capabilities over traditional sequence alignment methods, providing a deeper understanding of transcriptional regulation.

Transcription Factor Binding Site Prediction: DNA LLMs play an important role in predicting transcription factor binding sites (TFBS)[14]. Transcription factors are proteins that bind to specific regions in the DNA to regulate gene expression, and identifying their binding sites is essential for mapping gene regulatory networks[15]. These models capture subtle nucleotide-level features that indicate potential TFBS, offering insights into protein–DNA interactions. The enhanced resolution of models like HyenaDNA allows for a more detailed examination of how these interactions are modulated, which is crucial for understanding cellular responses and disease mechanisms.

Epigenetic Modification and Chromatin State Analysis: The prediction of epigenetic modifications and the analysis of chromatin states. Epigenetic marks, including DNA methylation and various histone modifications, influence the structure of chromatin and, consequently, gene expression [16]. DNA LLMs can be fine-tuned to predict these modifications by recognizing the sequence features that correlate with specific epigenetic states. This capability not only aids in understanding how genes are turned on or off but also provides valuable insights into how epigenetic alterations may contribute to diseases, making these models powerful tools in both research and clinical settings [17].

Variant Effect and Mutation Impact Prediction: The fine granularity offered by single nucleotide resolution is particularly beneficial for assessing the impact of genetic variants[18]. DNA LLMs, especially those designed for long-range context such as HyenaDNA, can evaluate the functional consequences of single nucleotide polymorphisms (SNPs) and other mutations. By predicting how specific alterations in the DNA sequence might disrupt gene function or regulatory processes, these models support efforts in precision medicine and disease research. They can, for example, help determine whether a particular mutation is likely to be deleterious, thereby guiding further experimental investigation and clinical decision-making[19].

Challenges and Limitations

Computational Complexity: Genomic datasets are vast, and training or performing inference on such data requires substantial computational resources. This is particularly pronounced when dealing with models designed to process extremely long sequences. The computational cost not only affects the training time but also limits the feasibility of real-time analysis and the deployment of these models in resource-constrained environments.

Data Bias and Generalization: It highly dependents on the quality and diversity of their training data. There is a risk that these models may inadvertently learn biases present in the training datasets, which can result in suboptimal performance when generalizing to unseen genomic sequences. This challenge is compounded by the complexity and variability of genomic data, where even small discrepancies can lead to significant differences in biological function.

Interpretability: Unlike traditional bioinformatics tools that often provide clear, rule-based insights, deep learning models tend to operate as "black boxes." The opacity in their decision-making processes makes it difficult to ascertain the specific reasons behind their predictions. This lack of transparency can be a significant drawback, especially in applications such as clinical diagnostics or research, where understanding the underlying rationale is as important as the prediction itself.


References

  1. ^ a b c d Ji, Yanrong; Zhou, Zhihan; Liu, Han; Davuluri, Ramana V (2021-08-09). "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome". Bioinformatics. 37 (15): 2112–2120. doi:10.1093/bioinformatics/btab083. ISSN 1367-4803. PMC 11025658. PMID 33538820.
  2. ^ a b c d Nguyen, Eric; Poli, Michael; Faizi, Marjan; Thomas, Armin; Birch-Sykes, Callum; Wornow, Michael; Patel, Aman; Rabideau, Clayton; Massaroli, Stefano (2023-11-14), "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution", Arxiv: arXiv:2306.15794v2, arXiv:2306.15794, PMC 10327243, PMID 37426456
  3. ^ a b c togethercomputer/stripedhyena, Together, 2025-02-25, retrieved 2025-02-26
  4. ^ Liu, Guanqing (December 6, 2024). "Initial impact of the sequencing of the human genome".
  5. ^ evo-design/evo, Laboratory of Evolutionary Design, 2025-03-07, retrieved 2025-03-07
  6. ^ Liu, Guanqing (December 6, 2024). "Sequence modeling and design from molecular to genome scale with Evo". Science. Vol. 386, no. 6723. doi:10.1126/science.ado9336.
  7. ^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2023-08-02), Attention Is All You Need, arXiv:1706.03762
  8. ^ a b c d e Poli, Michael; Massaroli, Stefano; Nguyen, Eric; Fu, Daniel Y.; Dao, Tri; Baccus, Stephen; Bengio, Yoshua; Ermon, Stefano; Ré, Christopher (2023-04-19), Hyena Hierarchy: Towards Larger Convolutional Language Models, arXiv:2302.10866 Cite error: The named reference ":0" was defined multiple times with different content (see the help page).
  9. ^ a b Zhou, Zhihan; Ji, Yanrong; Li, Weijian; Dutta, Pratik; Davuluri, Ramana; Liu, Han (2024-03-18), DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome, arXiv:2306.15006
  10. ^ Sarumi, Oluwafemi (September 30, 2024). "Large language models and their applications in bioinformatics".
  11. ^ Benegas, Gonzalo (July 16, 2024). "Genomic Language Models: Opportunities and Challenges".
  12. ^ Rojano, E.; Seoane, P.; Ranea JAG; Perkins, J. R. (December 6, 2024), "Regulatory variants: from detection to predicting impact", Briefings in Bioinformatics, 20 (5): 1639–1654, doi:10.1093/bib/bby039, PMC 6917219, PMID 29893792
  13. ^ Worsley-Hunt, R.; Bernard, V.; Wasserman, W. W. (December 6, 2024), "Identification of cis-regulatory sequence variations in individual genome sequencesimpact", Genome Medicine, 3 (10): 65, doi:10.1186/gm281, PMC 3239227, PMID 21989199
  14. ^ Worsley-Hunt, R.; Bernard, V.; Wasserman, W. W. (December 6, 2024), "Identifying regulatory elements in eukaryotic genomes", Genome Medicine, 3 (10): 65, doi:10.1186/gm281, PMC 3239227, PMID 21989199
  15. ^ He, H.; Yang, M.; Li, S.; Zhang, G.; Ding, Z.; Zhang, L.; Shi, G.; Li, Y. (December 6, 2024), "Mechanisms and biotechnological applications of transcription factors", Synthetic and Systems Biotechnology, 8 (4): 565–577, doi:10.1016/j.synbio.2023.08.006, PMC 10482752, PMID 37691767
  16. ^ Moore, Lisa (July 11, 2012). "DNA Methylation and Its Basic Function". Nature. Vol. 38.
  17. ^ Nguyen, Eric; Poli, Michael; Faizi, Marjan; Thomas, Armin; Birch-Sykes, Callum; Wornow, Michael; Patel, Aman; Rabideau, Clayton; Massaroli, Stefano; Bengio, Yoshua; Ermon, Stefano; Baccus, Stephen A.; Ré, Chris (December 6, 2024), HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution, arXiv:2306.15794
  18. ^ Kwok, P. Y.; Chen, X. (December 6, 2024), "Detection of single nucleotide polymorphisms", Current Issues in Molecular Biology, 5 (2): 43–60, PMID 12793528
  19. ^ Nguyen, Eric; Poli, Michael; Faizi, Marjan; Thomas, Armin; Birch-Sykes, Callum; Wornow, Michael; Patel, Aman; Rabideau, Clayton; Massaroli, Stefano; Bengio, Yoshua; Ermon, Stefano; Baccus, Stephen A.; Ré, Chris (December 6, 2024), HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution, arXiv:2306.15794