Draft:Large Language Models on DNA
![]() | Draft article not currently submitted for review.
This is a draft Articles for creation (AfC) submission. It is not currently pending review. While there are no deadlines, abandoned drafts may be deleted after six months. To edit the draft click on the "Edit" tab at the top of the window. To be accepted, a draft should:
It is strongly discouraged to write about yourself, your business or employer. If you do so, you must declare it. Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
Last edited by Selinassun (talk | contribs) 56 days ago. (Update) |
Introduction
Large Language Models (LLMs) were originally developed for natural language processing. Another type of LLM is being developed to interpret genomic data by treating DNA as a “biological language.”[1] In this context, the four nucleotide bases—adenine (A), guanine (G), cytosine (C), and thymine (T)—are analogous to words in a sentence. Leveraging self-supervised learning and self-attention mechanisms, LLMs create deep embeddings and tokenized representations of DNA sequences, revealing intricate patterns that traditional methods such as alignment-based tools or motif analyses often miss.
Unlike rule-based approaches, these models excel at uncovering long-range dependencies and subtle contextual relationships in genomic data, making them particularly useful for tasks like gene annotation, functional genomics, and evolutionary studies. The breakthrough application in this field came with DNABERT, introduced by Ji et al. in 2021, which adapted BERT’s bidirectional attention through k-mer tokenization to effectively handle genomic sequences.
Since then, advanced models such as HyenaDNA and Striped Hyena have emerged, significantly enhancing the ability to process extremely long genomic sequences—up to 1 million tokens—without sacrificing computational efficiency. Meanwhile, the Evo model, with its 7-billion-parameter architecture trained on 2.7 million prokaryotic and phage genomes, leverages self-supervised learning to predict evolutionary dynamics and pinpoint regulatory elements critical for microbial adaptation.
These AI-driven approaches are transforming our understanding of DNA by providing nuanced insights into gene function, mutation patterns, and regulatory mechanisms. As LLMs continue to evolve, they promise to further revolutionize genetic medicine, disease research, and synthetic biology by offering a more holistic and integrative view of genomic complexity.
Background
Deoxyribonucleic acid (DNA) is the molecule that carries genetic information essential for the development and functioning of organisms. This information is stored as a code composed of four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). The human genome consists of approximately 3 billion bases, with only about 1% encoding proteins, while the remaining 99% consists of noncoding regions. These noncoding regions include regulatory elements such as enhancers and promoters, which play crucial roles in gene expression and cellular function.
Historically, computational genomics relied on rule-based algorithms such as BLAST (Basic Local Alignment Search Tool) for sequence alignment, and Hidden Markov Models (HMMs) for motif detection. While effective for many tasks, these methods struggled with capturing long-range dependencies in DNA sequences. Early machine learning models improved upon these approaches by enabling tasks such as gene classification, but they lacked the complexity needed for capturing intricate genomic patterns.
The emergence of deep learning and Large Language Models (LLMs) has transformed DNA sequence analysis by providing a global and transferable understanding of genomic sequences. LLMs are deep learning-based AI models originally designed for processing and generating human-like text. They function by tokenizing inputs, converting sequences into numerical representations, and are trained on massive datasets using self-supervised learning to recognize patterns in sequential data. This ability allows LLMs to model complex genomic interactions by leveraging upstream and downstream nucleotide contexts.
In 2021, DNABERT became one of the first LLMs specifically designed for DNA sequences, utilizing k-mer tokenization to adapt BERT’s bidirectional attention mechanism for genomic data. Building on this foundation, HyenaDNA and Striped Hyena introduced memory-efficient architectures capable of processing long genomic sequences up to 1 million tokens in length. Meanwhile, Evo, a 7-billion-parameter model trained on over 2.7 million prokaryotic and phage genomes, has demonstrated remarkable capabilities in zero-shot function prediction and generative tasks, uncovering evolutionary patterns and aiding pathogen surveillance.
These advancements mark a paradigm shift in genomics, moving from rule-based and alignment-heavy methods to deep learning-driven sequence analysis. By leveraging self-attention mechanisms and scalable architectures, LLMs have opened new avenues for research in functional genomics, evolutionary biology, and personalized medicine, fundamentally redefining how scientists interpret the vast complexity of genetic information.
Scientific Principles and Mechanisms
Large language models usually utilize different architectures to process and predict the data. For AI in genomics, due to the high volume of context and distance between semantics, Transformers and Hyena hierarchy are the most common frameworks implemented to solve problems.
Transformers
Transformers are a type of deep learning model introduced in 2017 by Vaswani et al. in the paper "Attention Is All You Need[2]." They represent a significant departure from traditional recurrent neural networks (RNNs) by relying entirely on self-attention mechanisms rather than sequential processing. This self-attention allows transformers to evaluate the relationships between all tokens in an input sequence simultaneously, enabling them to capture long-range dependencies more effectively.
The architecture of transformers is based on an encoder-decoder structure, where the encoder processes the input data to generate a set of continuous representations and the decoder uses these representations to generate output sequences. Key components of the transformer model include multi-head self-attention, positional encoding, and feed-forward neural networks. These features have made transformers the foundation for state-of-the-art models in natural language processing, such as BERT, GPT, and many others, and they are increasingly being applied across various domains including computer vision, genomics, and beyond.
Hyena Hierarchy
The Hyena model is a neural network architecture that was developed to address the scalability issues associated with traditional self‐attention mechanisms.[3] It is designed to efficiently handle very long sequences by replacing the quadratic-complexity self‐attention with a subquadratic operator that interleaves implicit long convolutions with data-controlled gating.
Motivation and Context
Traditional Transformer models rely on self-attention to allow each token in a sequence to interact with every other token. Although this mechanism is highly effective for capturing dependencies, its computational cost scales quadratically () with the sequence length L. This quadratic scaling creates significant challenges when processing long sequences, such as entire documents, long time series, or high-resolution images.
The need for more efficient models that can process long-range dependencies has led researchers to explore alternatives that reduce computational and memory requirements. The Hyena model was introduced as a drop-in replacement for self-attention, aiming to maintain the global receptive field and expressive power of attention while scaling subquadratically with sequence length.
Architecture
At the core of the Hyena model is the concept of implicit long convolutions. Traditional convolutions use fixed kernels that are explicitly defined and stored, resulting in a parameter count that scales linearly with the kernel size. In contrast, Hyena generates convolutional filters implicitly using a parameterized function—typically implemented as a small feed-forward network. This allows the model to synthesize long filters on the fly, effectively decoupling the filter length from the number of parameters.
In addition to implicit convolutions, the Hyena operator incorporates data-controlled multiplicative gating. In this mechanism, each token is modulated by gating signals that are derived from learned linear projections of the input. The gating operation is performed element-wise and serves to dynamically adjust the influence of the convolutional output, effectively tailoring the operator to the specific input context.
The overall Hyena operator is defined as a recurrence that alternates between implicit long convolutions and element-wise gating. For an order-N Hyena operator, the recurrence is expressed as follows:
- , where is one of the linear projections of the input.
- For :
- , where represents a gating projection and is an implicitly parameterized long convolution filter.
- The final output is given by .
Mathematical Formulation
The implicit convolution filters in Hyena are typically parameterized as functions of time. For each filter , the response at time is given by:
Here, the window function serves to modulate the filter (for example, by imposing an exponential decay), and the feed-forward network (FFN) together with positional encodings generate the filter values. This implicit parameterization is a key design choice that allows Hyena to capture long-range dependencies without a proportional increase in parameter count.
Efficiency and Scalability
By replacing the quadratic self-attention mechanism with a sequence of FFT-based convolutions and element-wise multiplications, the Hyena operator achieves an overall time complexity of , where is the number of recurrence steps. This subquadratic scaling is particularly advantageous for long sequences, allowing the model to process inputs that are orders of magnitude longer than those feasible with conventional attention.
The operations in the Hyena model—both the implicit convolutions and the gating functions—are highly parallelizable and amenable to optimization on modern hardware accelerators. Techniques such as fast Fourier transforms (FFT) further enhance the efficiency, making the model well-suited for large-scale applications where both speed and memory efficiency are critical.
Comparison with Transformer Models
While Transformer models use self-attention to achieve a global receptive field, this comes at the cost of quadratic complexity with respect to the sequence length. In contrast, the Hyena model achieves a similar global context through its recurrence of long convolutions and gating, but with much lower computational cost. This makes Hyena a promising alternative in settings where long-range dependencies need to be modeled efficiently.
Model Architectures
- DNABERT[4] adapts the transformer-based BERT architecture for genomic sequence analysis by converting DNA into overlapping k-mer tokens. In this setup, each k-mer functions like a “word” in a sentence, allowing the model to capture both local patterns and longer-range contextual relationships within the sequence. This approach leverages self-attention to build deep, context-aware embeddings of genomic data, facilitating tasks such as gene annotation and mutation detection.
- DNABERT2[5] refines the original architecture by replacing fixed k-mer tokenization with Byte Pair Encoding (BPE). Unlike k-mer tokenization, BPE dynamically segments DNA sequences into variable-length subunits, resulting in a more flexible and efficient representation. This not only reduces computational costs but also enhances the model's ability to capture complex patterns in the data, improving its scalability and overall performance in genomic analyses.
- HyenaDNA[6] is a model that leverages a Hyena[3]-based architecture—built on implicit convolutions—to process DNA at single nucleotide resolution. By replacing traditional self-attention with highly efficient convolutional operators, HyenaDNA scales sub-quadratically with sequence length. This efficiency allows it to model extraordinarily long genomic sequences—up to one million tokens—while training up to 160× faster than standard transformer models. Its single-character tokenizer ensures that every nucleotide is represented without loss of resolution, capturing long-range dependencies crucial for understanding genomic regulation.
- StripedHyena[7] is an advanced variant of the Hyena architecture that enhances sequence modeling by integrating specialized components such as rotary self-attention layers with its core gated convolution operators. This hybrid design combines the benefits of efficient implicit convolutions with targeted pattern recall from attention mechanisms, further improving training speed and scalability. Like HyenaDNA, StripedHyena supports single nucleotide tokenization and can handle sequences as long as one million tokens, making it exceptionally well-suited for large-scale genomic datasets and long-range interaction analysis.
Advantages
Autonomous Pattern Recognition: LLMs are capable of learning intricate patterns within genomic sequences. They excel at detecting subtle regulatory elements such as motifs, enhancers, and transcription factor binding sites. This automated recognition eliminates the need for manual feature engineering, thereby reducing human bias and accelerating the discovery process.
Efficient Feature Extraction: By pre-training on vast amounts of genomic data, LLMs automatically extract essential features from DNA sequences. This efficiency in feature extraction allows them to identify important genomic markers without relying on hand-crafted features. As a result, researchers can focus on downstream analysis rather than the labor-intensive process of designing features.
Transferability and Fine-Tuning: Pre-trained models encapsulate universal genomic features that can be fine-tuned for specific applications—such as mutation detection, gene annotation, or regulatory element prediction—with relatively little additional data. This transfer learning capability enables rapid adaptation to new challenges and facilitates the development of versatile diagnostic and research tools.
Challenges and Limitations
Computational Complexity: Genomic datasets are vast, and training or performing inference on such data requires substantial computational resources. This is particularly pronounced when dealing with models designed to process extremely long sequences. The computational cost not only affects the training time but also limits the feasibility of real-time analysis and the deployment of these models in resource-constrained environments.
Data Bias and Generalization: It highly dependents on the quality and diversity of their training data. There is a risk that these models may inadvertently learn biases present in the training datasets, which can result in suboptimal performance when generalizing to unseen genomic sequences. This challenge is compounded by the complexity and variability of genomic data, where even small discrepancies can lead to significant differences in biological function.
Interpretability: Unlike traditional bioinformatics tools that often provide clear, rule-based insights, deep learning models tend to operate as "black boxes." The opacity in their decision-making processes makes it difficult to ascertain the specific reasons behind their predictions. This lack of transparency can be a significant drawback, especially in applications such as clinical diagnostics or research, where understanding the underlying rationale is as important as the prediction itself.
Applications
Regulatory Element Identification: One of the primary applications of DNA LLMs is in the identification of regulatory elements. Regulatory elements such as promoters, enhancers, and silencers are crucial for controlling gene expression, and their precise location in the genome can greatly influence cellular function. Models like DNABERT and DNABERT2 have been fine-tuned to predict these regions, enabling researchers to annotate genomes more accurately. By learning the patterns associated with active regulatory sites, these models offer improved detection capabilities over traditional sequence alignment methods, providing a deeper understanding of transcriptional regulation.
Transcription Factor Binding Site Prediction: DNA LLMs play an important role in predicting transcription factor binding sites (TFBS). Transcription factors are proteins that bind to specific regions in the DNA to regulate gene expression, and identifying their binding sites is essential for mapping gene regulatory networks. These models capture subtle nucleotide-level features that indicate potential TFBS, offering insights into protein–DNA interactions. The enhanced resolution of models like HyenaDNA allows for a more detailed examination of how these interactions are modulated, which is crucial for understanding cellular responses and disease mechanisms.
Epigenetic Modification and Chromatin State Analysis: The prediction of epigenetic modifications and the analysis of chromatin states. Epigenetic marks, including DNA methylation and various histone modifications, influence the structure of chromatin and, consequently, gene expression |url=https://www.nature.com/articles/npp2012112 |url-status=live}}</ref>. DNA LLMs can be fine-tuned to predict these modifications by recognizing the sequence features that correlate with specific epigenetic states. This capability not only aids in understanding how genes are turned on or off but also provides valuable insights into how epigenetic alterations may contribute to diseases, making these models powerful tools in both research and clinical settings.
Variant Effect and Mutation Impact Prediction: The fine granularity offered by single nucleotide resolution is particularly beneficial for assessing the impact of genetic variants. DNA LLMs, especially those designed for long-range context such as HyenaDNA and StripedHyena, can evaluate the functional consequences of single nucleotide polymorphisms (SNPs) and other mutations. By predicting how specific alterations in the DNA sequence might disrupt gene function or regulatory processes, these models support efforts in precision medicine and disease research. They can, for example, help determine whether a particular mutation is likely to be deleterious, thereby guiding further experimental investigation and clinical decision-making.
Comparative and Evolutionary Genomics: DNA LLMs have broad applications in comparative and evolutionary genomics. Models that are pre-trained on multi-species datasets, such as DNABERT2, have the capacity to differentiate between species by recognizing distinct mutational profiles and conserved genomic features. This ability facilitates evolutionary studies and biodiversity assessments by identifying regions of the genome that are under selective pressure. Moreover, the comparative analysis enabled by these models helps in understanding the genetic basis of phenotypic diversity across organisms, which is fundamental to both evolutionary biology and conservation efforts.
References
- ^ Liu, Guanqing (December 6, 2024). "PDLLMs: A group of tailored DNA large language models for analyzing plant genomes".
{{cite news}}
: CS1 maint: url-status (link) - ^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2023-08-02), Attention Is All You Need, arXiv, doi:10.48550/arXiv.1706.03762, arXiv:1706.03762, retrieved 2025-02-28
- ^ a b Poli, Michael; Massaroli, Stefano; Nguyen, Eric; Fu, Daniel Y.; Dao, Tri; Baccus, Stephen; Bengio, Yoshua; Ermon, Stefano; Ré, Christopher (2023-04-19), Hyena Hierarchy: Towards Larger Convolutional Language Models, arXiv, doi:10.48550/arXiv.2302.10866, arXiv:2302.10866, retrieved 2025-02-27
- ^ Ji, Yanrong; Zhou, Zhihan; Liu, Han; Davuluri, Ramana V (2021-08-09). "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome". Bioinformatics. 37 (15): 2112–2120. doi:10.1093/bioinformatics/btab083. ISSN 1367-4803. PMC 11025658. PMID 33538820.
- ^ Zhou, Zhihan; Ji, Yanrong; Li, Weijian; Dutta, Pratik; Davuluri, Ramana; Liu, Han (2024-03-18), DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome, arXiv:2306.15006, retrieved 2025-02-26
- ^ Nguyen, Eric; Poli, Michael; Faizi, Marjan; Thomas, Armin; Birch-Sykes, Callum; Wornow, Michael; Patel, Aman; Rabideau, Clayton; Massaroli, Stefano (2023-11-14), HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution, arXiv:2306.15794, PMID 37426456, retrieved 2025-02-26
- ^ togethercomputer/stripedhyena, Together, 2025-02-25, retrieved 2025-02-26