RARELINK
2025 Genomic Diagnostics Winner
The RareLink project uses deep learning to uncover genetic interactions—specifically epistasis—linked to rare diseases, which traditional methods often miss. By comparing neural activations from DNA sequences containing single and paired variants, the team quantifies epistasis and interprets functional effects, such as loss or gain of protein function. They demonstrate their method on the LMNA gene, associated with congenital muscular dystrophy, revealing how one variant can repress another’s effect. Future directions include improving mechanistic interpretability and building clinician-friendly tools to aid diagnosis and therapeutic development.
PROJECT SUMMARY
Introduction
Although rare diseases cause immense suffering to affected patients and their families, there is much to be discovered surrounding their genetic architecture. One of the primary challenges in understanding the genetic architecture of rare diseases is pervasive genetic interactions, a phenomenon known as epistasis [1]. Through epistasis, variants associated with a disease can modify the effects of other variants to create a unique combined effect. Furthermore, it has been shown that single variants are often insufficient in explaining the total heritability of a trait, suggesting the need for modeling interactions [2]. Epistasis has historically been difficult to quantify due to the combinatorial growth of interaction terms, but recent advances in large language modeling for DNA sequences [3, 4] permit detection of these elusive genetic interactions. These interactions are particularly important for autosomal recessive genetic disorders, where two variants are needed to diagnose a disorder, leading to many missed diagnoses for suffering patients. We built RareLink, a quantitative framework leveraging deep learning to detect genetic interactions for rare disease variants. With our model, we demonstrate the pervasiveness of interactions within rare disease associated regions and create a simple statistical test for interactions between variants, allowing for quantification of loss-of-function or gain-of-function through interaction. Furthermore, we created a prototype for mechanistic interpretability of these interactions on a biological level. Altogether, RareLink will uncover new drug targets that center on variant interactions as well as more nuanced diagnosis that takes multiple variants into account.
Approach
Our approach is based on comparing neural activations between single variant sequences and double variant sequences. RareLink utilizes a sparse autoencoder to return a minimal set of activating features related to each mutant [5]. We pass in all three mutated sequences and calculate the differences in activated features between the two single variant sequences and the double variant. This ultimately allows us to quantify the level of epistasis between two variants in a rare disease gene and provides insight into the gain or loss of function in the gene based on pairs of variants. We provide a schematic of our approach to the right:
Epistasis scoring predicts a interactions between two disease associated loci in LMNA
In order to make our analysis specific to the function of paired variants, we demonstrated the workflow using a single gene LMNA, associated with LMNA-related congenital muscular dystrophy [6], and observed the nature of the epistasis between two of its disease-associated variants. We applied a sparse autoencoder to three sequences passed through GPN-MSA—the first two sequences each held a single variant and the third sequence held both variants. We compared the activation and deactivation of features across all three sequences and compared these sets of features by observing which features activating in variant 1 and 2 maintained activation for the variant pair sequence and which new features activated or deactivated on the variant pair. We observed that the pair of variant 1 and 2 causes features that originally activated alone on variant 2 to have differential behavior on the pair. Further, this behavior was not observed in the reverse direction. For these reasons, we argue that variant 1 represses the function of variant 2 within the protein LMNA.
Next, we may apply a transform to our function score to return an interpretable epistasis score. We apply a hyperbolic transformation of f(x) = 10 tanh(x/a) where x is our function score and a is a steepness value that we choose to be 7 based on our data. From here, we calculate the epistasis score for LMNA as -9.912, which suggests loss of function as a result of the genetic interaction. Negative epistasis scores correspond to a phenomenon of negative epistasis, where interactions neutralize the function of two mutations. Positive scores correspond to synergistic epistasis, where there is novel function [7]. To refine our algorithm, we generated a dataset of 1 million paired variants in chromosome 1 that exist within a 128 bp window. From this dataset, we can determine a better steepness value for our hyperbolic transform and calculate summary statistics for epistasis score values that will allow us to calculate statistical significance.
Future Developments
While language modeling is powerful for studying biological sequences, a consistent problem for their usage has been interpretability. For this reason, we leverage sparse autoencoders as a means of creating interpretable features from the neural activations within intermediate layers. As a proof of concept, we utilized SAEs on the GPN-MSA model to extract interpretable features, which we found to correspond to specific DNA sequence motifs. We aim to expand to using SAEs on Evo due to Evo having a larger input window, allowing for pairs of variants farther apart to be analyzed. Mechanistic interpretability will pave the way for detailed biological insights into interactions which may provide new therapeutic targets.
We also would like to build more functionality for clinicians as finding groups of disease implicated variants will improve diagnosis of disease, particularly autosomal recessive disorders. We can accomplish this goal by building upon the user-friendly interface we’ve created and interviewing clinicians and researchers to understand what features would be important for our platform.
Citations
[1] Why epistasis is important for tackling complex human disease genetics.
[2] The mystery of missing heritability: Genetic interactions create phantom heritability.
[3] Sequence modeling and design from molecular to genome scale with Evo.
[4] GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction.
[5] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.
[6] Laminopathies: One Gene, Two Proteins, Five Diseases.
[7] Evolutionary constraints in fitness landscapes.
MEET THE TEAM
David An
Harvard College
Undergraduate (2027)
Biology
Eren Shin
Talkowski Lab @ MGH
Computational Associate
Computational Biology
Xichen Zhang
University at Buffalo
Undergraduate (2025)
Applied Mathematics
Shivam Gandhi
Harvard Medical School
Graduate Student
Genetics & AI