A new method for genomics analysis doesn’t require reference data

The framework, SPLASH, can detect strain-defining viral mutations, alternative splicing, and other genetic variations using much less computing power than traditional methods.
Tags:
Share:
Computer screen with nucleic acid sequence
JuSun, iStock
Elizabeth Gribkoff
December 14, 2023

In 2003, scientists finished sequencing almost all of the three million nucleotide base pairs that make up the human genome. This feat led to an explosion in genomics analysis, which to this day relies on aligning sequencing data to a “reference genome” — a composite made up of DNA samples from different individuals in the same species — for humans and other species.

Now, researchers at Stanford University and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a genomics analysis framework, SPLASH, that directly analyzes raw sequencing samples, eliminating the need for reference data. The method can perform genomic analyses more quickly and with less computing power than traditional methods. SPLASH should prove especially useful for analyzing genomes of understudied or rapidly mutating species.

In a study published earlier this month in Cell, the team showed that the framework can detect different strains of SARS-CoV-2 and find sequence diversity in adaptive immune receptors, among other findings. Kaitlin Chuang and Tavor Baharav, former PhD students at Stanford University, were co-first authors on the paper, and Julia Salzman, associate professor of biomedical science and biochemistry at Stanford, was the lead author. All research was performed in Salzman's group, whose lab combines statistics and genomics.

 “A lot of sequencing analysis is done with implicit priors, meaning that your pipeline is only going to identify the one feature that it was designed to find,” said Baharav, who is now an Eric and Wendy Schmidt Center postdoctoral fellow. “With SPLASH, we’ve developed a method for unbiased, reference-free hypothesis generation.”

From alignment- to statistics-first

While genomics has revolutionized both medicine and ecology, its dependence on reference genomes has its limitations. For example, only 5% of mammalian species have had their genomes sequenced — a percentage that drops even further for organisms like bacteria and viruses. Additionally, because the human-reference genome only contains samples from a handful of individuals, it does not reflect global genomic diversity. 

Eric and Wendy Schmidt Center postdoctoral fellow Tavor Baharav

Also, traditional genomics analysis aligns samples with references before comparing the samples to each other, discarding outliers. “When you're trying to detect an interesting, novel event, it almost by definition isn't going to align well to the reference,” said Baharav.

To address these and other limitations, researchers in the Salzman Lab at Stanford University came up with a way to analyze raw sequencing data without having to first align it to a reference genome. 

Their framework, SPLASH, identifies unchanging "anchor" subsequences in the raw sequencing data  that are followed by "target" sequences that vary by sample. SPLASH, which stands for “Statistically Primary aLignment Agnostic Sequence Homing,” uses a new statistical test to determine which stretch of RNA reads exhibit the most variation. 

"This work illustrates how interdisciplinary teams with diverse perspectives and skill sets are powerful and needed for scientific progress,” said Salzman. “Initially, the team questioned why such a straightforward approach hadn't been implemented before, but we gradually came to realize that rethinking conventions can sometimes yield simple solutions that could work better than ingrained approaches.”

Unlike traditional methods, which can only detect certain types of genetic variations, the framework can detect a wide variety of variations. SPLASH is also much more computationally efficient than those methods. An updated version of the framework can complete the entire analysis in an hour while using much less computing power than alignment-first approaches. 

Detecting viral mutations + microalgae growing on eelgrass

To test the effectiveness of SPLASH, the team used it to perform a range of genomic analyses. In one, they compared nasal swab samples from patients taken at different periods during the COVID-19 pandemic, when different viral strains were dominant. SPLASH was able to identify which anchors had “low p-values” and high effect sizes — indicators of viral mutations. They then mapped these reads to control samples from different COVID strains, determining that almost all of the anchors that SPLASH homed in on were indeed strain-defining mutations.

Eelgrass provides foraging areas and shelter for fish. Adam Obaza/NOAA.

Given that very few species have reference genomes, the team also tested how well SPLASH can detect variations between samples from two species — eelgrass and octopus — with limited reference data available. They compared RNA from eelgrass, a common seagrass, found in the Mediterranean and Norway, finding that almost 6% of targets did not align to eelgrass references. In particular, they noticed that the target sequences for one anchor varied by location and season. 

The team theorized that these discrepancies could indicate the presence of different species of diatoms, microalgae that grow on other plants, as the anchor was less abundant in samples taken at night, when diatoms reduce expression of this particular type of gene.

 “On its own, SPLASH does not provide immediately interpretable results, but it points researchers to interesting questions that they can investigate further,” said Baharav. 

Next steps

Baharav, who completed his PhD in electrical engineering at Stanford earlier this year, is now applying his computational background to cancer research. As white blood cells develop, they shuffle around parts of their genome through a process called “V(D)J recombination.” This genetic reshuffling allows them to produce a huge array of antibodies and T-cell receptors, which they use to recognize and kill millions of microbes. 

Cancer researchers like Baharav’s mentor, Rafael Irizarry, chair of the Department of Data Science at Dana-Farber Cancer Institute, want to better understand how V(D)J recombination works to design cancer vaccines. As a Schmidt Center fellow, Baharav is developing a reference-free way to analyze these adaptive immune receptors. 

“SPLASH provides an exciting new statistical and computational framework for genomic analysis. I'm looking forward to building on this work to expand the scope of reference-free analysis, allowing researchers to perform unbiased inference on their data,” said Baharav. “As discussed in SPLASH, reference-based methods fall short in analyzing highly diverse genomic regions such as T cell receptors, which I'm looking to change.”

Get Involved