Close

A spatially aware likelihood test to detect sweeps from haplotype distributions

Michael DeGiorgio, Zachary A. Szpiech

https://www.biorxiv.org/content/10.1101/2021.05.12.443825v1

Article now published in PLOS Genetics at http://dx.doi.org/10.1371/journal.pgen.1010134

A powerful test for genomic signatures of selective sweeps from haplotype frequency spectra.

Selected by Matteo Fumagalli

Background

One of the key questions in evolutionary biology is how much adaptation has affected the history of a species or population. The identification and characterization of signatures of natural Darwinian selection is a powerful way to pinpoint putative genomic loci associated with adaptive phenotypes. Among all forms of natural selection, positive selection is the most widely studied. The genomic signals of positive selection are typically characterized by low nucleotide diversity and long common haplotypes. Furthermore, genomic signatures of selection vary depending on whether there is occurrence of a “hard sweep” (a single haplotype rising in frequency) or a “soft sweep” (multiple haplotypes rising in frequency). As such, several methods have been developed to detect signatures of positive selection by analyzing patterns of extended haplotypes [Sabeti et al. 2002, Voight et al. 2006]. Interestingly, supervised machine learning has recently been proposed as a powerful tool to detect positive selection from alignments of haplotypes [Torada et al. 2019], although this requires extensive simulations from known demographic models. Recently, the software LASSI has been proposed to calculate a likelihood test statistic T and infer the number of sweeping haplotypes [Harris and DeGiorgio, 2020].

In this pre-print, authors present an improved version of LASSI by accounting for the spatial distribution of haplotype frequencies along the genome [DeGiorgio and Szpiech, 2021]. They do so by comparing the local haplotype frequency spectrum distorted by sweeping events with the expected distribution obtained from genome-wide data. Authors consider the spatial distribution as the number of single nucleotide polymorphisms (SNPs) surrounding the targeted variant, although other definitions are possible in principle. Modeling of haplotype frequency spectra under various scenarios of selection are derived from previous findings [Cheng and DeGiorgio, 2020]. Notably, this parameterization allows for the inference of two variables, the number of sweeps (m) and the sweep footprint size (A), the latter being related to the age and strength of the selection event. Authors performed extensive simulations to detect selective sweeps and benchmarked the power of the new statistic, called Λ, against commonly used methods based on haplotype diversity. They finally applied Λ to publicly available data sets of human genomes (the 1000 Genomes Project) of African and European descent.

Key findings

Λ outperforms all tested competing statistics based on haplotype information under various degrees of sweep “softness” and age or strength. Notably, Λ has higher power to detect older sweeps when applying a relatively short window (51 SNPs). This highlights the importance of smaller windows to detect selection as they are less affected by historical recombination events.

For recent sweeps, Λ can provide reliable information on the number of sweeps. Importantly, this suggests that this method not only detects signatures of selection, but is also able to characterize notable features of the sweeping event, like its “softness” and strength or age through the footprint size.

While the power of Λ is negatively affected by not having phased genotypes or extreme demographic scenarios, it generally outperforms competing statistics under the same scenarios.

When applied to human genomes from African (YRI) and European (CEU) samples, Λ was able to recapitulate known targets of selection, namely LCT and MHC, with evidence of soft sweeps acting on the latter gene.

What I liked about this preprint

Among the plethora of handcrafted statistics to detect signatures of positive selection, Λ is standing out for several reasons. First, it clearly outperforms existing methods under a wide range of evolutionary scenarios. Second, it does not require either extensive simulations or detailed knowledge of the demographic model of the population of interest. Third, it appears to be less sensitive to assumptions on phasing data, extreme demographic history and background selection. Fourth, it provides statistical evidence on the number of sweeping haplotypes and the age or strength of the selection event. Finally, a user-friendly open-source implementation is available at https://github.com/szpiech/lassip.

Questions to authors

  1. Tests for positive selection are usually more powerful when deployed on pairs of populations, one used as “control”. Do you see any scope of extending Λ to test for local distorted joint haplotype frequency spectra?
  2. While it is worth assessing the performance of Λ on unphased genotypes, large-scale genomic data produced nowadays are also affected by mapping and sequencing errors, limited sample sizes, and genotype uncertainty and data missingness when a low-coverage sequencing strategy is employed. How do you see the applicability of Λ on such sequencing data sets?
  3. It is reassuring to observe that Λ is able to confirm previous findings of selection. What would be the next steps for this method to being able to produce novel biological insights? In which species and systems do you think Λ is more likely to generate breakthrough discoveries?

References

X Cheng and M DeGiorgio. Flexible mixture model approaches that accommodate footprint size variability for robust detection of balancing selection. Mol Biol Evol, 37:3267–3291, 2020.

M DeGiorgio and ZA Szpiech. A spatially aware likelihood test to detect sweeps from haplotype distributions. BioRxiv, 2021 https://www.biorxiv.org/content/10.1101/2021.05.12.443825v1

AM Harris and M DeGiorgio. A likelihood approach for uncovering selective sweep signatures from haplotype data. Mol Biol Evol, 215:143–171, 2020.

PC Sabeti, DE Reich, JM Higgins, HZP Levine, DJ Richter, SF Schaffner, SB Gabriel, JV Platko, NJ Patterson, GJ McDonald, HC Ackerman, SJ Campbell, D Altshuler, R Cooper, D Kwiatkowski, R Ward, and ES Lander. Detecting recent positive selection in the human genome from haplotype structure. Nature, 419:832–837, 2002.

L Torada, L Lorenzon, A Beddis, U Isildak, L Pattini, S Mathieson, and M Fumagalli. ImaGene: a convolutional neural network to quantify natural selection from genomic data. BMC bioinformatics, 20(9), 1-12, 2019.

BF Voight, S Kudaravalli, X Wen, and JK Pritchard. A map of recent positive selection in the human genome. PLoS Biol, 4:e72, 2006.

 

Posted on: 1 June 2021

doi: https://doi.org/10.1242/prelights.29309

Read preprint (No Ratings Yet)

Author's response

Zachary shared

Tests for positive selection are usually more powerful when deployed on pairs of populations, one used as “control”. Do you see any scope of extending Λ to test for local distorted joint haplotype frequency spectra?

As it happens, we are currently working on an expanded two-population version of this method for precisely this reason. Although the computational demands will likely grow, we anticipate that including a second population may indeed assist in learning about the characteristics of sweeps in one or both populations.

While it is worth assessing the performance of Λ on unphased genotypes, large-scale genomic data produced nowadays are also affected by mapping and sequencing errors, limited sample sizes, and genotype uncertainty and data missingness when a low-coverage sequencing strategy is employed. How do you see the applicability of Λ on such sequencing data sets?

These are great points. Mapping errors are challenging to address, and our current approach is to filter these regions out. Although, perhaps with the first complete human genome assembly (https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1) combined with improved assembly methodology, we may see some improvement in this area soon. Sequencing errors/genotype uncertainty/missing data could conceivably be modeled in our approach here, and it is something we are thinking about for future updates to this (and related) methods. Small sample sizes, of course, will introduce noise into the estimation of the HFS, but as our method only considers the top 10-20 most frequent haplotypes, we anticipate that this will pose less of a problem. Finally, we should mention that when we applied Λ to the TGP data, we used the low-coverage calls without incorporating any uncertainty parameters with respect to genotype calls and were still able to identify sensible results.

It is reassuring to observe that Λ is able to confirm previous findings of selection. What would be the next steps for this method to being able to produce novel biological insights? In which species and systems do you think Λ is more likely to generate breakthrough discoveries?

Understanding the genomic basis of adaptation to changing/extreme environments would be an important area of application. This would likely reveal important biology and aid in conservation efforts.

Have your say

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Sign up to customise the site to your preferences and to receive alerts

Register here
Close