A spatially aware likelihood test to detect sweeps from haplotype distributions
Article now published in PLOS Genetics at http://dx.doi.org/10.1371/journal.pgen.1010134
One of the key questions in evolutionary biology is how much adaptation has affected the history of a species or population. The identification and characterization of signatures of natural Darwinian selection is a powerful way to pinpoint putative genomic loci associated with adaptive phenotypes. Among all forms of natural selection, positive selection is the most widely studied. The genomic signals of positive selection are typically characterized by low nucleotide diversity and long common haplotypes. Furthermore, genomic signatures of selection vary depending on whether there is occurrence of a “hard sweep” (a single haplotype rising in frequency) or a “soft sweep” (multiple haplotypes rising in frequency). As such, several methods have been developed to detect signatures of positive selection by analyzing patterns of extended haplotypes [Sabeti et al. 2002, Voight et al. 2006]. Interestingly, supervised machine learning has recently been proposed as a powerful tool to detect positive selection from alignments of haplotypes [Torada et al. 2019], although this requires extensive simulations from known demographic models. Recently, the software LASSI has been proposed to calculate a likelihood test statistic T and infer the number of sweeping haplotypes [Harris and DeGiorgio, 2020].
In this pre-print, authors present an improved version of LASSI by accounting for the spatial distribution of haplotype frequencies along the genome [DeGiorgio and Szpiech, 2021]. They do so by comparing the local haplotype frequency spectrum distorted by sweeping events with the expected distribution obtained from genome-wide data. Authors consider the spatial distribution as the number of single nucleotide polymorphisms (SNPs) surrounding the targeted variant, although other definitions are possible in principle. Modeling of haplotype frequency spectra under various scenarios of selection are derived from previous findings [Cheng and DeGiorgio, 2020]. Notably, this parameterization allows for the inference of two variables, the number of sweeps (m) and the sweep footprint size (A), the latter being related to the age and strength of the selection event. Authors performed extensive simulations to detect selective sweeps and benchmarked the power of the new statistic, called Λ, against commonly used methods based on haplotype diversity. They finally applied Λ to publicly available data sets of human genomes (the 1000 Genomes Project) of African and European descent.
Λ outperforms all tested competing statistics based on haplotype information under various degrees of sweep “softness” and age or strength. Notably, Λ has higher power to detect older sweeps when applying a relatively short window (51 SNPs). This highlights the importance of smaller windows to detect selection as they are less affected by historical recombination events.
For recent sweeps, Λ can provide reliable information on the number of sweeps. Importantly, this suggests that this method not only detects signatures of selection, but is also able to characterize notable features of the sweeping event, like its “softness” and strength or age through the footprint size.
While the power of Λ is negatively affected by not having phased genotypes or extreme demographic scenarios, it generally outperforms competing statistics under the same scenarios.
When applied to human genomes from African (YRI) and European (CEU) samples, Λ was able to recapitulate known targets of selection, namely LCT and MHC, with evidence of soft sweeps acting on the latter gene.
What I liked about this preprint
Among the plethora of handcrafted statistics to detect signatures of positive selection, Λ is standing out for several reasons. First, it clearly outperforms existing methods under a wide range of evolutionary scenarios. Second, it does not require either extensive simulations or detailed knowledge of the demographic model of the population of interest. Third, it appears to be less sensitive to assumptions on phasing data, extreme demographic history and background selection. Fourth, it provides statistical evidence on the number of sweeping haplotypes and the age or strength of the selection event. Finally, a user-friendly open-source implementation is available at https://github.com/szpiech/lassip.
Questions to authors
- Tests for positive selection are usually more powerful when deployed on pairs of populations, one used as “control”. Do you see any scope of extending Λ to test for local distorted joint haplotype frequency spectra?
- While it is worth assessing the performance of Λ on unphased genotypes, large-scale genomic data produced nowadays are also affected by mapping and sequencing errors, limited sample sizes, and genotype uncertainty and data missingness when a low-coverage sequencing strategy is employed. How do you see the applicability of Λ on such sequencing data sets?
- It is reassuring to observe that Λ is able to confirm previous findings of selection. What would be the next steps for this method to being able to produce novel biological insights? In which species and systems do you think Λ is more likely to generate breakthrough discoveries?
X Cheng and M DeGiorgio. Flexible mixture model approaches that accommodate footprint size variability for robust detection of balancing selection. Mol Biol Evol, 37:3267–3291, 2020.
M DeGiorgio and ZA Szpiech. A spatially aware likelihood test to detect sweeps from haplotype distributions. BioRxiv, 2021 https://www.biorxiv.org/content/10.1101/2021.05.12.443825v1
AM Harris and M DeGiorgio. A likelihood approach for uncovering selective sweep signatures from haplotype data. Mol Biol Evol, 215:143–171, 2020.
PC Sabeti, DE Reich, JM Higgins, HZP Levine, DJ Richter, SF Schaffner, SB Gabriel, JV Platko, NJ Patterson, GJ McDonald, HC Ackerman, SJ Campbell, D Altshuler, R Cooper, D Kwiatkowski, R Ward, and ES Lander. Detecting recent positive selection in the human genome from haplotype structure. Nature, 419:832–837, 2002.
L Torada, L Lorenzon, A Beddis, U Isildak, L Pattini, S Mathieson, and M Fumagalli. ImaGene: a convolutional neural network to quantify natural selection from genomic data. BMC bioinformatics, 20(9), 1-12, 2019.
BF Voight, S Kudaravalli, X Wen, and JK Pritchard. A map of recent positive selection in the human genome. PLoS Biol, 4:e72, 2006.
Posted on: 1 June 2021Read preprint