
Introduction to Genomics Sequencing and Data Methods
0
14
0
Genomics projects utilizing next-generation paired-end read data require careful consideration of various factors, depending on the organismal system and the specific research questions being addressed. Different sequencing methods offer unique advantages depending on the availability of genomic resources, the nature of the study organism (whether a model or non-model system), and the project's goals and budget.
In this post, we will explore three key sequencing approaches that are commonly used in population genomics and phylogenomics:
Table of Contents
Project Considerations
To determine the optimal sequencing method for your project, it’s essential to answer a core set of questions:
What genomic resources are already available for your organism? Are reference genomes accessible?
Is your study system a model or non-model organism? Understudied non-model organisms may present additional challenges, such as polyploidy, hybridization, or cryptic lineages that complicate genomic analysis.
What is your budget? Different methods vary in cost and scalability.
Does your research focus on identifying specific genes or regions of interest, or are you more interested in assessing population-level differentiation?
How old is the lineage of your focal species? Younger species may benefit from methods that maximize SNP variation for deeper insights into recent evolution.
By addressing these key questions, you can better navigate the available sequencing options. Below, we will provide an overview of each method and a summary of the pros, cons, and budget considerations for each approach.
Low-Coverage Whole Genome Sequencing
Low-coverage whole genome sequencing is a flexible and powerful approach that provides genome-wide insights into population structure and diversity metrics. Additionally, it allows researchers to identify genomic regions of interest, particularly through Genome-Wide Association Studies (GWAS). This method is especially valuable for systems where published genomes are available for read mapping, enabling comprehensive analysis across the entire genome.
The primary advantage of low-coverage sequencing is its ability to optimize the number of samples analyzed by sacrificing read-depth across the genome. Typically, this results in an average read-depth of 0.5x to 3x. While this may seem insufficient for individual-level resolution, the power of this approach lies in leveraging genotype likelihoods across multiple individuals within a population. Advanced bioinformatic tools like ANGSD (Analysis of Next Generation Sequencing Data) have made this approach feasible by calculating the most likely genotype for each SNP based on population-level data. This allows researchers to overcome the limitations of low read-depth by relying on statistical models that aggregate data across the population.
When a reference genome is available, low-coverage whole genome sequencing offers cutting-edge methods to analyze entire genomes across populations, with the added benefit of identifying genomic regions associated with specific traits or conditions. However, this approach is less feasible for understudied, non-model systems due to the high costs associated with generating novel genomes. For such systems, alternative sequencing approaches may be more appropriate.
DD-Radseq Approach
The DD-Radseq (Double Digest Restriction-site Associated DNA Sequencing) approach offers a significant advantage for researchers: it does not require a reference genome, yet it still provides a large sampling of SNPs (Single Nucleotide Polymorphisms) across the entire genome. This approach has gained widespread popularity over the past decade in population genomic studies and phylogenomics due to its affordability and its ability to generate millions of SNPs, particularly for non-model organisms. DD-Radseq is especially powerful for young lineages with low levels of genetic differentiation, making it an invaluable tool for studying non-model organisms.
The library preparation for DD-Radseq involves fragmenting genomic DNA into small sequences using restriction enzymes or sonication, after which these fragments are barcoded for each individual. The resulting sequences are then bioinformatically organized into orthologous groups across samples using clustering algorithms, allowing researchers to obtain SNPs from a ‘random’ sampling across the genome. This process is facilitated by widely-used pipelines such as IPYRAD.
Despite its strengths, DD-Radseq has certain limitations, primarily due to the lack of genomic resources for verifying the orthology of sequences. The assembly of sequences relies heavily on sequence similarity clustering, which can be problematic in the absence of a reference genome. Additionally, because DD-Radseq loci are typically short (around 300bp), obtaining haplotype-level information is challenging. This limitation can impact the accuracy of orthology inference, potentially leading to the erroneous grouping of paralogs due to insufficient sequence variation in a given region. If a reference genome is available, these challenges can be mitigated to some extent.
Another obstacle with DD-Radseq is the variability introduced by the DNA fragmentation process. The regions of the genome that are sequenced can vary significantly between different batches, which means that datasets generated from different projects may not be easily aggregated into larger datasets. This variability can pose issues for replicability and comparative studies.
Despite these challenges, DD-Radseq remains a powerful and cost-effective method for generating a high volume of SNP data. It is particularly useful for studying non-model organisms with no available genomic resources, providing critical insights into population genomics and evolutionary history.
Target-Enrichment/Capture, Hyb-Seq, Reduced Representation Methods
Target-capture and its variations (often referred to broadly as ‘reduced representation’ approaches) are powerful techniques optimized for studying specific genes or regions of a genome. In the last decade, several probe sets have been developed for specific organismal systems, such as the Angiosperms353 probe set for use across angiosperms, or the GoFLAG 408 probe set for ferns. These hybridization-based target capture methods leverage available genomic resources, like transcriptomes or genomes, to identify loci with high similarity across broader organismal groups or clades. The core goal of these methods is to target 'low-copy' genomic regions that can be reliably captured across species within a broader group, even if the study species lack specific genomic resources.
One of the significant strengths of reduced-representation or target capture (TC) methods is their ability to recover high read-depth across targeted loci, which can be used to phase and generate haplotype-level assemblies. For example, SORTER 2 leverages this high read-depth to assemble homeologs and homoploid hybrid haplotypes, facilitating studies of hybrids within phylogenetic and population genetics frameworks.
However, depending on the system being studied and the probe set being used, target-capture methods can sometimes amplify paralogs—duplicated variants of a targeted gene that have undergone independent evolutionary histories. Although probe sets are designed to target 'low-copy' genes, the design may not anticipate gene duplications across all study systems and clades, especially in non-model organisms. This can complicate the assembly of homologs across loci and species, as some samples may contain a single copy at a locus, while others may have unanticipated duplications, leading to the mixing of homologs with paralogs.
Most bioinformatic pipelines attempt to detect paralogs by assessing allelic load or SNP density across loci, often leading to the exclusion of such loci from analyses. Unfortunately, this can result in the loss of valuable data that could be critical for resolving species or clade relationships. SORTER2 addresses this issue through identity clustering with USEARCH, separating paralogs into their own orthologous sets. This approach not only allows researchers to retain all targeted loci in their analyses but also generates additional loci from paralogs that can be used for further analysis.
Overall, target capture methods offer a cost-effective option for studying non-model organisms, despite the challenges mentioned. These challenges, however, can be effectively managed with tools like the SORTER2 Toolkit, making these methods a viable choice for genomic research across diverse systems.
Summary Table
Method | Pros | Cons | Budget Considerations |
Low-Coverage Whole Genome Sequencing | >Provides genome-wide coverage for population structure and diversity analysis. >Capable of identifying genomic regions of interest using GWAS. >Leverages genotype likelihoods for accurate SNP identification across populations. | >Requires an available reference genome for effective read mapping. >Limited by lower read depth (0.5x-3x), which may miss some variations. >Not suitable for non-model systems without an existing genome due to the high cost of generating new genomes. | >Cost-effective and powerful if a reference genome is available. >High costs associated with generating novel genomes for non-model systems. |
dd-RADSEQ | >Does not require a reference genome. >Generates a large number of SNPs, useful for population genomics and phylogenomics. >Affordable and effective for non-model organisms, especially in young lineages with low genetic differentiation. | >Difficult to verify orthology of sequences without a genome. >Short read lengths (~300bp) limit haplotype-level information. >Variability between batches complicates data aggregation and reproducibility. | >Generally affordable. >Ideal for non-model organisms with no available genomic resources. |
Target-Enrichment/Capture, Hyb-Seq | >Optimized for studying specific genes or genome regions. >High read-depth allows for haplotype-level assemblies. >SORTER2 pipeline allows for phasing of hybrid samples and enhances paralog handling by separating them into orthologous sets, retaining valuable loci. | >May amplify paralogs, complicating data assembly. >Designed probe-sets may not fully capture duplication events in non-model organisms, potentially leading to mixed homolog and paralog assemblies. | >More affordable than whole-genome sequencing. >Costs vary depending on the probe-set and system studied. |