SORTER2 Tutorial: Stages 1A + 1B

Nov 8, 2024

9 min read

Tutorial Introduction

In this tutorial we will be walking through how to run SORTER stages 1A and 1B which are the foundation for running downstream stages 2 and 3. We will first outline how to set up software and python dependencies using Anaconda environment manager as well as how to set up USEARCH to properly run the pipeline.

1.Installing Software Dependencies

The SORTER2-Toolkit is mostly fully compatible with Anaconda environments. All software dependencies (with the exception of USEARCH, as of writing) and Python packages can be installed using the conda install command after creating an Anaconda environment for SORTER2.

Install Software and Python Dependencies via Anaconda

We provide a pre-made conda environment .yml file for SORTER2 (only compatible in Linux systems) that will automatically install all required software and python libraries. To install this .yml file use the following conda command:

conda env create -f SORTER2.yml

The SORTER2 environment can then be activated using: conda activate SORTER2

You can also set up the conda environment manually by first creating a new conda environment (conda create -n SORTER2), activating it (conda activate SORTER2) and then using the conda install command for the following software and Python packages:

Biopython https://anaconda.org/conda-forge/biopython

conda install conda-forge::biopython

NumPy https://anaconda.org/conda-forge/numpy

conda install conda-forge::numpy

Pandas https://anaconda.org/anaconda/pandas

conda install anaconda::pandas

TrimGalore https://anaconda.org/bioconda/trim-galore

conda install bioconda::trim-galore

Spades https://anaconda.org/bioconda/spades

conda install bioconda::spades

BWA https://anaconda.org/bioconda/bwa

conda install bioconda::bwa

SAMTools https://anaconda.org/bioconda/samtools

conda install bioconda::samtools

BCFTools https://anaconda.org/bioconda/bcftools

conda install bioconda::bcftools

VCFTools https://anaconda.org/bioconda/vcftools

conda install bioconda::vcftools

MAFFT https://anaconda.org/bioconda/mafft

conda install bioconda::mafft

Seqtk https://anaconda.org/bioconda/seqtkconda

install bioconda::seqtk

Trimal https://anaconda.org/bioconda/trimal

conda install bioconda::trimal

EMBOSS https://anaconda.org/bioconda/emboss

conda install bioconda::emboss

FASTQC https://anaconda.org/bioconda/fastqc

conda install bioconda::fastqc

Setting up USEARCH Software

As of writing (November, 2024) the conda package for USEARCH does not appear to have the same functionality as the binaries provided on their website, thus this software needs to be set up outside of the conda environment:

Download the binaries from their website and store it in a local directory
Decompress the .gz file using gzip -d or gunzip
Rename the USEARCH file (e.g. will be something like usearch11.0.667_i86linux32) to 'usearch'
Make sure to add the directory containing the usearch to your path before running SORTER2 (i.e. using export PATH=$PATH:/path/to/usearch_directory/ or changing your .bash_profile to include the usearch directory path)

2.File Formatting

Now that we have successfully set up the required software and Python libraries, we need to format our raw reads and reference files properly.

Formatting Paired-End FASTQ Files

Paired raw read FASTQ files need to be formatted based on a unique identifier for each sample, followed by the sample name (e.g., the species name), as follows:

ID_Species_(R1/R2).fastq

ID: A unique identifier for each sample (do not include 'R1' or 'R2' in the ID).
Species: The species name or sample name. All samples belonging to the same species should have the same species name to facilitate downstream hybrid phasing.
(R1/R2): Denotes the paired-end reads.

Important Note: Ensure that your raw FASTQ files are decompressed (i.e., they do not have the .gz extension). You can decompress them using gzip -d or gunzip, using * to denote all files with the .gz extension: gunzip *.gz

Renaming Read Files with SORTER2_FormatReads.py

To streamline the renaming process for raw reads, we provide the SORTER2_FormatReads.py script, which automates file renaming based on an input CSV file. The CSV file should have the following columns (no header required), with data for each sample:

Full file name of the R1 FASTQ file (e.g., Sample1_R1.fastq)
Unique ID for the sample (e.g., ID1)
Species name (e.g., SpeciesA)

The pipeline will use the species label for downstream hybrid phasing, thus all samples belonging to the same species (to the best of your knowledge) should be labeled with the same species name (but with different ID’s for separate individuals). If you are unsure if a sample represents a hybridized or introgressive individual, use your best guess and you can further assess the genetic make-up of the sample when running the SORTER2 Processor script to assess hybrids vs non-hybrid samples.

Running the Script

Place the SORTER2_FormatReads.py script in the same folder containing all the paired-end FASTQ files and the csv file you created (below denoted as reformatfile.csv), then run the script with:

python SORTER2_FormatReads.py -i reformatfile.csv

3.Stage 1A: Trimming Reads and Assembling Contigs

After decompressing and renaming your paired-end read files, you should have a folder and file structure similar to:

Within this folder, add the SORTER2_Stage1A_TrimSPAdes.py script to trim reads using TrimGalore and assemble contigs using SPAdes. Typically, you would run this script within a bash/SLURM/PBS shell script to allow the process to continue running even if you close or log out of your session. If using a shell script for PBS and SLURM, make sure to activate the SORTER2 conda environment within the script using :

conda init bash

conda activate SORTER2

This can also be set up prior to running a shell script if you are running it using bash on a local machine. We have provided a template shell script including PBS and SLURM settings, if needed, as well as python commands for each SORTER2 script with example setting.

Running the Script

The SORTER2_Stage1A_TrimSPAdes.py script takes three input flags as follows:

python SORTER2_Stage1A_TrimSPAdes2.py -n SORTER2_Tutorial -trim T -spades T

Flags

-n: Specifies the project name, which will be used as the name of the folder containing all SORTER2 output for the set of samples being processed.
-trim: Set to T (True) to run TrimGalore for read trimming.
-spades: Set to T (True) to run SPAdes for contig assembly.

The script also allows you to run Trim Galore and Spades in separate runs, given the project name is the same for reads that were already trimmed but not assembled into contigs with Spades. This was included to facilitate the processing of large datasets that might run into wall-time limits on HPC’s if trimming and contig assembly takes too long in one single run:

To run only trimming:
- python SORTER2_Stage1A_TrimSPAdes.py -n ProjectName -trim T -spades F
To run only SPAdes (after trimming is completed):
- python SORTER2_Stage1A_TrimSPAdes.py -n ProjectName -trim F -spades T
Ensure that you use the same project name (-n flag) for both runs

You may also parallelize trimming and assembly this way by separating paired fastq’s into separate sets for trimming/spades runs. The final sample assembly folders for trimmed and spades assemblies can be combined for the same SORTER2 run by simply placing all output assembly folders into the same project folder.

After running the Stage 1A script, you will have a new project folder (named according to the -n flag, e.g., ProjectName) containing subfolders for each sample as below:

Each sample folder includes:

Trimmed reads (files ending with val1.fq and val2.fq)
FastQC reports (if FastQC is installed)
SPAdes assemblies (within the spades_hybrid_assembly folder)

As below:

With the sample reads trimmed, spades assemblies completed and assembly folders for SORTER2 created we are ready to run Stage 1B.

3.Stage 1B: Assembling Orthologs

Reference File Format for Stage 1B

Before running Stage 1B, ensure that your locus reference FASTA file is properly formatted. The sequence IDs must designate targeted loci based on an index:

>L1_*

ATGAGAC...

>L2_*

GCTAGTC...

>L3_*

TTAGGCA...

Locus Index: Each sequence ID starts with L, followed by the locus number (e.g., L1, L2), and ends with an underscore _.
Asterisk *: Can be any combination of compatible characters to give the reference a specific name. This is useful for differentiating variants of the same targeted locus.
Multiple References: If you have multiple references for the same locus, ensure they share the same locus number but have different identifiers after the underscore.

Setting Up Stage 1B

Place your properly formatted FASTA reference file and the SORTER2_Stage1B_AssembleOrthologs.py script in the SORTER2 project folder that contains the assembly folders for your samples.

Running the Script

There are several user modifiable settings associated with SORTER2; an example command might look like:

python SORTER2_Stage1B_AssembleOrthologs.py -wd /path/to/ProjectName/ -ref /path/to/reference.fasta -loci 455 -cl 0.70 -reclust F -csn 20 -csl 300 -al 1000 -indel 0.1 -idformat onlysample

Flags

-wd: Path to the working directory (project folder). Make sure to include the trailing /.
-ref: Path to the reference FASTA file.
-loci: The largest index number for reference loci (e.g., 455).
-cl: Clustering identity threshold to generate orthologs (e.g., 0.70 for 70% identity).
-csn: Number of SPAdes contigs per reference locus per sample to keep for consensus allele clustering (e.g., 20).
-csl: Minimum sequence length of SPAdes contigs to be considered for consensus-alleles (e.g., 300).
-reclust: Set to F (False) to assemble consensus alleles from scratch. Set to T (True) to use previous -csn and -csl settings and only re-cluster at a different identity threshold.
-al: Number of MAFFT alignment iterations (e.g., 1000).
-indel: Indel trimming threshold for TrimAl. A value of 0.1 means indels represented in at least 10% of samples are kept.
-idformat: Determines how samples are annotated in final output alignments. Use onlysample for sequence IDs like >ID_Species, suitable for concatenation.

Sequence Identity for Orthologs from Consensus Alleles

In the above example we are clustering our targeted loci among samples at 70% identity with USEARCH to generate orthologs, given the presence of paralogous sequences. For each sample, contigs that are mapped to the same reference are clustered at 99% identity resulting in ‘consensus alleles’ which is done to combine sample contigs representing heterozygous variants of the same ortholog and will serve as the sequences that are used for clustering among samples to determine orthologs and separate paralogous sets.

For datasets at the species level (i.e. Different species of genera or families) I recommend using a range between 70%-80% sequence similarity and use 80% (or more) when dealing with shallower divergences below the species level (i.e. populations of the same species, or subspecies/varieties), but these values can vary across systems depending on how divergent/diverse the samples in the dataset are. The -csn flag allows the user to modify how many contigs are kept when multiple sample contigs map to a single reference, in practice I set this to a high value like 20 to maximize the amount of orthologs generated and also improves inference of consensus alleles and orthologs across samples. The -csl flag sets a lower limit for how long each of the previous contigs has to be in order to be processed by the pipeline; we probably don’t want contig sequences smaller than 200-300bp as we begin to lose sequence variation that is required for generating robust consensus alleles and parsing paralogs. To optimize clustering I would rerun Stage1B option at a couple of different values to assess what threshold results in a reasonable number of consensus-alleles per sample per ortholog (i.e. around one or two per ortholog, assuming diploid samples, which can be assessed via the ALLsamples_consensusallele_cslX_csnX_SUMMARY_TABLE.csv and ALLsamples_allclusterbaits_cid.xx_SUMMARY_TABLE.csv summary files). The -reclust flag is included for this purpose in that it keeps the -csl and -csn settings from a previous Stage1B run if the flag is set to ‘T’, and only repeats the ortholog clustering step to save time and computational resources.

If paralogs don’t appear to be a significant factor in your data, you can always choose a lower value like 50%-60%, which should limit any unnecessary clustering but still be useful in removing contaminant or erroneous sequences mapped to references.

Output from Stage 1B

After running Stage 1B, two main folders are created within your project directory:

diploids
diploid_clusters

The final output alignment files are stored in the ‘diploids’ folder and have the following naming scheme: ‘LX_clX_duprem_trimmed_deint.fasta’: where the LX portion refers to the targeted locus, clX is the specific locus cluster for the targeted locus based on the clustering identity used, ‘duprem’ refers to the fact that duplicate sequences for samples, if present, were filtered (keeping only the longest consensus-allele sequence), and ‘trimmed’ signifies that the alignment has been trimmed based on the indel representation threshold used. These fasta alignments can be used as input for phylogenetic analyses (i.e. either as a concatenated supermatrix for Maximum Likelihood analyses with software like raxml or to generate gene-trees with iqtree2 for input into summary tree methods such as ASTRAL).

Other files found in the diploids folder include:

Consensus Allele Files: Files named LX_allsamples_allcontigs.fasta contain consensus alleles for each targeted locus prior to clustering.
Sample-Specific Files: Files named with the sample ID and species (e.g., ID_Species_) contain all consensus alleles for that sample.
Summary CSV: The file ALLsamples_consensusallele_cslX_csnX_SUMMARY_TABLE.csv summarizes the number of consensus alleles retrieved per sample per targeted locus before clustering denoting the -csl and -csn values used to generate consensus alleles.

Contents of the diploid_clusters Folder

This folder contains intermediate files used for downstream stages and scripts as well as additional post-clustering summary statistics.

Clustering Output: Files ending with LX_clX are raw clustering outputs from USEARCH.
Unaligned Cluster Files: Files named allsamples_allcontigs.fasta contain all sequences for each locus cluster, including duplicates.
Degapped Files: Files ending with _degap.fasta are unaligned, untrimmed sequences used as USEARCH references in Stage 3 for phasing hybrids.
Aligned Cluster Files: Files ending with duprem_al.fasta are aligned but untrimmed, and used for generating consensus references if using SORTER2_Processor.py
Detailed Alignment Files: Files named LX_X__allsamples_allcontigs.fasta contain fully annotated alignments of clustered loci, including duplicate sequences.
Summary CSV: The file ALLsamples_allclusterbaits_cid.X_SUMMARY_TABLE.csv summarizes consensus allele copy numbers per locus cluster after clustering.

You have now successfully completed Stages 1A and 1B of the SORTER2 pipeline which is required if you are interested in running Stages 2 and 3. The generated alignment files in the diploids folder are ready for phylogenetic analyses, though I personally prefer phased haplotype alignments for phylogenetic analyses output by Stage 2 (particularly if using ASTRAL to infer a tree as it requires haplotypes rather than consensus sequences). Be sure to review the summary CSV files to assess the number of consensus alleles per sample and per locus/locus-cluster, which can provide insights into the presence of paralogs or samples with low sequence representation. If you need to re-cluster at different identity thresholds, you can rerun Stage 1B using the -reclust T flag to save time by using previously generated consensus alleles.