Read-Depth and Coverage Metrics with awk and SAMtools

Aug 20, 2024

5 min read

awk is a programming language used for text processing that is often used as a command-line tool in Unix/Linux environments. It can facilitate processing and analysis of text files, particularly if they are structured into columns like CSV files or other delimited files.

Table of Contents

Basics of awk

Read depth vs coverage

Calculating average read depth

Calculating read coverage

Basics of awk:

Fields: awk splits each line of input into fields based on a delimiter (by default, whitespace, or you can use: awk -F ‘delimiter’ to designate a different delimiter). You can access these fields using $1, $2, $3, etc., where $1 is the first field, $2 is the second, and so on.
Actions: awk allows you to perform actions on the text, such as printing, summing, or counting. These actions can be specified within curly braces {}.
Patterns: awk can match patterns in text and execute actions only on lines that match the pattern.

Using awk can be valuable for generating quick summary statistics from a variety of outputs. Samtools is often used to output statistics on called sites stored within a .bam file generated after mapping reads or sequences to a specific reference. Two common statistics reported for such bam files are read-depth and read-coverage:

Read Depth: The amount of reads/sequences that were mapped to a specific locus or site in the reference. This metric reflects the level of confidence we have when calling variants at specific sites or regions of the reference, where a higher read depth usually correlates with a more accurate ability to call SNPs or haplotypes at that locus.
Read Coverage: How much of the reference sequence was ‘covered’ or represented by the mapped reads. This gives us a sense of how well our sequenced reads as a whole match the regions in our references. Assuming the reference represents the loci you were targeting for sequencing, read-coverage reflects what percentage of the reference sequence regions were recovered.

These two metrics are usually assessed in tandem because high read depth alone may only be representing a poorly covered region of the reference, and vice versa: a high reference coverage may reflect a high recovery of reference loci, but could have lower read-depth which may lower the ability to confidently call variants across the reference. ‘Good’ read depth averages can be variable depending on the specific research goals; an average read depth of 3x-5x might be reasonable for a low-coverage whole genome project, where researchers are optimizing for coverage across an entire genome with larger sequence space. Conversely targeted-capture methods focusing on specific regions of the genome may recover a read-depth of 100x or more and give researchers more power in detecting alternative variants associated with the reference (i.e. alleles or paralogs). Ideally, sequencing projects should recover a read-depth as high as possible given the sequencing approach as well as reference coverage representing most if not all of the targeted regions.

Calculating Average Read-Depth with AWK and Samtools

The following command line string will output the average read-depth of a bam file given the output from ‘samtools depth’:

samtools depth -a bamfile.bam | awk '{c++;s+=$3}END{print s/c}' >> read_depth.txt

samtools depth: Used to calculate the depth of coverage at each position in the reference genome for a given BAM file (bamfile.bam).
-a: This option ensures that all positions in the reference are output, even if there is zero coverage (i.e. no reads mapped to that position).
bamfile.bam: The input BAM file, which contains the alignment data, often generated using bwa software with fastq read files and a reference.

The output of samtools depth is a three-column text where the first column ($1) represents the sequence name, column 2 ($2) is the position in the reference, and column 3 ($3) is the depth at that position (i.e., how many reads align to that specific position). For example:

chr1 1 10

chr1 2 12

chr1 3 8

…

After piping (using ‘|’) this output from samtools, we can use awk to perform specific calculations on the output in order to get an average. awk '{c++;s+=$3}END{print s/c}' calculates the average depth of coverage across all positions by summing the depths and dividing by the number of positions:

awk '{c++;s+=$3}END{print s/c}':

c++: This increments a counter defined as c for each line processed by awk and is used to count the total number of positions.
s+=$3: This adds the value in the third column ($3, which is the depth) to a running total defined as s. So, s will be the sum of all depths across all positions.
END{print s/c}: After processing all lines (END block), awk prints the average depth of coverage by dividing the total sum s by the number of positions c.

With this calculation we can get the average depth across all sites in the .bam file references without having to calculate the average in another program like R or Excel. Using ‘>> read_depth.txt’ we can append this value into a specific file that already exists; or it will create it if it does not already exist.

Calculating Reference Coverage with AWK and Samtools

We can also use awk to calculate the reference coverage from the output generated by samtools depth using:

samtools depth -a ‘bamfile.bam’ | awk '{c++; if($3>0) total+=1}END{print (total/c)*100}' >> ref_coverage.txt

As in the previous example we use samtools depth with the -a flag and the c++ syntax to count the total number of positions, regardless of coverage.

if($3>0) total+=1:

if($3>0) checks if the value in the third column (the read-depth) is greater than 0, meaning that the position has some coverage.
If the condition is true ($3 > 0), total+=1 increments the total counter by 1, thus counting the number of positions with non-zero coverage.

END{print (total/c)*100}:

END{} specifies an action to perform after processing all input lines.
(total/c)*100 calculates the percentage of positions with non-zero coverage out of the total number of positions based on our previously defined and calculated variables. To get the total coverage we divide the number of positions with coverage (total) by the total number of positions (c) and then multiplying by 100 to get a percentage.
print (total/c)*100 prints the calculated percentage.

As in the previous example, we can take the printed output and append it (i.e. >> ref_coverage.txt) to a new or existing file.

Using awk to optimize your bioinformatic processing and output statistics is invaluable for saving time and generating workflows that can be incorporated as steps in more complex bioinformatic pipelines.

#SAMTools #awk #bioinformatics #genomics