Aligning the sequenced reads to the reference genome is the most crucial task of any NGS analysis. Fast aligners like Bowtie2 and BWA-MEM are widely used for alignment, among others, ChIPSeq and ATACseq data. We will exemplarily show how to align reads using Bowtie2
Due to the long computation time and memory constraints, we have precomputed for you all the alignments using Bowtie2. You find these data in data/processed/ATACseq/Bowtie2
#------------------------------------------
# Read alignment with Bowtie2 - Paired end
#------------------------------------------
bowtie2 \
--phred33 \
--mm \
--maxins 2000 \
--very-sensitive \
--threads 10 \
-x <hg38 genome index> \
-1 <A_R1.fastq.gz> \
-2 <A_R2.fastq.gz> \
2> \
<bowtie2.log> \
| samtools view -h -b - > <aligned.bam>
#--------------------------------------------
After the alignment, some read have a low mapping quality (possibly with many mismatches); this mapping quality is indicated by the MAPQ score. This score takes into account the number of mismatches, but also the quality of the sequenced bases, as indicated by the Phred score.
Here, we will:
#----------------------------
# Low quality read filtering
#----------------------------
# For paired end
samtools view -h -b -F 1804 -f 2 -q 30 -@ 10 -o <aligned_filtered.bam> <aligned.bam>
#------------------------------------------------------------
# NOTE:[Only for ATACseq] - Filtering Mitochondrial reads
# NOTE: With the new ATACseq protocol - OMNI-ATAC,
# mitochondrial contamination should no longer be a problem
#------------------------------------------------------------
samtools view -h -@ 15 <aligned_filtered.bam> | grep -v "chrM" | samtools view -h -b -@ 15 - > <aligned_filt_noMT.bam>
A typical problem in libraries sequenced using a PCR amplification is the presence of PCA duplicates among the reads.
These should be filtered out, as they do not bring any additional insights. Usually, these duplicates can be identified when many reads have the exact same coordinates. We will delete these duplicates from the bam file using the samtools markdup
tool.
#---------------------------------------
# Sorting, Fixmate and Duplicate removal
#---------------------------------------
# NOTE: fixmate needs bam files sorted by read name but
# the downstream processes need co-ordinate sorted bam files, hence the two sorting steps
# Fixmate is needed to add mate tags (-m), which is used for proper duplicate identification by markdup
samtools sort -n -O BAM -@ 10 <aligned_filt_noMT.bam> \
| samtools fixmate -m -@ 10 - - \
| samtools sort -O BAM -@ 10 - > \
| samtools markdup -r -S -@ 10 - <aligned_filtered_sorted_duprmv.bam>
Finally, now that we have generated a new bam file by filtering out reads, we need to re-index the bam file. Indexing helps in quickly accessing the reads.
#----------
# Indexing
#----------
samtools index <aligned_filtered_sorted_duprmv.bam> <aligned_filtered_sorted_duprmv.bam.bam.bai>
A detailed explanation of the parameters used for alignment can be found here, it is important to notice the difference for paired end and single end sequencing.
After alignment, we also filter out poor quality, unmapped and duplicate reads using samtools. The parameters -F, -f, q
along with markdup
are used to filter out reads.
Using this tool, one can find the exact meaning of the filtering flags like
1796, 1804, 2, etc
.
SAMtools provides some functions like flagstat, stats and idxstats
to compute overall summary of read alignment statistics. You can find more information on the flagstat here
# Go to your home directory
cd
# Create a folder for your analysis
mkdir -p analysis/AlignmentStats/ATAC
# Flagstat analysis
# Pseudocode:
# samtools flagstat <aligned.bam> > <aligned.bam.log>
# samtools flagstat <aligned_filtered_sorted_duprmv.bam> > <aligned_filtered_sorted_duprmv.bam.log>
samtools flagstat -@ 3 data/processed/ATACseq/Bowtie2/ATAC_REP1_aligned_filt_sort_nodup.bam > \
analysis/AlignmentStats/ATAC/ATAC_REP1_aligned_filt_sort_nodup.flagstat.log
Can you modify the code above to run flagstat also on the corresponding filtered and unfiltered
.bam
and compare the numbers from the two files. What do you learn ?
The unflitered alignment file is named as
ATAC_REP1_aligned_nofilt.bam
Can you count duplicates from the unfiltered file. Follow the instructions from yesterday.
In the next section, we will perform peak calling using these aligned .bam
files.