Aligning the sequenced reads to the reference genome is the most crucial task of any NGS analysis. Fast aligners like Bowtie2 and BWA-MEM are widely used for aligning, among others, ChIPSeq and ATACseq data. We will exemplarily show how to align your sequencing reads using Bowtie2
Due to the long computation time and memory constraints, we have precomputed for you all the alignments using Bowtie2. You find these data in data/processed/<CTCF/H3K4me3>/Bowtie2
#------------------------------------------
# Read alignment with Bowtie2 - Single end
#------------------------------------------
bowtie2 \
--phred33 \
--mm \
--very-sensitive \
--threads 10 \
-x <hg38 genome index> \
-U <input fastq> \
2> \
<bowtie2.log> \
| samtools view -h -b - > <aligned.bam>
#--------------------------------------------
After the alignment, some read have a low mapping quality (possibly with many mismatches); this mapping quality is indicated by the MAPQ score. This score takes into account the number of mismatches, but also the quality of the sequenced bases, as indicated by the Phred score.
Here, we will:
#----------------------------
# Low quality read filtering
#----------------------------
# For Single end
samtools view -h -b -F 4 -q 30 -@ 10 -o <aligned_filtered.bam> <aligned.bam>
A typical problem in libraries sequenced using a PCR amplification is the presence of PCR duplicates among the reads.
These should be filtered out, as they do not bring any additional insights. Usually, these duplicates can be identifiied when many reads have the exact same coordinates. We will delete these duplicates from the bam file using the samtools markdup
tool.
#---------------------------------------
# Sorting, Fixmate and Duplicate removal
#---------------------------------------
# NOTE: fixmate needs bam files sorted by read name but
# the downstream processes need co-ordinate sorted bam files, hence the two sorting steps
# Fixmate is needed to add mate tags (-m), which is used for proper duplicate identification by markdup
samtools sort -n -O BAM -@ 10 <aligned_filtered.bam> \
| samtools fixmate -m -@ 10 - - \
| samtools sort -O BAM -@ 10 - > \
| samtools markdup -r -S -@ 10 - <aligned_filtered_sorted_duprmv.bam>
Finally, now that we have generated a new bam file by filtering out reads, we need to re-index the bam file. Indexing helps in quickly accessing the reads.
#----------
# Indexing
#----------
samtools index <aligned_filtered_sorted_duprmv.bam> <aligned_filtered_sorted_duprmv.bam.bam.bai>
A detailed explaination of the parameters used for alignment can be found here, it is important to notice the difference for paired end and single end sequencing.
After alignment, we also filter out poor quality, unmapped and duplicate reads using samtools. The parameters -F, -f, q
along with markdup
are used to filter out reads.
Using this tool, one can find the exact meaning of the filtering flags like
4, 1804, 2, etc
.
SAMtools provides some functions like flagstat, stats and idxstats
to compute overall summary of read alignment statistics. You can find more information on the flagstat here
# Go to your home directory
cd
# Create a folder for your analysis
mkdir -p analysis/AlignmentStats/ChIP
# Flagstat analysis
# Pseudocode:
# samtools flagstat <aligned.bam> > <aligned.bam.log>
# samtools flagstat <aligned_filtered_sorted_duprmv.bam> > <aligned_filtered_sorted_duprmv.bam.log>
samtools flagstat -@ 3 data/processed/CTCF/Bowtie2/CTCF_Rep1_ENCFF001HLV_trimmed_aligned_filt_sort_nodup.bam > \
analysis/AlignmentStats/ChIP/CTCF_Rep1_ENCFF001HLV_trimmed_aligned_filt_sort_nodup.flagstat.log
Can you modify the code above to run flagstat also on the corresponding filtered and unfiltered
.bam
and compare the numbers from the two files. What do you learn ? The unflitered alignment file is named asCTCF_Rep1_ENCFF001HLV_trimmed_aligned_nofilt.bam
cd
samtools flagstat -@ 3 data/processed/CTCF/Bowtie2/CTCF_Rep1_ENCFF001HLV_trimmed_aligned_nofilt.bam > \
analysis/AlignmentStats/ChIP/CTCF_Rep1_ENCFF001HLV_trimmed_aligned_nofilt.flagstat.log
Look at the duplicates field in both files. What does it say? This might be MISLEADING !! See next section why...
The aligned .bam
files output of Bowtie2
does not have duplicate reads marked by any Flags recognized by Samtools
. Thus, the flasgstat
output might give an impression that there are no duplicates in your data and that you don't need to perform duplicate removal. However, this is highly misleading !!.
These aligned files has to be
markdup
We will use one such unfiltered alignment file output from Bowtie2
, perform the steps above and using flagstat
again count the number of duplicates in our data.
cd
samtools sort -O BAM -n -@ 3 data/processed/CTCF/Bowtie2/CTCF_Rep1_ENCFF001HLV_trimmed_aligned_nofilt.bam \
| samtools fixmate -m -@ 3 - - \
| samtools sort -O BAM -@ 3 \
| samtools markdup -@ 3 - - \
| samtools flagstat - > analysis/AlignmentStats/ChIP/CTCF_Rep1_ENCFF001HLV_trimmed_aligned_nofilt_dupmarked_flagstat.log
Compare this
flagstat
output with the alignment statistics identified in the previous section. Do you notice a difference
In the next section, we will perform peak calling using these aligned .bam
files.