Workshop ChIPATAC 2020

Computational analysis of ChIP-seq and ATAC-seq data

14-15 December 2020

2. ATAC-seq : Raw read quality control - FastQC

fastq file format

Raw sequencing reads are stored in text files containing the sequence of nucleotides and its associated quality scores. These are called fastq text files and are usually in compressed formats like XYZ.fastq.gz. Each read is described by 4 lines, let's look at the first four lines of an example fastq file -

cd

zcat data/fastqdata/ATACseq/ATAC_Rep1_ENCFF121EPT.fastq_R1.gz | head -n 4

@J00118:569:HGKLCBBXY:5:1101:1489:1261 1:N:0:GTAGAGGA+AGAGTANA
AATCAGCACCCTGTGTCTAGCTCANGGTTTGTAAANATACCANTCAGCACTCTNTATCTAGCTAATCNAGTGNAGANCTTTTGTGTCTAGCTNAGGGNTTG
+
AA-FFJJJJFJJJJJJFJJJJJJF#FFFJJJ7AJJ#FJJ<<F#-FJJ--FFFJ#JJFJJA<AAFJFF#-<JF#7JJ#FFAFJ7J<J-A-F<J#FAFF#JF7

Do you know the difference between fast[q] and fast[a] file formats ? see here

Read quality encoding

phred 33 and phred 64

Phred 33 and Phred 64 are two encoding schemes, the former is commonly used by Illumina based sequencers and seen in most mordern sequencing while the latter was used in older Illumina sequenced reads. In both, quality values range from 0-40, however, they are represented by entirely different symbols in the fastq files. For instance, a quality score of 0 is represented by ! in Phred 33 while its represented as @ in Phred 64 encoding.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
                  |                         |    |        |                              |                     |
                 33                        59   64       73                            104                   126
Phred 33:         0........................26...31.......40..                                
Phred 64:                                        ...3.....9..............................41. 
                                 

Quality scores

A quality score represents the probability that a called nucleotide in a sequence in incorrect. This is represented mathematically as -

Q = -10 x log10(p), where p represents the probability of incorrect call

For,

p=1%,     Q=0 
p=0.1%,   Q=10
p=0.01%,  Q=20 
p=0.001%, Q=30

The reads that we displayed above are encoded using Phred 33 (Illumina 1.9), can you estimate how good the base calls are based on the map above ?

Quality control

We will use the FastQC tool to perform basis quality assessment of the raw reads. This tools give us the following information -

Based on these parameters one could estimate the sequencing quality and identify major problems right at the beginning of a project.

FastQC

Let us perform a simple FastQC analysis on the fastq file that we viewed above.

# Go to your home directory
cd 

# Create a folder for your analysis
mkdir -p analysis/FastQC/ATAC

# Check out all the available parameters in FastQC
# Do note, when in doubt, its often good practice to use default settings
# Most options are optional and set to default, focus on the essential parameters that has to be changed

fastqc --help

# FastQC analysis for two fastq files

# Pseudocode: 
# fastqc --outdir <name of output directory> <space separated list of fastq files>

# Actual analysis:
fastqc --outdir analysis/FastQC/ATAC \
data/fastqdata/ATACseq/ATAC_Rep1_ENCFF121EPT.fastq_R1.gz

# Find your results here
cd analysis/FastQC/ATAC

Analyzing the output

Now you can use Cyberduck to open the generated html files -

Try performing this analysis on another ATAcseq fastq files of your choice - compare the number of reads in R1/R2 of the same replicate can you explain why you see these numbers ?

Refer to ENCODE ATACseq good practices, do the numbers look good ?

In the next section, we will perform adapter trimming/clipping based on our knowledge of the FastQC results.