Workshop ChIPATAC 2020

Computational analysis of ChIP-seq and ATAC-seq data

14-15 December 2020

2. ChIP-seq : Raw read quality control - FastQC

fastq file format

Raw sequencing reads are stored in text files containing the sequence of nucleotides and its associated quality scores. These are called fastq text files and are usually in compressed formats like XYZ.fastq.gz. Each read is described by 4 lines, let's look at the first four lines of an example fastq file -

cd

zcat data/fastqdata/ChIPseq/H3K4me3/H3K4me3_Rep1_ENCFF001FIS.fastq.gz | head -n 4

@SOLEXA-1GA-2_0051_FC62478:3:1:1371:1211#0/1
ACAATAATAGGTTAGGTGGATTCCCAGGNNNNNNNN
+SOLEXA-1GA-2_0051_FC62478:3:1:1371:1211#0/1
afaagggg_ddffcfffc_cfffffBBBBBBBBBBB

Do you know the difference between fast[q] and fast[a] file formats ? see here

Read quality encoding

phred 33 and phred 64

Phred 33 and Phred 64 are two encoding schemes, the former is commonly used by Illumina based sequencers and seen in most mordern sequencing while the latter was used in older Illumina sequenced reads. In both, quality values range from 0-40, however, they are represented by entirely different symbols in the fastq files. For instance, a quality score of 0 is represented by ! in Phred 33 while its represented as @ in Phred 64 encoding.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
                  |                         |    |        |                              |                     |
                 33                        59   64       73                            104                   126
Phred 33:         0........................26...31.......40..                                
Phred 64:                                        ...3.....9..............................41. 
                                 

Quality scores

A quality score represents the probability that a called nucleotide in a sequence in incorrect. This is represented mathematically as -

Q = -10 x log10(p), where p represents the probability of incorrect call

For,

p=1%,     Q=0 
p=0.1%,   Q=10
p=0.01%,  Q=20 
p=0.001%, Q=30

Looking at the quality scores (4th line of the fastq), can you guess what quality encoding (Phred 64 or Phred 33) was used?

[Click to show answer!]

The read quality contains the following characters:

afaagggg_ddffcfffc_cfffffBBBBBBBBBBB

Looking at the encoding table above, this can only correspond to Phred 64, as the characters a,b,c,... do not occur in the Phred 33 encoding! In Phred 64, the have a=33, b=34, ...



Quality control

We will use the FastQC tool to perform basis quality assessment of the raw reads. This tools give us the following information -

Based on these parameters one could estimate the sequencing quality and identify major problems right at the beginning of a project.

FastQC

Let us perform a simple FastQC analysis on the fastq file that we viewed above and its control.

# Go to your home directory
cd 

# Create a folder for your analysis
mkdir -p analysis/FastQC/ChIP

# Check out all the available parameters in FastQC
# Do note, when in doubt, its often good practice to use default settings
# Most options are optional and set to default, focus on the essential parameters that has to be changed

fastqc --help

# FastQC analysis for two fastq files
# Pseudocode: 
# fastqc --outdir <name of output directory> <space separated list of fastq files>

# Actual analysis:
fastqc --outdir analysis/FastQC/ChIP \
data/fastqdata/ChIPseq/H3K4me3/H3K4me3_Rep1_ENCFF001FIS.fastq.gz

# Find your results here
cd analysis/FastQC/ChIP

Analyzing the output

Using Cyberduck, open the folder analysis/FastQC/ChIP; open the generated html files:

Try performing this analysis on different ChIPseq fastq files of your choice - compare the read numbers and and other QC properties

Refer to ENCODE ChIPseq good practices

In the next section, we will perform adapter trimming/clipping based on our knowledge of the FastQC results.