2. ChIP-seq : Raw read quality control - FastQC

`fastq` file format

Raw sequencing reads are stored in text files containing the sequence of nucleotides and its associated quality scores. These are called fastq text files and are usually in compressed formats like XYZ.fastq.gz. Each read is described by 4 lines, let's look at the first four lines of an example fastq file -

cd

zcat data/fastqdata/ChIPseq/H3K4me3/H3K4me3_Rep1_ENCFF001FIS.fastq.gz | head -n 4

@SOLEXA-1GA-2_0051_FC62478:3:1:1371:1211#0/1
ACAATAATAGGTTAGGTGGATTCCCAGGNNNNNNNN
+SOLEXA-1GA-2_0051_FC62478:3:1:1371:1211#0/1
afaagggg_ddffcfffc_cfffffBBBBBBBBBBB

Line 1 beginning with a @ character represents a sequence identifier. In this identifier, each value separated by a : represents an information about the read. Depending on the sequencing platform, this may vary. You can find more detailed information here.
Line 2 consists of the actual sequence reads.
Line 3 beginning with a + character can have the same information as Line 1, be empty or may have some additional description of the reads.
Line 4 encodes the quality values for the nucleotides in Line 2. It, thus will have same number of letter as Line 2.

Do you know the difference between fast[q] and fast[a] file formats ? see here

Read quality encoding

phred 33 and phred 64

Phred 33 and Phred 64 are two encoding schemes, the former is commonly used by Illumina based sequencers and seen in most mordern sequencing while the latter was used in older Illumina sequenced reads. In both, quality values range from 0-40, however, they are represented by entirely different symbols in the fastq files. For instance, a quality score of 0 is represented by ! in Phred 33 while its represented as @ in Phred 64 encoding.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
                  |                         |    |        |                              |                     |
                 33                        59   64       73                            104                   126
Phred 33:         0........................26...31.......40..                                
Phred 64:                                        ...3.....9..............................41.

Quality scores

A quality score represents the probability that a called nucleotide in a sequence in incorrect. This is represented mathematically as -

Q = -10 x log10(p), where p represents the probability of incorrect call

For,

p=1%,     Q=0 
p=0.1%,   Q=10
p=0.01%,  Q=20 
p=0.001%, Q=30

Looking at the quality scores (4th line of the fastq), can you guess what quality encoding (Phred 64 or Phred 33) was used?

[Click to show answer!]

The read quality contains the following characters:

afaagggg_ddffcfffc_cfffffBBBBBBBBBBB

Looking at the encoding table above, this can only correspond to Phred 64, as the characters a,b,c,... do not occur in the Phred 33 encoding! In Phred 64, the have a=33, b=34, ...

Quality control

We will use the FastQC tool to perform basis quality assessment of the raw reads. This tools give us the following information -

Based on these parameters one could estimate the sequencing quality and identify major problems right at the beginning of a project.

FastQC

Let us perform a simple FastQC analysis on the fastq file that we viewed above and its control.

# Go to your home directory
cd 

# Create a folder for your analysis
mkdir -p analysis/FastQC/ChIP

# Check out all the available parameters in FastQC
# Do note, when in doubt, its often good practice to use default settings
# Most options are optional and set to default, focus on the essential parameters that has to be changed

fastqc --help

# FastQC analysis for two fastq files
# Pseudocode: 
# fastqc --outdir <name of output directory> <space separated list of fastq files>

# Actual analysis:
fastqc --outdir analysis/FastQC/ChIP \
data/fastqdata/ChIPseq/H3K4me3/H3K4me3_Rep1_ENCFF001FIS.fastq.gz

# Find your results here
cd analysis/FastQC/ChIP

Analyzing the output

Using Cyberduck, open the folder analysis/FastQC/ChIP; open the generated html files:

Can you identify the Phred encoding?

[Click to show answer!]

The encoding is Illumina 1.5, which corresponds to the Phred 64 encoding (see here)

How does the base quality look like?

[Click to show answer!]

Sequencing quality looks very good, with an average quality around 38.

Are there any adapter contaminations?

[Click to show answer!]

No sequencing adapters were found; however, we have contamination due to the Illumina PCA Primer 2 (section "Overrepresented sequences")

Can you find any issues with the fastq files refer to individual module description
[Click to show answer!]

Possible issues:
- PCR primer contamination
- some sequence bias at the 5'end of the reads (section "per base sequence content")

Try performing this analysis on different ChIPseq fastq files of your choice - compare the read numbers and and other QC properties

Refer to ENCODE ChIPseq good practices

In the next section, we will perform adapter trimming/clipping based on our knowledge of the FastQC results.

Workshop ChIPATAC 2020

Computational analysis of ChIP-seq and ATAC-seq data

2. ChIP-seq : Raw read quality control - FastQC

`fastq` file format

Read quality encoding

phred 33 and phred 64

Quality scores

Quality control

FastQC

Analyzing the output

2. ChIP-seq : Raw read quality control - FastQC

fastq file format

Read quality encoding

phred 33 and phred 64

Quality scores

Quality control

FastQC

Analyzing the output

`fastq` file format