fastq
file formatRaw sequencing reads are stored in text files containing the sequence of nucleotides and its associated quality scores. These are called fastq
text files and are usually in compressed formats like XYZ.fastq.gz
. Each read is described by 4 lines, let's look at the first four lines of an example fastq
file -
cd
zcat data/fastqdata/ChIPseq/H3K4me3/H3K4me3_Rep1_ENCFF001FIS.fastq.gz | head -n 4
@SOLEXA-1GA-2_0051_FC62478:3:1:1371:1211#0/1
ACAATAATAGGTTAGGTGGATTCCCAGGNNNNNNNN
+SOLEXA-1GA-2_0051_FC62478:3:1:1371:1211#0/1
afaagggg_ddffcfffc_cfffffBBBBBBBBBBB
@
character represents a sequence identifier. In this identifier, each value separated by a :
represents an information about the read. Depending on the sequencing platform, this may vary. You can find more detailed information here.+
character can have the same information as Line 1, be empty or may have some additional description of the reads.Do you know the difference between
fast[q]
andfast[a]
file formats ? seehere
Phred 33 and Phred 64 are two encoding schemes, the former is commonly used by Illumina based sequencers and seen in most mordern sequencing while the latter was used in older Illumina sequenced reads. In both, quality values range from 0-40, however, they are represented by entirely different symbols in the fastq
files. For instance, a quality score of 0 is represented by !
in Phred 33 while its represented as @
in Phred 64 encoding.
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
33 59 64 73 104 126
Phred 33: 0........................26...31.......40..
Phred 64: ...3.....9..............................41.
A quality score represents the probability that a called nucleotide in a sequence in incorrect. This is represented mathematically as -
Q = -10 x log10(p), where p represents the probability of incorrect call
For,
p=1%, Q=0
p=0.1%, Q=10
p=0.01%, Q=20
p=0.001%, Q=30
Looking at the quality scores (4th line of the fastq), can you guess what quality encoding (Phred 64 or Phred 33) was used?
The read quality contains the following characters:
afaagggg_ddffcfffc_cfffffBBBBBBBBBBB
Looking at the encoding table above, this can only correspond to Phred 64, as the characters a,b,c,...
do not occur in the Phred 33 encoding! In Phred 64, the have a=33, b=34, ...
We will use the FastQC tool to perform basis quality assessment of the raw reads. This tools give us the following information -
Based on these parameters one could estimate the sequencing quality and identify major problems right at the beginning of a project.
Let us perform a simple FastQC analysis on the fastq
file that we viewed above and its control.
# Go to your home directory
cd
# Create a folder for your analysis
mkdir -p analysis/FastQC/ChIP
# Check out all the available parameters in FastQC
# Do note, when in doubt, its often good practice to use default settings
# Most options are optional and set to default, focus on the essential parameters that has to be changed
fastqc --help
# FastQC analysis for two fastq files
# Pseudocode:
# fastqc --outdir <name of output directory> <space separated list of fastq files>
# Actual analysis:
fastqc --outdir analysis/FastQC/ChIP \
data/fastqdata/ChIPseq/H3K4me3/H3K4me3_Rep1_ENCFF001FIS.fastq.gz
# Find your results here
cd analysis/FastQC/ChIP
Using Cyberduck, open the folder analysis/FastQC/ChIP
; open the generated html files:
Can you identify the Phred encoding?
The encoding is Illumina 1.5, which corresponds to the Phred 64 encoding (see here)
How does the base quality look like?
Sequencing quality looks very good, with an average quality around 38.
Are there any adapter contaminations?
No sequencing adapters were found; however, we have contamination due to the Illumina PCA Primer 2 (section "Overrepresented sequences")
Can you find any issues with the fastq
files refer to individual module description
Possible issues:
Try performing this analysis on different ChIPseq
fastq
files of your choice - compare the read numbers and and other QC properties
Refer to
ENCODE ChIPseq good practices
In the next section, we will perform adapter trimming/clipping based on our knowledge of the FastQC results.