IRTG Course

Introduction to R for genomics

Carl Herrmann & Carlos Ramirez

8-9 December 2021

Quality control

Filtering cells with low sequencing quality is a very important step since it can greatly impact in further analysis. Quality check and control often requires to visualize and inspect samples in order to determine appropiate thresholds. Threshold values might vary from one dataset to another, so no hard threshold rule can be applied equally to every case.

We will examine the number of UMI counts, the number of RNA features and the percentage of reads of mitochondrial genes.

We will first calculate the percentage of UMI counts of reads mapped to mitochondrial genes. This step most be manually done since is based on a priori knowledge of which genes corresponds to mitochondrial genes. In this case, genes are annotated using human ensembl gene symbol annotations mitochondrial genes are annotated starting with a MT- string.

pbmc.seurat[[""]] <- PercentageFeatureSet(pbmc.seurat, pattern = "^MT-")

Then, we can visualize the following metrics.

We can plot these metrics using the function VlnPlot() as follows:

VlnPlot(pbmc.seurat, features = c('nFeature_RNA', 'nCount_RNA', ''))

The violin plots show the values of the metrics for each cell along with an adjusted violin distribution.

Filtering out cells

Based on the previous violin plots we can define some thresholds and filter out cells using the subset function. In the following code we select cells having nFeatures < 1250, nCount < 4000 and percentage of reads mapped to mitochondrial genes < 5 percent.

pbmc.filtered <- subset(pbmc.seurat, 
                           nFeature_RNA < 1250 &
                                nCount_RNA < 4000 & 
                           < 5)

We can check the number of cell that passed the QC.

## [1] 452


There are several methods for normalization of scRNA-Seq data. A commonly used strategy is the log normalization which basically corrects sequencing deep in cells by dividing each feature by the total number of counts and then multiplied the result by a factor, usually 10000, and finally the values are log transformed.

Log normalization can be implemented by using the NormalizeData() function.

pbmc.filtered <- NormalizeData(pbmc.filtered)

Then, in order to make genes measurements more comparable log transformed values are scaled in a way that the media is equal to zero and the variance is equal to 1 as follows:

pbmc.filtered <- ScaleData(pbmc.filtered)


Performing your own QC!


How are the number of features and UMI counts related?
a) They are not related and randomly distributed in a scatter plot
b) They are related in a non-linear way
c) They are linearly related
TIP: Use the function FeatureScatter, inspect the manual using ?function.

FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

We observe as expected a linear relation between the number of UMI counts and the features recorded.


Load a seurat object using the following command:

Warning!! Check your Seurat version, and use one of the two commands:

pbmc.seurat <- readRDS(url(''))
pbmc.seurat <- readRDS(url(''))
*Calculate the mean and median values of the percentage of mitochondrial reads*
  1. mean=2.2133 and median=2.0532
  2. mean=2.2246 and median=2.0639
  3. mean=0.0102. and median=1.1743

Filter out cells based on QC values

NOTE: Do not skip any step in the pipeline.