MoBi 6.FS - SoSe 2026
Carl Herrmann - IPMB & BioQuant
The goal of these sessions is to investigate how we can improve the prediction of binding sites for transcription factors by using additional information (epigenetic signals, sequence conservation, chromatin accessibility), besides sequence only informations based on using a sequence profile (PWM) for TFBS prediction.
This additional information could be for example histone modification datasets (ChIP-seq), information about chromatin opening given e.g. by DNAse hypersensitivity assays, information about DNA methylation, …
As a gold standard to determine whether a predicted TFBS is a true or false hit, we will consider ChIP-seq datasets for the transcription factor of interest: every predicted TFBS overlapping with a ChIP peak will be considered a true positive, while those outside ChIP peaks will be false positives.
In the following, we will
for each TF, there is an R rds object available containing a data frame: each line corresponds to a potential TFBS site on chromosome 21, predicted using a position weight matrix; each column corresponds to a feature; the numerical entries represent the scoring of the potential TFBS site with the corresponding feature.
In addition, the last columns (tfbs) indicates if this is a true-positive (i.e. if the predicted site overlaps with a ChIP-seq peak) or not (i.e. no overlap). This is the column we want to predict!
In order to use the file, download the file onto your computer (e.g. CTCF.rds) and load it as:
data = readRDS('/path/to/your/file/CTCF.rds')
head(data)
Here are the links to the specific datasets; select one TF and click on the link to download the .rds file. The first file contains the predicted TFBS the second one the true ChIP-seq peaks.
Goal: We want to evalute the baseline performance of a prediction model which is only based on using a position-weight matrix (pwm), hence is only based on sequence information.
We have scanned the sequence of the human chromosome 21 with a number of PWM for some transcription factors. The output files for each TF contain one line for each predicted TFBS, together with the log-likelihood score (column LLR) of the match, and whether the predicted TFBS can be considered as a true-positive (i.e. overlaps with a ChIP-seq peak for this TF) or a false-positive (i.e does NOT overlap with a ChIP-seq peak.) (column tfbs)
Hence, we have two sets of TFBS:
Goal:
PRROC package, check the vignette here)Stop and think!
- What is the performance of the sequence based prediction? Does the curve have the expected behavior?
- Compute the recall, using the number of ChIP-seq peaks as the ground truth
- Plot precision and recall on the same plot
- Compare the PR plot with the ROC plot; can you explain the difference?
We will now explore the additional genomic features for all the potential TFBS; in particular, we will look for correlations between the features.
We will explore the difference between positive and negative sites by looking at boxplots for each feature for positive/negative tfbs make a boxplot showing the distribution of the signal between positive and negative sites for each feature run a Wilcoxon test to check for significant differences.
Compute the correlation between the features using the cor function in R, and plot the correlation matrix as a heatmap
library(pheatmap)
cor.ft = cor(data[,-16],method="spearman")
diag(cor.ft) = NA
pheatmap(cor.ft)
Stop and think
Check the interpretation of the different marks here. Which features are most correlated to each other? Does this make sense?
We need to keep in mind the correlation between the variables, as they might impact the results of the regression analysis later. Correlated variables are usually harmfull in regression approaches, as they impact the accuray of the prediction for the regression coefficients of the correlated variables. One solution would be to use a ridge regression or a LASSO method from the glmnet package.
We now come to the core of the machine learning analysis, and train a logistic regression to distinguish optimally the positive from the negative TFBS predictions. This will reveal which variables are most suited to discriminate true from false positives. We will use the glm function in R, using the binomial model.
Training the model
We will now apply the glm function in R, assuming that our binary output variable is distributed according to a binomial distribution with parameter p. All available variables will be included into the model. Run the glm function by specifying the binomial distribution as family:
lrfit <- glm(tfbs ~ ., data=data,family="binomial")
summary(lrfit)
Stop and think
Check the output of thesummaryfunction
- what are the most relevant variables ?
- what does the AIC represent ?
Evaluate the performance
Now that we have a model, we can apply it to test how good it performs, by using the model as a predictor :
pred = predict(lrfit,newdata=data,type="response")
This yields the predicted probabilities.
PRROC package, compute the ROC and PR area under the curve (AUC). Plot the ROC and PR curves. Did we improve?Stop and think hard!
Comment on the value of the accuracy. What would be the value of the accuracy if you would predict all sites to be negative?
Of course we have cheated here: we are evaluating the performance of the model on the data that was used to train the model, thus biasing the evaluation…. The proper way to evaluate the performance of a machine-learning model is using a cross-validation : splitting the data into a training and testing set, learning the model on the training set and applying it to the test set, thus evaluating the performance on an independent dataset.
Creating a partition
We will create a training dataset with 75% of the data, and keep the remaining 25% for the testing dataset; we need to make sure to sample 75% of the positive and the negative samples!
i.pos = which(data$tfbs==1)
i.pos.train = sample(i.pos,floor(0.75*length(i.pos)))
i.neg = which(data$tfbs==0)
i.neg.train = sample(i.neg,floor(0.75*length(i.neg)))
i.train = c(i.pos.train,i.neg.train)
train.set = data[i.train,]
test.set = data[-i.train,]
We have created a balanced dataset, meaning that we have sampled proportionaly from each class. To do this, you can also use the caret package in R, which has a function createDataPartition which allows to do this. Check this web page for explanations on this function!
Train the model
Run the glm function to fit the model, while evaluating the performance using the cross-validations
lrFit = glm(tfbs~ .,data=train.set,family="binomial")
Applying the model
Remember that we had defined a training and a test set? Now that we have an optimal model, we can apply it to the left-out test dataset…
pred <- predict(lrFit,newdata=test.set,type="response")
Did we improve compared to the sequence-only performance?
We have now a good model, which includes all variables. We can now try to simplify the model by reducing it, removing “uninteresting” variables that do not contribute to the performance of the model …
Subset selection
We can apply a built in function step, which recursively eliminates variables
lrfit <- glm(tfbs ~ ., data=train.set,family="binomial")
lr.red <- step(lrfit)
Stop and think
- What is happening here? What criteria is used to eliminate variables?
- Check the final model stored in lr.red: what variables does it contain? Is this expected?
- Apply this reduced model to the left out
test.set- Did the performance in terms of precision or recall get worse?
The glmnet package implements ridge regression and LASSO shrinkage methods. If it is not installed, install the package with
install.packages('glmnet')
Check the tutorial for this package!
Ridge regression
Start by performing a ridge regression using the following code:
library(glmnet)
var = data[,-16]
out=as.integer(data[,16])
fit.ridge = glmnet(data.matrix(var),out,alpha=0,family="binomial")
Now you can graphically plot the coefficients, as the regularization parameter lambda is varied:
plot(fit.ridge,label=TRUE,lwd=2)
The number on the lines corespond to the column number of the corresponding variable. Can you interpret the plot? Is it consistent with the previous stepwise variable selection?
LASSO
LASSO implements a different regularization compared to ridge regression (i.e. L1 regularization), which has the effect to set a number of variables to zero, hence leading to a selection of relevant variables:
library(glmnet)
var = data[,-16]
out=as.integer(data[,16])
fit.lasso = glmnet(data.matrix(var),out,alpha=1,family="binomial")
Now you can graphically plot the coefficients, as the regularization parameter lambda is varied:
plot(fit.lasso,label=TRUE,lwd=2)
Is the selection of variables consistent with the backward procedure?
There are many other machine-learning approaches for classification tasks. One popular methd is Random forest, which is an ensemble method based on classification trees. You can install the R packages randomForest
install.packages('randomForest')
Filter the initial dataset using an appropriate cutoff on the LLR for the motif, the run the random forest:
library(randomForest)
rf = randomForest(tfbs ~., data = filtered_data, ntree=500)
print(rf)
Stop and think
- what is the number of variables at each split?
- What is the OOB error?
- What is the class error?
You might notice some weird results in the class error. Why is that? Create a dataset which is more balanced between positives and negatives, and run the random forest again!
Using randomForest, we can evaluate the importance of each variable for the classification. This is estimated using the Gini purity index
importance(rf)
varImpPlot(rf)
Machine-learning for TFBS Ernst, J., Plasterer, H. L., Simon, I., & Bar-Joseph, Z. (2010). Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Research, 20(4), 526–36
James, Witten, Hastie, Tibshirani (2013) An introduction to statistical learning
James, Witten, Hastie, Tibshirani (2017) The Elements of Statistical Learning