library(ggplot2)
library(reshape2)
library(Ckmeans.1d.dp)
library(corrplot)
library(bnlearn)
library(igraph)
library(Rgraphviz)
We have generated a dataset with 3 random variables A, B, C; using constraint-based and score based approaches, we try to infer the structure of the networks
Using the function ci.test
with the
test="mi"
method, test:
In each case, the null hypothesis is: “The two variables are (conditionally) independent”
Can you determine the possible architecture(s) of the BN?
Using constraint-based approaches, we cannot fully resolve the
structure of the network. IAMB (incremental association algorithm) is
one of the constrained-based algorithms implemented in BNlearn (see
?iamb
).
We can now try to orient the edges manually, and compute the score (e.g. AIC score) of the oriented network
We can determine the parameters (i.e. the conditional probabilities) of the network:
Test the other possible structures of the network; can you find the most likely one(s)?
Topics:
This dataset consists of single-cell cytometry data on 11 proteins in a primary T-cell signaling network, using normal condition (“observational”) or perturbed conditions (“interventional”). http://www.sciencemag.org/cgi/doi/10.1126/science.1105809
We can plot the distribution of the data for each component of the pathway that was investigated.
Let’s load the data:
sachs = read.table('https://www.dropbox.com/s/fcq3cqxheklr2it/sachs.data.txt?dl=1',header=TRUE)
head(sachs)
tmp = melt(sachs)
p = ggplot(tmp,aes(x=value, fill=variable, color=variable)) +
geom_density(alpha=0.30,size=0.7) + theme_bw() +
facet_wrap(~variable, scales = "free") + labs(x="log2(value)")+
theme(legend.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
rm(tmp)
p
We discretize the data, since it abviously does not follow the assumption of normal distribution from the previous plots!
Again, several strategies are possible: the Hartemink method tries to
maximize the mutual information between the original non-discretized
data and the discretized data. The user must provide the number of
breaks (i.e classes) to be determined, the ibreaks
parameters specifies the number of initial classes, which are then
merged to obtain the final number of classes.
## discretize the data
dsachs = discretize(sachs,method='hartemink',breaks = 3, ibreaks = 60, idisc = 'quantile')
head(dsachs)
Let’s look at the data distribution:
Here, we apply a bootstrapping strategy, use
score-based structural learning (hill-climbing method
hc
). We use 500 bootstraps.
boot = boot.strength(data=dsachs, R=500,
algorithm='hc',
algorithm.args = list(score='bde')
)
head(boot)
For every edge, we had the strength (i.e. the frequency this edge was observed) and the direction (i.e. the frequency it was observed in one or the other direction).
We can now apply a filtering on the edges according to the strength and direction of the edge:
We have learned 500 networks starting from different initial random networks; we can now average these networks.
So far the network was built using observational data, i.e. data collected in normal conditions. The dataset however contained additional interventional data, resulting from perturbation of specific nodes in the network. This data can either be used for validation (do the perturbation have the predicted effect from the initial network ?) or can be used to strengthen the evidence on the learned edges.
isachs = read.table('https://www.dropbox.com/s/xegmoa1e04opc0t/sachs.interventional.txt?dl=1', header = TRUE, colClasses = "factor")
head(isachs)
Here, the last column indicates on which variable (i.e. column) the intervention has been performed.
INT = sapply(1:11, function(x) { which(isachs$INT == x) })
isachs2 = isachs[, 1:11]
nodes = names(isachs2)
names(INT) = nodes
## we start from a set of 200 random graphs
start = random.graph(nodes = nodes,
method = "melancon", # algorithm to generate random graphs
num = 200, # number of initial random graphs
burn.in = 10^5,
every = 100)
netlist = lapply(start, function(net) {
tabu(isachs2,
score = "mbde", # method to score the network
exp = INT, # information of the interventions
iss = 1,
start = net, # starting point (i.e. random graph)
tabu = 50) }
)
arcs = custom.strength(netlist, nodes = nodes, cpdag = FALSE)
bn.mbde = averaged.network(arcs, threshold = 0.85)
net = bn.mbde$arcs
net = graph_from_edgelist(net,directed=T)
graphviz.plot(bn.mbde)
Compare the networks obtained with the ones obtained without intervention
Compare the networks with the real pathway structure!
If you have time, you can go to the next example!
Topics:
Here, we want to build a chromatin network where nodes are epigenetic components, and the observations are the state of these epigenetic modifications at the promoters of genes.
dat = read.table('https://www.dropbox.com/s/npb177z5ceod4fw/CEMT_26-nonCGI-allDataContinuous.txt?dl=1',
header = TRUE, stringsAsFactors = FALSE, sep="\t")
head(dat)
dim(dat)
dat[,3:ncol(dat)] = round(dat[,3:ncol(dat)]/dat$Input,3)
dat = dat[,-ncol(dat)]
corVals = round(cor(dat, method="spearman"),2)
summary(dat)
log2dat = log2(dat+0.1)
tmp = melt(log2dat)
p = ggplot(tmp,aes(x=value, fill=variable, color=variable)) +
geom_density(alpha=0.30,size=0.7) + theme_bw() +
facet_wrap(~variable, scales = "free") + labs(x="log2(value)")+
theme(legend.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
rm(tmp)
p
Any comment about these distributions?
For details regarding Ckmeans.1d.dp, see https://cran.r-project.org/web/packages/Ckmeans.1d.dp/vignettes/Ckmeans.1d.dp.html. The method estimates an optimal number of clusters if no information is given.
Lets check the discrete states for H3K4me3 as an example
# Lets look at the predictions
plot(log2dat$H3K4me3, col= res$cluster, cex=0.5, pch=20, xlab="Genes", ylab= "H3K4me3 signal")
#abline(h=res$centers, lwd=2,lty=2,col="blue")
rm(res)
# Compute optimal states for all
states = apply(log2dat, 2, function(y){max(Ckmeans.1d.dp(x=y, k=c(1,9))$cluster)})
For methylation lots of states are predicted, from the distribution plots it is clear that these are mostly intermediate states that have little biological interpretation:
res = Ckmeans.1d.dp(log2dat$CPGfrac, k=states[1])
plot(log2dat$CPGfrac, col= res$cluster, cex=0.5, pch=20, xlab="Genes", ylab= "CPGfrac")
So let’s approximate by taking 3 states
res = Ckmeans.1d.dp(log2dat$CPGfrac, k=3)
plot(log2dat$CPGfrac, col= res$cluster, cex=0.5, pch=20, xlab="Genes", ylab= "CPGfrac")
We accept all the predicted states from Ckmeans.1d.dp except for methylation).
states[1] = 3
states
# Creating the discretized version of the data
disDat = matrix(ncol=ncol(log2dat), nrow=nrow(log2dat))
colnames(disDat) = colnames(log2dat)
rownames(disDat) = rownames(log2dat)
disDat = as.data.frame(disDat)
for( i in 1: length(states))
{
tmp = suppressWarnings(Ckmeans.1d.dp(log2dat[,i], k=c(1,states[i])))
disDat[,i] = factor(tmp$cluster)
rm(tmp)
}
# Final view of the discretized data
summary(disDat)
Plot the discretized data:
# Plot the discretized data
tmp=melt(disDat,id.vars=c())
p = ggplot(tmp,aes(x=value, fill=variable, color=variable)) +
geom_bar() + theme_bw() +
facet_wrap(~variable, scales = "free") + labs(x="Categories")+
theme(legend.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)
p
We use a set of blacklist edges: no edge should go out of the expression node
# Black list .. no interaction should originate from gene expression
bl=data.frame(from="RPKM",to=colnames(disDat))
bl=bl[-which(bl$to == "RPKM"),]
Now learn the network using a bootstrapping strategy:
# Learn bayesian network via bootstrapping
strength = boot.strength(disDat,
algorithm="tabu",
algorithm.args=list(score="aic", tabu=10, blacklist=bl),
R=1000) # takes about 2mins
See the first edges:
# Select only those interactions meeting our filteration criteria
selarcs = strength[strength$direction > 0.5 ,]
selarcs = selarcs[order(selarcs$strength,decreasing=TRUE),]
selarcs
# Get an averaged network
dag.average.filt = averaged.network(selarcs, threshold = 0.90)
knitr::kable(dag.average.filt$arcs, caption = "Interactions meeting our threshold for strength and direction")
# Compute correlations for the selected interactions
net = dag.average.filt$arcs
corVals = apply(net,1, function(x){
a = log2dat[,x[1]]
b = log2dat[,x[2]]
c = round(cor(a,b,method="spearman"),2)
return(c)
})
# Coloring the correlation values
col = rep("forestgreen",nrow(net))
col[which(corVals < 0)] = "firebrick"
# Setting up the network in igraph
net = graph_from_edgelist(net,directed=T)
E(net)$weight = corVals
V(net)$label.color="black"
rm(corVals)
Now that the network is defined, we can learn the parameters of the edges:
# Fitting the learnt network to the data to obtain
# conditional probability tables
fitted = bn.fit(dag.average.filt, disDat)
With the structure of the network and the parameters learned, we can
start doing simulations of interventions, using the cpdist
function. For example, we will predict the gene expression (variable
RPKM
) if we set DNA methylation to high or low (variable
CPGfrac
).
# Predict Expression level by setting DNAme to high/low
x.high = as.numeric(table(cpdist(fitted,nodes=c('RPKM'),CPGfrac=='3')))
x.low = as.numeric(table(cpdist(fitted,nodes=c('RPKM'),CPGfrac=='1')))
pred = data.frame(prop=c(x.high/sum(x.high),x.low/sum(x.low)),
condition=c(rep('DNAme high',length(x.high)),
rep('DNAme low',length(x.high))),
Expression=c(1:length(x.high),1:length(x.low))
)
ggplot(pred,aes(x=Expression,fill=condition,y=prop)) + geom_bar(stat='identity',position=position_dodge())
Can you interpret the two distributions?
Same experience, setting H3K4me1 to low/high, looking at the effect on expression:
# Predict Expression level by setting H3K4me1 to high/low
x.high = as.numeric(table(cpdist(fitted,nodes=c('RPKM'),H3K27me3=='3')))
x.low = as.numeric(table(cpdist(fitted,nodes=c('RPKM'),H3K27me3=='1')))
pred = data.frame(prop=c(x.high/sum(x.high),x.low/sum(x.low)),
condition=c(rep('K27me3 high',length(x.high)),rep('K27me3 low',length(x.high))),
Expression=c(1:length(x.high),1:length(x.low))
)
ggplot(pred,aes(x=Expression,fill=condition,y=prop)) + geom_bar(stat='identity',position=position_dodge())