24  Worked example: predicting gene expression from histone marks

Published

June 1, 2024

Modified

June 2, 2026

A gene’s DNA sequence doesn’t change from one cell type to the next, yet the same gene can be switched loudly on in one tissue and silent in another. Much of that control is written in histone modifications — chemical marks on the proteins that DNA wraps around, concentrated near each gene’s promoter. This example asks a question at the heart of gene regulation: can the pattern of histone modifications near a gene predict that gene’s expression level? (Figure 24.1)

This is the third of three worked examples, and like the methylation age example it is a regression problem — the target is a continuous expression level. We use the same mlr3 framework throughout; for the four building blocks see The mlr3verse, and for the general train/predict/assess workflow see the cancer example. Here the new idea is penalized regression with cross-validation, and a peek under the hood at how R² is actually computed.

This chapter views the same three ideas through the lens of regularization:

NoteFrom a manual filter to an automatic penalty

The cancer chapter selected features by hand (keep the most variable genes); this chapter lets the model do it. A lasso penalty is, in effect, automatic feature selection baked into the fit — and cross-validation, run only on the training data, is how we pick how aggressive that penalty should be. The held-out test set is never consulted while we tune; it is saved for the single honest score at the end.

24.1 What you’ll learn

  • Center and scale features, and visualize them with boxplots and a heatmap.
  • Build a regression task and split it with mlr3’s partition() helper.
  • Fit penalized regression (ridge) with regr.cv_glmnet, and explain how cross-validation within the training set chooses the penalty strength λ.
  • Read a penalized model’s coefficients, contrast ridge with lasso, and see how the lasso penalty performs automatic feature selection — the job you did by hand in the cancer chapter.
  • Compute R² by hand and confirm it matches the regr.rsq measure.

24.2 Setup

library(mlr3verse)
library(mlr3learners) # for cv_glmnet
library(mlr3viz)      # for plotting predictions
library(ComplexHeatmap)
set.seed(789)

\[ y \sim h_1 + h_2 + h_3 + \ldots \tag{24.1}\]

Figure 24.1: What is the combined effect of histone marks on gene expression?

The data summarize the histone marks within each gene’s promoter (here, the region 2 kb upstream of the transcription start site), paired with that gene’s measured expression. The target is the expression level; the features are the histone-mark summaries. As Equation 24.1 sketches, we are modelling expression \(y\) as a combination of histone marks \(h_1, h_2, \ldots\). We’ll try two approaches: penalized regression and a random forest.

24.3 The data

The data were developed by Anshul Kundaje. We won’t dwell on the collection details; we’ll focus on the modelling.

feature_set <- read.table("http://seandavi.github.io/ITR/expression-prediction/features.txt")

What are the predictor columns?

colnames(feature_set)
 [1] "Control"  "Dnase"    "H2az"     "H3k27ac"  "H3k27me3" "H3k36me3"
 [7] "H3k4me1"  "H3k4me2"  "H3k4me3"  "H3k79me2" "H3k9ac"   "H3k9me1" 
[13] "H3k9me3"  "H4k20me1"

These histone-mark summaries are our predictors. Some learners assume the predictors are on a similar scale, so we center and scale each column (subtract its mean, divide by its standard deviation) by converting to a matrix and using scale().

par(mfrow = c(1, 2))
scaled_features <- scale(as.matrix(feature_set))
boxplot(feature_set, main = 'Original data')
boxplot(scaled_features, main = 'Centered and scaled data')
Figure 24.2: Boxplots of the original and the centered-and-scaled features. After scaling, every feature sits on a comparable scale.

After scaling, every feature is centered near zero with comparable spread, as the right-hand boxplot shows. We can also spot structure across genes with a heatmap (Figure 24.3).

set.seed(789)
sampled_features <- scaled_features[sample(nrow(scaled_features), 500), ]
Heatmap(sampled_features, name = 'histone marks', show_row_names = FALSE)
Figure 24.3: Heatmap of 500 randomly sampled genes (rows) across the histone marks (columns). Blocks of similar color reveal groups of genes with similar mark patterns.

Now we read in the matching gene-expression values — our prediction target — and combine them with the scaled features into one data frame.

target <- scan(url("http://seandavi.github.io/ITR/expression-prediction/target.txt"), skip = 1)
# combine target and features into one data frame
exp_pred_data <- data.frame(gene_expression = target, scaled_features)

The first few rows:

head(exp_pred_data, 3)
                            gene_expression    Control      Dnase       H2az
ENSG00000000419.7.49575069         6.082343  0.7452926  0.7575546  1.0728432
ENSG00000000457.8.169863093        2.989145  1.9509786  1.0216546  0.3702787
ENSG00000000938.7.27961645        -5.058894 -0.3505542 -1.4482958 -1.0390775
                               H3k27ac   H3k27me3   H3k36me3    H3k4me1
ENSG00000000419.7.49575069   1.0950440 -0.5125312  1.1334793  0.4127984
ENSG00000000457.8.169863093  0.7142157 -0.4079244  0.8739005  1.1649282
ENSG00000000938.7.27961645  -1.0173283  1.4117293 -0.5157582 -0.5017450
                               H3k4me2    H3k4me3   H3k79me2     H3k9ac
ENSG00000000419.7.49575069   1.2136176  1.1202901  1.5155803  1.2468256
ENSG00000000457.8.169863093  0.6456572  0.6508561  0.7976487  0.5792891
ENSG00000000938.7.27961645  -0.1878255 -0.6560973 -1.3803974 -1.0067972
                              H3k9me1   H3k9me3   H4k20me1
ENSG00000000419.7.49575069  0.1426980  1.185622  1.9599992
ENSG00000000457.8.169863093 0.3630902  1.014923 -0.2695111
ENSG00000000938.7.27961645  0.6564520 -1.370871 -1.8773178

24.4 Create the task and split

The target is a number, so this is a regression task. This time we use mlr3’s own partition() helper to make the train/test split, which by default holds out one-third for testing.

set.seed(7)
exp_pred_task <- as_task_regr(exp_pred_data, target = 'gene_expression')
split <- partition(exp_pred_task)

24.5 Linear regression

As a baseline, fit an ordinary linear regression.

learner <- lrn("regr.lm")

24.5.1 Train and predict

learner$train(exp_pred_task, split$train)
pred_train <- learner$predict(exp_pred_task, split$train)
pred_test <- learner$predict(exp_pred_task, split$test)

24.5.2 Assess

pred_train

── <PredictionRegr> for 5789 observations: ─────────────────────────────────────
 row_ids     truth  response
       2  2.989145  2.996674
       3 -5.058894 -5.364407
       5  5.558023  4.105793
     ---       ---       ---
    8639 -5.058894 -5.238131
    8640 -3.114148 -4.756313
    8641  5.731075  4.198905
plot(pred_train)

The training R²:

measures <- msrs(c('regr.rsq'))
pred_train$score(measures)
 regr.rsq 
0.7507266 

And the test R² — the number that matters:

pred_test$score(measures)
 regr.rsq 
0.7505519 
plot(pred_test)

A moderate R² here is a real biological result: histone marks in the promoter carry genuine, but partial, information about how strongly a gene is expressed.

24.6 Penalized regression

When a model has many predictors, it can become too flexible and start fitting noise — the problem of overfitting. Penalized regression combats this by adding a penalty for complexity, nudging the model toward simpler, more general solutions.

NoteThree flavours of penalty
  • Ridge (L2): shrinks all coefficients toward zero, like a rubber band pulling them in. It keeps every feature but tames large coefficients.
  • Lasso (L1): can shrink some coefficients all the way to zero, like scissors that cut the least useful features out of the model entirely. This doubles as automatic feature selection.
  • Elastic net: a blend of the two — rubber band and scissors — combining shrinkage with feature elimination.

In mlr3, the alpha setting picks the flavour: alpha = 0 is ridge, alpha = 1 is lasso, and values in between are elastic net.

We’ll use the regr.cv_glmnet learner, which automatically uses cross-validation to choose the penalty strength.

NoteWhat is cross-validation?

Cross-validation estimates how a model will perform on unseen data by splitting the training data into k equal folds. The model trains on k − 1 folds and is validated on the held-out fold, repeating until each fold has served as the validation set once. Averaging the k scores gives a stable performance estimate, and cv_glmnet uses it to pick the best penalty (lambda). The nfolds setting controls k; 5 or 10 are common choices.

With alpha = 0 we fit a ridge model with 10-fold cross-validation:

set.seed(789)
learner <- lrn("regr.cv_glmnet", nfolds = 10, alpha = 0)

24.6.1 Train

learner$train(exp_pred_task, split$train)

For a penalized model we can inspect the coefficients directly:

coef(learner$model)
15 x 1 sparse Matrix of class "dgCMatrix"
             lambda.1se
(Intercept)  0.08832251
Control     -0.05020310
Dnase        0.85954551
H2az         0.31828260
H3k27ac      0.26566165
H3k27me3    -0.27170626
H3k36me3     0.65845540
H3k4me1     -0.02832563
H3k4me2      0.17902035
H3k4me3      0.37064260
H3k79me2     0.86371398
H3k9ac       0.48125738
H3k9me1     -0.05440077
H3k9me3     -0.16204169
H4k20me1     0.13611422

Each coefficient tells us how strongly its histone mark pushes predicted expression up or down. With lasso (alpha = 1) many of these would be exactly zero — the model would have selected a subset of marks for us, which is useful not just for prediction but for understanding which marks matter.

24.6.2 Predict and assess

pred_train <- learner$predict(exp_pred_task, split$train)
pred_test <- learner$predict(exp_pred_task, split$test)
pred_train

── <PredictionRegr> for 5789 observations: ─────────────────────────────────────
 row_ids     truth  response
       2  2.989145  2.933043
       3 -5.058894 -4.471610
       5  5.558023  4.221675
     ---       ---       ---
    8639 -5.058894 -5.581904
    8640 -3.114148 -4.105914
    8641  5.731075  4.427548
plot(pred_train)

The training R²:

measures <- msrs(c('regr.rsq'))
pred_train$score(measures)
regr.rsq 
0.740439 

And the test R²:

pred_test$score(measures)
 regr.rsq 
0.7405317 
plot(pred_test)

We can also compute the test R² by hand, to demystify the measure: it is one minus the ratio of the residual sum of squares to the total sum of squares.

# calculate the R-squared value by hand
truth <- pred_test$truth
predicted <- pred_test$response
rss <- sum((truth - predicted)^2)          # residual sum of squares
tss <- sum((truth - mean(truth))^2)        # total sum of squares
r_squared <- 1 - (rss / tss)
r_squared
[1] 0.7405317

This matches the regr.rsq measure above — reassuring confirmation that you understand exactly what R² is counting.

24.7 Summary

You have now used machine learning to probe a question in gene regulation — whether promoter histone marks predict expression — through the same mlr3 workflow as the previous two chapters:

  • You scaled the histone-mark features, visualized them with boxplots and a heatmap, and built a regression task split with partition().
  • A baseline linear regression gave a moderate R² — a real, if partial, biological signal that marks carry information about expression.
  • Penalized regression (ridge via regr.cv_glmnet) used cross-validation to choose its penalty and fight overfitting; with lasso it would also perform feature selection, turning the model into a tool for understanding which marks matter.
  • Computing R² by hand confirmed exactly what the regr.rsq measure counts.

Across all three worked examples the same disciplines recurred: split the data, select features thoughtfully, train then predict, and judge every model by its test-set score. That last habit — never trusting the training score — is the single most important thing to carry forward.

24.8 Exercises

Each exercise reuses objects you already built, so the solutions are quick to run.

  1. Change the number of CV folds. The ridge model used nfolds = 10. Refit it with nfolds = 5 and see whether the test R² changes much.

    ridge5 <- lrn("regr.cv_glmnet", nfolds = 5, alpha = 0)
    ridge5$train(exp_pred_task, split$train)
    ridge5$predict(exp_pred_task, split$test)$score(msrs(c('regr.rsq')))

    Fewer folds means each training fold is a little smaller and the chosen penalty is a little noisier, but the test R² usually shifts only slightly. The penalty selection is fairly robust to the exact number of folds.

  2. Try lasso instead of ridge. We used alpha = 0 (ridge). Refit with alpha = 1 (lasso) and inspect the coefficients. How many histone marks does lasso drop to exactly zero?

    lasso <- lrn("regr.cv_glmnet", nfolds = 10, alpha = 1)
    lasso$train(exp_pred_task, split$train)
    coef(lasso$model)               # zeros are the dropped marks
    lasso$predict(exp_pred_task, split$test)$score(msrs(c('regr.rsq')))

    Lasso sets the coefficients of the least useful marks to exactly zero, so the non-zero entries are the marks it “selected.” This is feature selection built into the model, and it makes the result easier to interpret biologically.

  3. Add a random forest. The chapter opening promised a random forest too. Fit regr.ranger on the same task and split, and compare its test R² to ridge.

    rf <- lrn("regr.ranger")
    rf$train(exp_pred_task, split$train)
    rf$predict(exp_pred_task, split$test)$score(msrs(c('regr.rsq')))

    Swapping the learner is a one-line change to lrn(). The forest can capture non-linear interactions between marks that a linear model cannot, so it sometimes edges out ridge — but, as always, the honest comparison is the held-out test R².