23  Worked example: predicting age from DNA methylation

Published

June 1, 2024

Modified

June 2, 2026

How old is a tissue sample? Not the donor’s age on a birthday cake, but the biological age written into the DNA itself. It turns out you can read it off the DNA-methylation pattern: at certain sites the methylation level drifts steadily with age, and a model trained on those sites becomes an epigenetic clock. This example builds one, following the work of Naue et al. (2017).

This is the second of three worked examples. Unlike the cancer-classification example, where the target was a category, here the target — a person’s age — is a number, so this is a regression problem. We use the same mlr3 framework throughout; if you need the refresher on tasks, learners, measures, and the train/test split, see The mlr3verse and the cancer example for the general workflow. Here we keep our eyes on the biology and on how regression differs from classification.

The three core ideas from the cancer chapter all return, but each takes on a new shade once the goal is an epigenetic clock rather than a cancer label:

WarningMore features than samples: the pn problem

When the number of features p dwarfs the number of samples n, a flexible model can find some combination of CpG sites that fits the training ages perfectly by chance alone — pure memorization dressed up as signal. This is overfitting in its most dangerous form, because the training score looks fantastic. The defences are the same ones you already know: limit the features (keep only the CpG sites most associated with age), prefer simpler models, and always judge on the held-out test set. The next chapter shows a second defence — letting the model shrink its own coefficients.

23.1 What you’ll learn

  • Build a regression task from a real methylation dataset with as_task_regr().
  • Fit and compare three regression learners — linear regression, a regression tree, and a random forest — through the same mlr3 calls.
  • Read regression diagnostic plots and a truth-versus-prediction scatter.
  • Interpret as the fraction of variation explained (and RMSE as the typical error in years), and spot overfitting when the test R² falls below the training R².
  • Explain why having far more CpG features than people (the pn problem) makes feature selection and an honest test set non-negotiable for a biomarker.

23.2 Setup

23.3 The data

The biology behind it: within each individual, the level of DNA methylation at certain sites changes steadily with age. The CpG sites whose methylation correlates most strongly with age make excellent biomarkers — and excellent features for a regression model. The data come from this Galaxy tutorial.

meth_age <- rbind(
    fread('https://zenodo.org/record/2545213/files/test_rows_labels.csv'),
    fread('https://zenodo.org/record/2545213/files/train_rows.csv')
)

A quick look at the data:

head(meth_age)
   RPA2_3 ZYG11A_4  F5_2 HOXC4_1 NKIRAS2_2 MEIS1_1 SAMD10_2 GRM2_9 TRIM59_5
    <num>    <num> <num>   <num>     <num>   <num>    <num>  <num>    <num>
1:  65.96    18.08 41.57   55.46     30.69   63.42    40.86  68.88    44.32
2:  66.83    20.27 40.55   49.67     29.53   30.47    37.73  53.30    50.09
3:  50.30    11.74 40.17   33.85     23.39   58.83    38.84  35.08    35.90
4:  65.54    15.56 33.56   36.79     20.23   56.39    41.75  50.37    41.46
5:  59.01    14.38 41.95   30.30     24.99   54.40    37.38  30.35    31.28
6:  81.30    14.68 35.91   50.20     26.57   32.37    32.30  55.19    42.21
   LDB2_3 ELOVL2_6 DDO_1 KLF14_2   Age
    <num>    <num> <num>   <num> <int>
1:  56.17    62.29 40.99    2.30    40
2:  58.40    61.10 49.73    1.07    44
3:  58.81    50.38 63.03    0.95    28
4:  58.05    50.58 62.13    1.99    37
5:  65.80    48.74 41.88    0.90    24
6:  70.15    61.36 33.62    1.87    43

Each row is a person, each cg… column is the methylation level at one CpG site, and Age is what we want to predict. Because the target is a number, this is a regression task — so we use as_task_regr().

task <- as_task_regr(meth_age, target = 'Age')

We split into training and test sets exactly as in the cancer example, seeding for reproducibility.

set.seed(7)
train_set <- sample(task$row_ids, 0.67 * task$nrow)
test_set <- setdiff(task$row_ids, train_set)

23.4 Linear regression

We start with the simplest regression model: a straight-line fit relating methylation to age.

learner <- lrn("regr.lm")

23.4.1 Train

learner$train(task, row_ids = train_set)

After fitting a linear model, the standard diagnostic plots (Figure 23.1) help check whether the model’s assumptions hold.

par(mfrow = c(2, 2))
plot(learner$model)
Figure 23.1: Regression diagnostic plots. Top left: residuals vs. fitted values (look for no pattern). Top right: normal Q-Q plot (points should follow the line). Bottom left: scale-location. Bottom right: residuals vs. leverage.

23.4.2 Predict

pred_train <- learner$predict(task, row_ids = train_set)
pred_test <- learner$predict(task, row_ids = test_set)

23.4.3 Assess

Printing a regression prediction shows the predicted (response) and true (truth) values side by side.

pred_train

── <PredictionRegr> for 209 observations: ──────────────────────────────────────
 row_ids truth response
     298    29 31.40565
     103    58 56.26019
     194    53 48.96480
     ---   ---      ---
     312    48 52.61195
     246    66 67.66312
     238    38 39.38414

A scatter plot of truth versus prediction makes the fit visible: a perfect model would put every point on the diagonal.

ggplot(pred_train, aes(x = truth, y = response)) +
    geom_point() +
    geom_smooth(method = 'lm')
`geom_smooth()` using formula = 'y ~ x'

We summarize the fit with (the fraction of variation in age the model explains; 1.0 is perfect, 0 is no better than guessing the mean).

measures <- msrs(c('regr.rsq'))
pred_train$score(measures)
 regr.rsq 
0.9376672 
pred_test

── <PredictionRegr> for 103 observations: ──────────────────────────────────────
 row_ids truth response
       4    37 37.64301
       5    24 28.34777
       7    34 33.22419
     ---   ---      ---
     306    42 41.65864
     307    63 58.68486
     309    68 70.41987
pred_test$score(measures)
 regr.rsq 
0.9363526 

The training R² is high, but the test R² is the honest measure of how well the clock would predict the age of a new person. A drop from training to test is, again, the signature of overfitting.

23.5 Regression tree

We can use a tree for regression too. Instead of voting on a class, each leaf of the tree reports a predicted number.

learner <- lrn("regr.rpart", keep_model = TRUE)
learner$train(task, row_ids = train_set)
learner$model
n= 209 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 209 45441.4500 43.27273  
   2) ELOVL2_6< 56.675 98  5512.1220 30.24490  
     4) ELOVL2_6< 47.24 47   866.4255 24.23404  
       8) GRM2_9< 31.3 34   289.0588 22.29412 *
       9) GRM2_9>=31.3 13   114.7692 29.30769 *
     5) ELOVL2_6>=47.24 51  1382.6270 35.78431  
      10) F5_2>=39.295 35   473.1429 33.28571 *
      11) F5_2< 39.295 16   213.0000 41.25000 *
   3) ELOVL2_6>=56.675 111  8611.3690 54.77477  
     6) ELOVL2_6< 65.365 63  3101.2700 49.41270  
      12) KLF14_2< 3.415 37  1059.0270 46.16216 *
      13) KLF14_2>=3.415 26  1094.9620 54.03846 *
     7) ELOVL2_6>=65.365 48  1321.3120 61.81250 *

Notice something odd: a single regression tree can only output a handful of discrete age values — one per leaf. It cannot predict, say, 43.7 if no leaf sits there.

23.5.1 Predict

pred_train <- learner$predict(task, row_ids = train_set)
pred_test <- learner$predict(task, row_ids = test_set)

23.5.2 Assess

pred_train

── <PredictionRegr> for 209 observations: ──────────────────────────────────────
 row_ids truth response
     298    29 33.28571
     103    58 61.81250
     194    53 46.16216
     ---   ---      ---
     312    48 54.03846
     246    66 61.81250
     238    38 41.25000

The discreteness is obvious in the scatter plot: predictions stack up into horizontal bands, one per leaf.

ggplot(pred_train, aes(x = truth, y = response)) +
    geom_point() +
    geom_smooth(method = 'lm')
`geom_smooth()` using formula = 'y ~ x'

measures <- msrs(c('regr.rsq'))
pred_train$score(measures)
 regr.rsq 
0.8995351 
pred_test

── <PredictionRegr> for 103 observations: ──────────────────────────────────────
 row_ids truth response
       4    37 41.25000
       5    24 33.28571
       7    34 33.28571
     ---   ---      ---
     306    42 46.16216
     307    63 61.81250
     309    68 61.81250
pred_test$score(measures)
 regr.rsq 
0.8545402 

The R² differs noticeably from the linear model — the single tree’s blocky predictions are a poor fit for a continuous quantity like age.

23.6 Random forest

A random forest is also tree-based, but averaging over many trees smooths away the blocky, discrete behaviour of a single tree, giving genuinely continuous predictions.

set.seed(789)
learner <- lrn("regr.ranger", mtry = 2, min.node.size = 20)
learner$train(task, row_ids = train_set)
learner$model
$model
Ranger result

Call:
 ranger::ranger(dependent.variable.name = task$target_names, data = data,      min.node.size = 20L, mtry = 2L, num.threads = 1L) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      209 
Number of independent variables:  13 
Mtry:                             2 
Target node size:                 20 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       17.72152 
R squared (OOB):                  0.918883 

23.6.1 Predict

pred_train <- learner$predict(task, row_ids = train_set)
pred_test <- learner$predict(task, row_ids = test_set)

23.6.2 Assess

pred_train

── <PredictionRegr> for 209 observations: ──────────────────────────────────────
 row_ids truth response
     298    29 30.48368
     103    58 58.25574
     194    53 48.05929
     ---   ---      ---
     312    48 51.18667
     246    66 64.54711
     238    38 37.65249
ggplot(pred_train, aes(x = truth, y = response)) +
    geom_point() +
    geom_smooth(method = 'lm')
`geom_smooth()` using formula = 'y ~ x'

The points now scatter smoothly around the diagonal — no more horizontal bands.

measures <- msrs(c('regr.rsq'))
pred_train$score(measures)
 regr.rsq 
0.9623924 
pred_test

── <PredictionRegr> for 103 observations: ──────────────────────────────────────
 row_ids truth response
       4    37 37.10820
       5    24 29.29540
       7    34 33.02805
     ---   ---      ---
     306    42 40.41656
     307    63 58.32185
     309    68 63.00852
pred_test$score(measures)
 regr.rsq 
0.9226935 

Compare the test R² across the three regression models. The forest typically predicts age more accurately than either the linear model or the single tree, and without the single tree’s discrete artifacts.

23.7 Summary

You have now built an epigenetic clock — a complete regression analysis on real methylation data, through the same mlr3 workflow you used for cancer classification:

  • You turned the methylation table into an mlr3 regression task with as_task_regr() and split it into training and test sets.
  • You fit a linear regression, a regression tree, and a random forest, and judged each with and a truth-versus-prediction scatter.
  • The single tree’s predictions came in discrete bands — a poor fit for a continuous target — while the random forest’s smooth predictions typically beat both the straight line and the single tree.

The change from the cancer example was only the kind of target: a number instead of a category, so as_task_regr() instead of as_task_classif() and R² instead of accuracy. The discipline of trusting only the test-set score carries over unchanged.

23.8 Exercises

Each exercise reuses objects you already built, so the solutions are quick to run.

  1. Tune the forest. The random forest used mtry = 2 and min.node.size = 20. Refit it with the defaults (drop both arguments) and compare the test R². Did the hand-chosen settings help?

    rf_default <- lrn("regr.ranger")
    rf_default$train(task, row_ids = train_set)
    rf_default$predict(task, row_ids = test_set)$score(msrs(c('regr.rsq')))

    Changing mtry and min.node.size is a one-line edit to the lrn() call; everything else is identical. On this dataset the defaults are usually close, which is part of why random forests are a low-fuss default — but the honest comparison is always the held-out test R².

  2. Add a second measure. Alongside R², score the test predictions with root-mean-squared error, msr("regr.rmse"), which reports the typical prediction error in years. Which number is easier to explain to a biologist?

    pred_test$score(msrs(c('regr.rsq', 'regr.rmse')))

    R² is unitless and says what fraction of the variation in age the model explains; RMSE is in the target’s own units (years) and says how far off a typical prediction is. For an audience that thinks in years, RMSE is often the more intuitive headline number.