FAQ about the epigenetic
clock and DNAm age
by Steve Horvath
Human Genetics and Biostatistics, University of California, Los Angeles
shorvath (at) mednet (dot) ucla (dot) edu
This page provides a list
of Frequently Asked Questions and frequently given answers. Please read these
before emailing me about a problem.
Contents
Is there web interface where
I can upload my data?
Should I use the web version
or the R version of the age calculator?
Why does the web based calculator
not return any results for my data set?
How should one estimate age
acceleration?
Should one increase/decrease
DNA methylation in order to stay young?
Should I remove batch effects
in my data
Do I need to impute missing
data
What could explain systematic
differences between predicted age and chronological age
What is the relationship with
the age predictor by Hannum et al 2013 in Mol Cell
What is the relationship with
the blood based age predictor by Weidener et al 2014 in Genome Biology
What is new in the article?
Where is the beef
What is DNAm age?
Answer: DNA methylation age is the predicted age based on the DNA methylation
levels measured from a DNA source (e.g. a tissue). In other words, it is the
result of applying the epigenetic clock to DNA methylation levels from a human
cell, tissue, fluid, organ. It assumes the data represent beta values measured
using either the Illumina 450K or 27K platform. Additional information can be
found in Horvath 2013 or on the following Wikipedia page: https://en.wikipedia.org/wiki/Biological_clock_(aging)
Is there web interface where I can upload my data?
Answer: Yes, you can find a link on the webpage. An R software tutorial can be
found in Additional file 20 of the article. Or alternatively, on the DNAmAge
webpage: http://labs.genetics.ucla.edu/horvath/htdocs/dnamage/ Nobody will ever see your uploaded data.
Should I use the web version or the R version of the age
calculator?
Answer: The web version
is much easier to use and importantly it outputs various statistics for
identifying array outliers (corSampleVSgoldstandard, predicted gender,
predicted tissue). Importantly, your data get automatically deleted after you
upload them (in order to avoid storage overflow). Nobody will ever see your
uploaded data. This online calculator is completely safe. R is often more
convenient for expert programmers.
Why does the web based calculator not return any results
for my data set?
Answer: probably your
data lead to an error. Here are some common remedies.
a) Make sure that you
upload numeric data (missing values should be coded as NA and not as null or NULL).
Apart from probe identifiers (cg numbers) in the first column, don't include
other annotation (e.g. remove chromosome number, CpG island status etc). If
need be, run the following R code before you upload the data.
for (i in
2:dim(dat0)[[2]] ) { dat0[,i]=as.numeric(as.character(dat0[,i])) }
b) Make sure that your
DNA methylation data file contains all the necessary probes. While it is OK to
have missing DNA methylation levels, it is not OK to have missing probe IDs.
Unless you use all probes on the 450K array or the 27K array, please make sure
that your file includes all CpGs listed in datMiniAnnotation27k.csv Probes that were not measured in your data
set should lead to a row filled with NAs. But the probe name needs to be
listed.
If you want to use the
advanced analysis in blood, you need to specify all CpGs mentioned in the file
datMiniAnnotation.csv (which includes more probes than
datMiniAnnotation27k.csv).
c) If you uploaded a
sample annotation file, make sure that its numbers of rows correspond to the
number of samples, i.e. the numbers of columns of dat0 minus 1.
d) Sometimes users
upload .csv files whose line break does not include a line feed and carriage
return. To avoid this bug, I suggest you open your file in Excel and save it as
a .csv file for Windows.
How should one estimate age acceleration?
Answer: I recommend to use the measure AgeAccelerationResidual that results
from the online tool. In other words, the studentized residual resulting from a
linear regression model where DNAm age is regressed on chronological age. An
alternative and more intuitive approachis to define age acceleration as
difference between DNAm age and chronological age. Thus, if the blood of a 40
year old man has a DNAm age of 30, then his age acceleration is minus 10 years
(he looks younger than expected).
Should one increase/decrease DNA methylation in order to
stay young?
Answer: This is an open research question. Not sure, the question will
have a meaningful answer. At this point, there is no simple answer. Note that
the epigenetic clock is based on a total of 353 epigenetic markers (CpGs). The
DNA methylation levels of 193 of these markers increase with age but the
remaining 160 markers show the opposite behavior.
Should I remove batch effects in my data?
Answer: As a rule, I advise against it. There is a danger that batch
removal and other fancy pre-processing makes things worse. Remember that the
epigenetic clock implements a normalization method (based on re-purposing the
BMIQ method from Teschendorff). This approach is implemented in the age
calculator.
Do I need to impute missing data?
Answer: Try to submit data without any missing data, i.e. don't mask
probes based on insignificant detection p-values. But to directly answer your
question: no, the age calculator automatically imputes missing data. The
default (and preferred) method is based on k-nearest neighbor imputation.
However, I strongly recommend you use raw beta values or use GenomeStudio
without masking probes based on a detection p-values. If you want to obtain the
predicted age, you will need to upload all the 27k CpGs on the Illumina 27K
array (see the file datMiniAnnotation27k.csv).
If you want to use the advanced analysis in blood, please use all probes
in datMiniAnnotation.csv.
Include CpG probes not located on the 450K array by simply specifying
"NA" for the respective missing CpGs.
What could explain systematic differences between
predicted age and chronological age?
Answer: Please check the following:
a) You are using the
current age of the person instead of their age when the sample was collected.
For example, if a blood sample was collected 10 years ago and the current age
of the subject is 50, then the predicted age will be around 40.
b) DNA storage effects.
c) Poor DNA quality or
insufficient DNA was used.
d) Excessive numbers of
missing data.
e) Failure to follow
the standard Illumina protocol for generating the beta values.
f) You used a special
normalization method, e.g. to remove batch or chip effects.
g) You use M values
instead of beta values. Make sure the values lie between 0 and 1.
h) You are using a DNA
source (e.g. tissue) that is very different from any of the tissues used in the
training data sets. If so, please contact SH (shorvath at mednet.ucla.edu) so
that he can update the epigenetic clock.
i) You are using
malignant (cancerous) tissue. As a rule, the age predictor works well in
healthy tissues.
j) The DNA does not
come from humans or chimpanzees.
k) You accidentally
permuted the sample annotation file or the methylation data. It is crucial that the
columns in the methylation data set correspond to the rows in the sample annotation
file.
l) Let SH know if you
can figure out additional reasons.
What is the relationship with the age predictor by Hannum
et al 2013 in Mol Cell?
There are several differences.
First, the age predictor from Hannum et al (2013) only works in blood
tissue and typically requires adjustments in other tissues. For example, it
does not apply to buccal epithelium (unpublished data). As mentioned in Hannum
(2013), each tissue led to a clear linear offset (intercept and slope). In
other words, if the Hannum predictor is applied to breast tissue then an
intercept and slope term will chosen so that there is no systematic difference
between blood and breast tissue. Therefore, the authors had to use a linear
model to adjust for each tissue trend. When the Hannum predictor is applied to
other tissues, it leads to an unacceptably high error (due to poor calibration)
as can be seen from their Figure 4A. Hannum et al have to "calibrate"
their age prediction (by adjusting the slope and the intercept) in order to get
it to work in other tissues. The defining characteristic of the epigenetic
clock (Horvath 2013) is that one does not have to carry out such a calibration
step. Hannum et al also fit separate predictors for each tissue but the
corresponding sets of covariates (CpGs) have very little overlap. In contrast,
the epigenetic clock uses the same set of 353 CpGs for each tissue and the same
coefficient values.
Second, blood based predictor from Hannum et al can not be used to
compare the ages of different tissues/organs because it calibrates differences
away as described in 1). The most exciting aspect of the epigenetic clock is
that it can be used to compare the ages of different tissues/cells/organs from
the same individual. Hannum et al never claimed that they could compare the
ages of different normal tissues. But their predictor lends itself to comparing
cancer with normal tissue. Hannum et al evaluated accelerated aging effects in
cancer and reported pronounced age acceleration in cancers. Horvath (2013) did not
corroborate this claim using the epigenetic clock. As a matter of fact, a related
statement had to be retracted as can be see from the following correction:
http://labs.genetics.ucla.edu/horvath/htdocs/dnamage/correction/
Third, Hannum's multivariate age predictor makes use of covariates such
as such as gender, BMI, diabetes status, ethnicity, and batch. Hannum et al did
not fully report all the information regarding their prediction method: neither
the coefficient values for the above mentioned covariates nor the covariates.
Strictly speaking, one cannot apply their prediction method to predict age in
independent data since new data involve different batches etc. However, the
authors present coefficient values for their CpGs in the Supplementary Table.
Fourth, Hannum's age predictor works very well for blood samples from
middle aged and older subjects. However, it works poorly for blood samples from
subjects who are younger than 20 as can be seen from Figure 1B (where the
estimated ages tend to be very negative). In contrast, the epigenetic clock is
well calibrated (panel A) in younger subjects.
Figure 1: Comparison of the 3 age predictors
described in A) Horvath (2013), B) Hannum (2013), and C) Weidener (2014),
respectively. The x-axis depicts
the chronological age in years whereas the y-axis shows the predicted age. The
solid black line corresponds to y=x. The whole blood samples represent an
independent blood methylation data set (generated in Nov 2014).
In contrast, the epigenetic clock (Horvath 2013) does not require
covariate information and can be applied to independent data without any
adjustments.
Regarding the supervised machine learning method: Hannum et al used a penalized
regression model similar to our saliva based predictor in Bocklandt et al 2011
and similar to that of the epigenetic clock. Differences between the elastic
net and lasso approach are described here:
https://en.wikipedia.org/wiki/Elastic_net_regularization
What is the relationship with the blood based age
predictor by Weidener et al 2014 in Genome Biology?
The epigenetic clock is much more
accurate than the predictor by Weidener et al 2014 even in blood tissue.
Further, the age predictor by Weidener et al only applies to blood tissue and
cannot be used to compare the ages of different parts of the human body.
Here is a detailed analysis that
corroborates these statements:
http://labs.genetics.ucla.edu/horvath/htdocs/dnamage/weidener2014
An additional comparison is described in Figure 1 (see panel C)
What is the relationship with the saliva based age
predictor by Bocklandt et al 2011 or the predictor by Koch et al 2011?
The epigenetic clock is more accurate than our age predictor in Bocklandt
et al 2011 even in saliva tissue. Further, our original predictor only applied
to saliva and cannot be used to compare the ages of different parts of the
human body. The epigenetic clock is much more accurate than the predictor by
Koch et al 2011. A detailed comparison can be found in Additional file 2 of
Horvath 2013.
What is new in the article? Where is the beef?
The epigenetic clock is the first age prediction method based on DNAm
levels that accurately predicts age in more than one tissue or fluid. As a
matter of fact, it works in the vast majority of tissues/fluids/organs. It is
arguably the first accurate measure of age that allows one to compare the ages
of different parts of the human body. Researchers who develop genomic
biomarkers will appreciate its astonishing accuracy, the fact that it works
across two Illumina array platforms and that it is remarkably robust to batch
effects. The epigenetic clock yields many interesting insights. Here is a top
10 list:
1) stem cells and iPS cells are perfectly young,
2) the epigenetic clock works in chimpanzees,
3) age acceleration effects (measured by the clock) are highly heritable,
4) normal female breast tissue (adjacent to tumor) exhibits positive age
acceleration effects while heart tissue appears younger,
5) cell passaging increases DNAm age,
6) the ticking rate of the epigenetic clock is fastest during
development,
7) tumor morphology does not relate to age acceleration,
8) there is a weak inverse relationship between the number of somatic
mutations and age acceleration in many cancer types,
9) mutations in steroid receptors are associated with lower age
acceleration effects in breast cancer tissue,
10) TP53 mutation status relates to age acceleration.
The big picture: Most (but certainly not all) prior articles propose that
age effects on DNA methylation levels represent noise or epigenetic drift, see
for example the excellent recent article by A. Teschendorff et al (2013). Hum
Mol Genet. PMID: 23918660. While epigenetic drift may explain age related
changes for most CpGs, Horvath (2013) presents compelling data that the
epigenetic clock relates to a purposeful biological process. Further, it
proposes the epigenomic maintenance system (EMS) model of DNAm age.