DNAm age: Frequently Asked Questions

FAQ about the epigenetic clock and DNAm age

by Steve Horvath

Human Genetics and Biostatistics, University of California, Los Angeles

shorvath (at) mednet (dot) ucla (dot) edu

This page provides a list of Frequently Asked Questions and frequently given answers. Please read these before emailing me about a problem.

Contents

What is DNAm age?. 1

Is there web interface where I can upload my data?. 2

Should I use the web version or the R version of the age calculator?. 2

Why does the web based calculator not return any results for my data set?. 2

How should one estimate age acceleration?. 2

Should one increase/decrease DNA methylation in order to stay young?. 2

Should I remove batch effects in my data. 3

Do I need to impute missing data. 3

What could explain systematic differences between predicted age and chronological age. 3

What is the relationship with the age predictor by Hannum et al 2013 in Mol Cell 3

What is the relationship with the blood based age predictor by Weidener et al 2014 in Genome Biology 4

What is the relationship with the saliva based age predictor by Bocklandt et al 2011 or the predictor by Koch et al 2011. 4

What is new in the article? Where is the beef 5

What is DNAm age?

Answer: DNA methylation age is the predicted age based on the DNA methylation levels measured from a DNA source (e.g. a tissue). In other words, it is the result of applying the epigenetic clock to DNA methylation levels from a human cell, tissue, fluid, organ. It assumes the data represent beta values measured using either the Illumina 450K or 27K platform. Additional information can be found in Horvath 2013 or on the following Wikipedia page: https://en.wikipedia.org/wiki/Biological_clock_(aging)

Is there web interface where I can upload my data?

Answer: Yes, you can find a link on the webpage. An R software tutorial can be found in Additional file 20 of the article. Or alternatively, on the DNAmAge webpage: http://labs.genetics.ucla.edu/horvath/htdocs/dnamage/ Nobody will ever see your uploaded data.

Should I use the web version or the R version of the age calculator?

Answer: The web version is much easier to use and importantly it outputs various statistics for identifying array outliers (corSampleVSgoldstandard, predicted gender, predicted tissue). Importantly, your data get automatically deleted after you upload them (in order to avoid storage overflow). Nobody will ever see your uploaded data. This online calculator is completely safe. R is often more convenient for expert programmers.

Why does the web based calculator not return any results for my data set?

Answer: probably your data lead to an error. Here are some common remedies.

a) Make sure that you upload numeric data (missing values should be coded as NA and not as null or NULL). Apart from probe identifiers (cg numbers) in the first column, don't include other annotation (e.g. remove chromosome number, CpG island status etc). If need be, run the following R code before you upload the data.

for (i in 2:dim(dat0)[[2]] ) { dat0[,i]=as.numeric(as.character(dat0[,i])) }

b) Make sure that your DNA methylation data file contains all the necessary probes. While it is OK to have missing DNA methylation levels, it is not OK to have missing probe IDs. Unless you use all probes on the 450K array or the 27K array, please make sure that your file includes all CpGs listed in datMiniAnnotation27k.csv Probes that were not measured in your data set should lead to a row filled with NAs. But the probe name needs to be listed.

If you want to use the advanced analysis in blood, you need to specify all CpGs mentioned in the file datMiniAnnotation.csv (which includes more probes than datMiniAnnotation27k.csv).

c) If you uploaded a sample annotation file, make sure that its numbers of rows correspond to the number of samples, i.e. the numbers of columns of dat0 minus 1.

d) Sometimes users upload .csv files whose line break does not include a line feed and carriage return. To avoid this bug, I suggest you open your file in Excel and save it as a .csv file for Windows.

How should one estimate age acceleration?

Answer: I recommend to use the measure AgeAccelerationResidual that results from the online tool. In other words, the studentized residual resulting from a linear regression model where DNAm age is regressed on chronological age. An alternative and more intuitive approachis to define age acceleration as difference between DNAm age and chronological age. Thus, if the blood of a 40 year old man has a DNAm age of 30, then his age acceleration is minus 10 years (he looks younger than expected).

Should one increase/decrease DNA methylation in order to stay young?

Answer: This is an open research question. Not sure, the question will have a meaningful answer. At this point, there is no simple answer. Note that the epigenetic clock is based on a total of 353 epigenetic markers (CpGs). The DNA methylation levels of 193 of these markers increase with age but the remaining 160 markers show the opposite behavior.

Should I remove batch effects in my data?

Answer: As a rule, I advise against it. There is a danger that batch removal and other fancy pre-processing makes things worse. Remember that the epigenetic clock implements a normalization method (based on re-purposing the BMIQ method from Teschendorff). This approach is implemented in the age calculator.

Do I need to impute missing data?

Answer: Try to submit data without any missing data, i.e. don't mask probes based on insignificant detection p-values. But to directly answer your question: no, the age calculator automatically imputes missing data. The default (and preferred) method is based on k-nearest neighbor imputation. However, I strongly recommend you use raw beta values or use GenomeStudio without masking probes based on a detection p-values. If you want to obtain the predicted age, you will need to upload all the 27k CpGs on the Illumina 27K array (see the file datMiniAnnotation27k.csv).

If you want to use the advanced analysis in blood, please use all probes in datMiniAnnotation.csv.

Include CpG probes not located on the 450K array by simply specifying "NA" for the respective missing CpGs.

What could explain systematic differences between predicted age and chronological age?

Answer: Please check the following:

a) You are using the current age of the person instead of their age when the sample was collected. For example, if a blood sample was collected 10 years ago and the current age of the subject is 50, then the predicted age will be around 40.

b) DNA storage effects.

c) Poor DNA quality or insufficient DNA was used.

d) Excessive numbers of missing data.

e) Failure to follow the standard Illumina protocol for generating the beta values.

f) You used a special normalization method, e.g. to remove batch or chip effects.

g) You use M values instead of beta values. Make sure the values lie between 0 and 1.

h) You are using a DNA source (e.g. tissue) that is very different from any of the tissues used in the training data sets. If so, please contact SH (shorvath at mednet.ucla.edu) so that he can update the epigenetic clock.

i) You are using malignant (cancerous) tissue. As a rule, the age predictor works well in healthy tissues.

j) The DNA does not come from humans or chimpanzees.

k) You accidentally permuted the sample annotation file or the methylation data. It is crucial that the columns in the methylation data set correspond to the rows in the sample annotation file.

l) Let SH know if you can figure out additional reasons.

What is the relationship with the age predictor by Hannum et al 2013 in Mol Cell?

There are several differences.

First, the age predictor from Hannum et al (2013) only works in blood tissue and typically requires adjustments in other tissues. For example, it does not apply to buccal epithelium (unpublished data). As mentioned in Hannum (2013), each tissue led to a clear linear offset (intercept and slope). In other words, if the Hannum predictor is applied to breast tissue then an intercept and slope term will chosen so that there is no systematic difference between blood and breast tissue. Therefore, the authors had to use a linear model to adjust for each tissue trend. When the Hannum predictor is applied to other tissues, it leads to an unacceptably high error (due to poor calibration) as can be seen from their Figure 4A. Hannum et al have to "calibrate" their age prediction (by adjusting the slope and the intercept) in order to get it to work in other tissues. The defining characteristic of the epigenetic clock (Horvath 2013) is that one does not have to carry out such a calibration step. Hannum et al also fit separate predictors for each tissue but the corresponding sets of covariates (CpGs) have very little overlap. In contrast, the epigenetic clock uses the same set of 353 CpGs for each tissue and the same coefficient values.

Second, blood based predictor from Hannum et al can not be used to compare the ages of different tissues/organs because it calibrates differences away as described in 1). The most exciting aspect of the epigenetic clock is that it can be used to compare the ages of different tissues/cells/organs from the same individual. Hannum et al never claimed that they could compare the ages of different normal tissues. But their predictor lends itself to comparing cancer with normal tissue. Hannum et al evaluated accelerated aging effects in cancer and reported pronounced age acceleration in cancers. Horvath (2013) did not corroborate this claim using the epigenetic clock. As a matter of fact, a related statement had to be retracted as can be see from the following correction:

http://labs.genetics.ucla.edu/horvath/htdocs/dnamage/correction/

Third, Hannum's multivariate age predictor makes use of covariates such as such as gender, BMI, diabetes status, ethnicity, and batch. Hannum et al did not fully report all the information regarding their prediction method: neither the coefficient values for the above mentioned covariates nor the covariates. Strictly speaking, one cannot apply their prediction method to predict age in independent data since new data involve different batches etc. However, the authors present coefficient values for their CpGs in the Supplementary Table.

Fourth, Hannum's age predictor works very well for blood samples from middle aged and older subjects. However, it works poorly for blood samples from subjects who are younger than 20 as can be seen from Figure 1B (where the estimated ages tend to be very negative). In contrast, the epigenetic clock is well calibrated (panel A) in younger subjects.

Figure 1: Comparison of the 3 age predictors described in A) Horvath (2013), B) Hannum (2013), and C) Weidener (2014), respectively. The x-axis depicts the chronological age in years whereas the y-axis shows the predicted age. The solid black line corresponds to y=x. The whole blood samples represent an independent blood methylation data set (generated in Nov 2014).

In contrast, the epigenetic clock (Horvath 2013) does not require covariate information and can be applied to independent data without any adjustments.

Regarding the supervised machine learning method: Hannum et al used a penalized regression model similar to our saliva based predictor in Bocklandt et al 2011 and similar to that of the epigenetic clock. Differences between the elastic net and lasso approach are described here:

https://en.wikipedia.org/wiki/Elastic_net_regularization

What is the relationship with the blood based age predictor by Weidener et al 2014 in Genome Biology?

The epigenetic clock is much more accurate than the predictor by Weidener et al 2014 even in blood tissue. Further, the age predictor by Weidener et al only applies to blood tissue and cannot be used to compare the ages of different parts of the human body.

Here is a detailed analysis that corroborates these statements:

http://labs.genetics.ucla.edu/horvath/htdocs/dnamage/weidener2014

An additional comparison is described in Figure 1 (see panel C)

What is the relationship with the saliva based age predictor by Bocklandt et al 2011 or the predictor by Koch et al 2011?

The epigenetic clock is more accurate than our age predictor in Bocklandt et al 2011 even in saliva tissue. Further, our original predictor only applied to saliva and cannot be used to compare the ages of different parts of the human body. The epigenetic clock is much more accurate than the predictor by Koch et al 2011. A detailed comparison can be found in Additional file 2 of Horvath 2013.

What is new in the article? Where is the beef?

The epigenetic clock is the first age prediction method based on DNAm levels that accurately predicts age in more than one tissue or fluid. As a matter of fact, it works in the vast majority of tissues/fluids/organs. It is arguably the first accurate measure of age that allows one to compare the ages of different parts of the human body. Researchers who develop genomic biomarkers will appreciate its astonishing accuracy, the fact that it works across two Illumina array platforms and that it is remarkably robust to batch effects. The epigenetic clock yields many interesting insights. Here is a top 10 list:

1) stem cells and iPS cells are perfectly young,

2) the epigenetic clock works in chimpanzees,

3) age acceleration effects (measured by the clock) are highly heritable,

4) normal female breast tissue (adjacent to tumor) exhibits positive age acceleration effects while heart tissue appears younger,

5) cell passaging increases DNAm age,

6) the ticking rate of the epigenetic clock is fastest during development,

7) tumor morphology does not relate to age acceleration,

8) there is a weak inverse relationship between the number of somatic mutations and age acceleration in many cancer types,

9) mutations in steroid receptors are associated with lower age acceleration effects in breast cancer tissue,

10) TP53 mutation status relates to age acceleration.

The big picture: Most (but certainly not all) prior articles propose that age effects on DNA methylation levels represent noise or epigenetic drift, see for example the excellent recent article by A. Teschendorff et al (2013). Hum Mol Genet. PMID: 23918660. While epigenetic drift may explain age related changes for most CpGs, Horvath (2013) presents compelling data that the epigenetic clock relates to a purposeful biological process. Further, it proposes the epigenomic maintenance system (EMS) model of DNAm age.