Calculating physiological age by data mining hospital records
02618A87-B8C1-4EA7-B5BD-D9E069EF4CF9.jpg

How old are you?

(Are you sure?)

Obviously we each know our own age, in the sense that we know how long we’ve been alive, but what does this number really tell us? Sadly, it does not tell us how many years we have left to live; some would argue that our uncertainty about the answer to that question is a central aspect of the human condition.

But perhaps even more troubling, the amount of time that has passed since our birthdate—which we’ll call chronological age—doesn’t even tell us how old we are right now.

We have an intuitive ability to assess other people’s ages, but we also know that our assessments are less than perfectly accurate: some people simply look older (or younger) than other people born in the same year. Moreover, these differences go more than skin deep, reflecting general vigor and health.

Biogerontologists have sought to make this intuition more rigorous by identifying changes in the body that correlate strongly with chronological age, but are also in concordance with the idea that some people age more rapidly than others, allowing calculation of what we’ll call physiological age. The best aging markers, then, have two features:

  1. Measuring them allows us to guess with reasonable accuracy how long it has been since someone was born.

  2. To the extent that the guesses are inaccurate, the error should reflect differences in longevity from the time of measurement. (So for example, if we measure the physiological ages of a population of 50-year-olds, the subjects who live to 100 should ideally have lower physiological ages than the ones who only make it to 60.)

The amount of time that has passed since our birthdate—our chronological age—doesn’t even tell us how old we are right now.

The challenge of rigorously defining physiological age has been tackled in a wide variety of ways, ranging from calculations based on easily observed aspects of a person’s life (“start with your mother’s age; add 7 if you smoked; subtract 3 if you eat fish once a week”) to “omics” techniques that measure gene expression or DNA methylation. Both approaches have issues: The actuarial techniques are cheap, easy, and imprecise, whereas the molecular methods are expensive and time-consuming—which is a problem because when the measurements are very dear, it’s hard to get enough data to train a model.

Wouldn’t it be great if there were a huge set of clinically relevant data for hundreds of thousands of people, just waiting to be analyzed? Good news: There is. Modern hospitals collect an electronic medical record (EMR), which contains a wealth of information including disease diagnoses, laboratory tests, and physiological traits, for each patient. Although these records are inherently noisy and diverse, the sheer volume of information makes EMR a tempting target for analysis — and the biomedical nature of the data make them a good fit for efforts to calculate physiological age.

In a very recent study, a group of bioinformaticians at Mount Sinai analyzed EMR data from more than a third of a million patients. Using an artificial intelligence technique known as deep learning, they were able to predict chronological age with reasonable accuracy. Moreover, they found that the discrepancies between predicted and chronological age provided insights into the regulation of human longevity.

Attentive readers will have noted my qualification: the accuracy was “reasonable,” not “breathtaking.” The model could only predict the patients’ chronological age with an error of about 7 years in either direction—but then again, that’s about as well as a human observer can do with an adult of middle age or older, and as we can see from the figure below, the overall correlation was excellent.

Predicted age vs. chronological age  (Figure 4 of  Wang et al. )

Predicted age vs. chronological age

(Figure 4 of Wang et al.)

More to the point, given the broad range of lifespans in human populations, it could be that the prediction error represents the real distribution of physiological ages for any given chronological age. (If it is, one corollary is that we shouldn’t be able to decrease the error term by adding orthogonal data like white blood cell telomere length or DNA methylation patterns…but I digress.)

As one might expect, patients who were physiologically older than their chronological age exhibited higher rates of age-related disease, whereas those who were “young for their age” had lower blood pressure and a generally lower disease burden. (As an interesting aside, the “younger” group tended to be shorter than the “older” group, reminiscent of the finding that for a given body plan, smaller animals tend to live longer—so, dwarf mice live longer than regular-sized mice; beagles longer than Great Danes; and ponies longer than Clydesdales.)

In an exciting sub-analysis, the authors examined the genetic data from a relatively small fraction (“only” 10,000 or so patients). Using an approach known as genome-wide association study (GWAS), they mapped the variation in the rate of aging to genes that influence inflammation, hypertension, lipid metabolism, height (once again, size matters), and longevity in mice. In addition to known aging-related genes, the authors also identified a few previously unreported loci that represent candidate longevity regulators, and whose homologs could be experimentally tested for aging effects in model organisms.

An analysis I would have liked to see relates to feature #2 of an ideal aging biomarker that I mentioned above: the ability to predict longevity from the time of measurement. Unfortunately, for a variety of reasons, the EMR does not generally contain information about when a patient dies (unless they die in the hospital). One hopes that a clever demographer or bioinformatician will find some way to connect the treasure trove of information in EMRs to public records about time of death — with all due respect and heartfelt condolences to the families of the departed, I am dying to know whether the authors’ physiological age metric can forecast longevity.

Still, I found this study refreshing because it brings a new class of data to bear on the challenge of rigorously establishing a definition of physiological age, allowing us to understand how and why it differs from chronological age. EMR data are both abundant (to first order, there’s a record for every person in the US) and biomedically relevant. Therefore it stands to reason that models trained on such data should yield information that is clinically useful—both in guiding the care of individual patients and in teaching us about the overall trajectories of human aging.


Wang et al. “Predicting age by mining electronic medical records with deep learning characterizes differences between chronological and physiological age.” Journal of Biomedical Informatics 76: 59–69 (2018) • DOISci-Hub