Imputation

No, not when someone imputes malevolent intention to your attempt to help! In research terminology this is all about “missing data” and what you do about it. This is the term for all the ways of handling missing data, i.e. that you don’t have values for some variables, from some people, in your dataset.

Details #

Imputation involves replacing the missing values with “imputed” values. There are two groups of ways to do this:

  • single imputation: each missing value is replaced with one imputed value
  • multiple imputation: the dataset is duplicated multiple times with some sensible but “stochastic” i.e partly random, process creating replacements for each missing value that will vary.

Single imputation methods are usually one of the following.

  • Within participant methods:
  • Replacing the missing values with the mean across the non-missing values from that person (on that occasion). This is the basis of pro-rating.
  • Where a participant opts out of a study that involved repeated completions of some measurement(s), replacing the missing values with the most recent from that person. This is the LOCF (Last Occasion Carried Forward) process.
  • Where a participant opts out of a study that involved repeated completions of some measurement(s) and does not give data for some occasions in principle LOCF could be used but it’s more common and probably more logical to interpolate the missing values from the values before and after. That can be done simply by taking their average or, a better method except in almost impossibly bizarre models of the data generation, by interpolating taking into account the time between the the non-missing values and the time of the occasion when no value was recorded. This will produce findings similar to using MLM: Multi-Level Modelling assuming a linear model with the participant as a level in the model.
  • Across participant imputation:
  • I think for single imputation this is always replacing missing values for that variable with the arithmetic mean across the values from the participants who did have a recorded value. In principle the median or some other estimation of central location might be used but I don’t think I’ve ever seen this.

Multiple imputation methods have been actively developed over the last few decades and involve creating some model of how it is assumed individual scores arise and how they are related to other variables in the data. The most used methods are MICE (Multiple Imputation by Chained Equations) and MIDAS (I think Multiple Imputation by Denoising AutoencoderS) with the latter perhaps starting to replace the former at least as the trendy new thing to do.

One catch with all imputation is that it can be vulnerable to the model of how the data is generated not fitting the model in the imputation method. In my view the greater catch, particularly when the rate of missingness is small, is that it lends itself to “blinding with science” and can distract from reports considering potentially much larger biases that may have arisen from the sampling frame and complete total non-participation.

Try also … #

Missing Completely at Random (MCAR)
Missing values
MICE (Multiple Imputation by Chained Equations)
MIDAS (Multiple Imputation by Denoising AutoencoderS)

Chapters #

Not mentioned in the OMbook.

Online resources #

I’m unlikely to create any I think, the issues are too complex and too general to make it easy.

Dates #

Created 14/9/24.

Powered by BetterDocs