| What is multiple imputation? | ||||
| The general idea | ||||
| Multiple imputation is a statistical technique for analyzing incomplete data sets, that is, data sets for which some entries are missing. Application of the technique requires three steps: imputation, analysis and pooling. The figure illustrates these steps. | ||||
|
|
||||
![]() |
||||
|
IMPUTATION
|
Impute (=fill in) the missing entries of the incomplete data sets, not once, but m times (m=3 in the figure). Imputed values are drawn for a distribution (that can be different for each missing entry). This step results is m complete data sets. | |||
|
ANALYSIS
|
Analyze each of the m completed data sets. This step results in m analyses. | |||
|
POOLING
|
Integrate the m analysis results into a final result. Simple rules exist for combining the m analyses. | |||
|
Rubin (1987) has shown that if the method to create imputations is 'proper', then the resulting inferences will be statistically valid. The most challenging step is IMPUTATION, that is, the construction of the m completed data sets. This step accounts for the process that created the missing data. Typical problems are:
The repeated ANALYSIS step on the imputed data is actually somewhat simpler than the same analysis without imputation, since there is no need to bother with the missing data. The POOLING step consists of computing the mean over the m repeated analysis, its variance, and its confidence interval or P value. In general, these computation are relatively simple. |
||||
| Further reading | ||||
| A good place to start into multiple imputation is the recent introductory article by Joe Schafer Multiple imputation: a primer. Statistical Methods in Medical Research, 8:3-15, 1999. For the impatient, an online alternative is Schafer's page of FAQs in multiple imputation.
A slightly broader text appeared in the famous green and cheap SAGE series. Check out P.D. Allison's Missing Data (2001), which contains a gentle introduction into multiple imputation. Another expository paper is Schafer, J.L. and Olsen, M.K. (1998) Multiple imputation for multivariate missing-data problems: a data analyst's perspective. Multivariate Behavioral Research, 33, 545-571. An earlier version is available online in Acrobat .pdf (190 Kb) or Postscript .ps (639 kB) formats. Joe Schafer's page contains a wealth of other introductory material, including notes of lectures and short courses. Other recommend reading for starters includes Chapter 1 of Rubin's book (Rubin, D.B., 1987, Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons), but please don't be put off by the later chapters, and Rubin's online introduction. At an intermediate level, Rubin's 1996 JASA-paper (Rubin, D.B., 1996, Multiple imputation after 18+ years (with discussion). Journal of the American Statistical Association, 91, 473-489) provides excellent insight into many issues in MI, as well as a fairly complete account of the literature up to 1996. The discussants provide contrasting views. The paper by Meng (X.L. Meng, 1994, Multiple Imputation with Uncongenial Sources of Input (with discussion). Statistical Science, 9, 538-574.) deals with the important issue of how the imputation model relates to the intended complete-data analysis. Little's 1992 review (Little, R.J.A. (1992). Regression with missing X's: a review. Journal of the American Statistical Association, 87, 1227-1238) clarifies a number of conceptual issues associated with imputation. Little & Rubin's Statistical analysis with missing data (1987) is a must for anyone that seriously deals with incomplete data problems. The second edition (2002) incorporated much more material about MI. Also very informative is the three-volume series Incomplete data in sample surveys (I, II, III) dated 1983, edited by Madow and others. Another excellent book is Schafer's Analysis of incomplete multivariate data (1997), which, in some aspects, is a follow up to the Rubin (1987) book. |
||||
| Learning by example | ||||
| It is helpful to study how other people solve their incomplete data problems. Various insightful case studies have been published. For medical applications, we recommend:
Barnard, J. and Meng, X.L. (1999). Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical Methods in Medical Research, 8, 17-36. van Buuren, S., Boshuizen, H.C. and Knook, D.L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18, 681-694 A major application in surey research can be found in Clogg, C.C., Rubin, D.B., Schenker, N., Schultz, B. and Weidman, L. (1991). Multiple Imputation of Industry and Occupation Codes in Census Public-Use Samples Using Bayesian Logistic Regression. Journal of the American Statistical Association, 86, 413, 68-78. A political science application appears in Rubin, D.B., Stern, H. and Vehovar, V. (1995). Handling 'Don't Know' Survey Responses: The Case of the Slovenian Plebiscite. Journal of the American Statistical Association, 90, 822-828. |
||||