Apparently the mice function sets contrasts for factors. So you get the following (check out the column names):
contrasts(nhanes$age) ## 1 2 ## 0 0 0 ## 1 1 0 ## 2 0 1 contrasts(imp$data$age) ## 2 3 ## 0 0 0 ## 1 1 0 ## 2 0 1You can just change the contrasts of the imputed data, then you get the same dummy coding:
imp <- mice(nhanes) contrasts(imp$data$age)<- contrasts(nhanes$age) fit <- with(imp, lm(chl ~ age + bmi)) pool(fit)## Call: pool(object = fit)## ## Pooled coefficients:## (Intercept) age1 age2 bmi ## 0.9771566 47.6351257 63.1332336 6.2589887 ## ## Fraction of information about the coefficients missing due to nonresponse: ## (Intercept) age1 age2 bmi ## 0.3210118 0.5554399 0.6421063 0.3036489MICE (Multivariate Imputation via Chained Equations) is one of the commonly used package by R users. Creating multiple imputations as compared to a single imputation (such as mean) takes care of uncertainty in missing values.
MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.
For example: Suppose we have X1, X2….Xk variables. If X1 has missing values, then it will be regressed on other variables X2 to Xk. The missing values in X1 will be then replaced by predictive values obtained. Similarly, if X2 has missing values, then X1, X3 to Xk variables will be used in prediction model as independent variables. Later, missing values will be replaced with predicted values.
By default, linear regression is used to predict continuous missing values. Logistic regression is used for categorical missing values. Once this cycle is complete, multiple data sets are generated. These data sets differ only in imputed missing values. Generally, it’s considered to be a good practice to build models on these data sets separately and combining their results.