QBoard » Artificial Intelligence & ML » AI and ML - R » Using mice in R changes dummy coding

Using mice in R changes dummy coding

  • I'm trying to use the mice package in R for a project and discovered that the pooled results seemed to change the dummy code I had for one of the variables in the output.

    To elaborate, let's say I have a factor, foo, with two levels: 0 and 1. Using a regular lm would typically yield an estimate for foo1. Using mice and the pool function, however, yields an estimate for foo2. I included a reproducible example below using the nhanes dataset from the mice package. Any ideas why the might be occurring?

    require(mice)

    # Create age as: 0, 1, 2
    nhanes$age <- as.factor(nhanes$age - 1)
    head(nhanes)

    # age bmi hyp chl
    # 1 0 NA NA NA
    # 2 1 22.7 1 187
    # 3 0 NA 1 187
    # 4 2 NA NA NA
    # 5 0 20.4 1 113
    # 6 2 NA NA 184

    # Use a regular lm with missing data just to see output
    # age1 and age2 come up as expected

    lm(chl ~ age + bmi, data = nhanes)

    # Call:
    # lm(formula = chl ~ age + bmi, data = nhanes)

    # Coefficients:
    # (Intercept) age1 age2 bmi
    # -28.948 55.810 104.724 6.921

    imp <- mice(nhanes)
    str(complete(imp)) # still the same coding

    fit <- with(imp, lm(chl ~ age + bmi))
    pool(fit)

    # Now the estimates are for age2 and age3

    # Call: pool(object = fit)

    # Pooled coefficients:
    # (Intercept) age2 age3 bmi
    # 29.88431 43.76159 56.57606 5.05537 
    This post was edited by Shivakumar Kota at May 23, 2019 1:07 PM IST
      May 23, 2019 1:06 PM IST
    0
  • Apparently the mice function sets contrasts for factors. So you get the following (check out the column names):

    contrasts(nhanes$age) ## 1 2 ## 0 0 0 ## 1 1 0 ## 2 0 1 contrasts(imp$data$age) ## 2 3 ## 0 0 0 ## 1 1 0 ## 2 0 1

    You can just change the contrasts of the imputed data, then you get the same dummy coding:

    imp <- mice(nhanes) contrasts(imp$data$age)<- contrasts(nhanes$age) fit <- with(imp, lm(chl ~ age + bmi)) pool(fit)## Call: pool(object = fit)## ## Pooled coefficients:## (Intercept) age1 age2 bmi ## 0.9771566 47.6351257 63.1332336 6.2589887 ## ## Fraction of information about the coefficients missing due to nonresponse: ## (Intercept) age1 age2 bmi ## 0.3210118 0.5554399 0.6421063 0.3036489
    This post was edited by Rakesh Racharla at May 23, 2019 1:10 PM IST
      May 23, 2019 1:10 PM IST
    0
  • MICE (Multivariate Imputation via Chained Equations) is one of the commonly used package by R users. Creating multiple imputations as compared to a single imputation (such as mean) takes care of uncertainty in missing values.

    MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.

    For example: Suppose we have X1, X2….Xk variables. If X1 has missing values, then it will be regressed on other variables X2 to Xk. The missing values in X1 will be then replaced by predictive values obtained. Similarly, if X2 has missing values, then X1, X3 to Xk variables will be used in prediction model as independent variables. Later, missing values will be replaced with predicted values.

    By default, linear regression is used to predict continuous missing values. Logistic regression is used for categorical missing values. Once this cycle is complete, multiple data sets are generated. These data sets differ only in imputed missing values. Generally, it’s considered to be a good practice to build models on these data sets separately and combining their results.


      August 14, 2021 1:21 PM IST
    0