Imputation

Raji Reddy A

Reviews

Editor Rating

User Ratings

Based on 1 review

Recommendations

Recommended by 100% users

Major Concepts

Articles Home » Data Preparation » Imputation

Imputation is the process of replacing missing values in a dataset. When analyzing a data, removing the data points with missing values is not a good approach as it may introduce bias in the results. Imputation helps in retaining all the data by predicting missing value based on other available information.

There are two types of missing data:

MCAR - missing completely at random: This is the desirable scenario in case of missing data. This is most likely not a serious process issue and root cause is difficult to establish.

MNAR - missing not at random: Missing not at random data is a more serious issue and in this case, it might be wise to check the data gathering process further and try to understand why the information is missing rather than imputing the values. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?

Assuming data is MCAR, too much missing data can be a problem too

Usually a safe maximum threshold is 5% of the total for large datasets

If missing data for a certain feature or sample is more than 5% then it is better to leave that feature or sample out

Alternatives to fill missing data:

Leave the data as is and go for a model which can handle missing data

Drop the NAs. This might lead to a significant loss of signal detection as we are letting go of some data portion

Opt for a data imputation methodology

Check if mean, median, mode replacements can help

Go ahead with advanced imputations

Observe the imputed values by corroborating with domain expertise.

MICE ( Multiple Imputation by Chained Equations )

The primary method of imputation when the missing data follow the missing at random mechanism is MICE. It is also known as “fully conditional specification” and “sequential regression multiple imputation”. In multiple imputation technique, m sets of imputed values are suggested rather than just one. The m datasets are analyzed and results are consolidated into one set based on Rubin’s method (Rubin, 1987). There are many statistical packages available to perform multiple imputation. In R, the MICE package is used to perform multiple imputation using MICE method.

The imputation method of MICE can be briefly explained as follows:

All the NA(missing values) across all variables are imputed with simple replacements like mean (or median or mode)

The mean imputations for one variable, say var1, are again set back to NA

The var1 now acts as a dependent variable in a regression model and all the other variables act as independet variables. var1 is imputed/predicted by using regression models like linear, logistic etc.

Similar process is followed to impute each variable one at a time where the var1 now acts as one of the independent variables.

The above process repeats for a number of iterations as specified and finally a dataset with no missing values is generated.

After a dataset is generated, the process is repeated to generate multiple datasets. Determining the number of datasets to generate depends on the size of the dataset, the amount of missing information and the computational resources available.

MICE in R:

Consider the following dataset which consists of measurements of 150 Iris flowers. The first 10 rows are as follows (Complete dataset can be downloaded from the vault):

Before imputation, it is important to check whether the data is missing completely at random or missing not at random as MICE should be applied only when the data is missing completely at random. The md.pattern() command is used to check the missing value pattern in the data.

md.pattern(mydata)

Output:

From the above table, the missing data pattern can be observed for each column. For example, Row1 states that there are 95 rows in which all petal.width, sepal.width, petal.length, sepal.length are present i.e. no data is missing, Row2 states that there are 14 rows with only sepal.length is missing etc.

The aggr() function can be used to generate graphical representation of the missing value pattern.

mice_plot = aggr(mydata, col=c('grey','red'),numbers=TRUE, sortVars=TRUE,

           labels=names(mydata), cex.axis=.7, gap=3,

           ylab=c("Missing data","Pattern"))

The imputation of the missing values is done with mice() function. MICE has several methods for imputing the values and by default it uses PMM ( Predictive Mean Matching ).

Some other methods are logreg(Logistic Regression) for binary variables (with 2 levels), polyreg(Bayesian polytomous regression) for factor variables (>= 2 levels)

imputed_Data = mice(mydata, m=5, maxit = 50)

In the above code, m denotes the number of values we created for each missing value and maxit is the number of iterations to generate missing value. All the imputed values can be viewed by imputed_Data$imp and the imputed values of each column can be viewed in the following way:

imputed_Data$imp$Sepal.Width

Output:

After 5 datasets are generated, we can analyze each of them individually and select one dataset or we can use pool() function and build a predictive model using all the imputed datasets. For example, selecting dataset 2 for imputing is shown below:

completeData = complete(imputed_Data,2)

Example of building a predictive model with all the imputed datasets:

fit = with(data = imputed_Data, exp = lm(Sepal.Width ~ Sepal.Length + Petal.Width))

combine = pool(fit)

Here, using all the 5 imputed datasets, a model is built to predict the Sepal.Width values using Sepal.Length and Petal.Width. Similarly, prediction models can be built for missing Sepal.Length, Petal.Width, Petal.Length values.

Business Lens:

Most of the datasets used for modeling are large and contains several missing values. Simply discarding those values is not the right solution as it affects the results. Imputation plays a major role in making the data ready for further analysis and building prediction models.

Let’s look at a business problem of a Tooth paste brand which wants to get an insight on behavioral pattern that exis among it’s customers.

The company does a disguised survey where the customers can rate the relative importance of the following attributes on a scale 1-Least Important to 7-Most Important.

Cavity prevention

Improve shine of the teeth

Improve gum strength

Fresh breath

Price

Instant results

The survey was administered to 160 participants and the data was collected. The survey was initially designed as an optional response survey. Marketing research team now realized that this was not the optimum method and has requested the company’s Data Science team to deal with the missing data.

Here is the sample data:

If we observe the whole dataset, the last 10 rows completely filled with NAs, as the survey is optional initially, users might have submitted empty survey form. These rows with complete NAs provide no value for our analysis and also mentioned earlier MICE should be only used for values missing at random which is not the case here(MNAR - missing not at random). So, discard the rows with complete NAs from the data.

newdata =  mydata[c(1:151),]

Use md.pattern() and aggr() to observe the missing value patterns.

md.pattern(newdata)

Output:

mice_plot = aggr(newdata, col=c('grey','red'),numbers=TRUE, sortVars=TRUE,

           labels=names(newdata), cex.axis=.7, gap=3,

           ylab=c("Missing data","Pattern"))

Output:

Next, impute the data using mice() function:

imputed_Data = mice(newdata, m=5, maxit = 50)

Now 5 datasets of imputed values are available. As discussed above, we can choose pool() function to build a model by using all the datasets or we can select dataset of our choice by analysing each dataset.

fit = with(data = imputed_Data, exp = lm(PreCavity ~ ShinyTeeth+StrongGum+

                                        FreshBreath+Price+Instanteffect))

combine = pool(fit)

Here, using all the 5 datasets generated using mice, a model is built to predict PreCavity based on all the remaining variables. Similarly, models can be built from the imputed datasets to predict all the other variables.

After all the NAs have been imputed successfully, the dataset can be used further for processes like factor analysis, building predictive models.

Reviews

Major Concepts

Imputation

User Reviews

Vault

Connect With Us

Member Sign In

Member Sign In

Create Account

Reviews

Major Concepts

Imputation

User Reviews

Vault

Connect With Us