Regression

Raji Reddy A

Reviews

Editor Rating

User Ratings

Based on 1 review

Recommendations

Recommended by 100% users

Major Concepts

Articles Home » Predictive and Prescriptive Modeling » Statistical Modeling » Regression

Regression is the process of analyzing the relationship between different variables and building a mathematical model that can be used to predict the value of a variable(label) based on the values of remaining variables(features). Regression analysis is helpful in understanding how

Dependent variable: It is the variable to be predicted, also known as label and usually represented by Y.

Independent variable: It is the predictor or explanatory variable, also known as feature and usually represented by X.

Linear regression is a form of regression analysis in which the relationship between the independent variable X and the dependent variable Y is modelled as a 1st degree polynomial in X i.e. the relationship between Y and X is modelled by a straight line.

Simple linear regression is a model with single dependent variable and single independent variable.

(X1, Y1), (X2, Y2), …….., (Xn, Yn)

Polynomial regression is a form of regression analysis in which the relationship between the independent variable X and the dependent variable Y is modelled as an nth degree polynomial in X i.e. the relationship between Y and X is modelled by a curved line.

Logistic regression is intended for the modelling of dichotomous categorical outcomes. As an alternative to modelling the value of the outcome, logistic regression focuses instead upon the relative probability (odds) of obtaining a given result category.

Multiple linear regression is the most elementary regression model in which a dependent variable Y is predicted based on 2 or more independent variables(say X1, X2, X3 etc).

( (X1)1, (X2)1, ………, (XK)1, Y1 ),

( (X1)2, (X2)2, ………, (XK)2, Y2 ),

……,

( (X1)n, (X2)n, ………, (XK)n, Yn)

The above model represents the dependent variable, yi, as a linear function of one indepentdent variable, xi, subject to a random disturbance or error, μi.

The task of estimation is to determine regression coefficients β0 and β1 estimates of the unknown parameters β0 and β1 respectively. The estimated equation will have the form

The difference between the observed value yi and fitted value (predicted) yi gives the error which also called as residual.

The basic technique for determining the coefficients β0 and β1 is Ordinary Least Squares (OLS). The values for β0 and β1 are chosen so as to minimize the sum of the squared residuals(SSR).

Linear Regression in R:

Consider the following example of predicting the salary of a ceo when given other variables. Here are the first 10 rows of the dataset with their column details.

#salary - salary in thousands of $

#pcsalary - % change in salary

#sales - firm sales in millions of $

#roe - avg. of return on equity

#pcroe - % change in roe

#ros - return on firm's stock

Here, salary is the label or dependent variable which needs to be predicted based on remaining variables(features).

Correlation Matrix:

Check the correlations and partial correlations of the feature variables and do factor analysis if necessary.

cor(mydata[,2:6])

Output:

cor2pcor(cor(mydata[,2:6]))

As no significant correlations are present between the features, they should all be used as standalone variables instead of factorising them.

Treat if there are any outliers and missing values before moving ahead ( Refer to Imputation article ).

Check if there is a skewness in the data for each variable by using histogram. For example, consider salary and sales variables.

hist(mydata$salary)

The above shows that the salary data is positively skewed. To build a better prediction model, the data has to be normally distributed. The distribution of a skewed data can be transformed by log transformation for positively skewed and by using exponential function for negatively skewed data.

As the salary data is positively skewed, log transformation can be applied to improve the distribution of the data.

mydata$salary = log(mydata$salary)

hist(mydata$salary)

Similarly, the distribution of sales data can be transformed

mydata$sales = log(mydata$sales)

hist(mydata$sales)

Once distribution of data is improved, diivide the dataset into 2 groups, one for training the model and the other for testng.

set.seed(121)

train = sample(1:nrow(mydata),nrow(mydata)/2)

test = -train

training_data = mydata[train,]

testing_data = mydata[test,]

testing_salary = mydata$salary[test]

set.seed() is for generating random numbers that can be reproduced. Here, half of the data is used for training and the other half for testing. The percentage of data to be used for training and testing can be varied depending on the variance in the available dataset.

Creating a model from the training data

model = lm(salary~ .,data=training_data)

summary(model)

Output:

Residual standard error: 0.4919 on 98 degrees of freedom

Multiple R-squared:  0.207,    Adjusted R-squared:  0.1665

F-statistic: 5.116 on 5 and 98 DF,  p-value: 0.0003278

Residual standard error indicates how well/poorly the model is predicting the dependent variable accordingly. From the above, Y value would be +/- 0.4919 for 1 std. deviation in this case (log transformation applied on salary)

The multiple R-squared, also called the coefficient of determination is the proportion of variance in the data that is explained by the model. The more variable you add - even if they don’t help - the larger this will be. The adjusted R reduces accounts for number of variables in the model and penalizes for adding not so useful variables.

The F-statistic tells whether the regression as a whole is performing better than random, higher the better( Corresponding p-value is also mentioned, on how significant the F-statistic is)

Plot the model

par(mfrow=c(2,2))

plot(model)

Residual = observed - predicted.

The Residuals vs Fitted shows if residuals have non-linear patterns. This will show whether the predicted variable and independent variables have any non-linear relationship. If there are equally spread residuals around a horizontal line without distinct patterns, that is a good indication that there are no non-linear relationships.

The Normal Q-Q plot shows if residuals are normally distributed. It is good if residuals are lined well on the straight dashed line.

The Scale-Location is also called Spread-Location plot. This plot shows whether residuals are spread equally along the ranges of predictors. It is good if there is a horizontal line with with equally(randomly) spread points.

Residuals vs Leverage: Not all outliers are influential in linear regression analysis, even though data have extreme values, they might not be influential to determine a regression line. The outlying values at the upper-right corner or at the lower-right corner. Those spots are places where cases can be influential against a regression line. When cases are outside cook’s distance, they are influential to the regression results.

Calculate the confidence interval

confint(model,level=0.95)

Plot the residuals of the model

residualPlots(model)

Curve in the residual plots indicate a non-linear relationship between that independent variable and the predicted variable.

Deletion Diagnostics:

influence.measures(model)

influenceIndexPlot(model, id.n=3)

From the above diagnostic plot, it can be understood that row 21 is a outlier which significantly influencing the model. So, it should be deleted to improve the existing model.

Delete the observation 21, create a new model and recheck the diagnostic plot.

mydataf1 = mydata[c(-21),]

model1 = lm(salary~ .,data=mydataf1)

influenceIndexPlot(model1, id.n=3)

Now new outliers observations 172, 173 can be seen from the diagnostic plot, repeat the process of deleting the outliers and creating new model until satisfactory model is created.

Predict the values with the model and calculate rmse (Root Mean Square Error) and mape (Mean Absolute Percent Error)

new_predict = predict.lm(model1,testing_data,type = 'response')

rmse(testing_salary,new_predict)

mape(testing_salary,new_predict)

Output:

> rmse(testing_salary,new_predict)

[1] 0.4670868

> mape(testing_salary,new_predict)

[1] 0.04580912

The rmse error depends on the units of the predicted variable, here as log transformation is applied on the salary column, rmse is also based on those values. As per mape, the model predicting with an error of 4.58%. The model can be further improved by treating the outliers, transform the remaining variables if necessary and testing with different types of rotations like varimax, oblique.

BUSINESS LENS:

Consider a business scenario where a data scientist of a company is assigned the task of building a house price prediction engine that helps the buyers and sellers identify the right price of the property. The most recent historical sales data has been digitized and it has the following variables.

The dataset consists of more than 19000 rows with the above mentioned column values. The rows where all the variables are NAs are not useful in the analysis, so they should be discarded. The first 10 rows of the dataset are as follows:

head(originaldata, n = 10)

The first step is to decide which of these features are useful in the analysis. The variable id can be discarded as it is just a identification number given to the house and it does not influence the price of the house in anyway. The sum of sqft_above and sqft_basement gives the sqft_living, so those 2 can also be discarded. The variable grade is also redundant as condition is somewhat similar to grade. The process of selecting variables to discard is subjective and can vary depending on the person analysing. One of the ways to discard variables is using subset function.

houses = subset(originaldata, select = -c(id, sqft_above, sqft_basement, grade))

The variables zipcode and waterfront are numeric in the given data but they do not have any numeric importance. So, it is recommended to convert them into factors. Factors are categorical variables which have limited number of values and the similar values are formed into groups.

houses$zipcode = as.factor(houses$zipcode)

houses$waterfront = as.factor(houses$waterfront)

After imputation, the zipcode and waterfront are replaced by dummy variables. A dummy variable is a numeric variable used to represent each subgroup. Dummy coding uses only ones and zeros to convey all of the necessary information on group membership.

The 2 features yr_renovated and yr_built can be replaced with single variable age of the house. If the house is never renovated, the difference between present year and yr_built gives the age of the house. If it is renovated, present year minus yr_renovated gives the age.

houses$age = ifelse(houses$yr_renovated == 0, 2018 - houses$yr_built,2018 - houses$yr_renovated)

houses = subset(houses, select = -c(yr_built, yr_renovated))

Impute missing values:

Check the missing value patterns in the data and impute them.

md.pattern(houses, plot = T)

mice_plot = aggr(houses, col=c('grey','red'),

                 numbers=TRUE, sortVars=TRUE,

                 labels=names(houses), cex.axis=.7,

                 gap=3, ylab=c("Missing data","Pattern"))

imputed_Data = mice(houses, m=3, maxit = 10, method = "cart")

From the above patterns, it can be observed that there are missing values in sqft_lot and bedrooms.

Pooling using all the datasets generated by mice imputation:

fit_bedrooms = with(data = imputed_Data,

                   exp = lm(bedrooms ~ price+bathrooms+sqft_living+

                            sqft_lot+floors+waterfront+

                            view+condition+zipcode+age))





fit_sqftlot = with(data = imputed_Data,

                  exp = lm(sqft_lot ~ price+bedrooms+bathrooms+

                           sqft_living+floors+waterfront+view+

                           condition+zipcode+age))



combine_bedrooms = pool(fit_bedrooms)

combine_sqftlot  = pool(fit_sqftlot)

For more accurate results the equations combine_bedrooms and combine_sqftlot can be used to predict respective missing values. For now, let us fill the missing values using any of the imputed datasets

housesimp1 = complete(imputed_Data,3)

Correlations:

Check the correlations and partial correlations of the numeric dependent variables.

cor(housesimp1[,c(2,3,4,5,6,8,9,11)])

cor2pcor(cor(housesimp1[,c(2,3,4,5,6,8,9,11)]))

Dummy Variables:

Replace zipcode and waterfront with dummy variables

dmy = dummyVars(" ~ zipcode+waterfront", data = housesimp1, fullRank=T)

trsf = data.frame(predict(dmy, newdata = housesimp1))

housesimp1 = subset(housesimp1, select = -c(zipcode, waterfront))

housesimp1 = cbind(housesimp1, trsf)

Normalcy check:

Check the distribution of each variable in the dataset and transform if necessary.

hist(housesimp1$bathrooms)

curve(dnorm(x,mean(housesimp1$bathrooms),sd(housesimp1$bathrooms)),col="red",add=TRUE)

qqnorm(housesimp1$bathrooms)

qqline(housesimp1$bathrooms, col= "red")

qqPlot(housesimp1$bathrooms, distribution="norm")

Above graphs indicate that bathrooms variable is positively skewed, so it should be transformed using log function, as applying log is not applicable for zero values, we can add 1 and transform the data.

housesimp1$bathrooms = log(housesimp1$bathrooms+1)

hist(housesimp1$bathrooms)

curve(dnorm(x,mean(housesimp1$bathrooms),sd(housesimp1$bathrooms)),col="red",add=TRUE)

qqnorm(housesimp1$bathrooms)

qqline(housesimp1$bathrooms, col= "red")

qqPlot(housesimp1$bathrooms, distribution="norm")

Now, the distribution of data in bathrooms variable has improved after transformation. Similarly, check the distribution of all the variables and transform if necessary.

Create Model:

Divide the data into training and testing data

set.seed(121)

train = sample(1:nrow(housesimp1),nrow(housesimp1) * 0.7)

test = -train

training_data = housesimp1[train,]

testing_data = housesimp1[test,]

testing_price = housesimp1$price[test]

Create the linear regression model with training data

model1 = lm(price~ .,data=training_data)

summary(model1)

Output:

Residual standard error: 0.2009 on 13908 degrees of freedom

Multiple R-squared:  0.8564,    Adjusted R-squared:  0.8556

F-statistic:  1064 on 78 and 13908 DF,  p-value: < 2.2e-16

Plot the residuals and check for any non-linear relationship in variables

residualPlots(model1)

The graphs for bedrooms, bathrooms, sqft_living are curved which indicate a non-linear relationship.

model2 = lm(price~ .+poly(sqft_living,degree = 2)+

           poly(bathrooms,degree = 2)+

           poly(bedrooms,degree=2), data=training_data)

summary(model2)

Output:

Residual standard error: 0.1973 on 13905 degrees of freedom

Multiple R-squared:  0.8615,    Adjusted R-squared:  0.8607

F-statistic:  1068 on 81 and 13905 DF,  p-value: < 2.2e-16

residualPlots(model2)

Deletion Diagnostics:

influenceIndexPlot(model2, id.n=3)

adjdata = housesimp1[c(-15871),]

adjmodel = lm(price~ .+poly(sqft_living,degree = 2)+

             poly(bathrooms,degree = 2)+

             poly(bedrooms,degree = 2),data=adjdata)

summary(adjmodel)

Output:

Residual standard error: 0.1963 on 19899 degrees of freedom

Multiple R-squared:  0.8629,    Adjusted R-squared:  0.8623

F-statistic:  1546 on 81 and 19899 DF,  p-value: < 2.2e-16

Repeat diagnostic process and remove the outliers. Then predict the values using the finalised model and check rmse, mape errors.

new_predict = predict.lm(adjmodel,testing_data,type = 'response')

rmse(testing_price,new_predict)

mape(testing_price,new_predict)

Output:

> rmse(testing_price,new_predict)

[1] 0.1939318

> mape(testing_price,new_predict)

[1] 0.009398599

The rmse error depends on the units of the prediction variable, here it is low as the price values are log transformed as they are positively skewed. The accuracy of the model can be improved by better treatment of outliers, data transformation, deletion diagnostics.

Reviews

Major Concepts

Regression

User Reviews

Vault

Connect With Us

Member Sign In

Member Sign In

Create Account

Reviews

Major Concepts

Regression

User Reviews

Vault

Connect With Us