Regression is the process of analyzing the relationship between different variables and building a mathematical model that can be used to predict the value of a variable(label) based on the values of remaining variables(features). Regression analysis is helpful in understanding how
Dependent variable: It is the variable to be predicted, also known as label and usually represented by Y.
Independent variable: It is the predictor or explanatory variable, also known as feature and usually represented by X.
Linear regression is a form of regression analysis in which the relationship between the independent variable X and the dependent variable Y is modelled as a 1st degree polynomial in X i.e. the relationship between Y and X is modelled by a straight line.
Simple linear regression is a model with single dependent variable and single independent variable.
(X1, Y1), (X2, Y2), …….., (Xn, Yn)
Polynomial regression is a form of regression analysis in which the relationship between the independent variable X and the dependent variable Y is modelled as an nth degree polynomial in X i.e. the relationship between Y and X is modelled by a curved line.
Logistic regression is intended for the modelling of dichotomous categorical outcomes. As an alternative to modelling the value of the outcome, logistic regression focuses instead upon the relative probability (odds) of obtaining a given result category.
Multiple linear regression is the most elementary regression model in which a dependent variable Y is predicted based on 2 or more independent variables(say X1, X2, X3 etc).
( (X1)1, (X2)1, ………, (XK)1, Y1 ),
( (X1)2, (X2)2, ………, (XK)2, Y2 ),
……,
( (X1)n, (X2)n, ………, (XK)n, Yn)
The above model represents the dependent variable, yi, as a linear function of one indepentdent variable, xi, subject to a random disturbance or error, μi.
The task of estimation is to determine regression coefficients β0 and β1 estimates of the unknown parameters β0 and β1 respectively. The estimated equation will have the form
The difference between the observed value yi and fitted value (predicted) yi gives the error which also called as residual.
The basic technique for determining the coefficients β0 and β1 is Ordinary Least Squares (OLS). The values for β0 and β1 are chosen so as to minimize the sum of the squared residuals(SSR).
Linear Regression in R:
Consider the following example of predicting the salary of a ceo when given other variables. Here are the first 10 rows of the dataset with their column details.
#salary - salary in thousands of $
#pcsalary - % change in salary
#sales - firm sales in millions of $
#roe - avg. of return on equity
#pcroe - % change in roe
#ros - return on firm's stock
Here, salary is the label or dependent variable which needs to be predicted based on remaining variables(features).
Correlation Matrix:
Check the correlations and partial correlations of the feature variables and do factor analysis if necessary.
cor(mydata[,2:6])
Output:
cor2pcor(cor(mydata[,2:6]))
As no significant correlations are present between the features, they should all be used as standalone variables instead of factorising them.
Treat if there are any outliers and missing values before moving ahead ( Refer to Imputation article ).
Check if there is a skewness in the data for each variable by using histogram. For example, consider salary and sales variables.
hist(mydata$salary)
As the salary data is positively skewed, log transformation can be applied to improve the distribution of the data.
mydata$salary = log(mydata$salary)
hist(mydata$salary)
Similarly, the distribution of sales data can be transformed
mydata$sales = log(mydata$sales)
hist(mydata$sales)
Once distribution of data is improved, diivide the dataset into 2 groups, one for training the model and the other for testng.
set.seed(121)
train = sample(1:nrow(mydata),nrow(mydata)/2)
test = -train
training_data = mydata[train,]
testing_data = mydata[test,]
testing_salary = mydata$salary[test]
set.seed() is for generating random numbers that can be reproduced. Here, half of the data is used for training and the other half for testing. The percentage of data to be used for training and testing can be varied depending on the variance in the available dataset.
Creating a model from the training data
model = lm(salary~ .,data=training_data)
summary(model)
Residual standard error: 0.4919 on 98 degrees of freedom
Multiple R-squared: 0.207, Adjusted R-squared: 0.1665
F-statistic: 5.116 on 5 and 98 DF, p-value: 0.0003278
par(mfrow=c(2,2))
plot(model)
confint(model,level=0.95)
residualPlots(model)
influence.measures(model)
influenceIndexPlot(model, id.n=3)
Delete the observation 21, create a new model and recheck the diagnostic plot.
mydataf1 = mydata[c(-21),]
model1 = lm(salary~ .,data=mydataf1)
influenceIndexPlot(model1, id.n=3)
Predict the values with the model and calculate rmse (Root Mean Square Error) and mape (Mean Absolute Percent Error)
new_predict = predict.lm(model1,testing_data,type = 'response')
rmse(testing_salary,new_predict)
mape(testing_salary,new_predict)
> rmse(testing_salary,new_predict)
[1] 0.4670868
> mape(testing_salary,new_predict)
[1] 0.04580912
The rmse error depends on the units of the predicted variable, here as log transformation is applied on the salary column, rmse is also based on those values. As per mape, the model predicting with an error of 4.58%. The model can be further improved by treating the outliers, transform the remaining variables if necessary and testing with different types of rotations like varimax, oblique.
BUSINESS LENS:
Consider a business scenario where a data scientist of a company is assigned the task of building a house price prediction engine that helps the buyers and sellers identify the right price of the property. The most recent historical sales data has been digitized and it has the following variables.
head(originaldata, n = 10)
houses = subset(originaldata, select = -c(id, sqft_above, sqft_basement, grade))
houses$zipcode = as.factor(houses$zipcode)
houses$waterfront = as.factor(houses$waterfront)
After imputation, the zipcode and waterfront are replaced by dummy variables. A dummy variable is a numeric variable used to represent each subgroup. Dummy coding uses only ones and zeros to convey all of the necessary information on group membership.
The 2 features yr_renovated and yr_built can be replaced with single variable age of the house. If the house is never renovated, the difference between present year and yr_built gives the age of the house. If it is renovated, present year minus yr_renovated gives the age.
houses$age = ifelse(houses$yr_renovated == 0, 2018 - houses$yr_built,2018 - houses$yr_renovated)
houses = subset(houses, select = -c(yr_built, yr_renovated))
Check the missing value patterns in the data and impute them.
md.pattern(houses, plot = T)
mice_plot = aggr(houses, col=c('grey','red'),
numbers=TRUE, sortVars=TRUE,
labels=names(houses), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
imputed_Data = mice(houses, m=3, maxit = 10, method = "cart")
From the above patterns, it can be observed that there are missing values in sqft_lot and bedrooms.
Pooling using all the datasets generated by mice imputation:
fit_bedrooms = with(data = imputed_Data,
exp = lm(bedrooms ~ price+bathrooms+sqft_living+
sqft_lot+floors+waterfront+
view+condition+zipcode+age))
fit_sqftlot = with(data = imputed_Data,
exp = lm(sqft_lot ~ price+bedrooms+bathrooms+
sqft_living+floors+waterfront+view+
condition+zipcode+age))
combine_bedrooms = pool(fit_bedrooms)
combine_sqftlot = pool(fit_sqftlot)
For more accurate results the equations combine_bedrooms and combine_sqftlot can be used to predict respective missing values. For now, let us fill the missing values using any of the imputed datasets
housesimp1 = complete(imputed_Data,3)
Check the correlations and partial correlations of the numeric dependent variables.
cor(housesimp1[,c(2,3,4,5,6,8,9,11)])
cor2pcor(cor(housesimp1[,c(2,3,4,5,6,8,9,11)]))
Dummy Variables:
Replace zipcode and waterfront with dummy variables
dmy = dummyVars(" ~ zipcode+waterfront", data = housesimp1, fullRank=T)
trsf = data.frame(predict(dmy, newdata = housesimp1))
housesimp1 = subset(housesimp1, select = -c(zipcode, waterfront))
housesimp1 = cbind(housesimp1, trsf)
Check the distribution of each variable in the dataset and transform if necessary.
hist(housesimp1$bathrooms)
curve(dnorm(x,mean(housesimp1$bathrooms),sd(housesimp1$bathrooms)),col="red",add=TRUE)
qqnorm(housesimp1$bathrooms)
qqline(housesimp1$bathrooms, col= "red")
qqPlot(housesimp1$bathrooms, distribution="norm")
Above graphs indicate that bathrooms variable is positively skewed, so it should be transformed using log function, as applying log is not applicable for zero values, we can add 1 and transform the data.
housesimp1$bathrooms = log(housesimp1$bathrooms+1)
hist(housesimp1$bathrooms)
curve(dnorm(x,mean(housesimp1$bathrooms),sd(housesimp1$bathrooms)),col="red",add=TRUE)
qqnorm(housesimp1$bathrooms)
qqline(housesimp1$bathrooms, col= "red")
qqPlot(housesimp1$bathrooms, distribution="norm")
Now, the distribution of data in bathrooms variable has improved after transformation. Similarly, check the distribution of all the variables and transform if necessary.
Create Model:
Divide the data into training and testing data
set.seed(121)
train = sample(1:nrow(housesimp1),nrow(housesimp1) * 0.7)
test = -train
training_data = housesimp1[train,]
testing_data = housesimp1[test,]
testing_price = housesimp1$price[test]
Create the linear regression model with training data
model1 = lm(price~ .,data=training_data)
summary(model1)
Residual standard error: 0.2009 on 13908 degrees of freedom
Multiple R-squared: 0.8564, Adjusted R-squared: 0.8556
F-statistic: 1064 on 78 and 13908 DF, p-value: < 2.2e-16
Plot the residuals and check for any non-linear relationship in variables
residualPlots(model1)
The graphs for bedrooms, bathrooms, sqft_living are curved which indicate a non-linear relationship.
model2 = lm(price~ .+poly(sqft_living,degree = 2)+
poly(bathrooms,degree = 2)+
poly(bedrooms,degree=2), data=training_data)
summary(model2)
Residual standard error: 0.1973 on 13905 degrees of freedom
Multiple R-squared: 0.8615, Adjusted R-squared: 0.8607
F-statistic: 1068 on 81 and 13905 DF, p-value: < 2.2e-16
residualPlots(model2)
Deletion Diagnostics:
influenceIndexPlot(model2, id.n=3)
adjdata = housesimp1[c(-15871),]
adjmodel = lm(price~ .+poly(sqft_living,degree = 2)+
poly(bathrooms,degree = 2)+
poly(bedrooms,degree = 2),data=adjdata)
summary(adjmodel)
Output:
Residual standard error: 0.1963 on 19899 degrees of freedom
Multiple R-squared: 0.8629, Adjusted R-squared: 0.8623
F-statistic: 1546 on 81 and 19899 DF, p-value: < 2.2e-16
Repeat diagnostic process and remove the outliers. Then predict the values using the finalised model and check rmse, mape errors.
new_predict = predict.lm(adjmodel,testing_data,type = 'response')
rmse(testing_price,new_predict)
mape(testing_price,new_predict)
> rmse(testing_price,new_predict)
[1] 0.1939318
> mape(testing_price,new_predict)
[1] 0.009398599
The rmse error depends on the units of the prediction variable, here it is low as the price values are log transformed as they are positively skewed. The accuracy of the model can be improved by better treatment of outliers, data transformation, deletion diagnostics.