QBoard » Statistical modeling » Stats - Conceptual » Multiple linear regression in Python

Multiple linear regression in Python

  • I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).

    For example, with this data:

    print 'y        x1      x2       x3       x4      x5     x6       x7'
    for t in texts:
        print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
       .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)​
    (output for above:)

      y        x1       x2       x3        x4     x5     x6       x7
       -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
       -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
      -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
       -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
       -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
       -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
       -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
       -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
       -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55​

    How would I regress these in python, to get the linear regression formula:

    Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c

      December 19, 2020 11:02 AM IST
    0
  • Multiple Linear Regression

    Multiple Linear Regression is basically indicating that we will be having many features Such as f1f2, f3f4, and our output feature f5. If we take the same example as above we discussed, suppose:

    f1 is the size of the house.

    f2 is bad rooms in the house.

    f3 is the locality of the house.

    f4 is the condition of the house and,

    f5 is our output feature which is the price of the house.

    Now, you can see that multiple independent features also make a huge impact on the price of the house, price can vary from feature to feature. When we are discussing multiple linear regression then the equation of simple linear regression y=A+Bx is converted to something like:

                                equation:  y = A+B1x1+B2x2+B3x3+B4x4

    “If we have one dependent feature and multiple independent features then basically call it a multiple linear regression.”

     

    Now, our aim to using the multiple linear regression is that we have to compute which is an intercept, and B1  B2  B3  Bwhich are the slops or coefficient concerning this independent feature, that basically indicates that if we increase the value of x1 by 1 unit then B1 says that how much value it will affect int he price of the house, and this was similar concerning others BBB4

     

    So, this is a small theoretical description of multiple linear regression now we will use the scikit learn linear regression library to solve the multiple linear regression problem.

     

    Dataset

    Now, we apply multiple linear regression on the 50_startups dataset, you can click here to download the dataset.

     

    Reading dataset

    Most of the dataset are in CSV file, for reading this file we use pandas library:

    df = pd.read_csv('50_Startups.csv')
    df

    Here you can see that there are 5 columns in the dataset where the state stores the categorical data points, and the rest are numerical features.

     

    Now, we have to classify independent and dependent features:

    Independent and Dependent variables

    There are total 5 features in the dataset, in which basically profit is our dependent feature, and the rest of them are our independent features:

    #separate the other attributes from the predicting attribute
    x = df.drop('Profit',axis=1)
    #separte the predicting attribute into Y for model training 
    y = ['profit']

     

    Handling categorical variables

    In our dataset, there is one categorical column State, we have to handle this categorical values present inside this column for that we will use pandas get_dummies() function:

     
    # handle categorical variable
    states=pd.get_dummies(x,drop_first=True)
    # dropping extra column
    x= x.drop(‘State’,axis=1)
    # concatation of independent variables and new cateorical variable.
    x=pd.concat([x,states],axis=1)
    x
     

     

    Splitting Data

    Now, we have to split the data into training and testing parts for that we use the scikit-learn train_test_split() function.

    # importing train_test_split from sklearn
    from sklearn.model_selection import train_test_split
    # splitting the data
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

     

    Applying model

    Now, we apply the linear regression model to our training data, first of all, we have to import linear regression from the scikit-learn library, there is no other library to implement multiple linear regression we do it with linear regression only.

    # importing module
    from sklearn.linear_model import LinearRegression
    # creating an object of LinearRegression class
    LR = LinearRegression()
    # fitting the training data
    LR.fit(x_train,y_train)

    finally, if we execute this then our model will be ready, now we have x_test data we use this data for the prediction of profit. 

    y_prediction =  LR.predict(x_test)
    y_prediction

     

    Now, we have to compare the y_prediction values with the original values because we have to calculate the accuracy of our model, which was implemented by a concept called r2_score. let’s discuss briefly on r2_score:

    r2_score:-

    It is a function inside sklearn. metrics module, where the value of r2_score varies between 0 and 100 percent,  we can say that it is closely related to MSE.

    r2 is basically calculated by the formula given below:

                                formula:  r2 = 1 – (SSres  /SSmean )

    now, when I say SSres it means, it is the sum of residuals and SSmean refers to the sum of means.

    where,

    y = original values

    y^ = predicted values. and,

    If we take calculation from this equation, then we have to know that the value of the sum of means is always greater than the sum of residuals. If this condition satisfies then our model is good for predictions. Its values range between 0.0 to 1.

    ”The proportion of the variance in the dependent variable that is predictable from the independent variable(s).”

    The best possible score is 1.0 and it can be negative because the model can be arbitrarily worse. A constant model that always predicts the expected value of y, disregarding the input features, would get an R2 score of 0.0.

    # importing r2_score module
    from sklearn.metrics import r2_score
    from sklearn.metrics import mean_squared_error
    # predicting the accuracy score
    score=r2_score(y_test,y_prediction)
    print(‘r2 socre is ‘,score)
    print(‘mean_sqrd_error is==’,mean_squared_error(y_test,y_prediction))
    print(‘root_mean_squared error of is==’,np.sqrt(mean_squared_error(y_test,y_prediction)))
     

    You can see that the accuracy score is greater than 0.8 it means we can use this model to solve multiple linear regression, and also mean squared error rate is also low.

      November 25, 2021 12:41 PM IST
    0
  • sklearn.linear_model.LinearRegression will do it:
    from sklearn import linear_model
    clf = linear_model.LinearRegression()
    clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
            [t.y for t in texts])

    Then clf.coef_ will have the regression coefficients.

    sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.

      December 22, 2020 1:19 PM IST
    0
  • Use scipy.optimize.curve_fit. And not only for the linear fit.
    from scipy.optimize import curve_fit
    import scipy
    
    def fn(x, a, b, c):
        return a + b*x[0] + c*x[1]
    
    # y(x0,x1) data:
    #    x0=0 1 2
    # ___________
    # x1=0 |0 1 2
    # x1=1 |1 2 3
    # x1=2 |2 3 4
    
    x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
    y = scipy.array([0,1,2,1,2,3,2,3,4])
    popt, pcov = curve_fit(fn, x, y)
    print popt
      December 22, 2020 2:13 PM IST
    0
  • I think this may the easiest way to finish this work:

    from random import random
    from pandas import DataFrame
    from statsmodels.api import OLS
    lr = lambda : [random() for i in range(100)]
    x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
    x['b'] = 1
    y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
    
    print x.head()
    
             x1        x2        x3  b
    0  0.433681  0.946723  0.103422  1
    1  0.400423  0.527179  0.131674  1
    2  0.992441  0.900678  0.360140  1
    3  0.413757  0.099319  0.825181  1
    4  0.796491  0.862593  0.193554  1
    
    print y.head()
    
    0    6.637392
    1    5.849802
    2    7.874218
    3    7.087938
    4    7.102337
    dtype: float64
    
    model = OLS(y, x)
    result = model.fit()
    print result.summary()
    
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       1.000
    Model:                            OLS   Adj. R-squared:                  1.000
    Method:                 Least Squares   F-statistic:                 5.859e+30
    Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
    Time:                        15:17:32   Log-Likelihood:                 3224.9
    No. Observations:                 100   AIC:                            -6442.
    Df Residuals:                      96   BIC:                            -6431.
    Df Model:                           3                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
    x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
    x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
    b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
    ==============================================================================
    Omnibus:                        7.675   Durbin-Watson:                   1.614
    Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
    Skew:                           0.045   Prob(JB):                        0.210
    Kurtosis:                       2.140   Cond. No.                         6.89
    ==============================================================================

     

      December 22, 2020 2:19 PM IST
    0
  • Here is a little work around that I created. I checked it with R and it works correct.

    import numpy as np
    import statsmodels.api as sm
    
    y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
    
    x = [
         [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
         [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
         [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
         ]
    
    def reg_m(y, x):
        ones = np.ones(len(x[0]))
        X = sm.add_constant(np.column_stack((x[0], ones)))
        for ele in x[1:]:
            X = sm.add_constant(np.column_stack((ele, X)))
        results = sm.OLS(y, X).fit()
        return results
    

    Result:

    print reg_m(y, x).summary()
    

    Output:

                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       0.535
    Model:                            OLS   Adj. R-squared:                  0.461
    Method:                 Least Squares   F-statistic:                     7.281
    Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
    Time:                        21:51:28   Log-Likelihood:                -26.025
    No. Observations:                  23   AIC:                             60.05
    Df Residuals:                      19   BIC:                             64.59
    Df Model:                           3                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    x1             0.2424      0.139      1.739      0.098        -0.049     0.534
    x2             0.2360      0.149      1.587      0.129        -0.075     0.547
    x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
    const          1.5704      0.633      2.481      0.023         0.245     2.895
    
    ==============================================================================
    Omnibus:                        6.904   Durbin-Watson:                   1.905
    Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
    Skew:                          -0.849   Prob(JB):                       0.0950
    Kurtosis:                       4.426   Cond. No.                         38.6
      December 22, 2020 3:43 PM IST
    0
  • Steps Involved in any Multiple Linear Regression Model

    Step #1: Data Pre Processing 

    1. Importing The Libraries.
    2. Importing the Data Set.
    3. Encoding the Categorical Data.
    4. Avoiding the Dummy Variable Trap.
    5. Splitting the Data set into Training Set and Test Set.

    Step #2: Fitting Multiple Linear Regression to the Training set 
    Step #3: Predicting the Test set results.

    import numpy as np
    import matplotlib as mpl
    from mpl_toolkits.mplot3d import Axes3D
    import matplotlib.pyplot as plt
    
    def generate_dataset(n):
    	x = []
    	y = []
    	random_x1 = np.random.rand()
    	random_x2 = np.random.rand()
    	for i in range(n):
    		x1 = i
    		x2 = i/2 + np.random.rand()*n
    		x.append([1, x1, x2])
    		y.append(random_x1 * x1 + random_x2 * x2 + 1)
    	return np.array(x), np.array(y)
    
    x, y = generate_dataset(200)
    
    mpl.rcParams['legend.fontsize'] = 12
    
    fig = plt.figure()
    ax = fig.gca(projection ='3d')
    
    ax.scatter(x[:, 1], x[:, 2], y, label ='y', s = 5)
    ax.legend()
    ax.view_init(45, 0)
    
    plt.show()
    
      September 20, 2021 1:23 PM IST
    0
  • Here is an alternative and basic method:

    from patsy import dmatrices
    import statsmodels.api as sm
    
    y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
    ### y_data is the name of the dependent variable in your data ### 
    model_fit = sm.OLS(y,x)
    results = model_fit.fit()
    print(results.summary())
    

    Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.

     
    0

    Finding a linear model such as this one can be handled with OpenTURNS.

    In OpenTURNS this is done with the LinearModelAlgorithmclass which creates a linear model from numerical samples. To be more specific, it builds the following linear model :

    Y = a0 + a1.X1 + ... + an.Xn + epsilon,

    where the error epsilon is gaussian with zero mean and unit variance. Assuming your data is in a csv file, here is a simple script to get the regression coefficients ai :

    from __future__ import print_function
    import pandas as pd
    import openturns as ot
    
    # Assuming the data is a csv file with the given structure                          
    # Y X1 X2 .. X7
    df = pd.read_csv("./data.csv", sep="\s+")
    
    # Build a sample from the pandas dataframe
    sample = ot.Sample(df.values)
    
    # The observation points are in the first column (dimension 1)
    Y = sample[:, 0]
    
    # The input vector (X1,..,X7) of dimension 7
    X = sample[:, 1::]
    
    # Build a Linear model approximation
    result = ot.LinearModelAlgorithm(X, Y).getResult()
    
    # Get the coefficients ai
    print("coefficients of the linear regression model = ", result.getCoefficients())
    

    You can then easily get the confidence intervals with the following call :

    # Get the confidence intervals at 90% of the ai coefficients
    print(
        "confidence intervals of the coefficients = ",
        ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
    )
    

    You may find a more detailed example in the OpenTURNS examples.

      October 28, 2021 6:14 PM IST
    0