QBoard » Statistical modeling » Stats - Tech » Find p-value (significance) in scikit-learn LinearRegression

Find p-value (significance) in scikit-learn LinearRegression

  • How can I find the p-value (significance) of each coefficient?

    lm = sklearn.linear_model.LinearRegression()
    lm.fit(x,y)
      September 19, 2020 2:46 PM IST
    0
  • scikit-learn's LinearRegression doesn't calculate this information but you can easily extend the class to do it:

    from sklearn import linear_model
    from scipy import stats
    import numpy as np
    
    
    class LinearRegression(linear_model.LinearRegression):
        """
        LinearRegression class after sklearn's, but calculate t-statistics
        and p-values for model coefficients (betas).
        Additional attributes available after .fit()
        are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
        which is (n_features, n_coefs)
        This class sets the intercept to 0 by default, since usually we include it
        in X.
        """
    
        def __init__(self, *args, **kwargs):
            if not "fit_intercept" in kwargs:
                kwargs['fit_intercept'] = False
            super(LinearRegression, self)\
                    .__init__(*args, **kwargs)
    
        def fit(self, X, y, n_jobs=1):
            self = super(LinearRegression, self).fit(X, y, n_jobs)
    
            sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
            se = np.array([
                np.sqrt(np.diagonal(sse * np.linalg.inv(np.dot(X.T, X))))
                                                        for i in range(sse.shape[0])
                        ])
    
            self.t = self.coef_ / se
            self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
            return self

     

    You should take a look at statsmodels for this kind of statistical analysis in Python.

     
      September 19, 2020 4:12 PM IST
    0
  • An easy way to pull of the p-values is to use statsmodels regression:

    import statsmodels.api as sm
    mod = sm.OLS(Y,X)
    fii = mod.fit()
    p_values = fii.summary2().tables[1]['P>|t|']

    You get a series of p-values that you can manipulate (for example choose the order you want to keep by evaluating each p-value):

      September 19, 2020 4:13 PM IST
    0
  • p_value is among f statistics. if you want to get the value, simply use this few lines of code:

    import statsmodels.api as sm
    from scipy import stats
    
    diabetes = datasets.load_diabetes()
    X = diabetes.data
    y = diabetes.target
    
    X2 = sm.add_constant(X)
    est = sm.OLS(y, X2)
    print(est.fit().f_pvalue)
      September 19, 2020 4:15 PM IST
    0
  • You can use scipy for p-value. This code is from scipy documentation.

    >>> from scipy import stats
    >>> import numpy as np
    >>> x = np.random.random(10)
    >>> y = np.random.random(10)
    >>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
      September 19, 2020 4:16 PM IST
    0
  • This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be

    import pandas as pd
    import numpy as np
    from sklearn import datasets, linear_model
    from sklearn.linear_model import LinearRegression
    import statsmodels.api as sm
    from scipy import stats
    
    diabetes = datasets.load_diabetes()
    X = diabetes.data
    y = diabetes.target
    
    X2 = sm.add_constant(X)
    est = sm.OLS(y, X2)
    est2 = est.fit()
    print(est2.summary())
     

    and we get

                             OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       0.518
    Model:                            OLS   Adj. R-squared:                  0.507
    Method:                 Least Squares   F-statistic:                     46.27
    Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
    Time:                        10:08:24   Log-Likelihood:                -2386.0
    No. Observations:                 442   AIC:                             4794.
    Df Residuals:                     431   BIC:                             4839.
    Df Model:                          10                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const        152.1335      2.576     59.061      0.000     147.071     157.196
    x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
    x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
    x3           519.8398     66.534      7.813      0.000     389.069     650.610
    x4           324.3904     65.422      4.958      0.000     195.805     452.976
    x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
    x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
    x7           101.0446    212.533      0.475      0.635    -316.685     518.774
    x8           177.0642    161.476      1.097      0.273    -140.313     494.442
    x9           751.2793    171.902      4.370      0.000     413.409    1089.150
    x10           67.6254     65.984      1.025      0.306     -62.065     197.316
    ==============================================================================
    Omnibus:                        1.506   Durbin-Watson:                   2.029
    Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
    Skew:                           0.017   Prob(JB):                        0.496
    Kurtosis:                       2.726   Cond. No.                         227.
    ==============================================================================

    Ok, let's reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

    lm = LinearRegression()
    lm.fit(X,y)
    params = np.append(lm.intercept_,lm.coef_)
    predictions = lm.predict(X)
    
    newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
    MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))
    
    # Note if you don't want to use a DataFrame replace the two lines above with
    # newX = np.append(np.ones((len(X),1)), X, axis=1)
    # MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))
    
    var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
    sd_b = np.sqrt(var_b)
    ts_b = params/ sd_b
    
    p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]
    
    sd_b = np.round(sd_b,3)
    ts_b = np.round(ts_b,3)
    p_values = np.round(p_values,3)
    params = np.round(params,4)
    
    myDF3 = pd.DataFrame()
    myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
    print(myDF3)
     

    And this gives us.

       Coefficients  Standard Errors  t values  Probabilities
    0       152.1335            2.576    59.061         0.000
    1       -10.0122           59.749    -0.168         0.867
    2      -239.8191           61.222    -3.917         0.000
    3       519.8398           66.534     7.813         0.000
    4       324.3904           65.422     4.958         0.000
    5      -792.1842          416.684    -1.901         0.058
    6       476.7458          339.035     1.406         0.160
    7       101.0446          212.533     0.475         0.635
    8       177.0642          161.476     1.097         0.273
    9       751.2793          171.902     4.370         0.000
    10       67.6254           65.984     1.025         0.306

    So we can reproduce the values from statsmodel.

      September 19, 2020 4:20 PM IST
    0