QBoard » Statistical modeling » Stats - Tech » Calculating Pearson correlation and significance in Python

Calculating Pearson correlation and significance in Python

  •  

    I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.

      June 11, 2019 4:02 PM IST
    0
  • You can have a look at scipy.stats:

    from pydoc import help
    from scipy.stats.stats import pearsonr
    help(pearsonr)
    
    >>>
    Help on function pearsonr in module scipy.stats.stats:
    
    pearsonr(x, y)
     Calculates a Pearson correlation coefficient and the p-value for testing
     non-correlation.
    
     The Pearson correlation coefficient measures the linear relationship
     between two datasets. Strictly speaking, Pearson's correlation requires
     that each dataset be normally distributed. Like other correlation
     coefficients, this one varies between -1 and +1 with 0 implying no
     correlation. Correlations of -1 or +1 imply an exact linear
     relationship. Positive correlations imply that as x increases, so does
     y. Negative correlations imply that as x increases, y decreases.
    
     The p-value roughly indicates the probability of an uncorrelated system
     producing datasets that have a Pearson correlation at least as extreme
     as the one computed from these datasets. The p-values are not entirely
     reliable but are probably reasonable for datasets larger than 500 or so.
    
     Parameters
     ----------
     x : 1D array
     y : 1D array the same length as x
    
     Returns
     -------
     (Pearson's correlation coefficient,
      2-tailed p-value)
    
     References
     ----------
     http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation​
    This post was edited by Rakesh Racharla at September 23, 2020 11:03 AM IST
      June 11, 2019 4:03 PM IST
    0
  •  

    The Pearson correlation can be calculated with numpy's corrcoef.

    import numpy
    numpy.corrcoef(list1, list2)[0, 1]
      September 23, 2020 11:05 AM IST
    0
  • If you don't feel like installing scipy, I've used this quick hack, slightly modified from Programming Collective Intelligence:

     

    from itertools import imap
    
    def pearsonr(x, y):
      # Assume len(x) == len(y)
      n = len(x)
      sum_x = float(sum(x))
      sum_y = float(sum(y))
      sum_x_sq = sum(map(lambda x: pow(x, 2), x))
      sum_y_sq = sum(map(lambda x: pow(x, 2), y))
      psum = sum(imap(lambda x, y: x * y, x, y))
      num = psum - (sum_x * sum_y/n)
      den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
      if den == 0: return 0
      return num / den
      September 23, 2020 11:07 AM IST
    0
  • The following code is a straight-up interpretation of the definition:

    import math
    
    def average(x):
        assert len(x) > 0
        return float(sum(x)) / len(x)
    
    def pearson_def(x, y):
        assert len(x) == len(y)
        n = len(x)
        assert n > 0
        avg_x = average(x)
        avg_y = average(y)
        diffprod = 0
        xdiff2 = 0
        ydiff2 = 0
        for idx in range(n):
            xdiff = x[idx] - avg_x
            ydiff = y[idx] - avg_y
            diffprod += xdiff * ydiff
            xdiff2 += xdiff * xdiff
            ydiff2 += ydiff * ydiff
    
        return diffprod / math.sqrt(xdiff2 * ydiff2)

    Test:

    print pearson_def([1,2,3], [1,5,7])

    returns

    0.981980506062

    This agrees with Excel, this calculatorSciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.

    R:

    > cor( c(1,2,3), c(1,5,7))
    [1] 0.9819805
    This post was edited by Laksh Nath at September 23, 2020 11:10 AM IST
      September 23, 2020 11:09 AM IST
    0
  • You can do this with pandas.DataFrame.corr, too:

    import pandas as pd
    a = [[1, 2, 3],
         [5, 6, 9],
         [5, 6, 11],
         [5, 6, 13],
         [5, 3, 13]]
    df = pd.DataFrame(data=a)
    df.corr()

    This gives

              0         1         2
    0  1.000000  0.745601  0.916579
    1  0.745601  1.000000  0.544248
    2  0.916579  0.544248  1.000000
      September 23, 2020 11:25 AM IST
    0