Calculating Pearson correlation and significance in Python

QBoard » Statistical modeling » Stats - Tech » Calculating Pearson correlation and significance in Python

User Dashboard

Calculating Pearson correlation and significance in Python

Back To Topics

Tags : statistics python

Raji Reddy A

90 6

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.

June 11, 2019 4:02 PM IST

0

Rakesh Racharla

129 8

You can have a look at scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

This post was edited by Rakesh Racharla at September 23, 2020 11:03 AM IST

June 11, 2019 4:03 PM IST

Jasmine Chacko

63 1
The Pearson correlation can be calculated with numpy's corrcoef.

import numpy numpy.corrcoef(list1, list2)[0, 1]
September 23, 2020 11:05 AM IST

0

Rishi Pandya

131 2

If you don't feel like installing scipy, I've used this quick hack, slightly modified from Programming Collective Intelligence:

from itertools import imap

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(map(lambda x: pow(x, 2), x))
  sum_y_sq = sum(map(lambda x: pow(x, 2), y))
  psum = sum(imap(lambda x, y: x * y, x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

September 23, 2020 11:07 AM IST

Laksh Nath

126

The following code is a straight-up interpretation of the definition:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

Test:

print pearson_def([1,2,3], [1,5,7])

returns

0.981980506062

This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

This post was edited by Laksh Nath at September 23, 2020 11:10 AM IST

September 23, 2020 11:09 AM IST

Pranav B

106 5
You can do this with pandas.DataFrame.corr, too:
```
import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
```
This gives
```
          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000
```
September 23, 2020 11:25 AM IST

0

Cluzters.ai

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

Calculating Pearson correlation and significance in Python

Connect With Us